Submitting

The following classes must be submitted in order to be graded:

  • LinkStrand.java
  • Make sure when you submit, all of the following are present and functional:

    Class: LinkStrand
     Constructor: LinkStrand()
     Constructor: LinkStrand(String)
     Method: public String toString()
     Method: public IDnaStrand append(String)
     Method: public IDnaStrand append(IDnaStrand)
     Method: public long size()
     Method: public IDnaStrand reverse()
     Method: public ArrayList nodeList()
     Method: public IDnaStrand cutAndSplice(String, String)
     Method: public String getStats()
     Method: public String strandInfo()
     Method: public void initializeFrom(String)

    You also must submit your README in order to receive any credit.

    If you create any new classes which are required for the above classes to compile, you should submit those as well.

    Linked List Overview

    You’ll be creating a class LinkStrand that implements a Java interface IDnaStrand. The class simulates cutting a strand of DNA with a restriction enzyme and appending/splicing-in a new strand.

    You must use a linked-list to support the operations — specifically the class LinkStrand should maintain pointers to a linked list used to represent a DNA strand. You should maintain a pointer to the first Node of the linked list and to the last Node of the linked list. These pointers are maintained as class invariants. A class invariant is a property that must hold true after the constructor is finished and at the entry and exit of all public member functions. In this case the property of pointing to first/last nodes must hold after any method in the class executes (and thus before any method in the class executes). A Strand of DNA is initially represented by a linked list with one Node; the Node stores one string representing the entire strand of DNA. Thus, initially the instance variables myFirst and myLast will point to the same node. If any nodes are appended, the value of myLast must be updated to ensure that it correctly points to the last node of the new list.

    You’ll need a nested/inner class to represent a Node of the linked list for storing DNA information. Here is what that should look like:
    //Outer class definition....
    public class Node {
         String info;
         Node next;

        Node(String s){
          info = s;
          next = null;
     }
    }

    private Node myFirst, myLast; //first and last nodes of list
    private long mySize; //# nucleotides in DNA
    ...rest of class

    Since the purpose of this assignment is to utilize Linked Lists, your LinkStrand should not use, nor should you need any global variables besides Nodes and primitive types. Using an unnecessary global data structure, such as an array, a StringBuilder, or an ArrayList, may cost you almost all or all of your correctness and engineering points.

    The diagram below shows the results of cutting an original strand of DNA (represented by "- - - ...")at three points and then splicing-in the strand “GTGATAATTC” at each of the locations at which the original strand was cut. Since splicing into a linked list is a constant-time, O(1) operation this implementation should be more efficient in time and space when compared to the String implementation.
    Node

    How to Test Your Code

    To test your LinkStrand class a JUnit tester has been provided for you. The tester will test individual methods in your class. If you need a refresher on JUnit refer to Discussion 4 Slides and Code.
    Note: Passing these tests doesn't guarantee full credit since the tests are about correctness, not efficiency.
    Implementing LinkStrand is the bulk of the coding work for this assignment. You’ll need to implement every method and use the JUnit tests to help determine if your methods are correctly implemented.

    Overview of LinkStrand Methods

    Because LinkStrand implements the IDnaStrand interface, all the methods you are overriding are documented with comments, so you should check there if you are wondering how any methods should work. You can also refer to the working methods in SimpleStrand, although they will differ in implementation because they use a StringBuilder instead of a linked list. Make sure you correctly implement every method specified in the interface. Below are some considerations for you as you begin to code:

    Benchmarking Part 2

    We're going to see how our newly created LinkStrand fares in the benchmark.

    Let's run virtual experiments to show that LinkStrand is O(B) where B is the number of breaks. First, change SimpleStrand to LinkStrand in the main method of the class DNABenchMark. Then, make different runs to demonstrate this O(B) behavior, changing the number of breaks in your file. You can do this by constructing your own genomic data or by reusing ecoli.dat.

    In your Analysis, describe the process, the results, and the reasoning you used to conclude the code is O(B). Please write down all the data you gathered, including timings, that demonstrate this O(B) behavior; graph the data to strengthen your case, and explain your data by demonstrating you understand the process.

    Analysis

    So, it should contain everything from the sections (and the subsections) labeled Benchmarking Part 1 and Benchmarking Part 2. Read it carefully.

    Grading

    For transparency’s sake, below is a list of aspects of your code the automated tests will check. Your code will still be checked by a TA, and passing all these tests does not guarantee a perfect grade.

  • Does the size method return the correct value?
  • Does the size method take O(1)?
  • Does the toString method return the correct value?
  • Does the append method modify the LinkStrand object correctly?
  • Can the append method be called multiple times in a row?
  • Does the append method append SimpleStrand objects correctly?
  • Does the append method append LinkStrand objects correctly?
  • Does the append method append generic IDnaStrand objects (i.e. implementations which are not SimpleStrand or LinkStrand) correctly?
  • Does the append method take O(1)?
  • Does the initializeFrom method set up the LinkStrand object correctly?
  • Does the initializeFrom method set up the LinkStrand object correctly, if the LinkStrand has already been initialized?
  • Does the reverse method work for a single-node LinkStrand?
  • Does the reverse method work for a multiple-node LinkStrand?
  • Does the reverse method use memoization? (For extra credit)?
  • Does the cutAndSplice method return the correct value?
  • Does the cutAndSplice method throw an exception when called from a multi-node strand?
  • Does the cutAndSplice method take O(B)?
  • Does the nodeList method return the correct value?
  • Does the constructor with no arguments initialize LinkStrand correctly?
  • Does the constructor initialize the values in the LinkStrand correctly?
  • Does the getStats method return the correct value?
  • Does the strandInfo method return the correct value?
  • Does calling methods in a random order modify the LinkStrand object as expected?
  • Does LinkStrand use unnecessary static variables?
  • FAQ

    Here are a few commonly asked questions. This will be updated through the semester and through the years.


    ###Q: Can I have an example of an enzyme and a splicee?

    DNA example

    ###Q: When implementing my LinkStrand class and then running it through the DnaBenchmark, two of my recomb values returned as negative numbers. Any ideas why that might happen?

    ###A: Sounds like an integer overflow, your two largest recomb strings are probably longer in length than int’s max value, and if you try to store a value above the max int value, it will instead overflow to the minimum int value (a large negative number) and continue counting from there. If you want to see the larger numbers, try using a long instead (the same as an int, but has more bits and thus can store larger values).

    ###Q: What should cutAndSplice return if the given enzyme cannot be found within the strand?

    ###A: An empty string.

    ###Q: How do I vary the number of breaks in proving LinkStrand cutAndSplice is O(B)?

    ###A: The easiest way to do this is to take the sample files we gave you and copy/paste them over and over into new text files. So, for example, say ecoli.txt has b breaks. Then, create ecoli2.txt, which has ecoli.txt pasted into it twice. Logically, it should have 2b breaks, since every instance of the enzyme has been pasted twice. Do the same for an ecoli3.txt and it has 3b breaks. This requires little effort, and creates nicely spaced out numbers of breaks, which will give you a more accurate regression.

    If you are unable to create larger files due to memory issues, you can also use the same data file and change the enzyme. Alternatively, instead of copy/pasting all of ecoli.txt, copy/paste part of it.

    Note that DNABenchmark removes non-acgt characters upon loading a new file.

    ###Q: My LinkStrand cutAndSplice looks like it works, but when I run TestStrand it returns the original strand and I get a ‘self alter fail’.

    ###A: Unlike every other method, in LinkStrand.cutAndSplice you should not modify/return this - instead, you should create a new instance of LinkStrand which is a copy of this, modify that, and return that instead. So if your original code for cut and splice looks like

    methoda();
    methodb();
    return this;

    You instead want

    IDNAStrand ret = new LinkStrand(this.toString());
    ret.methoda();
    ret.methodb();
    return ret;

    Theory

    Restriction Enzyme Cleaving


    The science behind the DNA Assignment

    Restriction enzymes cut a strand of DNA at a specific location, the binding site, typically separating the DNA strand into two pieces. In the real chemical process a strand can be split into several pieces at multiple binding sites, we’ll simulate this by repeatedly dividing a strand.

    Given a strand of DNA aatccgaattcgtatc and a restriction enzyme like EcoRI gaattc, the restriction enzyme locates each occurrence of its pattern in the DNA strand and divides the strand into two pieces at that point, leaving either blunt or sticky ends as described below. In the simulation there’s no difference between a blunt and sticky end, and we’ll use a single strand of DNA in the simulation rather than the double-helix/double-strand that’s found in the physical/real process.

    Restriction enzymes have two properties or features: the pattern of DNA that marks a site at which separation occurs and a number/index that indicates how many characters/nucleotides of the pattern attach to the left-part of the split strand. For example, the adjacent diagram shows a strand split by EcoRI. The pattern for EcoRI is gaattc and the index of the split is one indicating that the first nucleotide/character of the restriction enzyme adheres to the left part of the split.

    In some experiments, and in the simulation you’ll run, another strand of DNA will be spliced into the separated strand. The strand spliced in matches the separated strand at each end as shown in the diagram below where the spliced-in strand matches with G on the left and AATTC on the right as you view the strands.

    Your code will be a software simulation of this recombinant process: the restriction enzyme will cut a strand of DNA and new DNA will be spliced-in to create a recombinant strand of DNA. In the simulation the code simply replaces every occurrence of the restriction enzyme with new genetic material/DNA — your code models the process with what is essentially string replacement.