The following classes must be submitted in order to be graded:
Make sure when you submit, all of the following are present and functional:
Class: LinkStrand
Constructor: LinkStrand()
Constructor: LinkStrand(String)
Method: public String toString()
Method: public IDnaStrand append(String)
Method: public IDnaStrand append(IDnaStrand)
Method: public long size()
Method: public IDnaStrand reverse()
Method: public ArrayList nodeList()
Method: public IDnaStrand cutAndSplice(String, String)
Method: public String getStats()
Method: public String strandInfo()
Method: public void initializeFrom(String)
You also must submit your README in order to receive any credit.
If you create any new classes which are required for the above classes to compile, you should submit those as well.
You’ll be creating a class LinkStrand
that implements a Java interface IDnaStrand
. The class simulates cutting a strand of DNA with a restriction enzyme and appending/splicing-in a new strand.
You must use a linked-list to support the operations — specifically the class LinkStrand
should maintain pointers to a linked list used to represent a DNA strand. You should maintain a pointer to the first Node of the linked list and to the last Node of the linked list. These pointers are maintained as class invariants. A class invariant is a property that must hold true after the constructor is finished and at the entry and exit of all public member functions. In this case the property of pointing to first/last nodes must hold after any method in the class executes (and thus before any method in the class executes). A Strand of DNA is initially represented by a linked list with one Node; the Node stores one string representing the entire strand of DNA. Thus, initially the instance variables myFirst
and myLast
will point to the same node. If any nodes are appended, the value of myLast
must be updated to ensure that it correctly points to the last node of the new list.
You’ll need a nested/inner class to represent a Node
of
the linked list for storing DNA information. Here is what that should look like:
//Outer class definition....
public class Node {
String info;
Node next;
Node(String s){
info = s;
next = null;
}
}
private Node myFirst, myLast; //first and last nodes of list
private long mySize; //# nucleotides in DNA
...rest of class
Since the purpose of this assignment is to utilize Linked Lists, your LinkStrand
should not use, nor should you need any global variables besides Nodes and primitive types. Using an unnecessary global data structure, such as an array, a StringBuilder
, or an ArrayList
, may cost you almost all or all of your correctness and engineering points.
The diagram below shows the results of cutting an original strand of DNA (represented by "- - - ...")at three points and then splicing-in the strand “GTGATAATTC” at each of the locations at which the original strand was cut. Since splicing into a linked list is a constant-time, O(1) operation this implementation should be more efficient in time and space when compared to the String implementation.
To test your LinkStrand class a JUnit tester has been provided for you. The tester will test individual methods in your class. If you need a refresher on JUnit refer to Discussion 4 Slides and Code.
Note: Passing these tests doesn't guarantee full credit since the tests are about correctness, not efficiency.
Implementing LinkStrand is the bulk of the coding work for this assignment. You’ll need to implement every method and use the JUnit tests to help determine if your methods are correctly implemented.
Because LinkStrand
implements the IDnaStrand
interface, all the methods you are overriding are documented with comments, so you should check there if you are wondering how any methods should work. You can also refer to the working methods in SimpleStrand
, although they will differ in implementation because they use a StringBuilder instead of a linked list. Make sure you correctly implement every method specified in the interface. Below are some considerations for you as you begin to code:
public LinkStrand(){ this(""); }
if(myFirst.next != null)) {
throw new RuntimeException("link strand has more than one node");
}
We're going to see how our newly created LinkStrand
fares in the benchmark.
Let's run virtual experiments to show that LinkStrand is O(B) where B is the number of breaks. First, change SimpleStrand
to LinkStrand
in the main method of the class DNABenchMark. Then, make different runs to demonstrate this O(B) behavior, changing the number of breaks in your file. You can do this by constructing your own genomic data or by reusing ecoli.dat
.
In your Analysis, describe the process, the results, and the reasoning you used to conclude the code is O(B). Please write down all the data you gathered, including timings, that demonstrate this O(B) behavior; graph the data to strengthen your case, and explain your data by demonstrating you understand the process.
Your analysis should contain:
SimpleStrand.cutAndSplice
. In particular, you should generate data to display O(N) behaviour where N is the size of the recombined strand returned. For this, just run the DNABenchmark as given to you and put your data into README. Explain why O(N) makes sense. Record your results in Analysis. For transparency’s sake, below is a list of aspects of your code the automated tests will check. Your code will still be checked by a TA, and passing all these tests does not guarantee a perfect grade.
Here are a few commonly asked questions. This will be updated through the semester and through the years.
###Q: Can I have an example of an enzyme and a splicee?
###Q: When implementing my LinkStrand class and then running it through the DnaBenchmark, two of my recomb values returned as negative numbers. Any ideas why that might happen?
###A: Sounds like an integer overflow, your two largest recomb strings are probably longer in length than int’s max value, and if you try to store a value above the max int value, it will instead overflow to the minimum int value (a large negative number) and continue counting from there. If you want to see the larger numbers, try using a long instead (the same as an int, but has more bits and thus can store larger values).
###Q: What should cutAndSplice return if the given enzyme cannot be found within the strand?
###A: An empty string.
###Q: How do I vary the number of breaks in proving LinkStrand cutAndSplice is O(B)?
###A: The easiest way to do this is to take the sample files we gave you and copy/paste them over and over into new text files. So, for example, say ecoli.txt has b breaks. Then, create ecoli2.txt, which has ecoli.txt pasted into it twice. Logically, it should have 2b breaks, since every instance of the enzyme has been pasted twice. Do the same for an ecoli3.txt and it has 3b breaks. This requires little effort, and creates nicely spaced out numbers of breaks, which will give you a more accurate regression.
If you are unable to create larger files due to memory issues, you can also use the same data file and change the enzyme. Alternatively, instead of copy/pasting all of ecoli.txt, copy/paste part of it.
Note that DNABenchmark removes non-acgt characters upon loading a new file.
###Q: My LinkStrand cutAndSplice looks like it works, but when I run TestStrand it returns the original strand and I get a ‘self alter fail’.
###A: Unlike every other method, in LinkStrand.cutAndSplice you should not modify/return this - instead, you should create a new instance of LinkStrand which is a copy of this, modify that, and return that instead. So if your original code for cut and splice looks like
methoda();
methodb();
return this;
You instead want
IDNAStrand ret = new LinkStrand(this.toString());
ret.methoda();
ret.methodb();
return ret;
Restriction enzymes cut a strand of DNA at a specific location, the binding site, typically separating the DNA strand into two pieces. In the real chemical process a strand can be split into several pieces at multiple binding sites, we’ll simulate this by repeatedly dividing a strand.
Given a strand of DNA aatccgaattcgtatc
and a restriction enzyme like EcoRI gaattc
, the restriction enzyme locates each occurrence of its pattern in the DNA strand and divides the strand into two pieces at that point, leaving either blunt or sticky ends as described below. In the simulation there’s no difference between a blunt and sticky end, and we’ll use a single strand of DNA in the simulation rather than the double-helix/double-strand that’s found in the physical/real process.
Restriction enzymes have two properties or features: the pattern of DNA that marks
a site at which separation occurs and a number/index that indicates how many
characters/nucleotides of the pattern attach to the left-part of the split strand. For
example, the adjacent diagram shows a strand split by EcoRI. The pattern for EcoRI
is gaattc
and the index of the split is one indicating that the first
nucleotide/character of the restriction enzyme adheres to the left part of the split.
In some experiments, and in the simulation you’ll run, another strand of DNA will be
spliced into the separated strand. The strand spliced in matches the separated
strand at each end as shown in the diagram below where the spliced-in strand
matches with G
on the left and AATTC
on the right as you view the strands.
Your code will be a software simulation of this recombinant process: the restriction enzyme will cut a strand of DNA and new DNA will be spliced-in to create a recombinant strand of DNA. In the simulation the code simply replaces every occurrence of the restriction enzyme with new genetic material/DNA — your code models the process with what is essentially string replacement.