Chloroplast Genome Assembly Pipeline from Solexa/Illumina GA Micro-Reads
Preparing Raw Data for Assembly In order to turn the raw Solexa read data into a usable format, we used Linux commands and Perl scripts as follows:
1) Linux commands used to sort and bin all 36 nucleotide micro-read sequences according to the first three nucleotides (5’ tags), resulting in accession-specific bins of sequence micro-reads. 2) Any micro-reads with tag sequence not matching our tags moved to ‘discard’ bin. 2) Perl script used to remove tags.
At this point, 3 or 4 nucleotides have been removed from the 5’ ends of all usable micro-reads from an original length of 36-40 bp. These 33-37mers were used in all subsequent assembly steps.
De Novo Assembly Steps: Velvet Assembler 0.4 / 0.5.03 For each accession, 33-37 bp micro-reads were assembled into contigs using the following settings on Velvet Assembler (Zerbino and Berney 2008):
1) hash length = 23 or 25 2) minimum coverage = 5X 3) minimum contig length of 100 bp.
This resulted in 100 - 300 contigs per accession, with average contig length ranging from 691 - 1098 bp.
- NOTE: we are also using Edena Assembler 2.1.1 in addition to Velvet. Use of both assemblers increases our de novo assemblies typically by several %. Prior to the next step, a consensus sequence of both de novo assemblies is made using BioEdit.
Reference-Guided Assembly I: Assembling a ‘Chimeric’ Sequence Prior to alignment of contigs to reference genomes, an ‘N’ was added to both ends of all contigs using a text editor (TextPad). This ultimately allowed us to distinguish gaps from true indels (i.e. gaps within de novo contigs) from missing/unassembled sequence (i.e., gaps between de novo contigs). Each accession’s contigs were then aligned to a reference genome using the program CodonCode Aligner 2.0.4 (CodonCode Corporation), as follows:
1) The appropriate reference genome was chosen for each accession. We used Pinus thunbergii for hard pines (subgenus Pinus) and Pinus koraiensis for soft pines (subgenus Strobus). 2) Contigs aligned to reference using standard settings for end-to-end alignments, but with ‘% identity’ requirement decreased to 70%. 3) Each assembly was exported and saved in interleaved Phylip format and the project was saved in case of future need/reference as a .proj file. 4) All remaining unaligned contigs were searched for chloroplast homology using BLASTN on the NCBI website.
Assemblies in their .phy format were then manually checked and manipulated to better align to their reference using the program BioEdit 126.96.36.199 (Hall 1999), as follows:
1) Insertions and deletions were added where needed to improve alignment. 2) Contig ends were adjusted as necessary. 3) Unaligned contigs with chloroplast homology were inserted (these typically contained larger insertions or gaps that precluded their alignment in Codoncode. 4) A consensus sequence was made from this assembly.
We then altered the consensus sequence by filling gaps between contigs with X’s, and used Microsoft Access or a Python script and the P. thunbergii or koraiensis reference sequence to make a chimeric sequence, as follows:
1) All sequence represented by contigs remained the same, including indels. 2) All sequence represented by contig gaps (X’s) was replaced by corresponding reference sequence of P. thunbergii/koraiensis. 3) all nucleotides within 30 bp of contig ends were changed to match the P. thunbergii/koraiensis reference.
Reference-Guided Assembly II: RGA We used the program RGA (Shen and Mockler, in prep) to assemble final versions of our genomes. RGA takes micro-read data and aligns individual reads to a reference genome. The reference for each accession for this assembly was that accession’s ‘chimeric’ sequence. We used coverage minimum of 2X, error rate of 0.03, and SNP call minimum of 70% (meaning SNP calls have a minimum of 3X coverage and three out of four calls).
“Final Alignment” We use MAFFT with increased gap opening penalty (2-2.5) and decreased gap extension penalty (0.0) to align the RGA output sequences of our assemblies.
Quality Controls This is an area on which we’re currently working, although preliminary results suggest most sequences are highly accurate and >90% complete. We have used the following strategies:
1) Visual check using VISTA genome browser (Mayor et al. 2000). This program calculates homology to a reference over the entire length of the genome, and allows for annotation of the reference. Results are displayed in an easy to comprehend chart (y axis = % identity, x axis = length of genome).
2) Comparison to Sanger sequencing. We directly compared sequence from our assemblies to that of Sanger sequencing of ~5 kb of exonic regions (genes rpoC1, rpoB, cemA, rpoA, rps4, psbD/C). The average error rate is ~0.05%. Results also suggest that areas directly adjacent to primers may be problematic.
3) Proportion of exon coverage compared to P. thunbergii. Exon coverage compared to annotated exons of P. thunbergii was calculated using Microsoft Access and Excel. Exon coverage ranged from 90.2 – 99.8%.
4) Regions of low complexity: Mononucleotide repeats. We noticed that contig ends often ended in mononucleotide repeats, and long repeats should present a problem for micro-read assembly. Indeed, analysis using the program Phobos (Mayer 2008) showed that most mononucleotide repeats 12 bp or greater in length were not effectively assembled.
5) Overall Sequence Polymorphism. We are also in the early stages of utilizing the program DnaSP 4.2 (Rozas et al. 2003), which allows us to visualize polymorphism over the entire length of aligned sequences. This should indicate potential areas of misassembly.
6) Manual adjustment of alignment. What can I say? Even though MAFFT does a great job for 98% of the alignment, it takes a lot of time to adjust the 2% it doesn’t get right. Typically this involves slowly scrolling through the entire alignment a number of times and making adjustments as they are needed.
Final Annotation Our initial annotations are done on our hard pine reference genome (P. thunbergii) using DOGMA, with manual adjustments where necessary. Manual adjustments are usually based on comparison to GenBank records and/or ChloroplastDB records. Once this sequence is annotated, we drop down the annotations to all other sequences using BioEdit. All exons for all sequences are then extracted and manually adjusted/edited as we check for internal stop codons, frameshift mutations, stop/start codon shifts, etc., that suggest misassembly. Annotation also serves as another quality control check.