PersonalGenomes@Home

From OpenWetWare

Jump to: navigation, search

Image:PGP_template4.jpg

About        Projects        Publications        PersonalGenomes@Home        Public Data        FAQ        Updates (12/23)


A volunteer run, distributed computing project of human genome analogous to SETI@Home


Contents

Purpose

The Personal Genome Project (PGP) will publicly sequence the DNA of 100.000 volunteers and make the results freely available. The analysis of the raw data entails a huge computational effort and will require much computational power. The PGP is an open source project. In this spirit we would like the data analysis to become an open source community effort through distributed computing with 'PersonalGenomes@Home'. The part of the data analysis that will be initially distributed to the community is the analysis of raw Illumina images to produce reads and alignment against a reference. We are currently setting up the open source software packages Swift and BOINC as well as Tranche and Free Factories.

Collaborators

Running Swift

Swift is designed to process a single tile of Illumina data at a time. It can be run in two modes:1) base-caller only (processes intensity files produced by the Solexa pipeline) and 2) image analysis. The output files are purity-filtered and non purity-filtered reads in fastq format, a run report, an intensity file and files listing all processed images.


Processing PGP Data

AL - After verifying the functionality of the code with example data (available through Swift), we have run Swift on tile 87 of one PGP2 data set. Images are 7.1MB in size and the processing resources required exceed the abilities of AL's personal computer (1.5MB RAM). Swift is designed to process one full tile of Illumina data at a time. Attempts to process less than 36 cycles remain unsuccessful. Keeping in mind that the goal is to perform the data analysis on a 'regular' computer, we did not move to larger memory capacities. Instead, the large images were cropped into three horizontal thirds of 2.4MB each using ImageJ, an open source image processing tool. Images overlapped by 10 pixels to avoid errors caused by edge effects. The one tile was then individually processed in three sub-tiles using Swift in approximately 20 minutes.


Finding the MFN2_R364W Variant and Control

To prove that Swift produces correct results, the stack 87 data were searched for the MFN2_R364W variant and stack 27 for the control. The last numbers are the x and y coordinates on the images. Differences in aligning techniques between the Illumina pipeline and Swift result in minor discrepancies in coordinate assignment.

Previous analysis had shown a T in the 11th cycle:
HWI-EAS0184_1:2:87:783:1097
GCACACGGTCTGGGCCAaGCAGATTGCAGAGGCGGg

The Swift analysis confirms this result:
@mid:784:1097
GCACACGGTCTGGGCCAAGCAGATTGCAGAGGCGGG

Analogous, the control sequence was extracted from stack 27.

Previous analysis had shown a C in the 13th cycle:
HWI-EAS0184_1:2:27:936:1555
CAGCACACGGTCCGGGCCAAGCAGATTGCAGAGGCG

The Swift analysis confirms this result:
@control_bot:935:1555
CAGCACACGGTCCGGGCCAAGCAGATTGCAGAGGCG


The results are illustrated in Figures 1 and 2. Images from the three cycles C, G, and T were loaded into the three color channels red, green, and blue of SAOImage ds9 to create a combined three-color image. The images show 50x100 pixels centered on the location of the variant and control sequence as determined by Swift. Fig. 1 shows cycles 9 through 13 of stack 87 for the variant MFN_R364W. Colors are red: C, green: G, blue: T. The expected sequence is TCTGG (blue, red, blue, green, green). The location of the variant is circled. Fig. 2 shows cycles 11 through 15 of stack 27 for the control sequence for the variant MFN_R364W. Colors are red: C, green: G, blue: T. The expected sequence is TCCGG (blue, red, red, green, green). The location of the variant is circled.

Fig.1 Cycles 9 through 13 of stack 87 for the variant MFN_R364W. Colors are red: C, green: G, blue: T. The expected sequence is TCTGG (blue, red, blue, green, green).  The location of the variant is circled.
Fig.1 Cycles 9 through 13 of stack 87 for the variant MFN_R364W. Colors are red: C, green: G, blue: T. The expected sequence is TCTGG (blue, red, blue, green, green). The location of the variant is circled.

Fig.2 Cycles 11 through 15 of stack 27 for the control sequence for variant MFN_R364W. Colors are red: C, green: G, blue: T. The expected sequence is TCCGG (blue, red, red, green, green). The location of the control is circled.
Fig.2 Cycles 11 through 15 of stack 27 for the control sequence for variant MFN_R364W. Colors are red: C, green: G, blue: T. The expected sequence is TCCGG (blue, red, red, green, green). The location of the control is circled.


Swift on Free Factories

We have installed and run the Swift software on two virtual machines running on the http://bio.freelogy.org clusters at the Church lab. These nodes are intended as a production/development environment for coordinating the distributed computing effort. We plan to use the BOINC software. For now, we have automated the data analysis using a shell script, that downloads all images from PGP2 data set using wget, crops them into thirds using ImageMagick and runs Swift on the entire data set. This analysis has been run with version 140 of Swift. With a new quality score recalibration in version 149, we reprocessed the entire data set. The reads produced by Swift for PGP2 data set can be found below.

PGP2 Sample Data

We currently have raw images, two sets of processed reads, and draft assemblies (from each read-set against both the whole HG18 reference assembly and against just the annotated capture regions) for about 10% of the exome of PGP2. We also have Affymetrix 500K data (binary) w/ two sets of calls for this participant. These are summarized below:

  • Illumina data
    • 07fb - Swift processed images placed against annotated 55K
    • d0e5 - Swift processed images placed against HG18
    • baac - Processed images (default) placed against annotated 55K
    • 51a4 - Processed images (default) placed against HG18
    • Files in each directory:
      • input.gz - input FASTQ
      • all.map.gz - binary placement
      • mapcheck.txt.gz - statistics
      • all.aln.txt.gz - placed reads
      • cns.fq.txt.gz - consensus assembly
      • cns.snp.txt.gz - SNPs
      • cns.win.txt.gz - coverage
      • cns.final.snp.txt.gz - filtered SNPs

The above information is preliminary. The current release is here.


PGP2 Sample Data Analysis

After using Swift to produce reads, AWZ ran Mega BLAST to get alignments.

Gapped Alignments

First, we were interested in finding all gapped alignments including their read ID. AL wrote a perl script which reads the original Mega BLAST output and inserts the read ID before each line beginning with a '>', which marks an alignment. Gapped alignments are marked 'Gaps', and a simple grep -B5 -A7 'Gaps' can extract all gapped alignments including the read ID they are associated with. The results are available here: PGP2 Gapped Alignments. This is an older data product from Swift and reads are only 35 base pairs in length.

Unique Alignments

We are also interested in finding all reads with a unique alignment.
The data format for all reads (PGP2 Reads) looks like this; example for one read with 3 alignments:

 Query= pgp2_bot_stack_100:121:0
 Query= pgp2_bot_stack_100:121:0
           (35 letters)
Score E Sequences producing significant alignments: (bits) Value
30436=chr20@32125008-32125124 70 2e-13 54581=chrX@111761282-111761507 26 2.1 54288=chrX@74249424-74249554 26 2.1
Query= pgp2_bot_stack_100:121:0 >30436=chr20@32125008-32125124 Length = 117
Score = 69.9 bits (35), Expect = 2e-13 Identities = 35/35 (100%) Strand = Plus / Minus

Query: 1 ggtctgtcaggcttaggctctccagccatgttgat 35
Sbjct: 59 ggtctgtcaggcttaggctctccagccatgttgat 25

Query= pgp2_bot_stack_100:121:0 >54581=chrX@111761282-111761507 Length = 226
Score = 26.3 bits (13), Expect = 2.1 Identities = 13/13 (100%) Strand = Plus / Minus

Query: 23 cagccatgttgat 35
Sbjct: 153 cagccatgttgat 141

Query= pgp2_bot_stack_100:121:0 >54288=chrX@74249424-74249554 Length = 131
Score = 26.3 bits (13), Expect = 2.1 Identities = 13/13 (100%) Strand = Plus / Minus

Query: 13 ttaggctctccag 25
Sbjct: 113 ttaggctctccag 101

The order of relevant information is:

 read ID
 read ID 
 n chromosome numbers (i.e. n lines starting with a number)
 read ID
 >chromosome number
 13 lines of information per chromosome. 

For a unique alignment there will only be one line starting with a number, and we need to extract the 13 lines following it.

To extract the unique alignments, AL processes the file line by line with a perl script:

 read and save ID
count number of lines beginning with [0-9]
if line starts with '>' && number of lines beginning with [0-9]==1 grep -B1 -A12 this line, pipe into grep -B1 -A13 ID, pipe into output
reset ID, counter


The Script (oh boy!)

 #!/usr/bin/perl -w
 use strict;
 use Getopt::Long;
 my $input="reads_plusid";
 my $output="reads_plusid_unique_aligns";
 my $id="foo";
 my $suche="bar";
 my $counter=0; 
GetOptions( "input=s" => \$input, "output=s" => \$output); open(EINGABE, "<$input"); open(AUSGABE, ">$output");
while (defined(my $line = <EINGABE>)){
# save read ID if ($line =~ /pgp2/){ $id = $line; }
# count alignments if ($line=~ /^[0123456789]/){ $counter=$counter+1; }
# if only one alignment, grep it and select the one with the current read ID if (($line =~ /^>/) && ($counter==1)){ $suche=$line; system ("grep","-B2", "-A12",$suche,"reads_plusid"| (system "grep","-B1","-A13",$id, "reads_plusid")); $counter=0; $suche="foo"; }
# reset values for next read if ($line =~ /^Sequences/){ $id ="foo"; $counter=0; $suche="foo"; }
}
close(EINGABE); close(AUSGABE);


Problems

Speed: grep on the chromosome number is not unique; a chromosome can be aligned to several reads; therefore the grep is piped into another grep for the ID associated with the unique read as found before. The input file has 1,602,962,364 lines, the double grep does not help with speed. In tests, 10^7 lines are processed in 10 minutes, which extrapolates to 26 hours for 1,602,962,364 lines ...
Is there a way to grep [pattern_1 in line_1 && pattern_2 in line_2]? For example:

 if line starts with '>' && previous line contains ID
    grep -B1 -A12 line > output

False alarms: I have created a subset of the data consisting of 10^7 first lines of the input data (using 'head'). And while the script does find all the unique alignments, there are false alarms: one read is flagged as 'unique', but the output contains several alignments associated with it. In this subset of data, this happened for only one read.

Stats

37GB of input data contain 1,602,962,364 lines.
The first 2*10^6 lines contain 12 unique alignments and produce around 300 lines of output in 1 min.
The first 10^7 lines contain 44 unique alignments and produce around 1600 lines of output in 10 min.


Alignment with Bowtie

Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome.

See also: Bowtie paper in Genome Biology.

We have used bowtie with its default settings to align the PGP2 data from the Swift analysis described above. Each 1GB stack was processed in 50 seconds.

Results can be found here: PGP2 Bowtie Alignments

Personal tools