User:Lindenb/Notebook/UMR915/20110222

=Belgium= re-aligning on hg19. see Makefile in /GENOTYPAGE/data/users/lindenb/20101108_belgium/20110222

/usr/local/package/mosaik-aligner/bin/MosaikBuild -fr /GENOTYPAGE/data/pubdb/ucsc/hg19/chromosomes/hg19.fa -oa _ignore.backup/hg19/reference.dat -- MosaikBuild 1.1.0018                                               2010-10-29 Michael Stromberg & Wan-Ping Lee Marth Lab, Boston College Biology Department --

- converting /GENOTYPAGE/data/pubdb/ucsc/hg19/chromosomes/hg19.fa to a reference sequence archive.

- parsing reference sequences: ref seqs: 25 (0.2884 ref seqs/s)

- writing reference sequences: 100%[===========================================================================================================]   0.8177 ref seqs/s        in 30 s

- calculating MD5 checksums: 100%[===========================================================================================================]     1.43 ref seqs/s        in 17 s

- writing reference sequence index: 100%[===========================================================================================================]     25.0 ref seqs/s        in  1 s

- creating concatenated reference sequence: 100%[===========================================================================================================]     7.13 ref seqs/s        in  3 s

- writing concatenated reference sequence...       finished. - creating concatenated 2-bit reference sequence... finished. - writing concatenated 2-bit reference sequence... finished. - writing masking vector...                        finished.

MosaikBuild CPU time: 187.420 s, wall time: 200.587 s /usr/local/package/mosaik-aligner/bin/MosaikJump -ia _ignore.backup/hg19/reference.dat -hs 15 -out _ignore.backup/hg19/reference_hs15 -- MosaikJump 1.1.0018                                                2010-10-29 Michael Stromberg                Marth Lab, Boston College Biology Department --

- retrieving reference sequence... finished.

- hashing reference sequence: 100%[=============================================================================================================] 2,301,657 hashes/s      in 22:24

- serializing final sorting vector... finished.

- writing jump positions database: 100%[=====================================================================================================] 1,423,918 hash positions/s      in 33:29

- serializing jump keys database (17 blocks): blocks: 17 (0.9169 blocks/s)

MosaikJump CPU time: 3327.950 s, wall time: 3405.920 s

mean length for sample 1 is 385.282 gunzip -c 1.TCA.454Reads.fna.gz | egrep ">" | tr " " "\n" | grep length | cut -d '=' -f 2 | awk '{n+=1.0; total+=int($1)*1.0;} END { print total/n;}' using -mmp 0.05 -act 26 -bw 51 -mhp 100

for the Align param ( http://code.google.com/p/mosaik-aligner/wiki/ParameterSettings )

Align
/usr/local/package/mosaik-aligner/bin/MosaikAligner -mmp 0.05 -act 26 -bw 51 -mhp 100 -p 7 -in ../reads1.mkb -ia _ignore.backup/hg19/reference.dat -j _ignore.backup/hg19/reference_hs15 -out align1.hg19.mka -- MosaikAligner 1.1.0018                                             2010-10-29 Michael Stromberg & Wan-Ping Lee Marth Lab, Boston College Biology Department --

- Using the following alignment algorithm: all positions - Using the following alignment mode: aligning reads to all possible locations - Using a maximum mismatch percent threshold of 0.05 - Using 7 processors - Using a Smith-Waterman bandwidth of 51 - Using an alignment candidate threshold of 26bp. - Setting hash position threshold to 100 - Using a jump database for hashing. Storing keys & positions in memory. - Using a homo-polymer gap open penalty of 4 - loading reference sequence... finished. - loading jump key database into memory... finished. - loading jump positions database into memory... finished.

Aligning read library (317992): 100%[============================================================================================================================]     83.7 reads/s   in   1:03:18

Alignment statistics (mates):

=
====================== --- total:            317992 total aligned:    298049 ( 93.7 %)
 * 1) filtered out:     19943 (  6.3 %)
 * 2) unique:          288449 ( 90.7 %)
 * 3) non-unique:        9600 (  3.0 %)

MosaikAligner CPU time: 26744.380 s, wall time: 4015.994 s /usr/local/package/mosaik-aligner/bin/MosaikAligner -mmp 0.05 -act 26 -bw 51 -mhp 100 -p 7 -in ../reads2.mkb -ia _ignore.backup/hg19/reference.dat -j _ignore.backup/hg19/reference_hs15 -out align2.hg19.mka -- MosaikAligner 1.1.0018                                             2010-10-29 Michael Stromberg & Wan-Ping Lee Marth Lab, Boston College Biology Department --

- Using the following alignment algorithm: all positions - Using the following alignment mode: aligning reads to all possible locations - Using a maximum mismatch percent threshold of 0.05 - Using 7 processors - Using a Smith-Waterman bandwidth of 51 - Using an alignment candidate threshold of 26bp. - Setting hash position threshold to 100 - Using a jump database for hashing. Storing keys & positions in memory. - Using a homo-polymer gap open penalty of 4 - loading reference sequence... finished. - loading jump key database into memory... finished. - loading jump positions database into memory... finished.

Aligning read library (563204): 100%[============================================================================================================================]    131.1 reads/s   in   1:11:35

Alignment statistics (mates):

=
====================== --- total:            563204 total aligned:    534455 ( 94.9 %)
 * 1) filtered out:     28749 (  5.1 %)
 * 2) unique:          519486 ( 92.2 %)
 * 3) non-unique:       14969 (  2.7 %)

MosaikAligner CPU time: 30219.450 s, wall time: 4500.977 s

Result
A few reads on the other chromosomes: e.g:

CTCACCAGACCTCCTAGGCGACCCAGACAATTATACCCTAGCCAACCCCTTAAACACCCCTCCCCACATCAAGCCCGAATGATA TTTCCTATTCGCCTACACAATTCTCCGATCCGTCCCTAACAAACTAGGAGGCGTCCTTGCCCTATTACTATCCATCCTCATCCT AGCAATAATCCCCATCCTCCATATATCCAAACAACAAAGCATAATATTTCGCCCACTAAGCCAATCACTTTATTGACTCCTAGC CGCAGACCTCCTCATTCTAACCTGAATCGGAGGACAACCAGTAAGCTACCCTTTTACCATCATTGGACAAGTAGCATCCGTACT ATACTTCACAACAATCCTAATCCTAATACCAACTATCTCCCTAATTGAAAACAAAATACTCAAATGGGCCTGTCCTTGTAGTAT AAACTAATACACCAGTCTTGTAAACCGGAGATGAAAACCTTTTTCCAAGGACAAATCAGAGAAAAAGTCTTTAACTCCACCATT AGCACCCAAAGCTAAGATTCTAATTTAAACTATTCTCTG

e.g. BLAT confirms on chrM (!):

ACTIONS     QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN --- browser details YourSeq         541     1   543   543  99.9%     M   +      15482     16024    543

00001 ctcaccagacctcctaggcgacccagacaattataccctagccaacccct 00050 >>>>> |||||||||||||||||||||||||||||||||||||||||||||||||| >>>>> 15482 ctcaccagacctcctaggcgacccagacaattataccctagccaacccct 15531

00051 taaacacccctccccacatcaagcccgaatgatatttcctattcgcctac 00100 >>>>> |||||||||||||||||||||||||||||||||||||||||||||||||| >>>>> 15532 taaacacccctccccacatcaagcccgaatgatatttcctattcgcctac 15581

00101 acaattctccgatccgtccctaacaaactaggaggcgtccttgccctatt 00150 >>>>> |||||||||||||||||||||||||||||||||||||||||||||||||| >>>>> 15582 acaattctccgatccgtccctaacaaactaggaggcgtccttgccctatt 15631

00151 actatccatcctcatcctagcaataatccccatcctccatatatccaaac 00200 >>>>> |||||||||||||||||||||||||||||||||||||||||||||||||| >>>>> 15632 actatccatcctcatcctagcaataatccccatcctccatatatccaaac 15681

00201 aacaaagcataatatttcgcccactaagccaatcactttattgactccta 00250 >>>>> |||||||||||||||||||||||||||||||||||||||||||||||||| >>>>> 15682 aacaaagcataatatttcgcccactaagccaatcactttattgactccta 15731

00251 gccgcagacctcctcattctaacctgaatcggaggacaaccagtaagcta 00300 >>>>> |||||||||||||||||||||||||||||||||||||||||||||||||| >>>>> 15732 gccgcagacctcctcattctaacctgaatcggaggacaaccagtaagcta 15781

00301 cccttttaccatcattggacaagtagcatccgtactatacttcacaacaa 00350 >>>>> |||||||||||||||||||||||||||||||||||||||||||||||||| >>>>> 15782 cccttttaccatcattggacaagtagcatccgtactatacttcacaacaa 15831

00351 tcctaatcctaataccaactatctccctaattgaaaacaaaatactcaaa 00400 >>>>> |||||||||||||||||||||||||||||||||||||||||||||||||| >>>>> 15832 tcctaatcctaataccaactatctccctaattgaaaacaaaatactcaaa 15881

00401 tgggcctgtccttgtagtataaactaatacaccagtcttgtaaaccggag 00450 >>>>> |||||||||||||||||||||||||||||||||||||||||||||||||| >>>>> 15882 tgggcctgtccttgtagtataaactaatacaccagtcttgtaaaccggag 15931

00451 atgaaaacctttttccaaggacaaatcagagaaaaagtctttaactccac 00500 >>>>> | |||||||||||||||||||||||||||||||||||||||||||||||| >>>>> 15932 acgaaaacctttttccaaggacaaatcagagaaaaagtctttaactccac 15981

00501 cattagcacccaaagctaagattctaatttaaactattctctg 00543 >>>>> ||||||||||||||||||||||||||||||||||||||||||| >>>>> 15982 cattagcacccaaagctaagattctaatttaaactattctctg 16024