Etchevers:Notebook/STRA6 in eye development/2009/06/21

{| width="800"
 * style="background-color: #EEE"|[[Image:C14.jpg|128px]] Genetics of human eye development
 * style="background-color: #F2F2F2" align="center"|  |Main project page
 * style="background-color: #F2F2F2" align="center"|  |Main project page


 * colspan="2"|
 * colspan="2"|

End of PAX6 mapping
First, when I compare the results from OTX2-1 and PAX6, there are all the expected files in the outdirPAX6 whereas the cns.fq for instance is not present in outdir (the first directory I made, for OTX2-1 [GDZ-1 for Fasteris].

The log for OTX2-1 is:

- maq-0.7.1

[ma_load_reads] loading reads...

[ma_load_reads] set length of the first read as 71.

[ma_load_reads] 780016*2 reads loaded.

[ma_longread2read] encoding reads... 1560032 sequences processed.

[ma_match] set the minimum insert size as 72.

[match_core] Total length of the reference: 3095693983

[match_core] round 1/3...

[match_core] making index...

[match_search]  0% processed in 5.268 sec: 0 / 0 = 0.000 [match_search]  1% processed in 136.225 sec: 252710 / 18804593 = 0.013 [match_search]  2% processed in 255.760 sec: 425682 / 17166538 = 0.025 [match_search]  3% processed in 351.322 sec: 531698 / 13530050 = 0.039

(...) [match_search] 97% processed in 9514.435 sec: 800762 / 14316347 = 0.056 [match_search] 98% processed in 9607.204 sec: 677102 / 13661122 = 0.050 [match_search] 99% processed in 9694.450 sec: 604232 / 13506399 = 0.045 [match_search] 100% processed in 9700.078 sec: 67240 / 621474 = 0.108

[match_core] round 2/3...

[match_core] making index...

[match_search]  0% processed in 9701.598 sec: 216 / 3418 = 0.063 [match_search]  1% processed in 9828.930 sec: 248328 / 17657956 = 0.014 [match_search]  2% processed in 9945.282 sec: 430350 / 16183714 = 0.027 [match_search]  3% processed in 10039.319 sec: 531947 / 12874851 = 0.041

(...) [match_search] 98% processed in 19135.004 sec: 677459 / 13038677 = 0.052 [match_search] 99% processed in 19222.537 sec: 603627 / 12796895 = 0.047 [match_search] 100% processed in 19228.034 sec: 67351 / 586443 = 0.115

[match_core] round 3/3...

[match_core] making index...

[match_search]  0% processed in 19229.582 sec: 217 / 3454 = 0.063 [match_search]  1% processed in 19359.666 sec: 247251 / 17461382 = 0.014 [match_search]  2% processed in 19478.005 sec: 430582 / 16019918 = 0.027 [match_search]  3% processed in 19572.679 sec: 532315 / 12743822 = 0.042

(...) [match_search] 98% processed in 28779.071 sec: 676271 / 12874849 = 0.053 [match_search] 99% processed in 28864.708 sec: 603668 / 12688061 = 0.048 [match_search] 100% processed in 28870.408 sec: 67255 / 582084 = 0.116

[match_core] sorting the hits and dumping the results...

[ma_load_reads] loading reads...

[ma_load_reads] 780016*2 reads loaded.

[mapping_count_single] 9, 16, 26, 57

[maq_indel_pe] the indel detector only works with short-insert mate-pair reads.

-- Dumping unmapped or poorly mapped reads [match_data2mapping] 657880 out of 1560032 raw reads are mapped with 0 in pairs.

-- (total, isPE, mapped, paired) = (780016, 0, 657880, 0)

I used all the defaults for mapping; this might be optimizable.

I only have these files in the “outdir” folder:


 * aln1@4000001.map	43.4 Mb
 * aln1@4000001.map.log
 * read1@1.bfq	107.1 Mb
 * read1@2000001.bfq	106.9 Mb
 * read1@4000001.bfq	42.1Mb
 * ref.bfa	1400 Mb
 * unmap1@4000001.txt	20.5 Mb

The computer thinks that the *.map and *.bfq files are gzip archives and would uncompress them to be a lot bigger.

Meanwhile, sudo apt-get install subversion, which allows me to use the “svn” command that was recommended on the install instructions for PeakFinder. After installation, get the line “Checked out revision 1274”.

Now check out “parameters”: http://vancouvershortr.wiki.sourceforge.net/FP4Parameters

There is a lot to do here. If I just want to look at the text, and I generate it from the aln1@4000001.map file by doing $ maq mapview [-b] aln1@4000001.map > OTX2-1.txt

is it tab-delimited? And if so, can I open it in OpenOffice? Only one way to find out.

When I try, though, I get “Segmentation fault”. But that is because aln1@4000001.map is not in that folder. Hm. It seems to work now, although I launched the command from within the “outdir” folder. I get lines like this:

HWI-EAS038:3:98:854:1277#0/1	chr1	160527014	-	0	099	99	99	0	0	1	0	71

HWI-EAS038:3:88:1470:427#0/1	chr1	160537928	+	0	064	64	64	1	0	0	1	71

HWI-EAS038:3:98:1400:931#0/1	chr1	160538243	-	0	099	99	99	0	0	1	0	71

They are tab-delimited, but it's a shame that there is not a length. I suppose I could manipulate things to add or subtract 76 nt from the first position, in the direction of the read.

Also, changed to maqview folder, and read README file. This suggests to compile as so:

$> ./autogen.sh

$> ./configure

$> make

But at the first step, I get:

running: aclocal

eval: 1: aclocal: not found

error: while running 'aclocal'

and can not do configure either. Indeed, the autogen.sh specifies to look for an “aclocal” as well as other absent files. Perhaps they are part of the mentioned “OpenGL and GLUT” programs?

Meanwhile, putzed around by taking the OTZ2-1.txt file produced by maq mapview above, and opening in the spreadsheet. Need to learn to do this in another way so can keep more than 65K records. Meanwhile, added 76 bp to every coordinate, and removed all other columns to keep a barebones *.bed format and renamed as such. Zipped it to *.gz format and threw up just chromosome 1 onto UCSC.

All the reads look on top of one another when look chr band by band; but as one focuses in, there are usually spaces between them. A couple sites do seem to have a tight superposition of reads. It will be very important to see if these correspond to predicted and higher-than-threshold peaks once I get to generating such peaks.

In particular, as examples,

This one is 12-deep and in a functionally relevant gene:

>hg18_dna range=chr1:91852810-91853120 5'pad=0 3'pad=0 strand=+ repeatMasking=none CCCAGGCTGGAGTACAGTGGCACCAACATAGCTCACTGTAACTTCAAACTCCAGGACTCAAGCGATCCTACCGCTTCAGCTTCCTGAGAAGCTGGGACTACAGGCGCATGCCACCACGTCTGACTAATTTTTGTTTTTATTTTTTGTGGAGATGGGGTCTCAAACTCCTGTCCTCAAGCAATCCTCCCATCTTAGAGTTTGCCTTTTAAAGTCATCATCCTTAGAACTGGAAGAAACTAAAAGAGATGTCCTGTTCTAGCCCACAGATTTCACACAGGAAAGAGCTACCTTCTTCATTTTACAAAATAAAC

This one was 8-deep but very telomeric, and function?:

>hg18_dna range=chr1:10000-10155 5'pad=0 3'pad=0 strand=+ repeatMasking=none CCCAGGGGCCAGCACTGCTCGAAATGTACAGCATTTCTCTTTGTAACAGGATTATTAGCCTGCTGTGCCCGGGGAAAACATGCAGCACAGTGCATCTCGAGTCAGCAGGATTTTGACGGCTTCTAACAAAATCTTGTAGACAAGATGGAGCTATGG

Otx consensus may be more TAATCC like in Ogino 2008 paper, or more Bicoid-like with the GxCTAATT of Binato 2007.

Likewise, should take Christelle's ChIP-seq peaks and see if they correspond to tight superposition of reads.

For PAX6 [GDZ-5] the log is similar, but the folder contains:

all.map              assemble.log    cns.snp        read1@2000001.bfq aln1@1.map           cns.filter.snp  cns.win        ref.bfa aln1@1.map.log       cns.final.snp   consensus.cns  unmap1@1.txt aln1@2000001.map     cns.fq          mapcheck.txt   unmap1@2000001.txt aln1@2000001.map.log cns.indelse     read1@1.bfq    unmap.indel

And the cns.fq file, when I run "cat", looks like this: nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnatgactctccttggccaacgcaagcccgtccagctgcccaccctccctggcccagtc ctcccacaaaaccannnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnagtaacagacagggtctcagctgagttgtctcctgatccgctggcctcaccacccc taaatggaaaccagtnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnn... with lots of "n"s.

Is that the entire consensus sequence, masked everywhere except where my sequences map? If so, the sequences are of variable lengths, but do not go over the 76 bp original.

The "all.map" seems to be the file I will want.
 * Heather 17:16, 21 June 2009 (EDT):


 * }