Etchevers:Notebook/STRA6 in eye development/2009/06/20

From OpenWetWare
Jump to navigationJump to search
C14.jpg Genetics of human eye development Report.pngMain project page
Resultset previous.pngPrevious entry      Next entryResultset next.png

Bioinformatics 2

Re-downloaded all the hg19 chromosome files. Concatenated them into refhg19.fasta.

The easyrun command actually executes the following commands in order (see also the workflow chart): maq fasta2bfa ref.fasta ref.bfa Convert the reference sequences to the binary fasta format maq fastq2bfq reads.fastq reads-1.bfq Convert the reads to the binary fastq format maq match ref.bfa reads-1.bfq Align the reads to the reference. The match command aligns reads in the BFQ format to the reference in the BFA format. If only one read file reads.bfq is provided, match command will do single-end alignment: maq match ref.bfa reads.bfq maq mapcheck ref.bfa >mapcheck.txt Statistics from the alignment maq assemble consensus.cns ref.bfa 2>assemble.log Build the mapping assembly maq cns2fq consensus.cns >cns.fq Extract consensus sequences and qualities maq cns2snp consensus.cns >cns.snp Extract list of SNPs When I do this, I will get among other files in the “outdir” directory, the precious cns.fq file, which corresponds to Consensus sequences and their qualities (in fastq format). How do I then move from there to a visualization on a chromosome?

“The read sequences together with the positions can be extracted from with command mapview” and the format supposedly explained in the manual: maq mapview As written:

“Display the read alignment in plain text. For reads aligned before the Smith-Waterman alignment, each line consists of

  1. read name,
  2. chromosome,
  3. position,
  4. strand,
  5. insert size from the outer coordinates of a pair,
  6. paired flag [1]
  7. mapping quality,
  8. single-end mapping quality,
  9. alternative mapping quality [2],
  10. number of mismatches of the best hit,
  11. sum of qualities of mismatched bases of the best hit,
  12. number of 0-mismatch hits of the first 24bp,
  13. number of 1-mismatch hits of the first 24bp on the reference,
  14. length of the read,
  15. read sequence
  16. and its quality.

[1] The fifth column [sixth?], paired flag, is a bitwise flag. Its lower 4 bits give the orientation: 1 stands for FF, 2 for FR, 4 for RF, and 8 for RR, where FR means that the read with smaller coordinate is on the forward strand, and its mate is on the reverse strand. Only FR is allowed for a correct pair. The higher bits of this flag give further information. If the pair meets the paired end requirement, 16 will be set. If the two reads are mapped to different chromosomes, 32 will be set. If one of the two reads cannot be mapped at all, 64 will be set. The flag for a correct pair always equals to 18. [Given that I am not doing pairwise reads, this is not an issue for me?]

For reads aligned by the Smith-Waterman alignment afterwards, the flag is always 130. A line consists of read name, chromosome, position, strand, insert size, flag (always 130), position of the indel on the read (0 if no indel), length of the indels (positive for insertions and negative for deletions), mapping quality of its mate, number of mismatches of the best hit, sum of qualities of mismatched bases of the best hit, two zeros, length of the read, read sequence and its quality. The mate of a 130-flagged read always gets a flag 18. Flag 192 indicates that the read is not mapped but its mate is mapped. For such a read pair, one read has flag 64 and the other has 192.

[2] Alternative mapping quality always equals to mapping quality if the reads are not paired. If reads are paired, it equals to the smaller mapping quality of the two ends. This alternative mapping quality is actually the mapping quality of an abnormal pair.

OPTIONS: -b do not display the read sequence and the quality”

(end quote).

In passing, there appears to be another? MapView for aligning short reads (I only thought it is another because of the mention of the special binary format):

That page gives an example of Maq output as text: 8:18:1354:1553 chr1 597 - 0 0 99 99 99 0 01 0 35 GGTGGGACCGTTCGTGAAGGCTGGCCCATTGAGGA GGBFPOLQOPLHYRDOCLYM`OO^VSP``YR`_]T

8:22:173:821 chr1 2597 - 0 0 99 99 99 0 01 0 35 GGTGGGACCGTTCGTGAAGGCTGGCCCATTGAGGA NHELOUPVTZUT_WU^```````````````````

8:29:309:1409 chr1 5397 - 0 0 99 99 99 0 01 0 35 GGTGGGACCGTTCGTGAAGGCTGGCCCATTGAGGA HLHOBQCOOEGTP]LNJXVX`````J`````````

I suppose that all the middle numbers will look like that if the reads are not paired-end reads. And that I would rather have the version “-b” for trying to make a file in a format that I could display on the Genome Browser by UCSC, or for making an ALN file compatible with CisGenome.

Brought all the .fastq files over from my laptop to the tower at home. Have decompressed them. Copied GDZ1_OTX2.fastq into MAQ folder. (Renamed.)

Launch the $ easyrun -d outdir ref.fasta reads.fastq as $ easyrun -d outdir refhg19.fasta GDZ1_OTX2.fastq

It's working! But since I have more than 2 million reads, the program seems to have chopped them up into three files: “-- finish writing file 'outdir/read1@1.bfq' -- finish writing file 'outdir/read1@2000001.bfq'

-- finish writing file 'outdir/read1@4000001.bfq'”

Will paste in the whole log when it's done running. Launched around 10:45 am.

When returned at 1:10PM had wanted to cut and paste by using Ctrl-C. This clearly aborted the process.

heather@heather-desktop:~/maq-0.7.1$ easyrun -d outdir refhg19.fasta GDZ1_OTX2.fastq

-- CMD: /home/heather/maq-0.7.1/maq fasta2bfa /home/heather/maq-0.7.1/refhg19.fasta outdir/ref.bfa 2> /dev/null

-- CMD: /home/heather/maq-0.7.1/maq fastq2bfq -n 2000000 /home/heather/maq-0.7.1/GDZ1_OTX2.fastq outdir/read1

-- finish writing file 'outdir/read1@1.bfq'

-- finish writing file 'outdir/read1@2000001.bfq'

-- finish writing file 'outdir/read1@4000001.bfq'

-- 4780016 sequences were loaded.

-- CMD: (cd outdir; /home/heather/maq-0.7.1/maq map -n 2 -e 70 -u unmap1@4000001.txt ref.bfa read1@4000001.bfq 2>

^C** fail to run command '(cd outdir; /home/heather/maq-0.7.1/maq map -n 2 -e 70 -u unmap1@4000001.txt ref.bfa read1@4000001.bfq 2>' at /usr/local/bin/ line 842.

Serves me right for being impatient. Start again, perhaps by trying what is between the last parentheses

Try that for a few hours and see what happens.

At 2:43 is written in the [match_search] 56% processed in 5424.175 sec: 391292 / 14087008 = 0.028

This is =90 minutes, which corresponds to the second launch, and the update time was at 14:40. So it seems to be underway still. 87% processed in 8524 seconds: 2.4 hours so far. Oh; at 2.5 hours that was only round 1/3; round 2 has just started. I suppose that is what they meant by 2x10⁶ and no more at a time.

"!!" will repeat the command, as in $ cat so that I can see that at 5PM, I am at 38% of the second part of the file (OTX2-1) –

Here at this link we have the idea that a short sequence read in text format can be rendered into a .bed format, suitable for upload to UCSC:

“Data that is uniquely aligned to a genome could be viewed as a custom track in the UCSC genome browser (viewable only from the machine it was uploaded). To generate a track in the BED format from *_realign.txt files (The alignment files have names of the form "s_1_0013_realign.txt". There is one such file per tile on the flowcell. Let's say for lane 3, assuming 25 nt sequences): cat s_3_????_realign.txt|egrep -v '^#'|perl -ane 'if (@F>3){$_=~/(chr.+):(\d+)\s([F|R])/;print $1,"\t",$2,"\t",($2+25),"\n"}'> s3_customTrack.txt”

Also have a look here:

I particularly liked the comment, “Most of the fields are not useful for any form of analysis, and what's given is mostly incomprehensible.”

Which brought me here: and to the FindPeaks program, which seems to be just what I need and want. To start with,

(But it only converts maq-type and not .fq files to .bed).