Etchevers:Cardiac morphogenesis/2009/02/25

From OpenWetWare
Jump to: navigation, search
C14.jpg Cardiac morphogenesis Report.pngMain project page
Resultset previous.pngPrevious entry      

Steps taken in CisGenome to analyze ChIP-seq data

First, loaded the reference sequence – hg18. Everything has been moved to a filepath with no odd characters or spaces (D:\Heather\Downloads\cisgenome_v1\cisgenome_v1\2009-02-13_GCM2_results\ et cetera. The reference sequence used by Fasteris was Homo Sapiens full genome, Version NCBI 36.47.

Second, tried to upload the whole eland_multi file by doing the Alignment -> BAR in Sequencing menu. This goes too fast to actually be effective. Indeed, the created files are empty.

Instead do File -> Load data -> Genomic Region and enter the separate BED files per chromosome. Possible to select them all at once and load them in. Saved the whole project under 2009-feb*_heart.cgw.

Right now, am converting them using File -> File Conversion from BED -> COD. If you click on one BED file first, then the conversion file name is automatically filled in. Probably I did not need to load the BED files after all. This COD format is just tab-delimited chrN, start, end and which strand. These COD files are with the older, uncompressed BED files in 2009-02-11_GCM-2_eland_unique_Bed_by_Chr (which I can not rename because it’s open by CisGenome). But apparently can just load the BED files into UCSC Genome browser.

Rather resembles what I remember of the ALN files, so it is probably possible to upload these when I am done into Alignment -> BAR once more. Can select Signals one by one and erase them – I did this for the and F and R files. Since it is possible to choose different criteria for the alignments (window sizes and so forth) it is probably good to not keep useless files on hand. Also removed the BED files just leaving the COD ones.

The BED files look like this:

  • browser position chr22:1-1000000
  • track name="Insert mapped" description="Inserts with orientation" visibility=2 itemRgb=On
  • chr22 14489803 14489838 1 700 - 14489803 14489838 10,150,50

I wonder to get the Alignment files one does not again have to convert the COD files (previously converted from the BED format) by removing the first column?

These look like:

1 chr15 18263955 18263990 +

The eland_multi files actually has the original sequence data in it as well as the mapping. (ELAND: Efficient Local Alignment of Nucleotide Data, GAPipeline Version 1.0)

According to the provided PDF (2009-02-13_GCM-2_analysis_report.pdf), QC is quality control (not counted because of the N’s), NM is not mapped, etc. Format is: Reference sequence ID, Match start, Match end, Multiplicity (number of target best positions), Match score (i.e. 700 = 0 mismatch, 400 = 1 mismatch and 200 = 2 mismatches), Strand, Genome display values for the last three.


>HWI-EAS269:3:1:5:1887#0/1 TTGAACTAACATTTTTAATTCCACTGNNGATGCATT 1:0:0 Homo_sapiens.NCBI36.47.dna.chromosome.1.fa:190460483R0




>HWI-EAS269:3:1:5:1788#0/1 TGTGTGTGTATGTGGTGTGCGTGTATGNGGTGTGTT 3:4:3 Homo_sapiens.NCBI36.47.dna.chromosome.13.fa:30314503F0,30314716F1,30315772F1,30315787F0,30316595F0,30316714F1,30316798F1

>HWI-EAS269:3:1:5:131#0/1 TGAACCTTTGATTTTGGCCTCTCCCACNGCTTAGGT 1:0:0 Homo_sapiens.NCBI36.47.dna.chromosome.13.fa:26350209R0


>HWI-EAS269:3:1:6:202#0/1 TAGCATGTTGGCCAGGATGGTCTCTCTCTCCTTACC 0:1:11 Homo_sapiens.NCBI36.47.dna.chromosome.14.fa:60239872R1

>HWI-EAS269:3:1:6:1720#0/1 TTGCAGGAATAACAAGCTTTACCTCATAATACACAC 1:0:0 Homo_sapiens.NCBI36.47.dna.chromosome.10.fa:125675143R0


>HWI-EAS269:3:1:6:1455#0/1 AATATACATTCACCAAGCACAAGAATATTCTAGTCA 1:0:0 Homo_sapiens.NCBI36.47.dna.chromosome.14.fa:48422240F0


>HWI-EAS269:3:1:6:724#0/1 AAGAACACCTCTGATTACTCCTGCCATCATGACCCT 2:0:0 Homo_sapiens.NCBI36.47.dna.chromosome.12.fa:40043740R0,Homo_sapiens.NCBI36.47.dna.chromosome.MT.fa:3813F0

>HWI-EAS269:3:1:6:149#0/1 TCCCCAGCTAGCCCTGGTGTCAGTCCTCTTCCCCTG 0:0:1 Homo_sapiens.NCBI36.47.dna.chromosome.16.fa:56383652F2





>HWI-EAS269:3:1:7:1254#0/1 ATTCACATGAAAGGACCATTACGAACCTCTGTTCCC 1:0:1 Homo_sapiens.NCBI36.47.dna.chromosome.3.fa:76061917F0

>HWI-EAS269:3:1:7:1539#0/1 ACCCCAGGAGTCACCCTTGAGTCCTTTCCCTTGCCT 1:0:0 Homo_sapiens.NCBI36.47.dna.chromosome.8.fa:131521449F0


>HWI-EAS269:3:69:263:567#0/1 GAGAGAGAGAGAGAGAGAGACATAGAGAGAGAGAGA 4:255:255 Homo_sapiens.NCBI36.47.dna.chromosome.1.fa:52674160R0,Homo_sapiens.NCBI36.47.dna.chromosome.9.fa:28345286R0,Homo_sapiens.NCBI36.47.dna.chromosome.15.fa:32298922F0,Homo_sapiens.NCBI36.47.dna.chromosome.2.fa:83220059F0

>HWI-EAS269:3:68:1164:429#0/1 TCCTTCAGATGTTCAGACGTGTCCAGAGTTTCTTCC 5:130:151 Homo_sapiens.NCBI36.47.dna.chromosome.7.fa:36510453F0,Homo_sapiens.NCBI36.47.dna.chromosome.13.fa:36943894F0,Homo_sapiens.NCBI36.47.dna.chromosome.5.fa:17038144F0,32713630F0,55580497F0


>HWI-EAS269:3:68:794:1065#0/1 CATGAATACACACACACACACACACACACACACACA 10:255:255 Homo_sapiens.NCBI36.47.dna.chromosome.13.fa:46747609R0,Homo_sapiens.NCBI36.47.dna.chromosome.5.fa:111173484R0,126295155F0,170732551R0,Homo_sapiens.NCBI36.47.dna.chromosome.6.fa:142962946R0,Homo_sapiens.NCBI36.47.dna.chromosome.1.fa:217133694R0,Homo_sapiens.NCBI36.47.dna.chromosome.2.fa:46489958F0,70592524F0,143445348R0,Homo_sapiens.NCBI36.47.dna.chromosome.3.fa:117860203R0

>HWI-EAS269:3:68:794:1349#0/1 TAAACTGACAGGGCACACTGGTGGGCATGTCTTATG 2:10:37 Homo_sapiens.NCBI36.47.dna.chromosome.16.fa:76670940R1,Homo_sapiens.NCBI36.47.dna.chromosome.5.fa:56840413R0,Homo_sapiens.NCBI36.47.dna.chromosome.X.fa:10287329F1,15568207R1,Homo_sapiens.NCBI36.47.dna.chromosome.20.fa:46585522F1,Homo_sapiens.NCBI36.47.dna.chromosome.8.fa:143217613R1,Homo_sapiens.NCBI36.47.dna.chromosome.1.fa:212822624F1,Homo_sapiens.NCBI36.47.dna.chromosome.19.fa:55636348R1,Homo_sapiens.NCBI36.47.dna.chromosome.2.fa:74726161R0,87332758F1,112083484R1,Homo_sapiens.NCBI36.47.dna.chromosome.3.fa:62447703F1


Total mapped: 1,798,668 = 36.1% of the 4,980,779 original reads, representing initially 179,308,044 basepairs. But only 26.1% of them are actually unique on the genome, meaning useful for peak calculation, while the others map more than once.

THIS MAY BE CONTAMINATION BY ssDNA – à voir si certains de ceux-la mappent.

Fasteris also got a similar fraction using MAQ software instead of ELAND to read the 090203_s_3_seq_GCM2.txt original sequence file. Think to ask if this is typical and why? Sequencing errors? Illegible? Rules of ELAND are: “The ELAND software finds all matches of the reads with a maximum of 2 errors in the first 32 bases on the reference sequence. The matches are sorted in regards of their “quality” [(unique or repeats, zero, one or two errors) – see Results in the PDF report]. Up to [the]10 position best match position[s] on the genome are recorded. » Speak with “author: LBA” about it. (Actually spoke with M. Baerlocher, see below.)

In the files Fasteris provided, they grouped the uniquely mapped reads in various peak configurations and provided series where there are at least 10, 20 or 50 hits coverage. I think that uniquely mapped here means unambiguous as far as chromosomal location is concerned. This means that the “repeats” during mapping mean sequences repeated elsewhere in the genome, not repeats of the sequenced bit of DNA. Of the mappable reads, 72.2% are unique with 0, 1 or 2 “errors” (probably due to reading the sequences) and the rest are repeats. (Confirmed.)

If I get CisGenome to work, I can compare with the peak-mapping approach from Fasteris. Indeed, they have also confirmed that they developed internal (house) scripts that find peaks. This is how they approach it: "the coverage obtained by the ELAND mapping for each sample is screened for first position matches with a maximum gap (user defined) in between them. Successive position[s] fulfilling the gap criteria are considered as boundary for the peaks." A peak is bigger than a sequence. To call it a peak, there have to be X number of hits (coverage) and "A minimum of independent events (different starting position) of half the minimum coverage is also required at this step." I suppose that means that a fragment of sequenced DNA corresponds to a peak of small tiled sequences corresponding to that fragment, perhaps falling off at the very edges. If a peak is defined as at least eg. 20 little sequences (coverage) in a given area (the maximum gap?) then there have to be at least ten different starting positions among those 20 or more sequences.

So if sequences are 32 bases long, this explains why you get similar counts for gaps of 10 or 20 or even 50bp intervals, whereas if the gaps are longer and you accept 10 hits coverage, you’ll get many more than if you require 50 (the coverage is kind of a threshold). This is why they only provided the three files with gaps of 50, 100 and 200 but with respective thresholds of 10, 20 and 50; the 33 peaks with more than 50 hits coverage at a gap of 200 may or may not be represented among the 124 peaks with more than 20 hits coverage at a gap of 100, or the 456 peaks with more than 10 hits coverage at a gap of 50, as this last one is a quite tight gap. I also think it’s the last category which would allow me to see if the consensus sequence had been respected or not.

I’m unclear what "MaxCoverage To Size" means in the results from their peaks. Has something I presume to do with how much coverage (height) relative to the width of the peak and therefore, perhaps, confidence. This width comes under the “Peak Size” column and the coverage, under “GCM2” (the name of their sequencer).

So for example, in Cov10_Gap50, chr3:199384683-199384719 is only 36bp wide but has 126 sequences that cover it. In Cov50_Gap200 the only peak is chr3:199384524-199384719 and it is 195bp wide (note it finishes in same place) with 135 covering sequences. Ah. Of the other sequences in Cov10_Gap50, that was the only one with >50 coverage. I expect therefore to find all the 7 other peaks in the Cov20_Gap100 file. But it is not so easy, because that would have only been true for Cov20_Gap50. There is one additional peak in Cov20_Gap100 that was not in the Cov10_Gap50, and three of the latter in the former. So generally, this is true, but not specifically.

Perhaps peak sizes (widths) that are less than 32bp are not so reliable in Cov10_Gap50? (then again, the 36bp one is enormous – though perhaps not very specific. Check out the sequence if it’s not a TAATAATAA etc.). Those are the ones that are not present in Cov20_Gap100.

Fasteris (M. Baerlocher via M. Farinelli +41-22-794 22 23) has explained to me that even thin peaks ARE reliable because they are in fact composed only of the accumulation of first nucleotides in a read. That is why a gap is tolerated. So in the direction of a (series of) read(s), a peak is prolonged by 32 more nucleotides.

Go back to CisGenome help/FAQ to see what format the motif is supposed to be in. Probably just a text file, but as is or saved with the .cons suffix, doesn’t seem to do much. Being on the Internet could be helpful as I don’t have a genome browser offline.

  • Heather 10:16, 25 February 2009 (EST):