User:Timothee Flutre/Notebook/Postdoc/2012/03/30

From OpenWetWare
Jump to: navigation, search
Owwnotebook icon.png Project name Report.pngMain project page
Resultset previous.pngPrevious entry      Next entryResultset next.png

Make a BED file with all exons in the human genome

We go to the UCSC genome browser and download the genes annotations in two files:

wc -l < ens_transcripts_hg19.bed # 181,648
wc -l < ens_transcripts_hg19.txt # 181,649

We filter out the sex chromosomes and the "specific" chromosomes to keep only the "generic" autosomes:

awk '{if(match($1,/chr[0-9]*$/) != 0) print $0}' ens_transcripts_hg19.bed > ens_transcripts_hg19_autosomes.bed
awk 'NR>1 {if(match($3,/chr[0-9]*$/) != 0) print $0}' ens_transcripts_hg19.txt > ens_transcripts_hg19_autosomes.txt
wc -l < ens_transcripts_hg19_autosomes.bed # 165952
wc -l < ens_transcripts_hg19_autosomes.txt # 165952

We replace the transcripts names of the BED file with the genes names of the txt file:

diff <(cut -f4 ens_transcripts_hg19_autosomes.bed) <(cut -f2 ens_transcripts_hg19_autosomes.txt) # check the order of the lines
join -1 4 -2 2 ens_transcripts_hg19_autosomes.bed ens_transcripts_hg19_autosomes.txt \
| awk 'BEGIN{OFS="\t";} {print $2, $3, $4, $24, $5, $6, $7, $8, $9, $10, $11, $12}' > ens_genes_hg19_autosomes.bed
wc -l < ens_genes_hg19_autosomes.bed # 165952

We convert BED12 to BED6:

bed12ToBed6 -i ens_genes_hg19_autosomes.bed > ens_exons_hg19_autosomes.bed
wc -l < ens_exons_hg19_autosomes.bed # 1097861

We use mergeBed to consolidate all the exons per gene. The transcripts of each gene will use many of the same exons, or will partially use the same exon, and mergeBed merges these overlapping entries into single entries.

mergeBed -nms -i ens_exons_hg19_autosomes.bed > ens_uniq_exons_hg19_autosomes.bed
wc -l < ens_uniq_exons_hg19_autosomes.bed # 281821

Some exons will overlap with exons of other genes. The genes could be transcribed on opposite strands or partially use the same exon. Either way we want to remove these exons because they will confound the gene-level estimates of expression. Because of the -nms command to mergeBed, the 4th column contains the gene names of the merged exons. The awk command below removes all rows whose 4th column have multiple gene names:

awk 'BEGIN{OFS="\t"} {split($4,a,";"); if(length(a) == 1){print $1,$2,$3,a[1]} \
else{delete b; for(i in a){b[a[i]]++}; if(length(b) == 1){for(j in b)g=j; print $1,$2,$3,g}}}' \
ens_uniq_exons_hg19_autosomes.bed > ens_uniq_exons_clean_hg19_autosomes.bed
wc -l < ens_uniq_exons_clean_hg19_autosomes.bed # 270541

At the end, the file ens_uniq_exons_clean_hg19_autosomes.bed contains the coordinate of the exons with the name of the genes. Such a file is useful to get the number of reads mapped per exon after a RNA-seq experiment.