User:Timothee Flutre/Notebook/Postdoc/2012/03/30
Project name | Main project page Previous entry Next entry |
Make a BED file with all exons in the human genomeWe go to the UCSC genome browser and download the genes annotations in two files: wc -l < ens_transcripts_hg19.bed # 181,648 wc -l < ens_transcripts_hg19.txt # 181,649 We filter out the sex chromosomes and the "specific" chromosomes to keep only the "generic" autosomes: awk '{if(match($1,/chr[0-9]*$/) != 0) print $0}' ens_transcripts_hg19.bed > ens_transcripts_hg19_autosomes.bed awk 'NR>1 {if(match($3,/chr[0-9]*$/) != 0) print $0}' ens_transcripts_hg19.txt > ens_transcripts_hg19_autosomes.txt wc -l < ens_transcripts_hg19_autosomes.bed # 165952 wc -l < ens_transcripts_hg19_autosomes.txt # 165952 We replace the transcripts names of the BED file with the genes names of the txt file: diff <(cut -f4 ens_transcripts_hg19_autosomes.bed) <(cut -f2 ens_transcripts_hg19_autosomes.txt) # check the order of the lines join -1 4 -2 2 ens_transcripts_hg19_autosomes.bed ens_transcripts_hg19_autosomes.txt \ | awk 'BEGIN{OFS="\t";} {print $2, $3, $4, $24, $5, $6, $7, $8, $9, $10, $11, $12}' > ens_genes_hg19_autosomes.bed wc -l < ens_genes_hg19_autosomes.bed # 165952 We convert BED12 to BED6: bed12ToBed6 -i ens_genes_hg19_autosomes.bed > ens_exons_hg19_autosomes.bed wc -l < ens_exons_hg19_autosomes.bed # 1097861 We use mergeBed to consolidate all the exons per gene. The transcripts of each gene will use many of the same exons, or will partially use the same exon, and mergeBed merges these overlapping entries into single entries. mergeBed -nms -i ens_exons_hg19_autosomes.bed > ens_uniq_exons_hg19_autosomes.bed wc -l < ens_uniq_exons_hg19_autosomes.bed # 281821 Some exons will overlap with exons of other genes. The genes could be transcribed on opposite strands or partially use the same exon. Either way we want to remove these exons because they will confound the gene-level estimates of expression. Because of the -nms command to mergeBed, the 4th column contains the gene names of the merged exons. The awk command below removes all rows whose 4th column have multiple gene names: awk 'BEGIN{OFS="\t"} {split($4,a,";"); if(length(a) == 1){print $1,$2,$3,a[1]} \ else{delete b; for(i in a){b[a[i]]++}; if(length(b) == 1){for(j in b)g=j; print $1,$2,$3,g}}}' \ ens_uniq_exons_hg19_autosomes.bed > ens_uniq_exons_clean_hg19_autosomes.bed wc -l < ens_uniq_exons_clean_hg19_autosomes.bed # 270541 At the end, the file ens_uniq_exons_clean_hg19_autosomes.bed contains the coordinate of the exons with the name of the genes. Such a file is useful to get the number of reads mapped per exon after a RNA-seq experiment. |