User:Timothee Flutre/Notebook/Postdoc/2012/02/01: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(→‎Find SNPs in cis of genes: add info about UCSC annotation file format)
m (→‎Find SNPs in cis of genes: add links for BED format and BEDTools)
Line 14: Line 14:
To view the description of the format, go to the [http://genome.ucsc.edu/cgi-bin/hgTables?command=start UCSC Table Browser]. Then choose the relevant parameters, in our case: clade="Mammal", genome="Human", assembly="hg19", group="Genes and Gene Prediction Tracks", track="Ensembl Genes" and table="ensGene". Finally, click on "describe table schema".
To view the description of the format, go to the [http://genome.ucsc.edu/cgi-bin/hgTables?command=start UCSC Table Browser]. Then choose the relevant parameters, in our case: clade="Mammal", genome="Human", assembly="hg19", group="Genes and Gene Prediction Tracks", track="Ensembl Genes" and table="ensGene". Finally, click on "describe table schema".


* convert transcripts to BED format, and then gather coordinates at the gene level (TSS and TES):
* convert transcripts into the [http://genome.ucsc.edu/FAQ/FAQformat.html#format1 BED format], and then gather coordinates at the gene level (TSS and TES):


  zcat Ensembl_hg19_UCSC_20111019.txt.gz | awk '{print $3"\t"$5"\t"$6"\t"$13"|"$2}' | gzip > Ensembl_transcripts.bed.gz
  zcat Ensembl_hg19_UCSC_20111019.txt.gz | awk '{print $3"\t"$5"\t"$6"\t"$13"|"$2}' | gzip > Ensembl_transcripts.bed.gz
  transcripts2genes.py Ensembl_hg19_UCSC_20111019.txt.gz Ensembl_genes.bed.gz
  transcripts2genes.py Ensembl_hg19_UCSC_20111019.txt.gz Ensembl_genes.bed.gz


* identify SNPs in cis of each gene (500kb in 5' of TSS and 3' of TES) assuming the SNP coordinates are taken from a file in the [http://www.stats.ox.ac.uk/~marchini/software/gwas/file_format.html IMPUTE format]:
* identify SNPs in cis of each gene (500kb in 5' of TSS and 3' of TES) assuming the SNP coordinates are taken from a file in the [http://www.stats.ox.ac.uk/~marchini/software/gwas/file_format.html IMPUTE format] and we have [http://code.google.com/p/bedtools/ BEDTools] installed:


  for i in {1..22}; do echo "chr"${i}"..."; awk -v i=${i} -F" " '{print "chr"i"\t"$3-1"\t"$3"\t"$2}' /path/to/chr${i}.impute | \
  for i in {1..22}; do echo "chr"${i}"..."; awk -v i=${i} -F" " '{print "chr"i"\t"$3-1"\t"$3"\t"$2}' /path/to/chr${i}.impute | \

Revision as of 05:03, 11 April 2012

Project name <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page
<html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html>

Find SNPs in cis of genes

  • retrieve annotations from the UCSC:
wget -O Ensembl_hg19_UCSC_20111019.txt.gz ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ensGene.txt.gz

To view the description of the format, go to the UCSC Table Browser. Then choose the relevant parameters, in our case: clade="Mammal", genome="Human", assembly="hg19", group="Genes and Gene Prediction Tracks", track="Ensembl Genes" and table="ensGene". Finally, click on "describe table schema".

  • convert transcripts into the BED format, and then gather coordinates at the gene level (TSS and TES):
zcat Ensembl_hg19_UCSC_20111019.txt.gz | awk '{print $3"\t"$5"\t"$6"\t"$13"|"$2}' | gzip > Ensembl_transcripts.bed.gz
transcripts2genes.py Ensembl_hg19_UCSC_20111019.txt.gz Ensembl_genes.bed.gz
  • identify SNPs in cis of each gene (500kb in 5' of TSS and 3' of TES) assuming the SNP coordinates are taken from a file in the IMPUTE format and we have BEDTools installed:
for i in {1..22}; do echo "chr"${i}"..."; awk -v i=${i} -F" " '{print "chr"i"\t"$3-1"\t"$3"\t"$2}' /path/to/chr${i}.impute | \
windowBed -w 500000 -a Ensembl_genes.bed.gz -b stdin | \
awk '{print $4"\t"$9"|"$8}' | \
gzip > chr${i}_genes_cisSNPs.txt.gz; done