Talk:Wikiomics:EMBO Tunis 1

From OpenWetWare
Jump to: navigation, search

Exercises 17th Sep 2014 afternoon

This part is meant to be with an active participation of students (& teacher), therefore resembling a bit more real life situations where working directories do not contain all files needed at the given step.

You should either create symbolic links

ln -s /path/to/desired/file .
#or copy them
cp -i /path/to/desired/file .

If you feel stranded unix command line wise, try checking this tutorial: http://openwetware.org/wiki/Wikiomics:WinterSchool_day1#Introduction_to_Linux_and_the_command_line

Starting

there will be 0917_kd directory in which you will find a set of subdirectories:

000_igv
00_fasta
01_fastq
02_sambam
03_gtf
04_bed
05_vcf
06_wig
07_bwa
08_last
09_last_genomes

Lets go there:

#jump to home directory if you happen to be somewhere else
cd 

#go to this directory
cd 0917_kd

#list the content
ls
ls -R | less
press q if you need to quit

Configure new genome to use in IGV

#go to dir with required files
cd 000_igv

#see what is there in one column
ls -1

#lets use pet sized one-chromosome genome:
LmxM.01_single.fa
LmxM.01_single.fa.fai
LmxM.01_single.gff

#check the section of main page, and use these files as input 

Load BAM, VCF, WIG and IGV files into browser

In 000_igv and 02_sambam directories these is a bunch of files compatible with LmxM.01 _single chromosome. Check which files can be loaded. We can see GC content across the chromosome in 100bp window, "mappability" of 100bp long reads to different parts of the chromosome, mutations, and BAMs containing genomic mapped genomic DNA reads. Scroll along the chromosome, look at the differences. Note places enriched in SNPs, gaps in mapping, indels.

Close the IGV, go back to the command line.

Working with FASTA

The genome files are in 00_fasta directories.

Using the web page: can you do following:

  1. what is the total number of sequences in each of TriTrypDB-8.0*fasta files?
  2. what you think is a badly assembled genome and why?
  3. select KE148271.1 from L.enriettii assembly and dump it as a single fasta file. If you are having problems, think why?
  4. can you "fix" this genome naming scheme? If yes, dump KE148271.1 and ATAF01001110.1 in a single step.
  5. you have Lmex_genome.sort.fa file without chromosomal sizes in sequence names. Can you get them in a format:
chr_name<tab>size