Talk:Wikiomics:EMBO Tunis 1
Exercises 17th Sep 2014 afternoon
This part is meant to be with an active participation of students (& teacher), therefore resembling a bit more real life situations where working directories do not contain all files needed at the given step.
You should either create symbolic links
ln -s /path/to/desired/file . #or copy them cp -i /path/to/desired/file .
If you feel stranded unix command line wise, try checking this tutorial: http://openwetware.org/wiki/Wikiomics:WinterSchool_day1#Introduction_to_Linux_and_the_command_line
Starting
there will be 0917_kd directory in which you will find a set of subdirectories:
000_igv 00_fasta 01_fastq 02_sambam 03_gtf 04_bed 05_vcf 06_wig 07_bwa 08_last 09_last_genomes
Lets go there:
#jump to home directory if you happen to be somewhere else cd #go to this directory cd 0917_kd #list the content ls ls -R | less press q if you need to quit
Configure new genome to use in IGV
#go to dir with required files cd 000_igv #see what is there in one column ls -1 #lets use pet sized one-chromosome genome: LmxM.01_single.fa LmxM.01_single.fa.fai LmxM.01_single.gff #check the section of main page, and use these files as input
Load BAM, VCF, WIG and IGV files into browser
In 000_igv and 02_sambam directories these is a bunch of files compatible with LmxM.01 _single chromosome. Check which files can be loaded. We can see GC content across the chromosome in 100bp window, "mappability" of 100bp long reads to different parts of the chromosome, mutations, and BAMs containing genomic mapped genomic DNA reads. Scroll along the chromosome, look at the differences. Note places enriched in SNPs, gaps in mapping, indels.
Close the IGV, go back to the command line.
Working with FASTA
The genome files are in 00_fasta directories.
Using the web page: can you do following:
- what is the total number of sequences in each of TriTrypDB-8.0*fasta files?
- what you think is a badly assembled genome and why?
- select KE148271.1 from L.enriettii assembly and dump it as a single fasta file. If you are having problems, think why?
- can you "fix" this genome naming scheme? If yes, dump KE148271.1 and ATAF01001110.1 in a single step.
- you have Lmex_genome.sort.fa file without chromosomal sizes in sequence names. Can you get them in a format:
chr_name<tab>size