Harvard:Biophysics 101/2007/Notebook:Xiaodi Wu/2007-5-1

Happy May Day!
Much of my attention these last few weeks have been focused on transitioning the existing code to an actual server, with actual local data that we can call. This would make our scripts a lot faster, and a lot more consistent. To this end, although it's doubtful that anyone would have the actual facilities to duplicate this process, I thought I'd first document some of the steps undertaken so far in this direction. Previous writeups have been unusually short, but this is a more detailed accounting of all the activity that's been going on behind the scenes now that I have a moment to document it all online.

Installation and setup of server

 * Installed necessary Python and Apache2 modules via apt-get (apache2, apache2-threaded-dev, python2.4, python2.4-dev).
 * Downloaded, compiled, and configured mod-python (requires Apache2).
 * Altered Apache config file to reflect mod-python.
 * Installed python-mysqldb via apt-get (this is for interfacing between Python and our database).
 * Installed Django (our web framework, so that we can serve an interface).

More familiar stuff:


 * Installed egenix-mx-base-2.0.6 and Numeric-24.2 as BioPython prerequisites.
 * Installed BioPython 1.4.3.

Not yet completed tasks
Need to set up and configure MySQL database, so that the relevant components can start using a database to store information.

Web files and local data files
Installed custom error pages (try navigating to, say, a random url) as well as a placeholder page, having chosen a nice, blue colour scheme. Downnloaded over the last few days a few valuable sources of info, so that some queries can now be done locally:
 * NCBI has some very intriguing files if you dig deep enough, with pretty much all the data needed in some very accessible formats: so far, I've been able to download gene sequences in fasta format for each chromosome (big files!), as well as files that convert between gene IDs and PubMed IDs, UniGene IDs, accession numbers, and quite a few other things; also, I've been able to get a large file containing all consensus coding sequences (CCDS), so we know the names of every gene and where they map to on each chromosome (and by extension, since we have the sequences of each chromosome, pretty much all the info we need) -- ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/ and ftp://ftp.ncbi.nih.gov/pub/CCDS/current_human/
 * UCSC has a very interesting collection of data too; many of them correspond to ones that NCBI also has, and so it would be redundant to look for those as well; however, there is a file mapping chromosome positions in base pairs to cytobands (i.e. it'll tell you that base pairs 120938000-122032000 on chr 10 correspond to 10q12.3 or something (I just made that specific one up)); this file could become useful at some point -- ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/database/cytoBand.txt.gz
 * TCAG (a Toronto-based genomic centre, hurrah!) has a list of CNVs that are known (http://projects.tcag.ca/variation, I believe, is the correct URL), which I will also make part of our local data

Not yet completed tasks
I've still yet to look into finding local, downloadable versions of dbSNP data that can be easily dumped into a database; there are certainly files available, but they seem to come in an icky format that requires additional processing. I'm confident this will be solved soon, though, and in any case I haven't looked too closely so I could be wrong about the level of ickiness.

As to the data we already have, all but the actual chromosome sequences can be more effectively searched if they're dumped into a database. I will work on making a database schema that fits and import that data in (LOAD DATA INFILE seems to be the correct SQL command).

Regarding other data, the HapMap has plenty of info, which is not too easily accessible. They do report that most of the stuff they find they also submit to dbSNP, so I feel that it will be important to look into how much data that we need directly from them, and how much will trickle down through dbSNP. Integrating this data would be useful, but I guess this could come later and for now we could continue using web queries. ClustalW seems to be web-only for the moment, but BioPython handles that very transparently and I don't think this will be a problem. OMIM data is another beast I've yet to look at, but I suppose we will stick to web querying that unless there's some Harvard server we can access with a mirrored copy; likewise Blast. I don't feel that maintaining a separate Blast server just for this one project is worth the time, but if there's some Harvard mirror, that would be wonderful. PolyPhen, being a Harvard project, would be easy enough I'm sure to integrate into our tool. However, as it is, it seems as though the data they have is via HTML output only, but that can be easily captured through quick-and-dirty screenscraping.