User:Lindenb/Notebook/UMR915/20100901

Map UCSC knownGene to EMBL strings
java program scanning XML of http://string.embl.de/ might be too slow. trying using flat files. First download human data:

curl -s "http://string.embl.de:8080/newstring_download/protein.links.v8.3.txt.gz" | gunzip -c | egrep '^9606\.' | egrep ' 9606\.' > ~/jeter.links how many interactions ? wc jeter.links 2577772  7733316 118577512 jeter.links how many proteins ? tr " " "\n" < jeter.links | egrep '^9606\.' |sort | uniq | wc -l 17369

does STRING only use ensembl identifiers ? YES: tr " " "\n" < jeter.links | egrep '^9606\.' | grep -v ENSP (nothing)

cleanup 'links', remove '9606.' sed 's/9606\.E/E/g' jeter.links | tr " " " " > A; mv A jeter.links

download mapping ENSP - knownGene:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -P 3306 -D hg18 -e 'select distinct K.name,E.protein from knownToEnsembl as K, ensGtp as E where E.transcript=K.value and E.protein like "ENSP%" ' > jeter.ensembl wc -l jeter.ensembl 59310 sort this mapping sort -t ' ' -k 2,2 jeter.ensembl | grep ENSP > a ; mv a jeter.ensembl sort links on first column sort -t ' ' -k1,1 jeter.links > a; mv a jeter.links join links with ensembl for first column, sort result join -t ' ' -1 1 -2 2 jeter.links jeter.ensembl | cut -d '        ' -f 2- | sort -t '     ' -k 1,1 > a  mv a jeter.links

head jeter.links ENSP00000000233	151	uc002lfr.1 ENSP00000000233	151	uc003frn.1 ENSP00000000233	154	uc001ndt.1 ENSP00000000233	154	uc001ndu.1 ENSP00000000233	155	uc001rzo.1 ENSP00000000233	155	uc001rzp.1 ENSP00000000233	157	uc002qtg.1 ENSP00000000233	160	uc003fri.1 ENSP00000000233	160	uc003frj.1 ENSP00000000233	162	uc001nnb.1

join for second protein: join -t ' ' -1 1 -2 2 jeter.links  jeter.ensembl  | cut -d '      ' -f 2- > jeter.strings.txt wc jeter.strings.txt 7896492 23689476 205308792 jeter.strings.txt

result: head jeter.strings.txt 151	uc002lfr.1	uc003vmb.1 151	uc002lfr.1	uc010llb.1 151	uc003frn.1	uc003vmb.1 151	uc003frn.1	uc010llb.1 154	uc001ndt.1	uc003vmb.1 154	uc001ndt.1	uc010llb.1 154	uc001ndu.1	uc003vmb.1 154	uc001ndu.1	uc010llb.1 155	uc001rzo.1	uc003vmb.1 155	uc001rzo.1	uc010llb.1

put in mysql
CREATE TABLE `kg2kg` ( `confidence` smallint(6) NOT NULL,  `kg1` varchar(11) NOT NULL,  `kg2` varchar(11) NOT NULL,  KEY `kg1` (`kg1`),  KEY `kg2` (`kg2`)  ) ENGINE=InnoDB DEFAULT CHARSET=latin1 ;

then mysql (...) -e 'load data local infile "jeter.strings.txt" into table kg2kg'

Load missing tables from UCSC
load http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGtp.txt.gz and http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/knownToEnsembl.txt.gz from UCSC

mysql -D hg18 < ensGtp.sql mysql -D hg18 < knownToEnsembl.sql mysql -D hg18 -e 'load data local infile "ensGtp.txt" into table ensGtp' mysql -D hg18 -e 'load data local infile "knownToEnsembl.txt" into table knownToEnsembl'

Download Links
curl -s http://string.embl.de:8080/newstring_download/protein.links.v8.3.txt.gz | gunzip -c  |\ egrep '^9606\.' | egrep ' 9606\.' | sed 's/9606\.//g' | tr " " "      " > jeter.links

CREATE TABLE `ensp2ensp` ( `protein1` varchar(16) NOT NULL,  `protein2` varchar(16) NOT NULL,  `confidence` smallint(6) NOT NULL,  KEY `protein1` (`protein1`),  KEY `protein2` (`protein2`)  ) ENGINE=InnoDB DEFAULT CHARSET=latin1 load mysql -u root -D emblstring -e 'load data local infile "jeter.links" into table ensp2ensp'

Fact knownGene to ENSEMBL is a 1:1 relation
select K.name,count(name) as C from knownToEnsembl as K, ensGtp as E where E.transcript=K.value  group by K.name having C!=1; Empty set (0.78 sec)