User:Lindenb/Notebook/UMR915/20101115

Working for cedric, wrote a small tool to print the annotation of a swissprot entry:

Z3H7B_HUMAN http://www.uniprot.org/uniprot/Q9UGR2.xml ============================================================ [1-993] Zinc finger CCCH domain-containing protein 7B chain ...................................========================= [36-69] TPR 1 repeat .....................................==..................... [38-39] In Ref. 5; BAA82983. sequence conflict MERQKRKADIEKGLQFIQSTLPLKQEEYEAFLLKLVQNLFAEGNDLFREKDYKQALVQYM 1-60

=
=============================================== [1-993] Zinc finger CCCH domain-containing protein 7B chain

=
................................................... [36-69] TPR 1 repeat .....................==================================..... [82-115] TPR 2 repeat .......................................................===== [116-149] TPR 3 repeat ...........................=................................ [88] In Ref. 4; AAF05541. sequence conflict EGLNVADYAASDQVALPRELLCKLHVNRAACYFTMGLYEKALEDSEKALGLDSESIRALF 61-120

=
=============================================== [1-993] Zinc finger CCCH domain-containing protein 7B chain

=
================............................... [116-149] TPR 3 repeat RKARALNELGRHKEAYECSSRCSLALPHDESVTQLGQELAQKLGLRVRKAYKRPQELETF 121-180

=
=============================================== [1-993] Zinc finger CCCH domain-containing protein 7B chain ....................................................=....... [233] Phosphoserine modified residue ............................================................ [209-224] In isoform 2. splice variant ...................................................=........ [232] In Ref. 4; AAF05541. sequence conflict SLLSNGTAAGVADQGTSNGLGSIDDIETGNVPDTREQVEIGAPRDCYVDPRGSPALLPST 181-240

=
=============================================== [1-993] Zinc finger CCCH domain-containing protein 7B chain .......................=========............................ [264-272] LD motif; interaction with NSP3 short sequence motif ................===......................................... [257-259] Almost no effect on NSP3 binding. mutagenesis site ...........................===.............................. [268-270] Complete loss of NSP3 binding. mutagenesis site ......=..................................................... [247] In Ref. 3; AAI52559. sequence conflict PTMPLFPHVLDLLAPLDSSRTLPSTDSLDDFSDGDVFGPELDTLLDSLSLVQGGLSGSGV 241-300

=
=============================================== [1-993] Zinc finger CCCH domain-containing protein 7B chain PSELPQLIPVFPGGTPLLPPVVGGSIPVSSPLPPASFGLVMDPSKKLAASVLDALDPPGP 301-360

=
=============================================== [1-993] Zinc finger CCCH domain-containing protein 7B chain ......................................................=..... [415] Phosphoserine modified residue ..................=......................................... [379] In dbSNP:rs9607793. sequence variant .........................=.................................. [386] In Ref. 4; AAF05541. sequence conflict .............................................=.............. [406] In Ref. 3; AAI52559. sequence conflict TLDPLDLLPYSETRLDALDSFGSTRGSLDKPDSFMEETNSQDHRPPSGAQKPAPSPEPCM 361-420

=
=============================================== [1-993] Zinc finger CCCH domain-containing protein 7B chain PNTALLIKNPLAATHEFKQACQLCYPKTGPRAGDYTYREGLEHKCKRDILLGRLRSSEDQ 421-480

=
=============================================== [1-993] Zinc finger CCCH domain-containing protein 7B chain ...................=========================................ [500-524] C3H1-type 1 zinc finger region TWKRIRPRPTKTSFVGSYYLCKDMINKQDCKYGDNCTFAYHQEEIDVWTEERKGTLNRDL 481-540

=
=============================================== [1-993] Zinc finger CCCH domain-containing protein 7B chain LFDPLGGVKRGSLTIAKLLKEHQGIFTFLCEICFDSKPRIISKGTKDSPSVCSNLAAKHS 541-600

=
=============================================== [1-993] Zinc finger CCCH domain-containing protein 7B chain ...............................=======================...... [632-654] C3H1-type 2 zinc finger region FYNNKCLVHIVRSTSLKYSKIRQFQEHFQFDVCRHEVRYGCLREDSCHFAHSFIELKVWL 601-660

=
=============================================== [1-993] Zinc finger CCCH domain-containing protein 7B chain ...=........................................................ [664] Phosphotyrosine modified residue ...............=............................................ [676] In Ref. 4; AAF05541. sequence conflict LQQYSGMTHEDIVQESKKYWQQMEAHAGKASSSMGAPRTHGPSTFDLQMKFVCGQCWRNG 661-720

=
=============================================== [1-993] Zinc finger CCCH domain-containing protein 7B chain .................................................=========== [770-798] C3H1-type 3 zinc finger region QVVEPDKDLKYCSAKARHCWTKERRVLLVMSKAKRKWVSVRPLPSIRNFPQQYDLCIHAQ 721-780

=
=============================================== [1-993] Zinc finger CCCH domain-containing protein 7B chain

=
=====.......................................... [770-798] C3H1-type 3 zinc finger region NGRKCQYVGNCSFAHSPEERDMWTFMKENKILDMQQTYDMWLKKHNPGKPGEGTPISSRE 781-840

=
=============================================== [1-993] Zinc finger CCCH domain-containing protein 7B chain .................=========================.................. [858-882] C2H2-type zinc finger region GEKQIQMPTDYADIMMGYHCWLCGKNSNSKKQWQQHIQSEKHKEKVFTSDSDASGWAFRF 841-900

=
=============================================== [1-993] Zinc finger CCCH domain-containing protein 7B chain .=============================.............................. [902-930] C3H1-type 4 zinc finger region .....................===================================.... [922-956] null coiled-coil region .................=.......................................... [918] In Ref. 3; AAI52559. sequence conflict PMGEFRLCDRLQKGKACPDGDKCRCAHGQEELNEWLDRREVLKQKLAKARKDMLLCPRDD 901-960

=
==================== [1-993] Zinc finger CCCH domain-containing protein 7B chain .....................========.... [982-989] Poly-Ala compositionally biased region ..........=...................... [971] In Ref. 1; BAG37501 and 3; AAI52559. sequence conflict DFGKYNFLLQEDGDLAGATPEAPAAAATATTGE 961-993

Source code
import java.net.URL; import java.util.List;

import javax.xml.bind.JAXBContext;

import uniprot.Entry; import uniprot.FeatureType; import uniprot.LocationType; import uniprot.PositionType; import uniprot.Uniprot;

// xjc -p "uniprot" "http://www.uniprot.org/support/docs/uniprot.xsd" public class UniprotAscii {	private UniprotAscii throws Exception {		}	private void run(String id) throws Exception {		final int line_length=60; JAXBContext jc = JAXBContext.newInstance("uniprot"); String uri="http://www.uniprot.org/uniprot/"+id+".xml"; Uniprot uniprot=(Uniprot)jc.createUnmarshaller.unmarshal(new URL(uri)); for(Entry entry:uniprot.getEntry) {			System.out.println(entry.getName.get(0)); System.out.println(uri); String sequence= entry.getSequence.getValue.replaceAll("[ \n\t\r]", ""); List features=entry.getFeature; int start=0; while(start< sequence.length) {				int end=Math.min(start+line_length, sequence.length); for(FeatureType feat:features) {					int x0=0; int x1=0; LocationType t=feat.getLocation; if(t==null) continue; PositionType begT=t.getBegin; PositionType endT=t.getEnd; PositionType posT=t.getPosition; String range=null; if(begT!=null && endT!=null) {						x0=begT.getPosition.intValue; x1=endT.getPosition.intValue+1; range="["+begT.getPosition+"-"+endT.getPosition+"]"; }					else if(posT!=null) {						x0=posT.getPosition.intValue; x1=x0+1; range="["+posT.getPosition+"]"; }					else {						System.err.println("BOUM"); continue; }					if(x0>=end) continue; if(x1<start) continue; int x=start; x0--; x1--; x0=Math.max(start, x0); x1=Math.min(end,x1); while(x<x0) {x++;System.out.print(".");} while(x<x1) {x++;System.out.print("=");} while(x<end) {x++;System.out.print(".");} System.out.println(" "+range+" " + feat.getDescription+" "+feat.getType); }				System.out.println(sequence.subSequence(start, end)+" "+(start+1)+"-"+end); System.out.println; start+=line_length; }			}		//System.err.println("OK"); }	public static void main(String[] args) {		try {		UniprotAscii app=new UniprotAscii; int optind=0; while(optind<args.length) {			if(args[optind].equals("-h")) {				return; }			else if(args[optind].equals("-L")) {				//app.readLength=Integer.parseInt(args[++optind]); }			else if(args[optind].equals("--")) {				optind++; break; }			else if(args[optind].startsWith("-")) {				System.err.println("Unnown option: "+args[optind]); return; }			else {				break; }			++optind; }		if(optind==args.length) {			return; }		else {			while(optind< args.length) {				String inputName=args[optind++]; app.run(inputName); }			}		}catch(Throwable err) {			err.printStackTrace; }	}	}

Received update from IG
for the indels: new vs old headers

1	Position.Build36	    1	Position.Build36 2	chrom	    2	chrom 3	Depth	    3	Depth 4	CIGAR	    4	CIGAR 5	ref_upstream	    5	ref_upstream 6	ref.indel	    6	ref.indel 7	ref_downstream	    7	ref_downstream 8	Q.indel. 8	Q.indel. 9	max_gtype	    9	max_gtype 10	Q.max_gtype. 10	Q.max_gtype. 11	max2_gtype	   11	max2_gtype 12	bp1_reads	   12	bp1_reads 13	ref_reads	   13	ref_reads 14	indel_reads	   14	indel_reads 15	other_reads	   15	other_reads 16	repeat_unit	   16	repeat_unit 17	ref_repeat_count	   17	ref_repeat_count 18	indel_repeat_count	   18	indel_repeat_count 19	Gene.name	   19	Gene.name 20	Gene.start	   20	Gene.start 21	Gene.end	   21	Gene.end 22	Strand	   22	Strand 23	Nbr.exon	   23	Nbr.exon 24	refseq	   24	refseq 25	UCSC.ID	   25	type 26	type	   26	type.pos 27	type.pos	   27	Intron.start 28	Intron.start	   28	Intron.end 29	Intron.end	   29	region.splice 30	region.splice