Gene Family Notes

From OpenWetWare
Jump to navigationJump to search

100 genomes (Dongying Wu)

  • Summary
    • Took 100 genomes (bacterial and archaea)
    • Did all vs all blastp
    • Identified protein families using MCL
    • Annotated families according to evenness, universality, etc,
    • Ranked families

In order to identify potential phylogenetic markers for metagenomic studies, we are developing methods to analyze the gene distributions across genomes with compete genome sequences. We selected 85 bacterial and 15 archaeal genomes for the initial study. A maximum likelihood tree of 720 bacterial genomes based upon 31 concatenated genome markers were used for the bacterial genome selection, while the selection of archeal genome was based on a tree of the archeal radA genes. A program called maxPD was developed to automatically select taxa from a phylogenetic tree so the taxa are listed according to their contributions to the phylogenetic diversity based on the tree. The genomes that contributions most phylogenetic diversity were selected. In the selecting process, we only selected the genomes with publications and we also avoided genomes that undergone severe genome reductions.

A protocol has been developed to classify gene families automatically. The protocol is base upon Blastall and the Markov clustering algorithm so the protocol are automated, robust. We use an e value cutoff of 1e-10 in our test run. For the 313139 genes from the 100 selected genomes, we identify 23336 gene families that include a total of 239453 genes. In order for automatically select gene families for potential markers, we've developed a program that not only evaluates the universality of the gene families, but also estimates the evenness of the distributions of the genes across a selected group of genomes (See the figure below).

Protein Family Metrics

Across all 100 the selected genomes, we identified 30 gene families that appear in almost all the genomes with even gene distributions, 12 were ribosomal protein subunits that were already included in our gene marker collections which indicate the approach works in principal. The rest of the candidates are mostly tRNA synthetases among others. Different cutoffs will be applied for more potential marker selection, and the evenness and distribution estimation program was developed in a way that potential markers for specific phylogenetic groups (for example, a phylum or a class) can be easily identified.

Bacterial Families

We've identified 503 bacterial families with >=70 universalities. 56 can be phylogenetic markers (31 AMPHORA markers), the families (sequences, alignments, hmms) can be downloaded from http://www.biotorrents.net/details.php?id=39 (password for unzip is iseem).
The family characteristics are demonstrated below: (#: phylogenetic markers, #* AMPHORA markers)
the columns are:

[0] family_ID     
[1] family  size  
[2] universality  
[3] evenness 
[4] independent_evenness (independent of unversality)
[5] family descriptions

bactF10020	93	98.82	49.59	50.10	hypothetical protein
bactF10042	105	92.94	21.21	25.18	hypothetical protein
#*bactF10154	84	98.82	91.02	100.00	rpsB 30S ribosomal protein S2
bactF10345	106	100.00	29.76	29.76	polA DNA polymerase I
bactF10373	77	88.24	39.02	81.69	trpA tryptophan synthase subunit alpha
bactF10442	69	80.00	20.19	89.20	hypothetical protein
bactF10511	154	100.00	21.01	21.01	Peptidase M24
bactF10512	120	100.00	20.91	20.91	methionine aminopeptidase
bactF10567	90	95.29	47.10	49.11	atpB F0F1 ATP synthase subunit A
#*bactF10580	86	100.00	91.22	91.22	rplN 50S ribosomal protein L14
bactF10640	79	74.12	12.61	29.10	ribosome-associated GTPase
bactF10684	89	100.00	70.99	70.99	miaA tRNA delta(2)-isopentenylpyrophosphate transferase
bactF10712	75	87.06	35.51	90.01	50S ribosomal protein L25/general stress protein Ctc
bactF10714	73	85.88	32.32	100.00	ruvC Holliday junction resolvase
bactF10736	67	77.65	16.73	88.91	mutL DNA mismatch repair protein
bactF10744	124	81.18	9.40	22.51	hydrolase carbon-nitrogen family
bactF10745	89	98.82	65.16	65.53	nadE NAD synthetase
bactF10759	78	76.47	15.22	33.72	hypothetical protein
bactF10881	83	95.29	68.63	82.86	gidB glucose-inhibited division protein B
bactF10918	161	100.00	34.43	34.43	chaperone protein dnaJ
bactF10945	86	98.82	83.21	83.39	GrpE protein
#*bactF10964	85	100.00	100.00	100.00	rplK 50S ribosomal protein L11
bactF10992	81	90.59	47.10	68.76	riboflavin synthase subunit alpha
#*bactF11013	85	100.00	100.00	100.00	rpmA 50S ribosomal protein L27
bactF11020	77	71.76	10.45	27.78	NAD-dependent deacetylase
bactF11067	138	72.94	3.88	9.13	putative peptidase
bactF11068	103	76.47	8.28	18.64	succinyl-diaminopimelate desuccinylase
bactF11191	4632	100.00	16.12	16.12	ABC transporter ATP-binding protein
bactF11192	1975	100.00	5.41	5.41	sensor histidine kinase
bactF11193	1298	100.00	5.29	5.29	short chain dehydrogenase
bactF11194	1072	100.00	9.67	9.67	two-component response regulator
bactF11195	1038	77.65	2.68	5.24	hypothetical protein
bactF11196	794	97.65	4.59	4.88	hypothetical protein
bactF11199	623	100.00	53.21	53.21	infB translation initiation factor IF-2
bactF11200	660	70.59	1.45	3.32	hypothetical protein
bactF11201	500	97.65	10.42	11.30	Glycosyl transferase group 1
bactF11203	515	91.76	3.32	4.28	Aldehyde dehydrogenase
bactF11204	488	98.82	16.12	16.79	transcriptional regulator GntR family
bactF11205	489	77.65	2.63	5.64	3-hydroxybutyryl-CoA dehydrogenase
bactF11206	447	97.65	14.10	15.20	NAD-dependent epimerase/dehydratase
bactF11207	397	97.65	14.69	15.97	cation transport ATPase
bactF11208	396	95.29	14.29	17.13	NADH oxidase
bactF11209	424	83.53	3.09	4.91	two-component response regulator
bactF11210	429	98.82	4.78	4.88	beta-ketoacyl synthase
bactF11211	380	96.47	12.71	14.28	argD acetylornithine aminotransferase
bactF11212	391	74.12	1.77	3.51	permease of the major facilitator superfamily
bactF11213	410	82.35	3.59	7.08	AcrB/AcrD/AcrF family protein
bactF11214	356	83.53	3.78	6.35	putative anaerobic dehydrogenase
bactF11215	327	85.88	4.77	7.29	alcohol dehydrogenase zinc-containing
bactF11218	286	96.47	16.75	18.96	D-isomer specific 2-hydroxyacid dehydrogenase family protein
bactF11219	256	100.00	21.84	21.84	nucleotidyl transferase
bactF11220	262	100.00	81.98	81.98	ileS isoleucyl-tRNA synthetase
bactF11221	279	81.18	3.69	6.77	extracellular solute-binding protein family 5
bactF11222	258	90.59	5.38	6.68	binding-protein-dependent transport systems inner membrane component
bactF11223	257	89.41	5.62	7.18	oligopeptide ABC transporter permease protein
bactF11225	261	74.12	2.68	6.20	oxidoreductase aldo/keto reductase family
bactF11226	254	100.00	34.71	34.71	fliI flagellum-specific ATP synthase
bactF11227	266	98.82	22.83	23.61	aminotransferase class V
bactF11230	212	100.00	12.45	12.45	ATPase AAA family
bactF11231	267	98.82	14.42	14.93	RNA polymerase sigma factor
bactF11232	253	95.29	13.20	15.22	ATP-dependent RNA helicase
bactF11233	263	70.59	2.66	8.15	RNA polymerase sigma-70 factor ECF subfamily
bactF11234	254	98.82	25.31	26.24	hypothetical protein
bactF11235	253	100.00	62.88	62.88	trmE tRNA modification GTPase TrmE
bactF11237	247	83.53	9.27	17.58	hypothetical protein
bactF11241	215	74.12	3.64	8.53	Oxidoreductase
bactF11242	227	97.65	14.13	15.41	serine protease
bactF11243	221	94.12	15.33	18.51	xerC site-specific tyrosine recombinase XerC
bactF11245	198	98.82	42.02	43.55	ffh signal recognition particle protein
bactF11246	226	100.00	24.34	24.34	ATPase AAA-2 domain protein
bactF11247	196	89.41	8.81	12.45	acetolactate synthase large subunit
bactF11249	182	100.00	51.67	51.67	purF amidophosphoribosyltransferase
bactF11250	192	97.65	7.23	7.62	hypothetical protein
bactF11251	214	72.94	4.04	12.84	general secretion pathway protein E
bactF11253	198	89.41	13.36	20.19	cystathionine gamma-synthase
bactF11254	211	95.29	15.51	18.42	penicillin-binding protein
bactF11255	204	100.00	26.57	26.57	UvrD/REP helicase
bactF11256	182	98.82	38.20	39.15	Phosphomannomutase
bactF11258	183	91.76	12.01	14.58	phosphoenolpyruvate-protein phosphotransferase
bactF11259	204	97.65	22.71	23.93	penicillin-binding protein
bactF11260	185	70.59	3.71	12.87	glycosyl transferase group 2 family protein
bactF11261	186	92.94	21.51	26.27	trpE anthranilate synthase component I
bactF11262	173	92.94	14.96	18.02	hypothetical protein
bactF11263	174	71.76	4.27	12.97	amino acid permease family protein
bactF11265	176	83.53	7.13	11.77	oxidoreductase
bactF11266	179	98.82	24.17	24.74	thioredoxin
bactF11267	180	95.29	16.59	18.38	cysteine synthase
bactF11268	178	92.94	23.95	30.54	hypothetical protein
bactF11269	166	85.88	30.87	41.48	phosphate ABC transporter permease protein
bactF11271	182	92.94	19.14	22.72	peptide methionine sulfoxide reductase
bactF11272	182	94.12	13.13	15.39	pyruvate carboxylase
bactF11273	170	100.00	100.00	100.00	obgE GTPase ObgE
bactF11274	188	91.76	12.59	15.55	peptidase M16 family
bactF11275	156	100.00	54.90	54.90	DNA polymerase III subunits gamma and tau
bactF11276	164	76.47	5.13	8.88	MATE efflux family protein
bactF11279	158	76.47	7.93	21.81	uroporphyrin-III C-methyltransferase
bactF11280	157	98.82	17.65	18.59	thioredoxin reductase
bactF11281	174	76.47	6.16	14.26	DegT/DnrJ/EryC1/StrS aminotransferase
bactF11282	177	78.82	1.87	2.54	hypothetical protein
bactF11283	178	97.65	44.60	45.73	ftsW Cell division protein FtsW
bactF11284	164	94.12	25.23	32.49	sdhA succinate dehydrogenase flavoprotein subunit
bactF11285	169	77.65	5.19	10.05	glycosyl hydrolase family 3
bactF11287	171	100.00	79.36	79.36	prfA peptide chain release factor 1
bactF11289	154	90.59	11.23	17.10	ParA family protein
bactF11290	149	71.76	4.47	13.13	Na+/solute symporter
bactF11293	158	81.18	7.16	14.69	inositol monophosphatase family protein
bactF11295	164	100.00	72.81	72.81	mfd transcription-repair coupling factor
bactF11296	157	78.82	10.95	26.48	branched-chain alpha-keto acid dehydrogenase subunit E2
bactF11297	143	91.76	14.68	21.47	hisC histidinol-phosphate aminotransferase
bactF11299	160	82.35	10.54	19.20	hypothetical protein
bactF11300	133	85.88	10.18	18.62	leuA 2-isopropylmalate synthase
bactF11301	160	97.65	16.19	17.97	Pseudouridine synthase Rsu
bactF11303	134	88.24	12.53	21.23	leuB 3-isopropylmalate dehydrogenase
bactF11304	156	97.65	28.39	32.38	mraY phospho-N-acetylmuramoyl-pentapeptide- transferase
bactF11305	125	100.00	16.92	16.92	topA DNA topoisomerase I
bactF11306	138	78.82	5.99	14.29	Sodium/hydrogen exchanger
bactF11307	144	96.47	29.88	36.49	aroA 3-phosphoshikimate 1-carboxyvinyltransferase
bactF11311	144	89.41	14.01	23.66	pta phosphate acetyltransferase
bactF11314	146	100.00	10.22	10.22	dnaK molecular chaperone DnaK
bactF11315	141	97.65	17.89	19.68	ilvE branched-chain amino acid aminotransferase
bactF11317	134	87.06	11.41	20.24	hypothetical protein
bactF11318	141	90.59	26.88	49.29	secD preprotein translocase subunit SecD
bactF11319	145	96.47	24.37	29.25	rpsA 30S ribosomal protein S1
bactF11322	137	100.00	17.37	17.37	uvrA excinuclease ABC subunit A
bactF11323	137	95.29	18.97	23.38	purN phosphoribosylglycinamide formyltransferase
bactF11324	138	97.65	25.13	28.15	clpX ATP-dependent protease ATP-binding subunit
bactF11327	130	89.41	14.63	23.62	peptidyl-prolyl cis-trans isomerase
bactF11328	135	94.12	12.57	15.35	Phospholipid/glycerol acyltransferase
bactF11330	131	80.00	8.55	22.97	Undecaprenyl-phosphate galactosephosphotransferase
bactF11331	132	96.47	21.10	24.55	coproporphyrinogen III oxidase
bactF11332	118	95.29	17.01	19.37	aspartate kinase
bactF11333	131	100.00	22.60	22.60	polyA polymerase family protein
bactF11334	117	89.41	13.46	19.02	ribA GTP cyclohydrolase II
bactF11335	129	84.71	16.51	41.32	ispD 2-C-methyl-D-erythritol 4-phosphate cytidylyltransferase
bactF11338	120	96.47	17.12	18.86	lysA diaminopimelate decarboxylase
bactF11340	119	78.82	6.61	13.77	gltD glutamate synthase subunit beta
bactF11342	111	100.00	23.86	23.86	hydrolase TatD family
bactF11345	108	76.47	7.86	19.85	permease of the major facilitator superfamily
bactF11346	115	98.82	20.86	21.43	inositol-5-monophosphate dehydrogenase
bactF11347	114	85.88	11.54	18.36	gltB glutamate synthase large subunit
bactF11348	123	97.65	17.74	18.99	tRNA-dihydrouridine synthase A
bactF11350	108	78.82	8.93	19.56	hypothetical protein
bactF11351	120	77.65	6.58	15.65	bioF 8-amino-7-oxononanoate synthase
bactF11353	111	97.65	23.58	25.01	ribonucleotide-diphosphate reductase subunit alpha
bactF11357	111	81.18	9.60	18.88	gltA citrate synthase
bactF11359	120	83.53	9.18	16.62	carboxyl-terminal protease
bactF11360	102	80.00	9.66	15.83	bcp bacterioferritin comigratory protein
bactF11362	105	76.47	7.57	16.06	ATP-dependent protease La
bactF11364	98	89.41	21.84	27.91	tryptophan synthase subunit beta
bactF11366	105	80.00	9.68	18.27	ammonium transporter
bactF11367	109	92.94	18.96	22.83	methylated-DNA--protein-cysteine methyltransferase
bactF11368	117	100.00	23.06	23.06	dnaE DNA polymerase III subunit alpha
bactF11371	113	77.65	7.57	18.01	hypothetical protein
bactF11372	99	78.82	10.94	22.20	peroxiredoxin
bactF11374	112	70.59	4.40	12.19	sodium dicarboxylate symporter family protein
bactF11377	93	100.00	52.75	52.75	metG methionyl-tRNA synthetase
bactF11383	94	97.65	42.14	42.92	nth endonuclease III
bactF11386	101	94.12	23.70	25.89	aroE shikimate 5-dehydrogenase
bactF11390	107	95.29	21.91	24.27	transketolase
bactF11391	112	97.65	24.33	25.98	murC UDP-N-acetylmuramate--L-alanine ligase
bactF11393	112	92.94	15.96	18.81	Cell divisionFtsK/SpoIIIE
bactF11395	107	87.06	14.30	20.87	hypothetical protein
bactF11396	98	97.65	34.14	35.08	carB carbamoyl phosphate synthase large subunit
bactF11397	91	92.94	36.37	38.23	hypothetical protein
bactF11398	105	74.12	6.98	18.71	recQ ATP-dependent DNA helicase RecQ
bactF11399	109	100.00	28.25	28.25	N5-glutamine S-adenosyl-L-methionine-dependent methyltransferase
bactF11400	94	100.00	50.42	50.42	cysS cysteinyl-tRNA synthetase
bactF11401	107	100.00	27.35	27.35	D-alanine--D-alanine ligase
bactF11402	103	85.88	14.44	20.93	dxs 1-deoxy-D-xylulose-5-phosphate synthase
#bactF11403	85	98.82	91.02	91.12	guaA bifunctional GMP synthase/glutamine amidotransferase protein
bactF11405	96	98.82	41.42	41.94	purB adenylosuccinate lyase
bactF11406	104	91.76	19.76	23.79	RNA methyltransferase TrmA family
bactF11407	97	75.29	9.66	26.76	gcvP glycine dehydrogenase
bactF11410	105	89.41	17.18	22.70	ParB-like partition protein
bactF11413	91	85.88	22.98	30.35	Phosphomethylpyrimidine kinase
bactF11414	90	95.29	47.10	49.11	ppnK inorganic polyphosphate/ATP-NAD kinase
bactF11417	97	98.82	38.90	39.39	adk adenylate kinase
bactF11421	97	82.35	14.77	25.46	exodeoxyribonuclease III
bactF11422	88	100.00	76.87	76.87	gcp O-sialoglycoprotein endopeptidase
bactF11423	93	97.65	45.50	46.44	HIT family protein
bactF11424	101	80.00	11.80	25.06	exopolyphosphatase
bactF11425	103	97.65	28.44	29.63	murA UDP-N-acetylglucosamine 1-carboxyvinyltransferase
bactF11427	91	97.65	52.01	52.97	carA carbamoyl phosphate synthase small subunit
bactF11428	91	84.71	21.17	29.24	argB acetylglutamate kinase
bactF11429	88	96.47	59.08	60.32	putative deoxyribonucleotide triphosphate pyrophosphatase
bactF11430	97	88.24	20.33	25.80	fumC fumarate hydratase
bactF11432	90	90.59	33.53	37.70	pyruvate kinase
bactF11433	89	87.06	27.67	34.13	trpG anthranilate synthase component II
bactF11434	94	76.47	9.61	14.42	hypothetical protein
bactF11436	93	76.47	10.57	18.87	threonine dehydratase
bactF11437	96	78.82	11.31	19.73	kdsA 2-dehydro-3-deoxyphosphooctonate aldolase
bactF11439	99	96.47	31.32	33.10	alr alanine racemase
bactF11442	97	100.00	40.80	40.80	rpe ribulose-phosphate 3-epimerase
bactF11443	91	77.65	11.60	14.02	gcvT glycine cleavage system aminomethyltransferase T
bactF11444	94	98.82	46.72	47.25	uvrC excinuclease ABC subunit C
#bactF11446	85	100.00	100.00	100.00	pyrH uridylate kinase
bactF11449	85	97.65	82.84	83.22	purE phosphoribosylaminoimidazole carboxylase catalytic subunit
bactF11450	84	89.41	42.87	50.07	ribD riboflavin biosynthesis protein RibD
bactF11451	97	100.00	42.74	42.74	fmt methionyl-tRNA formyltransferase
bactF11454	96	78.82	11.68	21.98	mreB rod shape-determining protein MreB
bactF11455	95	78.82	11.42	18.47	cysE serine acetyltransferase
bactF11461	95	97.65	40.59	41.62	pnp polynucleotide phosphorylase/polyadenylase
bactF11462	96	100.00	43.12	43.12	dnaB replicative DNA helicase
bactF11464	84	89.41	42.87	50.07	hflX GTP-binding protein hflX
bactF11467	83	95.29	68.63	82.86	purD phosphoribosylamine--glycine ligase
bactF11472	82	96.47	75.40	100.00	purM phosphoribosylaminoimidazole synthetase
bactF11475	80	76.47	15.22	30.12	succinyl-CoA synthetase subunit alpha
bactF11478	91	98.82	56.44	56.89	FolC Folylpolyglutamate synthase
bactF11480	79	89.41	42.87	74.69	tmk thymidylate kinase
bactF11482	87	97.65	69.83	70.46	purH bifunctional phosphoribosylaminoimidazolecarboxamide formyltransferase/IMP cyclohydrolase
bactF11483	87	91.76	44.47	47.58	aroB 3-dehydroquinate synthase
bactF11484	86	92.94	52.54	55.24	folP dihydropteroate synthase
bactF11485	87	100.00	83.56	83.56	uvrB excinuclease ABC subunit B
bactF11486	79	89.41	42.87	74.69	Biotin--acetyl-CoA-carboxylase ligase
bactF11487	79	83.53	26.78	48.18	prephenate dehydratase
bactF11489	90	100.00	65.82	65.82	rpoA DNA-directed RNA polymerase subunit alpha
#*bactF11490	90	100.00	65.13	65.13	dnaG DNA primase
bactF11492	81	75.29	13.86	29.14	purK phosphoribosylaminoimidazole carboxylase ATPase subunit
bactF11493	89	100.00	70.69	70.69	pth peptidyl-tRNA hydrolase
bactF11494	89	100.00	70.99	70.99	dnaN DNA polymerase III subunit beta
bactF11495	89	100.00	70.69	70.69	recA recombinase A
bactF11496	78	87.06	35.51	67.84	trpC indole-3-glycerol phosphate synthase
bactF11498	88	98.82	70.41	70.73	gmk guanylate kinase
bactF11504	76	87.06	35.51	81.48	argH argininosuccinate lyase
bactF11506	75	75.29	13.86	36.48	sucC succinyl-CoA synthetase subunit beta
bactF11507	87	97.65	69.83	70.46	trmU tRNA (5-methylaminomethyl-2-thiouridylate)-methyltransferase
bactF11508	84	97.65	82.84	91.02	murD UDP-N-acetylmuramoyl-L-alanyl-D-glutamate synthetase
bactF11509	87	97.65	69.83	70.46	hypothetical protein
bactF11510	85	94.12	62.46	64.33	pgi glucose-6-phosphate isomerase
bactF11512	86	77.65	15.87	25.84	dnaQ DNA polymerase III epsilon subunit
bactF11513	79	80.00	20.19	39.31	thiE thiamine-phosphate pyrophosphorylase
bactF11515	86	96.47	69.23	70.19	hypothetical protein
bactF11516	82	77.65	16.73	26.59	DNA internalization-related competence protein ComEC/Rec2
bactF11517	85	90.59	47.10	50.43	ubiE ubiquinone/menaquinone biosynthesis methyltransferase
bactF11520	84	95.29	68.63	75.95	gpsA NAD(P)H-dependent glycerol-3-phosphate dehydrogenase
bactF11522	73	78.82	18.38	54.96	homoserine dehydrogenase
bactF11523	74	83.53	26.78	73.30	trpD anthranilate phosphoribosyltransferase
bactF11528	82	81.18	22.18	33.81	folE GTP cyclohydrolase I
bactF11529	85	97.65	82.84	83.22	murE UDP-N-acetylmuramoylalanyl-D-glutamate--2 6-diaminopimelate ligase
bactF11530	84	97.65	82.84	91.02	murG N-acetylglucosaminyl transferase
#bactF11531	85	100.00	100.00	100.00	priA primosome assembly protein PriA
bactF11533	74	85.88	32.32	89.89	hisB imidazoleglycerol-phosphate dehydratase
bactF11534	73	85.88	32.32	100.00	phosphoribosylformylglycinamidine synthase
#bactF11535	84	98.82	91.02	100.00	trmD tRNA (guanine-N(1)-)-methyltransferase
#bactF11536	84	98.82	91.02	100.00	rplU 50S ribosomal protein L21
#bactF11537	83	97.65	82.84	100.00	ruvB Holliday junction DNA helicase B
#bactF11538	83	97.65	82.84	100.00	radA DNA repair protein RadA
bactF11540	83	82.35	24.37	34.78	hypothetical protein
#bactF11541	83	97.65	82.84	100.00	coaE dephospho-CoA kinase
bactF11545	82	95.29	68.63	90.81	recombination factor protein RarA
bactF11546	81	88.24	39.02	57.97	aroK shikimate kinase
bactF11547	82	96.47	75.40	100.00	recN DNA repair protein RecN
bactF11548	82	75.29	13.86	28.30	ribonuclease Rne/Rng family
bactF11549	82	96.47	75.40	100.00	hypothetical protein
bactF11552	81	84.71	29.42	45.94	2-amino-4-hydroxy-6- hydroxymethyldihydropteridine pyrophosphokinase
bactF11553	81	89.41	42.87	62.64	murI glutamate racemase
bactF11554	81	94.12	62.46	90.71	UDP-N-acetylmuramoylalanyl-D-glutamyl-2 6-diaminopimelate--D-alanyl-D-alanyl ligase
bactF11555	81	88.24	39.02	57.97	rho transcription termination factor Rho
bactF11557	66	75.29	13.86	79.07	thiC thiamine biosynthesis protein ThiC
bactF11558	80	94.12	62.46	100.00	hypothetical protein
bactF11562	77	90.59	47.10	100.00	glmU UDP-N-acetylglucosamine pyrophosphorylase
bactF11568	78	82.35	24.37	48.35	ftsA cell division protein FtsA
bactF11569	77	85.88	32.32	67.52	proB gamma-glutamyl kinase
bactF11583	75	84.71	29.42	73.59	proA gamma-glutamyl phosphate reductase
bactF11584	76	84.71	29.42	66.41	panC pantoate--beta-alanine ligase
bactF11586	64	75.29	13.86	100.00	purL phosphoribosylformylglycinamidine synthase II
bactF11592	74	81.18	22.18	60.10	Mg chelatase-related protein
bactF11594	73	85.88	32.32	100.00	cmk cytidylate kinase
bactF11595	67	77.65	16.73	88.91	mutY A/G-specific adenine glycosylase
bactF11599	73	85.88	32.32	100.00	hypothetical protein
bactF11601	73	83.53	26.78	80.82	prmA ribosomal protein L11 methyltransferase
bactF11603	61	70.59	9.51	87.90	thiL thiamine-monophosphate kinase
bactF11611	71	82.35	24.37	89.49	accA acetyl-CoA carboxylase carboxyltransferase subunit alpha
bactF11613	71	75.29	13.86	48.33	hypothetical protein
bactF11615	63	74.12	12.61	100.00	smc chromosome segregation protein SMC
bactF11618	70	78.82	18.38	72.07	ispH 4-hydroxy-3-methylbut-2-enyl diphosphate reductase
bactF11620	70	74.12	12.61	49.11	hypoxanthine-guanine phosphoribosyltransferase
bactF11630	68	76.47	15.22	71.42	rnhA ribonuclease H
bactF11631	68	80.00	20.19	100.00	accD acetyl-CoA carboxylase subunit beta
bactF11635	67	75.29	13.86	71.08	accB acetyl-CoA carboxylase biotin carboxyl carrier protein
bactF11637	67	70.59	9.51	47.79	enoyl-(acyl carrier protein) reductase
bactF11648	66	77.65	16.73	100.00	GTPase EngB
bactF11650	65	76.47	15.22	100.00	hemE uroporphyrinogen decarboxylase
bactF11669	64	75.29	13.86	100.00	thiG thiazole synthase
bactF11676	62	71.76	10.45	88.08	apt adenine phosphoribosyltransferase
bactF11677	63	74.12	12.61	100.00	hypothetical protein
bactF11681	62	71.76	10.45	88.08	hypothetical protein
bactF17238	76	78.82	18.38	44.04	nadC nicotinate-nucleotide pyrophosphorylase
#bactF17299	83	97.65	82.84	100.00	purA adenylosuccinate synthetase
bactF17311	94	80.00	13.83	25.49	maf protein
#bactF17421	86	100.00	91.22	91.22	rpsH 30S ribosomal protein S8
#*bactF17450	86	98.82	83.21	83.39	rpsM 30S ribosomal protein S13
bactF17522	92	98.82	52.44	52.85	metK S-adenosylmethionine synthetase
bactF17582	87	98.82	76.39	76.64	rpoC DNA-directed RNA polymerase subunit beta'
#*bactF17583	83	96.47	75.40	90.92	rpoB DNA-directed RNA polymerase subunit beta
bactF17639	72	78.82	18.38	58.82	dut deoxyuridine 5'-triphosphate nucleotidohydrolase
bactF17655	91	76.47	11.16	17.23	formamidopyrimidine-DNA glycosylase
#*bactF17663	85	100.00	100.00	100.00	tsf elongation factor Ts
bactF17690	83	74.12	12.61	21.00	radC DNA repair protein RadC
bactF17824	82	95.29	68.63	90.81	atpG F0F1 ATP synthase subunit gamma
bactF17894	86	100.00	91.22	91.22	pheS phenylalanyl-tRNA synthetase subunit alpha
bactF17908	194	75.29	4.36	11.53	cold shock protein
bactF17983	69	78.82	18.38	79.85	hemH ferrochelatase
bactF18052	65	74.12	12.61	78.79	aroQ 3-dehydroquinate dehydratase
bactF18085	75	85.88	32.32	81.26	hemB delta-aminolevulinic acid dehydratase
#*bactF18161	89	97.65	59.56	60.24	pyrG CTP synthetase
bactF18167	187	94.12	8.65	10.57	DNA-binding protein HU
bactF18247	61	71.76	10.45	100.00	D-tyrosyl-tRNA deacylase
bactF18384	95	90.59	26.25	31.30	pgsA CDP-diacylglycerol--glycerol-3-phosphate 3-phosphatidyltransferase
#*bactF18443	85	98.82	91.02	91.12	pgk phosphoglycerate kinase
bactF18443B	85	98.82	91.02	91.12	tpiA triosephosphate isomerase
bactF18518	185	98.82	44.31	45.21	Polyprenyl synthetase
bactF18671	90	98.82	60.53	60.94	truA tRNA pseudouridine synthase A
#*bactF18679	87	100.00	83.56	83.56	rpsK 30S ribosomal protein S11
#bactF18711	86	100.00	91.22	91.22	rplR 50S ribosomal protein L18
bactF18724	364	71.76	3.03	10.17	efflux transporter RND family MFP subunit
bactF18779	77	88.24	39.02	81.69	argG argininosuccinate synthase
bactF18936	103	76.47	8.56	20.41	Periplasmic solute binding protein
bactF18970	76	76.47	15.22	35.59	cation efflux family protein
bactF18971	74	70.59	9.51	28.33	cation diffusion facilitator family transporter
bactF18983	116	90.59	13.73	18.06	pyrC dihydroorotase
bactF19008	94	100.00	50.42	50.42	pheT phenylalanyl-tRNA synthetase subunit beta
#*bactF19017	84	98.82	91.02	100.00	rplS 50S ribosomal protein L19
bactF19049	75	87.06	35.51	90.01	dapF diaminopimelate epimerase
bactF19091	78	85.88	32.32	61.59	hisD histidinol dehydrogenase
bactF19178	86	91.76	47.83	50.31	hypothetical protein
#*bactF19221	85	100.00	100.00	100.00	nusA transcription elongation factor NusA
#bactF19267	86	100.00	91.22	91.22	secY preprotein translocase subunit SecY
bactF19313	91	75.29	10.68	21.06	6-phosphofructokinase
bactF19542	236	100.00	35.41	35.41	aspS aspartyl-tRNA synthetase
bactF19613	63	74.12	12.61	100.00	bioB biotin synthase
bactF19618	70	80.00	20.19	79.83	hypothetical protein
bactF19638	67	71.76	10.45	51.20	lspA lipoprotein signal peptidase
bactF19647	81	95.29	68.63	100.00	aroC chorismate synthase
bactF19689	135	95.29	12.88	15.01	3-oxoacyl-(acyl carrier protein) synthase III
bactF19750	127	100.00	18.05	18.05	clpP ATP-dependent Clp protease proteolytic subunit
bactF19792	92	98.82	52.82	53.29	eno enolase
bactF19793	116	85.88	10.69	16.74	ilvD dihydroxy-acid dehydratase
#*bactF19849	86	100.00	91.22	91.22	rplP 50S ribosomal protein L16
bactF19916	69	81.18	22.18	100.00	hypothetical protein
bactF19931	119	100.00	19.92	19.92	ssb single-strand binding protein
bactF20164	73	84.71	29.42	89.76	hisG ATP phosphoribosyltransferase
bactF20289	106	78.82	8.56	16.51	DNA polymerase IV
bactF20308	111	91.76	18.77	24.59	hypothetical protein
bactF20343	197	94.12	6.57	7.61	Amidase
bactF20367	153	76.47	6.59	17.40	PpiC-type peptidyl-prolyl cis-trans isomerase
bactF20411	126	76.47	13.47	78.79	hypothetical protein
bactF20417	73	78.82	18.38	54.42	ilvH acetolactate synthase 3 regulatory subunit
bactF20428	86	98.82	83.21	83.39	coaD phosphopantetheine adenylyltransferase
bactF20436	74	85.88	32.32	89.89	hemC porphobilinogen deaminase
bactF20461	91	98.82	56.44	56.89	hypothetical protein
bactF20478	69	80.00	20.19	89.20	plsX fatty acid/phospholipid synthesis protein
#*bactF20480	85	100.00	100.00	100.00	rplM 50S ribosomal protein L13
bactF20507	75	84.71	29.42	73.59	panB 3-methyl-2-oxobutanoate hydroxymethyltransferase
#*bactF20724	85	98.82	91.02	91.12	rplF 50S ribosomal protein L6
bactF20735	87	82.35	21.50	27.38	gcvH glycine cleavage system H protein
bactF20742	167	98.82	54.75	59.41	pyrB aspartate carbamoyltransferase catalytic subunit
bactF20792	200	95.29	17.45	20.53	hypothetical protein
#*bactF20837	85	100.00	100.00	100.00	rpsI 30S ribosomal protein S9
bactF20895	65	75.29	13.86	88.59	nadA quinolinate synthetase
bactF20907	86	98.82	83.21	83.39	tig trigger factor
bactF20971	82	96.47	75.40	100.00	recR recombination protein RecR
bactF21058	86	100.00	91.22	91.22	nusG transcription antitermination protein NusG
bactF21117	74	76.47	15.22	42.60	phoU phosphate transport system regulatory protein PhoU
bactF21269	92	98.82	52.82	53.29	dnaA chromosomal replication initiation protein
bactF21320	70	82.35	24.37	100.00	rimM 16S rRNA processing protein rimM
bactF21359	88	100.00	76.87	76.87	ribF riboflavin biosynthesis protein RibF
bactF21462	112	71.76	5.79	21.17	Ald Alanine dehydrogenase
#*bactF21517	85	100.00	100.00	100.00	rplA 50S ribosomal protein L1
bactF21529	85	84.71	29.42	35.47	fabZ (3R)-hydroxymyristoyl-(acyl carrier protein) dehydratase
bactF21566	101	100.00	35.74	35.74	efp elongation factor P
bactF21687	118	100.00	20.08	20.08	groEL chaperonin GroEL
bactF21696	82	95.29	68.63	90.81	rplX 50S ribosomal protein L24
bactF21704	84	97.65	82.84	91.02	rnc ribonuclease III
bactF21732	103	87.06	16.39	23.92	sun Sun protein
#bactF21733	83	97.65	82.84	100.00	nusB transcription antitermination protein NusB
#*bactF21809	85	98.82	91.02	91.12	rpsJ 30S ribosomal protein S10
bactF21961	85	96.47	75.40	76.18	acpP acyl carrier protein
bactF22029	74	87.06	35.51	100.00	hypothetical protein
bactF22047	88	78.82	15.70	24.75	leucyl aminopeptidase
#bactF22056	86	100.00	91.22	91.22	rplO 50S ribosomal protein L15
bactF22162	81	95.29	68.63	100.00	rplJ 50S ribosomal protein L10
#*bactF22164	85	98.82	91.02	91.12	rpsS 30S ribosomal protein S19
bactF22267	99	98.82	35.38	35.90	groES co-chaperonin GroES
#bactF22312	86	98.82	83.21	83.39	rplQ 50S ribosomal protein L17
bactF22315	71	76.47	15.22	54.14	lipA lipoyl synthase
bactF22443	64	74.12	12.61	88.42	hypothetical protein
#*bactF22468	84	98.82	91.02	100.00	infC translation initiation factor IF-3
bactF22558	91	78.82	13.23	18.84	fructose-bisphosphate aldolase
bactF22567	77	87.06	35.51	74.15	hypothetical protein
bactF22605	61	71.76	10.45	100.00	glyQ glycyl-tRNA synthetase subunit alpha
bactF22610	98	85.88	17.53	25.41	hisI phosphoribosyl-AMP cyclohydrolase
bactF22704	61	71.76	10.45	100.00	rpsT 30S ribosomal protein S20
bactF22751	80	90.59	47.10	74.95	ribH riboflavin synthase subunit beta
bactF22862	65	71.76	10.45	63.13	ppk polyphosphate kinase
bactF22947	75	85.88	32.32	81.26	gidA tRNA uridine 5-carboxymethylaminomethyl modification enzyme GidA
#bactF23022	83	97.65	82.84	100.00	ruvA Holliday junction DNA helicase RuvA
#*bactF23090	85	100.00	100.00	100.00	rplL 50S ribosomal protein L7/L12
bactF23255	187	92.94	24.58	29.60	aconitate hydratase
bactF23256	80	83.53	26.78	45.00	leuD isopropylmalate isomerase small subunit
bactF23299	86	100.00	91.22	91.22	argS arginyl-tRNA synthetase
bactF6105	69	81.18	22.18	100.00	Holliday junction resolvase-like protein
bactF6109	143	89.41	12.64	20.63	mutS DNA mismatch repair protein
bactF6118	73	78.82	18.38	54.42	tatC Sec-independent protein translocase TatC
bactF6295	123	96.47	14.80	16.18	signal peptidase I
bactF6336	68	80.00	20.19	100.00	trmB tRNA (guanine-N(7))-methyltransferase
#*bactF6375	86	100.00	91.22	91.22	rpsE 30S ribosomal protein S5
bactF6392	184	98.82	27.67	28.52	rRNA methylase
bactF6393	63	74.12	12.61	100.00	RNA methyltransferase TrmH family group 2
bactF6506	98	100.00	40.20	40.20	putative inner membrane protein translocase component YidC
bactF6510	149	95.29	11.13	13.16	fur ferric uptake regulation protein
#bactF6533	85	100.00	100.00	100.00	rpsP 30S ribosomal protein S16
bactF6558	121	100.00	24.66	24.66	hisS histidyl-tRNA synthetase
#bactF6590	85	100.00	100.00	100.00	mraW S-adenosyl-methyltransferase MraW
bactF6622	101	74.12	7.05	15.52	glutamate dehydrogenase
bactF6647	92	97.65	48.76	49.77	asd aspartate-semialdehyde dehydrogenase
bactF6648	75	84.71	29.42	73.59	argC N-acetyl-gamma-glutamyl-phosphate reductase
bactF6693	78	88.24	39.02	74.42	recJ single-stranded-DNA-specific exonuclease RecJ
bactF6698	83	87.06	35.51	46.13	uppP undecaprenyl pyrophosphate phosphatase
bactF6725	79	90.59	47.10	82.10	xseA exodeoxyribonuclease VII large subunit
bactF6747	88	84.71	24.50	31.62	Superoxide dismutase
bactF6774	169	100.00	28.28	28.28	gltX glutamyl-tRNA synthetase
bactF6775	82	91.76	51.75	69.06	gatB aspartyl/glutamyl-tRNA amidotransferase subunit B
bactF6854	113	78.82	7.75	16.56	ABC transporter permease protein
#*bactF6908	86	100.00	91.22	91.22	rplE 50S ribosomal protein L5
bactF6955	89	98.82	65.16	65.53	ligA NAD-dependent DNA ligase LigA
bactF6969	65	75.29	13.86	88.59	argJ bifunctional ornithine acetyltransferase/N-acetylglutamate synthase protein
bactF7081	69	81.18	22.18	100.00	dxr 1-deoxy-D-xylulose 5-phosphate reductoisomerase
bactF7088	67	78.82	18.38	100.00	recF recombination protein F
#*bactF7097	86	100.00	91.22	91.22	rpsC 30S ribosomal protein S3
bactF7146	91	100.00	60.87	60.87	ksgA dimethyladenosine transferase
bactF7167	85	96.47	75.40	76.18	coaBC phosphopantothenoylcysteine decarboxylase/phosphopantothenate--cysteine ligase
bactF7182	81	92.94	56.85	82.49	dapB dihydrodipicolinate reductase
bactF7194	70	81.18	22.18	89.35	ispG 4-hydroxy-3-methylbut-2-en-1-yl diphosphate synthase
bactF7351	86	94.12	57.60	59.67	lgt prolipoprotein diacylglyceryl transferase
bactF7382	91	85.88	22.83	29.70	rnr ribonuclease R
bactF7487	75	87.06	35.51	90.01	ndk nucleoside diphosphate kinase
bactF7509	83	92.94	56.85	69.35	tgt queuine tRNA-ribosyltransferase
bactF7547	97	98.82	39.36	39.89	uppS undecaprenyl diphosphate synthase
bactF7565	98	98.82	37.51	38.06	folD methylenetetrahydrofolate dehydrogenase/methenyltetrahydrofolate cyclohydrolase
bactF7576	208	72.94	3.15	7.65	iron compound ABC transporter permease protein
bactF7581	100	100.00	37.22	37.22	alaS alanyl-tRNA synthetase
bactF7582	94	100.00	50.42	50.42	thrS threonyl-tRNA synthetase
bactF7583	88	100.00	76.87	76.87	proS prolyl-tRNA synthetase
#bactF7584	86	100.00	91.22	91.22	serS seryl-tRNA synthetase
bactF7609	109	94.12	20.83	24.41	greA transcription elongation factor GreA
bactF7645	83	96.47	75.40	90.92	hypothetical protein
#bactF7657	86	100.00	91.22	91.22	rpsQ 30S ribosomal protein S17
#*bactF7728	85	100.00	100.00	100.00	frr ribosome recycling factor
bactF7745	142	100.00	31.00	31.00	gyrB DNA gyrase subunit B
bactF7763	77	84.71	29.42	61.67	ilvC ketol-acid reductoisomerase
#*bactF7875	85	98.82	91.02	91.12	rplC 50S ribosomal protein L3
#bactF7911	85	100.00	100.00	100.00	rpsO 30S ribosomal protein S15
bactF7988	142	100.00	34.72	34.72	gyrA DNA gyrase subunit A
bactF8002	93	100.00	53.61	53.61	rpsD 30S ribosomal protein S4
#*bactF8007	86	100.00	91.22	91.22	smpB SsrA-binding protein
bactF8117	110	90.59	17.99	24.60	PhoH family protein
bactF8120	112	96.47	21.65	23.76	relA GTP pyrophosphokinase
bactF8137	153	85.88	31.00	68.17	hisA phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase
bactF8138	80	87.06	35.51	57.15	hisH imidazole glycerol phosphate synthase subunit HisH
bactF8219	89	92.94	42.10	44.58	proC pyrroline-5-carboxylate reductase
bactF8221	92	98.82	52.44	52.85	Phosphatidate cytidylyltransferase
bactF8238	71	83.53	26.78	100.00	4-diphosphocytidyl-2-C-methyl-D-erythritol kinase
bactF8259	553	90.59	4.25	5.46	transcriptional regulator LysR family
bactF8326	101	74.12	7.48	18.78	predicted inner membrane peptidase
bactF8344	90	100.00	65.82	65.82	infA translation initiation factor IF-1
bactF8417	92	100.00	56.79	56.79	secA preprotein translocase subunit SecA
bactF8482	144	81.18	9.14	21.99	UDP-glucose 6-dehydrogenase
bactF8533	80	92.94	56.85	90.60	nadD nicotinic acid mononucleotide adenyltransferase
bactF8534	73	85.88	32.32	100.00	hypothetical protein
bactF8549	88	100.00	76.87	76.87	tyrS tyrosyl-tRNA synthetase
bactF8550	90	100.00	65.82	65.82	trpS tryptophanyl-tRNA synthetase
#bactF8626	86	100.00	91.22	91.22	rplV 50S ribosomal protein L22
bactF8659	62	72.94	11.48	100.00	mazG nucleoside triphosphate pyrophosphohydrolase
bactF8667	76	89.41	42.87	100.00	hypothetical protein
bactF8696	60	70.59	9.51	100.00	atpH ATP synthase F1 delta subunit
bactF8702	101	97.65	29.17	30.05	glyA serine hydroxymethyltransferase
#*bactF8791	86	100.00	91.22	91.22	rplD 50S ribosomal protein L4
#*bactF8792	85	100.00	100.00	100.00	rplT 50S ribosomal protein L20
bactF8795	61	71.76	10.45	100.00	rbfA ribosome-binding factor A
#*bactF8827	86	100.00	91.22	91.22	rplB 50S ribosomal protein L2
bactF8838	64	75.29	13.86	100.00	hemA glutamyl-tRNA reductase
bactF8981	132	96.47	11.16	12.17	dapA dihydrodipicolinate synthase
bactF9061	86	97.65	75.90	76.42	rpsR 30S ribosomal protein S18
bactF9089	86	96.47	69.23	70.19	hypothetical protein
bactF9111	77	90.59	47.10	100.00	rpsF 30S ribosomal protein S6
bactF9178	105	100.00	30.63	30.63	ribose-phosphate pyrophosphokinase
bactF9202	138	98.82	16.65	17.35	glyceraldehyde-3-phosphate dehydrogenase
bactF9217	129	72.94	4.98	15.92	glpK glycerol kinase
#bactF9258	86	100.00	91.22	91.22	rpsG 30S ribosomal protein S7
#bactF9262	85	100.00	100.00	100.00	rplI 50S ribosomal protein L9
bactF9304	124	100.00	19.06	19.06	def peptide deformylase
bactF9353	111	74.12	5.96	15.46	nitrogen regulatory protein p-II
bactF9419	77	84.71	29.42	61.67	queA S-adenosylmethionine tRNA ribosyltransferase-isomerase
bactF9424	88	97.65	64.50	65.24	ftsZ cell division protein FtsZ
bactF9477	83	96.47	75.40	90.92	truB tRNA pseudouridine synthase B
bactF9500	356	81.18	8.75	18.12	putative monovalent cation/H+ antiporter subunit D
#bactF9641	85	97.65	82.84	83.22	rpsL 30S ribosomal protein S12
bactF9703	84	96.47	75.40	83.04	rnhB ribonuclease HII
bactF9872	87	97.65	69.68	70.15	murB UDP-N-acetylenolpyruvoylglucosamine reductase
bactF9900	132	88.24	8.35	12.02	glnA glutamine synthetase
bactF9905	61	71.76	10.45	100.00	hypothetical protein
bactF9916	77	90.59	47.10	100.00	rplW 50S ribosomal protein L23

##end###

== ToDo1: Eukaryotes and Viruses ==

== ToDo2: Run Zorro on families to build HMMs ==

IMG genomes (Guillaume Jospin, Morgan Langille, Thomas Sharpton, Dongying Wu)

  • Summary
    • Selected 707 families from DY's 100 genome families (with universality > 70).
    • Ran HMMsearch on those 707families' HMMs against the Bacterial+Archaeal IMG sequences.
    • Filtered the results for an 80% coverage.
    • Took the whole IMG database, filtered out the sequences that hit DY's 100genome families.
    • Did all vs all blastp on the remaining sequences.
    • Filtered for 80% coverage on query and hit.
    • Identified protein families using MCL.
    • Regroup all families (from this step forward the 100genome families and the IMG families are undergoing the same steps).
    • Aligned the families using muscle.
      • no selection on small families (size <= 250 members).
      • Tried to pick representatives for the very large families (up to 45000 members and virtually impossible to align decently) using DY's pick_rep_by_mcl.pl script.
  • in progress
    • Aligning some larger families (still running on merlot).
    • Determine a good Quality Control metric/method to rate alignments (Good alignments will provide usable HMMs). So far we are using DY's pro_ali_mask.pl script.
    • Create HMMs for the good alignments.
    • Find the bad alignments and find a method to improve the alignment.
      • If representatives need to be picked, a "seed" alignment will be used to HMMalign the sequences to it.
    • update the mysql DB with all the appropriate/current information.