Gene Family Notes
100 genomes (Dongying Wu)
- Summary
- Took 100 genomes (bacterial and archaea)
- Did all vs all blastp
- Identified protein families using MCL
- Annotated families according to evenness, universality, etc,
- Ranked families
In order to identify potential phylogenetic markers for metagenomic studies, we are developing methods to analyze the gene distributions across genomes with compete genome sequences. We selected 85 bacterial and 15 archaeal genomes for the initial study. A maximum likelihood tree of 720 bacterial genomes based upon 31 concatenated genome markers were used for the bacterial genome selection, while the selection of archeal genome was based on a tree of the archeal radA genes. A program called maxPD was developed to automatically select taxa from a phylogenetic tree so the taxa are listed according to their contributions to the phylogenetic diversity based on the tree. The genomes that contributions most phylogenetic diversity were selected. In the selecting process, we only selected the genomes with publications and we also avoided genomes that undergone severe genome reductions.
A protocol has been developed to classify gene families automatically. The protocol is base upon Blastall and the Markov clustering algorithm so the protocol are automated, robust. We use an e value cutoff of 1e-10 in our test run. For the 313139 genes from the 100 selected genomes, we identify 23336 gene families that include a total of 239453 genes. In order for automatically select gene families for potential markers, we've developed a program that not only evaluates the universality of the gene families, but also estimates the evenness of the distributions of the genes across a selected group of genomes (See the figure below).
Across all 100 the selected genomes, we identified 30 gene families that appear in almost all the genomes with even gene distributions, 12 were ribosomal protein subunits that were already included in our gene marker collections which indicate the approach works in principal. The rest of the candidates are mostly tRNA synthetases among others. Different cutoffs will be applied for more potential marker selection, and the evenness and distribution estimation program was developed in a way that potential markers for specific phylogenetic groups (for example, a phylum or a class) can be easily identified.
Bacterial Families We've identified 503 bacterial families with >=70 universalities. 56 can be phylogenetic markers (31 AMPHORA markers), the families (sequences, alignments, hmms) can be downloaded from http://www.biotorrents.net/details.php?id=39 (password for unzip is iseem). The family characteristics are demonstrated below: (#: phylogenetic markers, #* AMPHORA markers) the columns are: [0] family_ID [1] family size [2] universality [3] evenness [4] independent_evenness (independent of unversality) [5] family descriptions bactF10020 93 98.82 49.59 50.10 hypothetical protein bactF10042 105 92.94 21.21 25.18 hypothetical protein #*bactF10154 84 98.82 91.02 100.00 rpsB 30S ribosomal protein S2 bactF10345 106 100.00 29.76 29.76 polA DNA polymerase I bactF10373 77 88.24 39.02 81.69 trpA tryptophan synthase subunit alpha bactF10442 69 80.00 20.19 89.20 hypothetical protein bactF10511 154 100.00 21.01 21.01 Peptidase M24 bactF10512 120 100.00 20.91 20.91 methionine aminopeptidase bactF10567 90 95.29 47.10 49.11 atpB F0F1 ATP synthase subunit A #*bactF10580 86 100.00 91.22 91.22 rplN 50S ribosomal protein L14 bactF10640 79 74.12 12.61 29.10 ribosome-associated GTPase bactF10684 89 100.00 70.99 70.99 miaA tRNA delta(2)-isopentenylpyrophosphate transferase bactF10712 75 87.06 35.51 90.01 50S ribosomal protein L25/general stress protein Ctc bactF10714 73 85.88 32.32 100.00 ruvC Holliday junction resolvase bactF10736 67 77.65 16.73 88.91 mutL DNA mismatch repair protein bactF10744 124 81.18 9.40 22.51 hydrolase carbon-nitrogen family bactF10745 89 98.82 65.16 65.53 nadE NAD synthetase bactF10759 78 76.47 15.22 33.72 hypothetical protein bactF10881 83 95.29 68.63 82.86 gidB glucose-inhibited division protein B bactF10918 161 100.00 34.43 34.43 chaperone protein dnaJ bactF10945 86 98.82 83.21 83.39 GrpE protein #*bactF10964 85 100.00 100.00 100.00 rplK 50S ribosomal protein L11 bactF10992 81 90.59 47.10 68.76 riboflavin synthase subunit alpha #*bactF11013 85 100.00 100.00 100.00 rpmA 50S ribosomal protein L27 bactF11020 77 71.76 10.45 27.78 NAD-dependent deacetylase bactF11067 138 72.94 3.88 9.13 putative peptidase bactF11068 103 76.47 8.28 18.64 succinyl-diaminopimelate desuccinylase bactF11191 4632 100.00 16.12 16.12 ABC transporter ATP-binding protein bactF11192 1975 100.00 5.41 5.41 sensor histidine kinase bactF11193 1298 100.00 5.29 5.29 short chain dehydrogenase bactF11194 1072 100.00 9.67 9.67 two-component response regulator bactF11195 1038 77.65 2.68 5.24 hypothetical protein bactF11196 794 97.65 4.59 4.88 hypothetical protein bactF11199 623 100.00 53.21 53.21 infB translation initiation factor IF-2 bactF11200 660 70.59 1.45 3.32 hypothetical protein bactF11201 500 97.65 10.42 11.30 Glycosyl transferase group 1 bactF11203 515 91.76 3.32 4.28 Aldehyde dehydrogenase bactF11204 488 98.82 16.12 16.79 transcriptional regulator GntR family bactF11205 489 77.65 2.63 5.64 3-hydroxybutyryl-CoA dehydrogenase bactF11206 447 97.65 14.10 15.20 NAD-dependent epimerase/dehydratase bactF11207 397 97.65 14.69 15.97 cation transport ATPase bactF11208 396 95.29 14.29 17.13 NADH oxidase bactF11209 424 83.53 3.09 4.91 two-component response regulator bactF11210 429 98.82 4.78 4.88 beta-ketoacyl synthase bactF11211 380 96.47 12.71 14.28 argD acetylornithine aminotransferase bactF11212 391 74.12 1.77 3.51 permease of the major facilitator superfamily bactF11213 410 82.35 3.59 7.08 AcrB/AcrD/AcrF family protein bactF11214 356 83.53 3.78 6.35 putative anaerobic dehydrogenase bactF11215 327 85.88 4.77 7.29 alcohol dehydrogenase zinc-containing bactF11218 286 96.47 16.75 18.96 D-isomer specific 2-hydroxyacid dehydrogenase family protein bactF11219 256 100.00 21.84 21.84 nucleotidyl transferase bactF11220 262 100.00 81.98 81.98 ileS isoleucyl-tRNA synthetase bactF11221 279 81.18 3.69 6.77 extracellular solute-binding protein family 5 bactF11222 258 90.59 5.38 6.68 binding-protein-dependent transport systems inner membrane component bactF11223 257 89.41 5.62 7.18 oligopeptide ABC transporter permease protein bactF11225 261 74.12 2.68 6.20 oxidoreductase aldo/keto reductase family bactF11226 254 100.00 34.71 34.71 fliI flagellum-specific ATP synthase bactF11227 266 98.82 22.83 23.61 aminotransferase class V bactF11230 212 100.00 12.45 12.45 ATPase AAA family bactF11231 267 98.82 14.42 14.93 RNA polymerase sigma factor bactF11232 253 95.29 13.20 15.22 ATP-dependent RNA helicase bactF11233 263 70.59 2.66 8.15 RNA polymerase sigma-70 factor ECF subfamily bactF11234 254 98.82 25.31 26.24 hypothetical protein bactF11235 253 100.00 62.88 62.88 trmE tRNA modification GTPase TrmE bactF11237 247 83.53 9.27 17.58 hypothetical protein bactF11241 215 74.12 3.64 8.53 Oxidoreductase bactF11242 227 97.65 14.13 15.41 serine protease bactF11243 221 94.12 15.33 18.51 xerC site-specific tyrosine recombinase XerC bactF11245 198 98.82 42.02 43.55 ffh signal recognition particle protein bactF11246 226 100.00 24.34 24.34 ATPase AAA-2 domain protein bactF11247 196 89.41 8.81 12.45 acetolactate synthase large subunit bactF11249 182 100.00 51.67 51.67 purF amidophosphoribosyltransferase bactF11250 192 97.65 7.23 7.62 hypothetical protein bactF11251 214 72.94 4.04 12.84 general secretion pathway protein E bactF11253 198 89.41 13.36 20.19 cystathionine gamma-synthase bactF11254 211 95.29 15.51 18.42 penicillin-binding protein bactF11255 204 100.00 26.57 26.57 UvrD/REP helicase bactF11256 182 98.82 38.20 39.15 Phosphomannomutase bactF11258 183 91.76 12.01 14.58 phosphoenolpyruvate-protein phosphotransferase bactF11259 204 97.65 22.71 23.93 penicillin-binding protein bactF11260 185 70.59 3.71 12.87 glycosyl transferase group 2 family protein bactF11261 186 92.94 21.51 26.27 trpE anthranilate synthase component I bactF11262 173 92.94 14.96 18.02 hypothetical protein bactF11263 174 71.76 4.27 12.97 amino acid permease family protein bactF11265 176 83.53 7.13 11.77 oxidoreductase bactF11266 179 98.82 24.17 24.74 thioredoxin bactF11267 180 95.29 16.59 18.38 cysteine synthase bactF11268 178 92.94 23.95 30.54 hypothetical protein bactF11269 166 85.88 30.87 41.48 phosphate ABC transporter permease protein bactF11271 182 92.94 19.14 22.72 peptide methionine sulfoxide reductase bactF11272 182 94.12 13.13 15.39 pyruvate carboxylase bactF11273 170 100.00 100.00 100.00 obgE GTPase ObgE bactF11274 188 91.76 12.59 15.55 peptidase M16 family bactF11275 156 100.00 54.90 54.90 DNA polymerase III subunits gamma and tau bactF11276 164 76.47 5.13 8.88 MATE efflux family protein bactF11279 158 76.47 7.93 21.81 uroporphyrin-III C-methyltransferase bactF11280 157 98.82 17.65 18.59 thioredoxin reductase bactF11281 174 76.47 6.16 14.26 DegT/DnrJ/EryC1/StrS aminotransferase bactF11282 177 78.82 1.87 2.54 hypothetical protein bactF11283 178 97.65 44.60 45.73 ftsW Cell division protein FtsW bactF11284 164 94.12 25.23 32.49 sdhA succinate dehydrogenase flavoprotein subunit bactF11285 169 77.65 5.19 10.05 glycosyl hydrolase family 3 bactF11287 171 100.00 79.36 79.36 prfA peptide chain release factor 1 bactF11289 154 90.59 11.23 17.10 ParA family protein bactF11290 149 71.76 4.47 13.13 Na+/solute symporter bactF11293 158 81.18 7.16 14.69 inositol monophosphatase family protein bactF11295 164 100.00 72.81 72.81 mfd transcription-repair coupling factor bactF11296 157 78.82 10.95 26.48 branched-chain alpha-keto acid dehydrogenase subunit E2 bactF11297 143 91.76 14.68 21.47 hisC histidinol-phosphate aminotransferase bactF11299 160 82.35 10.54 19.20 hypothetical protein bactF11300 133 85.88 10.18 18.62 leuA 2-isopropylmalate synthase bactF11301 160 97.65 16.19 17.97 Pseudouridine synthase Rsu bactF11303 134 88.24 12.53 21.23 leuB 3-isopropylmalate dehydrogenase bactF11304 156 97.65 28.39 32.38 mraY phospho-N-acetylmuramoyl-pentapeptide- transferase bactF11305 125 100.00 16.92 16.92 topA DNA topoisomerase I bactF11306 138 78.82 5.99 14.29 Sodium/hydrogen exchanger bactF11307 144 96.47 29.88 36.49 aroA 3-phosphoshikimate 1-carboxyvinyltransferase bactF11311 144 89.41 14.01 23.66 pta phosphate acetyltransferase bactF11314 146 100.00 10.22 10.22 dnaK molecular chaperone DnaK bactF11315 141 97.65 17.89 19.68 ilvE branched-chain amino acid aminotransferase bactF11317 134 87.06 11.41 20.24 hypothetical protein bactF11318 141 90.59 26.88 49.29 secD preprotein translocase subunit SecD bactF11319 145 96.47 24.37 29.25 rpsA 30S ribosomal protein S1 bactF11322 137 100.00 17.37 17.37 uvrA excinuclease ABC subunit A bactF11323 137 95.29 18.97 23.38 purN phosphoribosylglycinamide formyltransferase bactF11324 138 97.65 25.13 28.15 clpX ATP-dependent protease ATP-binding subunit bactF11327 130 89.41 14.63 23.62 peptidyl-prolyl cis-trans isomerase bactF11328 135 94.12 12.57 15.35 Phospholipid/glycerol acyltransferase bactF11330 131 80.00 8.55 22.97 Undecaprenyl-phosphate galactosephosphotransferase bactF11331 132 96.47 21.10 24.55 coproporphyrinogen III oxidase bactF11332 118 95.29 17.01 19.37 aspartate kinase bactF11333 131 100.00 22.60 22.60 polyA polymerase family protein bactF11334 117 89.41 13.46 19.02 ribA GTP cyclohydrolase II bactF11335 129 84.71 16.51 41.32 ispD 2-C-methyl-D-erythritol 4-phosphate cytidylyltransferase bactF11338 120 96.47 17.12 18.86 lysA diaminopimelate decarboxylase bactF11340 119 78.82 6.61 13.77 gltD glutamate synthase subunit beta bactF11342 111 100.00 23.86 23.86 hydrolase TatD family bactF11345 108 76.47 7.86 19.85 permease of the major facilitator superfamily bactF11346 115 98.82 20.86 21.43 inositol-5-monophosphate dehydrogenase bactF11347 114 85.88 11.54 18.36 gltB glutamate synthase large subunit bactF11348 123 97.65 17.74 18.99 tRNA-dihydrouridine synthase A bactF11350 108 78.82 8.93 19.56 hypothetical protein bactF11351 120 77.65 6.58 15.65 bioF 8-amino-7-oxononanoate synthase bactF11353 111 97.65 23.58 25.01 ribonucleotide-diphosphate reductase subunit alpha bactF11357 111 81.18 9.60 18.88 gltA citrate synthase bactF11359 120 83.53 9.18 16.62 carboxyl-terminal protease bactF11360 102 80.00 9.66 15.83 bcp bacterioferritin comigratory protein bactF11362 105 76.47 7.57 16.06 ATP-dependent protease La bactF11364 98 89.41 21.84 27.91 tryptophan synthase subunit beta bactF11366 105 80.00 9.68 18.27 ammonium transporter bactF11367 109 92.94 18.96 22.83 methylated-DNA--protein-cysteine methyltransferase bactF11368 117 100.00 23.06 23.06 dnaE DNA polymerase III subunit alpha bactF11371 113 77.65 7.57 18.01 hypothetical protein bactF11372 99 78.82 10.94 22.20 peroxiredoxin bactF11374 112 70.59 4.40 12.19 sodium dicarboxylate symporter family protein bactF11377 93 100.00 52.75 52.75 metG methionyl-tRNA synthetase bactF11383 94 97.65 42.14 42.92 nth endonuclease III bactF11386 101 94.12 23.70 25.89 aroE shikimate 5-dehydrogenase bactF11390 107 95.29 21.91 24.27 transketolase bactF11391 112 97.65 24.33 25.98 murC UDP-N-acetylmuramate--L-alanine ligase bactF11393 112 92.94 15.96 18.81 Cell divisionFtsK/SpoIIIE bactF11395 107 87.06 14.30 20.87 hypothetical protein bactF11396 98 97.65 34.14 35.08 carB carbamoyl phosphate synthase large subunit bactF11397 91 92.94 36.37 38.23 hypothetical protein bactF11398 105 74.12 6.98 18.71 recQ ATP-dependent DNA helicase RecQ bactF11399 109 100.00 28.25 28.25 N5-glutamine S-adenosyl-L-methionine-dependent methyltransferase bactF11400 94 100.00 50.42 50.42 cysS cysteinyl-tRNA synthetase bactF11401 107 100.00 27.35 27.35 D-alanine--D-alanine ligase bactF11402 103 85.88 14.44 20.93 dxs 1-deoxy-D-xylulose-5-phosphate synthase #bactF11403 85 98.82 91.02 91.12 guaA bifunctional GMP synthase/glutamine amidotransferase protein bactF11405 96 98.82 41.42 41.94 purB adenylosuccinate lyase bactF11406 104 91.76 19.76 23.79 RNA methyltransferase TrmA family bactF11407 97 75.29 9.66 26.76 gcvP glycine dehydrogenase bactF11410 105 89.41 17.18 22.70 ParB-like partition protein bactF11413 91 85.88 22.98 30.35 Phosphomethylpyrimidine kinase bactF11414 90 95.29 47.10 49.11 ppnK inorganic polyphosphate/ATP-NAD kinase bactF11417 97 98.82 38.90 39.39 adk adenylate kinase bactF11421 97 82.35 14.77 25.46 exodeoxyribonuclease III bactF11422 88 100.00 76.87 76.87 gcp O-sialoglycoprotein endopeptidase bactF11423 93 97.65 45.50 46.44 HIT family protein bactF11424 101 80.00 11.80 25.06 exopolyphosphatase bactF11425 103 97.65 28.44 29.63 murA UDP-N-acetylglucosamine 1-carboxyvinyltransferase bactF11427 91 97.65 52.01 52.97 carA carbamoyl phosphate synthase small subunit bactF11428 91 84.71 21.17 29.24 argB acetylglutamate kinase bactF11429 88 96.47 59.08 60.32 putative deoxyribonucleotide triphosphate pyrophosphatase bactF11430 97 88.24 20.33 25.80 fumC fumarate hydratase bactF11432 90 90.59 33.53 37.70 pyruvate kinase bactF11433 89 87.06 27.67 34.13 trpG anthranilate synthase component II bactF11434 94 76.47 9.61 14.42 hypothetical protein bactF11436 93 76.47 10.57 18.87 threonine dehydratase bactF11437 96 78.82 11.31 19.73 kdsA 2-dehydro-3-deoxyphosphooctonate aldolase bactF11439 99 96.47 31.32 33.10 alr alanine racemase bactF11442 97 100.00 40.80 40.80 rpe ribulose-phosphate 3-epimerase bactF11443 91 77.65 11.60 14.02 gcvT glycine cleavage system aminomethyltransferase T bactF11444 94 98.82 46.72 47.25 uvrC excinuclease ABC subunit C #bactF11446 85 100.00 100.00 100.00 pyrH uridylate kinase bactF11449 85 97.65 82.84 83.22 purE phosphoribosylaminoimidazole carboxylase catalytic subunit bactF11450 84 89.41 42.87 50.07 ribD riboflavin biosynthesis protein RibD bactF11451 97 100.00 42.74 42.74 fmt methionyl-tRNA formyltransferase bactF11454 96 78.82 11.68 21.98 mreB rod shape-determining protein MreB bactF11455 95 78.82 11.42 18.47 cysE serine acetyltransferase bactF11461 95 97.65 40.59 41.62 pnp polynucleotide phosphorylase/polyadenylase bactF11462 96 100.00 43.12 43.12 dnaB replicative DNA helicase bactF11464 84 89.41 42.87 50.07 hflX GTP-binding protein hflX bactF11467 83 95.29 68.63 82.86 purD phosphoribosylamine--glycine ligase bactF11472 82 96.47 75.40 100.00 purM phosphoribosylaminoimidazole synthetase bactF11475 80 76.47 15.22 30.12 succinyl-CoA synthetase subunit alpha bactF11478 91 98.82 56.44 56.89 FolC Folylpolyglutamate synthase bactF11480 79 89.41 42.87 74.69 tmk thymidylate kinase bactF11482 87 97.65 69.83 70.46 purH bifunctional phosphoribosylaminoimidazolecarboxamide formyltransferase/IMP cyclohydrolase bactF11483 87 91.76 44.47 47.58 aroB 3-dehydroquinate synthase bactF11484 86 92.94 52.54 55.24 folP dihydropteroate synthase bactF11485 87 100.00 83.56 83.56 uvrB excinuclease ABC subunit B bactF11486 79 89.41 42.87 74.69 Biotin--acetyl-CoA-carboxylase ligase bactF11487 79 83.53 26.78 48.18 prephenate dehydratase bactF11489 90 100.00 65.82 65.82 rpoA DNA-directed RNA polymerase subunit alpha #*bactF11490 90 100.00 65.13 65.13 dnaG DNA primase bactF11492 81 75.29 13.86 29.14 purK phosphoribosylaminoimidazole carboxylase ATPase subunit bactF11493 89 100.00 70.69 70.69 pth peptidyl-tRNA hydrolase bactF11494 89 100.00 70.99 70.99 dnaN DNA polymerase III subunit beta bactF11495 89 100.00 70.69 70.69 recA recombinase A bactF11496 78 87.06 35.51 67.84 trpC indole-3-glycerol phosphate synthase bactF11498 88 98.82 70.41 70.73 gmk guanylate kinase bactF11504 76 87.06 35.51 81.48 argH argininosuccinate lyase bactF11506 75 75.29 13.86 36.48 sucC succinyl-CoA synthetase subunit beta bactF11507 87 97.65 69.83 70.46 trmU tRNA (5-methylaminomethyl-2-thiouridylate)-methyltransferase bactF11508 84 97.65 82.84 91.02 murD UDP-N-acetylmuramoyl-L-alanyl-D-glutamate synthetase bactF11509 87 97.65 69.83 70.46 hypothetical protein bactF11510 85 94.12 62.46 64.33 pgi glucose-6-phosphate isomerase bactF11512 86 77.65 15.87 25.84 dnaQ DNA polymerase III epsilon subunit bactF11513 79 80.00 20.19 39.31 thiE thiamine-phosphate pyrophosphorylase bactF11515 86 96.47 69.23 70.19 hypothetical protein bactF11516 82 77.65 16.73 26.59 DNA internalization-related competence protein ComEC/Rec2 bactF11517 85 90.59 47.10 50.43 ubiE ubiquinone/menaquinone biosynthesis methyltransferase bactF11520 84 95.29 68.63 75.95 gpsA NAD(P)H-dependent glycerol-3-phosphate dehydrogenase bactF11522 73 78.82 18.38 54.96 homoserine dehydrogenase bactF11523 74 83.53 26.78 73.30 trpD anthranilate phosphoribosyltransferase bactF11528 82 81.18 22.18 33.81 folE GTP cyclohydrolase I bactF11529 85 97.65 82.84 83.22 murE UDP-N-acetylmuramoylalanyl-D-glutamate--2 6-diaminopimelate ligase bactF11530 84 97.65 82.84 91.02 murG N-acetylglucosaminyl transferase #bactF11531 85 100.00 100.00 100.00 priA primosome assembly protein PriA bactF11533 74 85.88 32.32 89.89 hisB imidazoleglycerol-phosphate dehydratase bactF11534 73 85.88 32.32 100.00 phosphoribosylformylglycinamidine synthase #bactF11535 84 98.82 91.02 100.00 trmD tRNA (guanine-N(1)-)-methyltransferase #bactF11536 84 98.82 91.02 100.00 rplU 50S ribosomal protein L21 #bactF11537 83 97.65 82.84 100.00 ruvB Holliday junction DNA helicase B #bactF11538 83 97.65 82.84 100.00 radA DNA repair protein RadA bactF11540 83 82.35 24.37 34.78 hypothetical protein #bactF11541 83 97.65 82.84 100.00 coaE dephospho-CoA kinase bactF11545 82 95.29 68.63 90.81 recombination factor protein RarA bactF11546 81 88.24 39.02 57.97 aroK shikimate kinase bactF11547 82 96.47 75.40 100.00 recN DNA repair protein RecN bactF11548 82 75.29 13.86 28.30 ribonuclease Rne/Rng family bactF11549 82 96.47 75.40 100.00 hypothetical protein bactF11552 81 84.71 29.42 45.94 2-amino-4-hydroxy-6- hydroxymethyldihydropteridine pyrophosphokinase bactF11553 81 89.41 42.87 62.64 murI glutamate racemase bactF11554 81 94.12 62.46 90.71 UDP-N-acetylmuramoylalanyl-D-glutamyl-2 6-diaminopimelate--D-alanyl-D-alanyl ligase bactF11555 81 88.24 39.02 57.97 rho transcription termination factor Rho bactF11557 66 75.29 13.86 79.07 thiC thiamine biosynthesis protein ThiC bactF11558 80 94.12 62.46 100.00 hypothetical protein bactF11562 77 90.59 47.10 100.00 glmU UDP-N-acetylglucosamine pyrophosphorylase bactF11568 78 82.35 24.37 48.35 ftsA cell division protein FtsA bactF11569 77 85.88 32.32 67.52 proB gamma-glutamyl kinase bactF11583 75 84.71 29.42 73.59 proA gamma-glutamyl phosphate reductase bactF11584 76 84.71 29.42 66.41 panC pantoate--beta-alanine ligase bactF11586 64 75.29 13.86 100.00 purL phosphoribosylformylglycinamidine synthase II bactF11592 74 81.18 22.18 60.10 Mg chelatase-related protein bactF11594 73 85.88 32.32 100.00 cmk cytidylate kinase bactF11595 67 77.65 16.73 88.91 mutY A/G-specific adenine glycosylase bactF11599 73 85.88 32.32 100.00 hypothetical protein bactF11601 73 83.53 26.78 80.82 prmA ribosomal protein L11 methyltransferase bactF11603 61 70.59 9.51 87.90 thiL thiamine-monophosphate kinase bactF11611 71 82.35 24.37 89.49 accA acetyl-CoA carboxylase carboxyltransferase subunit alpha bactF11613 71 75.29 13.86 48.33 hypothetical protein bactF11615 63 74.12 12.61 100.00 smc chromosome segregation protein SMC bactF11618 70 78.82 18.38 72.07 ispH 4-hydroxy-3-methylbut-2-enyl diphosphate reductase bactF11620 70 74.12 12.61 49.11 hypoxanthine-guanine phosphoribosyltransferase bactF11630 68 76.47 15.22 71.42 rnhA ribonuclease H bactF11631 68 80.00 20.19 100.00 accD acetyl-CoA carboxylase subunit beta bactF11635 67 75.29 13.86 71.08 accB acetyl-CoA carboxylase biotin carboxyl carrier protein bactF11637 67 70.59 9.51 47.79 enoyl-(acyl carrier protein) reductase bactF11648 66 77.65 16.73 100.00 GTPase EngB bactF11650 65 76.47 15.22 100.00 hemE uroporphyrinogen decarboxylase bactF11669 64 75.29 13.86 100.00 thiG thiazole synthase bactF11676 62 71.76 10.45 88.08 apt adenine phosphoribosyltransferase bactF11677 63 74.12 12.61 100.00 hypothetical protein bactF11681 62 71.76 10.45 88.08 hypothetical protein bactF17238 76 78.82 18.38 44.04 nadC nicotinate-nucleotide pyrophosphorylase #bactF17299 83 97.65 82.84 100.00 purA adenylosuccinate synthetase bactF17311 94 80.00 13.83 25.49 maf protein #bactF17421 86 100.00 91.22 91.22 rpsH 30S ribosomal protein S8 #*bactF17450 86 98.82 83.21 83.39 rpsM 30S ribosomal protein S13 bactF17522 92 98.82 52.44 52.85 metK S-adenosylmethionine synthetase bactF17582 87 98.82 76.39 76.64 rpoC DNA-directed RNA polymerase subunit beta' #*bactF17583 83 96.47 75.40 90.92 rpoB DNA-directed RNA polymerase subunit beta bactF17639 72 78.82 18.38 58.82 dut deoxyuridine 5'-triphosphate nucleotidohydrolase bactF17655 91 76.47 11.16 17.23 formamidopyrimidine-DNA glycosylase #*bactF17663 85 100.00 100.00 100.00 tsf elongation factor Ts bactF17690 83 74.12 12.61 21.00 radC DNA repair protein RadC bactF17824 82 95.29 68.63 90.81 atpG F0F1 ATP synthase subunit gamma bactF17894 86 100.00 91.22 91.22 pheS phenylalanyl-tRNA synthetase subunit alpha bactF17908 194 75.29 4.36 11.53 cold shock protein bactF17983 69 78.82 18.38 79.85 hemH ferrochelatase bactF18052 65 74.12 12.61 78.79 aroQ 3-dehydroquinate dehydratase bactF18085 75 85.88 32.32 81.26 hemB delta-aminolevulinic acid dehydratase #*bactF18161 89 97.65 59.56 60.24 pyrG CTP synthetase bactF18167 187 94.12 8.65 10.57 DNA-binding protein HU bactF18247 61 71.76 10.45 100.00 D-tyrosyl-tRNA deacylase bactF18384 95 90.59 26.25 31.30 pgsA CDP-diacylglycerol--glycerol-3-phosphate 3-phosphatidyltransferase #*bactF18443 85 98.82 91.02 91.12 pgk phosphoglycerate kinase bactF18443B 85 98.82 91.02 91.12 tpiA triosephosphate isomerase bactF18518 185 98.82 44.31 45.21 Polyprenyl synthetase bactF18671 90 98.82 60.53 60.94 truA tRNA pseudouridine synthase A #*bactF18679 87 100.00 83.56 83.56 rpsK 30S ribosomal protein S11 #bactF18711 86 100.00 91.22 91.22 rplR 50S ribosomal protein L18 bactF18724 364 71.76 3.03 10.17 efflux transporter RND family MFP subunit bactF18779 77 88.24 39.02 81.69 argG argininosuccinate synthase bactF18936 103 76.47 8.56 20.41 Periplasmic solute binding protein bactF18970 76 76.47 15.22 35.59 cation efflux family protein bactF18971 74 70.59 9.51 28.33 cation diffusion facilitator family transporter bactF18983 116 90.59 13.73 18.06 pyrC dihydroorotase bactF19008 94 100.00 50.42 50.42 pheT phenylalanyl-tRNA synthetase subunit beta #*bactF19017 84 98.82 91.02 100.00 rplS 50S ribosomal protein L19 bactF19049 75 87.06 35.51 90.01 dapF diaminopimelate epimerase bactF19091 78 85.88 32.32 61.59 hisD histidinol dehydrogenase bactF19178 86 91.76 47.83 50.31 hypothetical protein #*bactF19221 85 100.00 100.00 100.00 nusA transcription elongation factor NusA #bactF19267 86 100.00 91.22 91.22 secY preprotein translocase subunit SecY bactF19313 91 75.29 10.68 21.06 6-phosphofructokinase bactF19542 236 100.00 35.41 35.41 aspS aspartyl-tRNA synthetase bactF19613 63 74.12 12.61 100.00 bioB biotin synthase bactF19618 70 80.00 20.19 79.83 hypothetical protein bactF19638 67 71.76 10.45 51.20 lspA lipoprotein signal peptidase bactF19647 81 95.29 68.63 100.00 aroC chorismate synthase bactF19689 135 95.29 12.88 15.01 3-oxoacyl-(acyl carrier protein) synthase III bactF19750 127 100.00 18.05 18.05 clpP ATP-dependent Clp protease proteolytic subunit bactF19792 92 98.82 52.82 53.29 eno enolase bactF19793 116 85.88 10.69 16.74 ilvD dihydroxy-acid dehydratase #*bactF19849 86 100.00 91.22 91.22 rplP 50S ribosomal protein L16 bactF19916 69 81.18 22.18 100.00 hypothetical protein bactF19931 119 100.00 19.92 19.92 ssb single-strand binding protein bactF20164 73 84.71 29.42 89.76 hisG ATP phosphoribosyltransferase bactF20289 106 78.82 8.56 16.51 DNA polymerase IV bactF20308 111 91.76 18.77 24.59 hypothetical protein bactF20343 197 94.12 6.57 7.61 Amidase bactF20367 153 76.47 6.59 17.40 PpiC-type peptidyl-prolyl cis-trans isomerase bactF20411 126 76.47 13.47 78.79 hypothetical protein bactF20417 73 78.82 18.38 54.42 ilvH acetolactate synthase 3 regulatory subunit bactF20428 86 98.82 83.21 83.39 coaD phosphopantetheine adenylyltransferase bactF20436 74 85.88 32.32 89.89 hemC porphobilinogen deaminase bactF20461 91 98.82 56.44 56.89 hypothetical protein bactF20478 69 80.00 20.19 89.20 plsX fatty acid/phospholipid synthesis protein #*bactF20480 85 100.00 100.00 100.00 rplM 50S ribosomal protein L13 bactF20507 75 84.71 29.42 73.59 panB 3-methyl-2-oxobutanoate hydroxymethyltransferase #*bactF20724 85 98.82 91.02 91.12 rplF 50S ribosomal protein L6 bactF20735 87 82.35 21.50 27.38 gcvH glycine cleavage system H protein bactF20742 167 98.82 54.75 59.41 pyrB aspartate carbamoyltransferase catalytic subunit bactF20792 200 95.29 17.45 20.53 hypothetical protein #*bactF20837 85 100.00 100.00 100.00 rpsI 30S ribosomal protein S9 bactF20895 65 75.29 13.86 88.59 nadA quinolinate synthetase bactF20907 86 98.82 83.21 83.39 tig trigger factor bactF20971 82 96.47 75.40 100.00 recR recombination protein RecR bactF21058 86 100.00 91.22 91.22 nusG transcription antitermination protein NusG bactF21117 74 76.47 15.22 42.60 phoU phosphate transport system regulatory protein PhoU bactF21269 92 98.82 52.82 53.29 dnaA chromosomal replication initiation protein bactF21320 70 82.35 24.37 100.00 rimM 16S rRNA processing protein rimM bactF21359 88 100.00 76.87 76.87 ribF riboflavin biosynthesis protein RibF bactF21462 112 71.76 5.79 21.17 Ald Alanine dehydrogenase #*bactF21517 85 100.00 100.00 100.00 rplA 50S ribosomal protein L1 bactF21529 85 84.71 29.42 35.47 fabZ (3R)-hydroxymyristoyl-(acyl carrier protein) dehydratase bactF21566 101 100.00 35.74 35.74 efp elongation factor P bactF21687 118 100.00 20.08 20.08 groEL chaperonin GroEL bactF21696 82 95.29 68.63 90.81 rplX 50S ribosomal protein L24 bactF21704 84 97.65 82.84 91.02 rnc ribonuclease III bactF21732 103 87.06 16.39 23.92 sun Sun protein #bactF21733 83 97.65 82.84 100.00 nusB transcription antitermination protein NusB #*bactF21809 85 98.82 91.02 91.12 rpsJ 30S ribosomal protein S10 bactF21961 85 96.47 75.40 76.18 acpP acyl carrier protein bactF22029 74 87.06 35.51 100.00 hypothetical protein bactF22047 88 78.82 15.70 24.75 leucyl aminopeptidase #bactF22056 86 100.00 91.22 91.22 rplO 50S ribosomal protein L15 bactF22162 81 95.29 68.63 100.00 rplJ 50S ribosomal protein L10 #*bactF22164 85 98.82 91.02 91.12 rpsS 30S ribosomal protein S19 bactF22267 99 98.82 35.38 35.90 groES co-chaperonin GroES #bactF22312 86 98.82 83.21 83.39 rplQ 50S ribosomal protein L17 bactF22315 71 76.47 15.22 54.14 lipA lipoyl synthase bactF22443 64 74.12 12.61 88.42 hypothetical protein #*bactF22468 84 98.82 91.02 100.00 infC translation initiation factor IF-3 bactF22558 91 78.82 13.23 18.84 fructose-bisphosphate aldolase bactF22567 77 87.06 35.51 74.15 hypothetical protein bactF22605 61 71.76 10.45 100.00 glyQ glycyl-tRNA synthetase subunit alpha bactF22610 98 85.88 17.53 25.41 hisI phosphoribosyl-AMP cyclohydrolase bactF22704 61 71.76 10.45 100.00 rpsT 30S ribosomal protein S20 bactF22751 80 90.59 47.10 74.95 ribH riboflavin synthase subunit beta bactF22862 65 71.76 10.45 63.13 ppk polyphosphate kinase bactF22947 75 85.88 32.32 81.26 gidA tRNA uridine 5-carboxymethylaminomethyl modification enzyme GidA #bactF23022 83 97.65 82.84 100.00 ruvA Holliday junction DNA helicase RuvA #*bactF23090 85 100.00 100.00 100.00 rplL 50S ribosomal protein L7/L12 bactF23255 187 92.94 24.58 29.60 aconitate hydratase bactF23256 80 83.53 26.78 45.00 leuD isopropylmalate isomerase small subunit bactF23299 86 100.00 91.22 91.22 argS arginyl-tRNA synthetase bactF6105 69 81.18 22.18 100.00 Holliday junction resolvase-like protein bactF6109 143 89.41 12.64 20.63 mutS DNA mismatch repair protein bactF6118 73 78.82 18.38 54.42 tatC Sec-independent protein translocase TatC bactF6295 123 96.47 14.80 16.18 signal peptidase I bactF6336 68 80.00 20.19 100.00 trmB tRNA (guanine-N(7))-methyltransferase #*bactF6375 86 100.00 91.22 91.22 rpsE 30S ribosomal protein S5 bactF6392 184 98.82 27.67 28.52 rRNA methylase bactF6393 63 74.12 12.61 100.00 RNA methyltransferase TrmH family group 2 bactF6506 98 100.00 40.20 40.20 putative inner membrane protein translocase component YidC bactF6510 149 95.29 11.13 13.16 fur ferric uptake regulation protein #bactF6533 85 100.00 100.00 100.00 rpsP 30S ribosomal protein S16 bactF6558 121 100.00 24.66 24.66 hisS histidyl-tRNA synthetase #bactF6590 85 100.00 100.00 100.00 mraW S-adenosyl-methyltransferase MraW bactF6622 101 74.12 7.05 15.52 glutamate dehydrogenase bactF6647 92 97.65 48.76 49.77 asd aspartate-semialdehyde dehydrogenase bactF6648 75 84.71 29.42 73.59 argC N-acetyl-gamma-glutamyl-phosphate reductase bactF6693 78 88.24 39.02 74.42 recJ single-stranded-DNA-specific exonuclease RecJ bactF6698 83 87.06 35.51 46.13 uppP undecaprenyl pyrophosphate phosphatase bactF6725 79 90.59 47.10 82.10 xseA exodeoxyribonuclease VII large subunit bactF6747 88 84.71 24.50 31.62 Superoxide dismutase bactF6774 169 100.00 28.28 28.28 gltX glutamyl-tRNA synthetase bactF6775 82 91.76 51.75 69.06 gatB aspartyl/glutamyl-tRNA amidotransferase subunit B bactF6854 113 78.82 7.75 16.56 ABC transporter permease protein #*bactF6908 86 100.00 91.22 91.22 rplE 50S ribosomal protein L5 bactF6955 89 98.82 65.16 65.53 ligA NAD-dependent DNA ligase LigA bactF6969 65 75.29 13.86 88.59 argJ bifunctional ornithine acetyltransferase/N-acetylglutamate synthase protein bactF7081 69 81.18 22.18 100.00 dxr 1-deoxy-D-xylulose 5-phosphate reductoisomerase bactF7088 67 78.82 18.38 100.00 recF recombination protein F #*bactF7097 86 100.00 91.22 91.22 rpsC 30S ribosomal protein S3 bactF7146 91 100.00 60.87 60.87 ksgA dimethyladenosine transferase bactF7167 85 96.47 75.40 76.18 coaBC phosphopantothenoylcysteine decarboxylase/phosphopantothenate--cysteine ligase bactF7182 81 92.94 56.85 82.49 dapB dihydrodipicolinate reductase bactF7194 70 81.18 22.18 89.35 ispG 4-hydroxy-3-methylbut-2-en-1-yl diphosphate synthase bactF7351 86 94.12 57.60 59.67 lgt prolipoprotein diacylglyceryl transferase bactF7382 91 85.88 22.83 29.70 rnr ribonuclease R bactF7487 75 87.06 35.51 90.01 ndk nucleoside diphosphate kinase bactF7509 83 92.94 56.85 69.35 tgt queuine tRNA-ribosyltransferase bactF7547 97 98.82 39.36 39.89 uppS undecaprenyl diphosphate synthase bactF7565 98 98.82 37.51 38.06 folD methylenetetrahydrofolate dehydrogenase/methenyltetrahydrofolate cyclohydrolase bactF7576 208 72.94 3.15 7.65 iron compound ABC transporter permease protein bactF7581 100 100.00 37.22 37.22 alaS alanyl-tRNA synthetase bactF7582 94 100.00 50.42 50.42 thrS threonyl-tRNA synthetase bactF7583 88 100.00 76.87 76.87 proS prolyl-tRNA synthetase #bactF7584 86 100.00 91.22 91.22 serS seryl-tRNA synthetase bactF7609 109 94.12 20.83 24.41 greA transcription elongation factor GreA bactF7645 83 96.47 75.40 90.92 hypothetical protein #bactF7657 86 100.00 91.22 91.22 rpsQ 30S ribosomal protein S17 #*bactF7728 85 100.00 100.00 100.00 frr ribosome recycling factor bactF7745 142 100.00 31.00 31.00 gyrB DNA gyrase subunit B bactF7763 77 84.71 29.42 61.67 ilvC ketol-acid reductoisomerase #*bactF7875 85 98.82 91.02 91.12 rplC 50S ribosomal protein L3 #bactF7911 85 100.00 100.00 100.00 rpsO 30S ribosomal protein S15 bactF7988 142 100.00 34.72 34.72 gyrA DNA gyrase subunit A bactF8002 93 100.00 53.61 53.61 rpsD 30S ribosomal protein S4 #*bactF8007 86 100.00 91.22 91.22 smpB SsrA-binding protein bactF8117 110 90.59 17.99 24.60 PhoH family protein bactF8120 112 96.47 21.65 23.76 relA GTP pyrophosphokinase bactF8137 153 85.88 31.00 68.17 hisA phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase bactF8138 80 87.06 35.51 57.15 hisH imidazole glycerol phosphate synthase subunit HisH bactF8219 89 92.94 42.10 44.58 proC pyrroline-5-carboxylate reductase bactF8221 92 98.82 52.44 52.85 Phosphatidate cytidylyltransferase bactF8238 71 83.53 26.78 100.00 4-diphosphocytidyl-2-C-methyl-D-erythritol kinase bactF8259 553 90.59 4.25 5.46 transcriptional regulator LysR family bactF8326 101 74.12 7.48 18.78 predicted inner membrane peptidase bactF8344 90 100.00 65.82 65.82 infA translation initiation factor IF-1 bactF8417 92 100.00 56.79 56.79 secA preprotein translocase subunit SecA bactF8482 144 81.18 9.14 21.99 UDP-glucose 6-dehydrogenase bactF8533 80 92.94 56.85 90.60 nadD nicotinic acid mononucleotide adenyltransferase bactF8534 73 85.88 32.32 100.00 hypothetical protein bactF8549 88 100.00 76.87 76.87 tyrS tyrosyl-tRNA synthetase bactF8550 90 100.00 65.82 65.82 trpS tryptophanyl-tRNA synthetase #bactF8626 86 100.00 91.22 91.22 rplV 50S ribosomal protein L22 bactF8659 62 72.94 11.48 100.00 mazG nucleoside triphosphate pyrophosphohydrolase bactF8667 76 89.41 42.87 100.00 hypothetical protein bactF8696 60 70.59 9.51 100.00 atpH ATP synthase F1 delta subunit bactF8702 101 97.65 29.17 30.05 glyA serine hydroxymethyltransferase #*bactF8791 86 100.00 91.22 91.22 rplD 50S ribosomal protein L4 #*bactF8792 85 100.00 100.00 100.00 rplT 50S ribosomal protein L20 bactF8795 61 71.76 10.45 100.00 rbfA ribosome-binding factor A #*bactF8827 86 100.00 91.22 91.22 rplB 50S ribosomal protein L2 bactF8838 64 75.29 13.86 100.00 hemA glutamyl-tRNA reductase bactF8981 132 96.47 11.16 12.17 dapA dihydrodipicolinate synthase bactF9061 86 97.65 75.90 76.42 rpsR 30S ribosomal protein S18 bactF9089 86 96.47 69.23 70.19 hypothetical protein bactF9111 77 90.59 47.10 100.00 rpsF 30S ribosomal protein S6 bactF9178 105 100.00 30.63 30.63 ribose-phosphate pyrophosphokinase bactF9202 138 98.82 16.65 17.35 glyceraldehyde-3-phosphate dehydrogenase bactF9217 129 72.94 4.98 15.92 glpK glycerol kinase #bactF9258 86 100.00 91.22 91.22 rpsG 30S ribosomal protein S7 #bactF9262 85 100.00 100.00 100.00 rplI 50S ribosomal protein L9 bactF9304 124 100.00 19.06 19.06 def peptide deformylase bactF9353 111 74.12 5.96 15.46 nitrogen regulatory protein p-II bactF9419 77 84.71 29.42 61.67 queA S-adenosylmethionine tRNA ribosyltransferase-isomerase bactF9424 88 97.65 64.50 65.24 ftsZ cell division protein FtsZ bactF9477 83 96.47 75.40 90.92 truB tRNA pseudouridine synthase B bactF9500 356 81.18 8.75 18.12 putative monovalent cation/H+ antiporter subunit D #bactF9641 85 97.65 82.84 83.22 rpsL 30S ribosomal protein S12 bactF9703 84 96.47 75.40 83.04 rnhB ribonuclease HII bactF9872 87 97.65 69.68 70.15 murB UDP-N-acetylenolpyruvoylglucosamine reductase bactF9900 132 88.24 8.35 12.02 glnA glutamine synthetase bactF9905 61 71.76 10.45 100.00 hypothetical protein bactF9916 77 90.59 47.10 100.00 rplW 50S ribosomal protein L23 ##end### == ToDo1: Eukaryotes and Viruses == == ToDo2: Run Zorro on families to build HMMs ==
IMG genomes (Guillaume Jospin, Morgan Langille, Thomas Sharpton, Dongying Wu)
- Summary
- Selected 707 families from DY's 100 genome families (with universality > 70).
- Ran HMMsearch on those 707families' HMMs against the Bacterial+Archaeal IMG sequences.
- Filtered the results for an 80% coverage.
- Took the whole IMG database, filtered out the sequences that hit DY's 100genome families.
- Did all vs all blastp on the remaining sequences.
- Filtered for 80% coverage on query and hit.
- Identified protein families using MCL.
- Regroup all families (from this step forward the 100genome families and the IMG families are undergoing the same steps).
- Aligned the families using muscle.
- no selection on small families (size <= 250 members).
- Tried to pick representatives for the very large families (up to 45000 members and virtually impossible to align decently) using DY's pick_rep_by_mcl.pl script.
- in progress
- Aligning some larger families (still running on merlot).
- Determine a good Quality Control metric/method to rate alignments (Good alignments will provide usable HMMs). So far we are using DY's pro_ali_mask.pl script.
- Create HMMs for the good alignments.
- Find the bad alignments and find a method to improve the alignment.
- If representatives need to be picked, a "seed" alignment will be used to HMMalign the sequences to it.
- update the mysql DB with all the appropriate/current information.