Polysat: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(let me know about bugs)
(→‎Source code: version 1.3)
(33 intermediate revisions by the same user not shown)
Line 1: Line 1:
<code>polysat</code> is an [[R_Statistics|R]] package for polyploid microsatellite analysis in ecological genetics.  Version 1.2-0 is available on CRAN as of May 2011. [[Image:polysat_screenshot_101021.jpg  | thumb | right | 450px | Screenshot of polysat from RGui ]]   
<code>polysat</code> is an [[R_Statistics|R]] package for polyploid microsatellite analysis in ecological genetics.  Version 1.3-3 is on CRAN as of September 2014. [[Image:polysat_screenshot_101021.jpg  | thumb | right | 450px | Screenshot of polysat from RGui ]]   


Since the paper on polysat was recently published in the May issue of ''Molecular Ecology Resources'', a lot of people are now using it for the first time.  This is occasionally bringing up issues that I didn't find when I tested the software myself.  '''If you think you have found a bug, please feel free to let me know about it!'''  I plan to release an updated version this summer with any bug fixes that are needed.  In the meantime I will send you the corrected source code so that you can proceed with your analysis.
Version 1.3 handles information about ploidy differently than earlier versions.  Due to this, there are minor changes to the source code of most functions with respect to earlier versions.  '''If you think you have found a bug, please feel free to let me know about it!'''  If I fix the bug but am not immediately ready to release a new version of polysat, I will send you the corrected source code so that you can proceed with your analysis in the meantime.


== What polysat does ==
== What polysat does ==
Line 7: Line 7:
* Handles data of any ploidy, including mixed ploidy samples.
* Handles data of any ploidy, including mixed ploidy samples.
* Stores genotype data in a simple format that can be easily manipulated to exclude or add samples and loci.
* Stores genotype data in a simple format that can be easily manipulated to exclude or add samples and loci.
* Imports and exports data in [http://www.appliedbiosystems.com/genemapper ABI GeneMapper Genotypes Table], [http://www.bentleydrummer.nl/software/software/GenoDive.html GenoDive], [http://pritch.bsd.uchicago.edu/structure.html Structure], [http://ebe.ulb.ac.be/ebe/Software.html SPAGeDi], [http://www.vub.ac.be/APNA/ATetra.html ATetra], [http://markwith.freehomepage.com/tetrasat.html Tetrasat]/[http://ecology.bnu.edu.cn/zhangdy/TETRA/TETRA.htm Tetra], [http://gbi.agrsci.dk/~bernt/popgen/ POPDIST], and binary presence/absence formats.
* Imports and exports data in [http://www.appliedbiosystems.com/genemapper ABI GeneMapper Genotypes Table], [http://www.bentleydrummer.nl/software/software/GenoDive.html GenoDive], [http://pritch.bsd.uchicago.edu/structure.html Structure], [http://ebe.ulb.ac.be/ebe/Software.html SPAGeDi], [http://www.vub.ac.be/APNA/ATetra.html ATetra], [http://markwith.freehomepage.com/tetrasat.html Tetrasat]/[http://ecology.bnu.edu.cn/zhangdy/TETRA/TETRA.htm Tetra], [http://gbi.agrsci.dk/~bernt/popgen/ POPDIST], and binary presence/absence formats.  Import is also available for [http://www.vgl.ucdavis.edu/informatics/strand.php STRand] format, and export is available for [http://adegenet.r-forge.r-project.org/ adegenet] presence/absence format.
* Calculates pairwise distances between individuals using a stepwise mutation model or infinite alleles model.
* Calculates pairwise distances between individuals using a stepwise mutation model or infinite alleles model.
* Calculates Shannon and Simpson indexes of genotype diversity.
* Calculates Shannon and Simpson indexes of genotype diversity, and can calculate a confidence interval for the Simpson index.  Counts alleles for measures of allelic diversity.
* Counts alleles to assist user in estimating ploidy.
* Counts alleles to assist user in estimating ploidy.
* Estimates allele frequencies in autopolyploids using either an iterative or non-iterative algorithm.  Calculates pairwise F<sub>ST</sub> based on these estimates.  Mixed ploidy population size is measured in genomes rather than individuals.
* Estimates allele frequencies in autopolyploids using either an iterative or non-iterative algorithm.  Calculates pairwise F<sub>ST</sub> based on these estimates.  Mixed ploidy population size is measured in genomes rather than individuals.
Line 17: Line 17:
== Author and Maintainer ==
== Author and Maintainer ==
[[User:Lindsay V. Clark]]
[[User:Lindsay V. Clark]]
Note that I have changed institutions since the paper was published!  I no longer check email at my UC Davis address.  If you replace "ucdavis.edu" with "illinois.edu", that is my new email address.
== Obtaining polysat ==
== Obtaining polysat ==


Line 31: Line 34:
== Documentation ==
== Documentation ==


[[Media: Polysattutorial_1.2.pdf | Tutorial manual]]: Most users will want to read this first to get a general idea of how to use the package.  It starts with a broad tutorial to familiarize users with the package, then goes into more detail about how data are stored in polysat and which analyses are appropriate for autopolyploid and allopolyploid data.
[[Media: Polysattutorial_1.3-3.pdf | Tutorial manual]]: Most users will want to read this first to get a general idea of how to use the package.  It starts with a broad tutorial to familiarize users with the package, then goes into more detail about how data are stored in polysat and which analyses are appropriate for autopolyploid and allopolyploid data.
 
[[Media: polysat1-2tutorialcode.R.txt | R code from tutorial manual]]: You can copy and paste this code into the R console in order to follow along with the tutorial, or edit it to work with your own data.  [http://ess.r-project.org/ Emacs Speaks Statistics] is a really handy program for editing this type of file and sending lines directly to R, but you can also use a simpler text editor such as Notepad to view and edit this file.


[[Media: Polysat-manual_1.2.pdf | Reference manual]]: This is an alphabetized collection of all of the help files provided with the package.  It contains more details about each function, as well as additional examples.
[[Media: Polysat-manual_1.3-3.pdf | Reference manual]]: This is an alphabetized collection of all of the help files provided with the package.  It contains more details about each function, as well as additional examples.


== Graphical Front End for Import/Export ==
[[Media: polysattutorial_R_1.3-3.txt | Code from the tutorial manual]]: Download this if you want to be able to easiliy run and edit code from the tutorial manual.  (This file is also included with the package installation.)
I have made a limited graphical front end (GUI) for interacting with <code>polysat</code>. It may be expanded in the future. Currently, it can assist the user with importing and exporting data to and from text files, as well as editing the datasetThe GUI does not yet perform any analyses (distance matrices, allele frequencies) but creates a <code>"genambig"</code> object, named <code>genobject</code>, that can be used for analysis from the R command prompt.


Notes on use of the GUI: [[Media: polysat_front_end_notes101017.txt]]
=== Example files for data import ===


To obtain the GUI:
Example files referred to in the tutorial manual are linked below:
# If you haven't already, follow the instructions above for installing <code>polysat</code>.
# Install the package <code>tcltk2</code>.  (Type <code>install.packages("tcltk2")</code> at the R prompt.)
# Save a copy of the following file to your computer: [[Media: polysat_front_end101017.R.txt]]
# Every time you want to launch the GUI, load the text file using the <code>source</code> function.  For example: <code>source("C:/Users/lvclark/Desktop/polysat_front_end101017.R.txt")</code>


Note that the GUI has not gone through the same quality control (i.e. extensive checks on CRAN) that <code>polysat</code> itself has. I am offering it here "as is".
* [[Media: GeneMapperCBA15.txt]]
* [[Media: GeneMapperCBA23.txt]]
* [[Media: GeneMapperCBA28.txt]]
* [[Media: structureExample.txt]]
* [[Media: ATetraExample.txt]]
* [[Media: genodiveExample.txt]]
* [[Media: spagediExample.txt]]
* [[Media: GeneMapperExample.txt]]
* [[Media: dominantExample.txt]]
* [[Media: POPDISTexample1.txt]]
* [[Media: POPDISTexample2.txt]]
* [[Media: tetrasatExample.txt]]
* [[Media: STRandExample.txt]]


== How to cite polysat ==
== How to cite polysat ==


Clark, LV and Jasieniuk, M, 2011. POLYSAT: an R package for polyploid microsatellite analysis. ''Molecular Ecology Resources'' 11(3): 562-566.  DOI: [[doi:10.1111/j.1755-0998.2011.02985.x | 10.1111/j.1755-0998.2011.02985.x]]
Clark, LV and Jasieniuk, M, 2011. POLYSAT: an R package for polyploid microsatellite analysis. ''Molecular Ecology Resources'' 11(3): 562-566.  DOI: [[doi:10.1111/j.1755-0998.2011.02985.x | 10.1111/j.1755-0998.2011.02985.x]]
== Upcoming in Version 1.3 ==
Below are some functions that I am adding because I will be using them in my dissertation work.  They are in various stages of development.  If you have an immediate need for something on this list, email me and I may be able to send you the source code and documentation.
* A scoring system to determine whether an offspring was sexually or asexually produced, as seen in [[Media:LVCpag2011.pdf | my poster at PAG XIX]].
* Some relatedness coefficients for unambiguous genotypes, to be used with <code>meandistance.matrix2</code>.
* <code>Bruvo.distance</code> under the genome addition or genome loss models, rather than having virtual alleles equal to infinity.


== Wish List ==
== Wish List ==
Line 66: Line 66:
This section lists additional functionality that I'm thinking of adding to polysat.  If you have any additional requests (please be specific), or would like to "vote" for one of the items below to be a top priority, just send me an email!  If you have created your own functions to interface with the package and would like to be added as a contributor, I am open to that as well.
This section lists additional functionality that I'm thinking of adding to polysat.  If you have any additional requests (please be specific), or would like to "vote" for one of the items below to be a top priority, just send me an email!  If you have created your own functions to interface with the package and would like to be added as a contributor, I am open to that as well.


* For allopolyploids, assign alleles to one genome or the other based on what genotypes are found in the population. (This is a complex problem and not on the to-do list for my dissertation, but could be very useful.  Want to hire me to do this as a post-doc?) Use these allele assignments to re-code allopolyploid data into autopolyploid data by splitting each locus into two or more loci.
* For allopolyploids, assign alleles to one genome or the other based on what genotypes are found in the population.  Use these allele assignments to re-code allopolyploid data into autopolyploid data by splitting each locus into two or more loci. (I've got some ideas on how to do this, but for now I can only work on it in my free time.)
* On a related note, test whether genotype distributions in a population are consistent with autopolyploid or allopolyploid inheritance.
* Related to the above functionality, use the distribution of genotypes to give a probabalistic estimate of whether a locus is polysomic or disomic.
* Improve computational efficiency on inter-individual distance measures.  (I tried vectorizing some of the code for Bruvo.distance and actually made it slower, but if I ever learn C maybe a compiled version would be faster.)
* Given probabilities of unambiguous genotypes (<code>genotypeProbs</code> function), randomly generate an unambiguous dataset.  This could then be passed to software such as <code>adegenet</code> that allows for polyploidy but not allele copy number ambiguity.
* Given probabilities of unambiguous genotypes (<code>genotypeProbs</code> function), randomly generate an unambiguous dataset.  This could then be passed to software such as <code>adegenet</code> that allows for polyploidy but not allele copy number ambiguity.
* More population statistics (Weir and Cockerham 1984, etc.).
* More population statistics (Weir and Cockerham 1984, etc.).
* Parentage analysis
* Parentage analysis
* Options for handling data where allele copy number is known.  At the same time, I will probably make it so that different loci can have different ploidies (for sex chromosomes, SSRs that only amplify in one homeologous genome, etc.).
* Options for handling data where allele copy number is known.
* Estimate selfing rate under polysomic inheritance, based on observed and expected frequencies of fully heterozygous genotypes.  I wrote a function to do this, but the results were imprecise due to stochastic effects in simulated datasets.  I can email you the source code and documentation if you would like to tinker with it.
* Estimate selfing rate under polysomic inheritance, based on observed and expected frequencies of fully heterozygous genotypes.  I wrote a function to do this, but the results were imprecise due to stochastic effects in simulated datasets.  I can email you the source code and documentation if you would like to tinker with it.
 
* Some relatedness coefficients for unambiguous genotypes, to be used with <code>meandistance.matrix2</code>.


== Frequently asked questions ==
== Frequently asked questions ==
Line 79: Line 80:
If you have never used R before, particularly if you find command-line software to be intimidating, you may need to spend a day or two just learning R before you even touch <code>polysat</code>.  (Look for the ''An Introduction to R'' manual on the CRAN website.)  I have tried to make <code>polysat</code> as user-friendly as possible, but that cannot substitute for a basic understanding of how R works.  Trust me, learning R is worth it!  R is very powerful and efficient software for data analysis, and if you take the time to learn it for the sake of using <code>polysat</code>, you may find yourself using R in other areas of your research.  If you are not sure how something works, try experimenting to see if it does what you think it does.
If you have never used R before, particularly if you find command-line software to be intimidating, you may need to spend a day or two just learning R before you even touch <code>polysat</code>.  (Look for the ''An Introduction to R'' manual on the CRAN website.)  I have tried to make <code>polysat</code> as user-friendly as possible, but that cannot substitute for a basic understanding of how R works.  Trust me, learning R is worth it!  R is very powerful and efficient software for data analysis, and if you take the time to learn it for the sake of using <code>polysat</code>, you may find yourself using R in other areas of your research.  If you are not sure how something works, try experimenting to see if it does what you think it does.


If you get an error message or if the software behaves in an unexpected way, I'm happy to help you figure out what went wrong.  (Particularly since several users have helped me to find bugs in this way.)  In this case, please send your dataset (or a subset of your data that you don't mind sharing) and code that can be used to duplicate the error.
=== polysat ===
* '''Is missing data allowed in polysat?'''  Yes it is!  For the Structure, GenoDive, SPAGeDi, and Tetrasat/Tetra formats, you can code the missing data as you normally would for that format.  For the GeneMapper format, you can either delete rows with missing data, or fill in a <code>-9</code> in the first allele column for that row.
* '''Is missing data allowed in polysat?'''  Yes it is!  For the Structure, GenoDive, SPAGeDi, and Tetrasat/Tetra formats, you can code the missing data as you normally would for that format.  For the GeneMapper format, you can either delete rows with missing data, or fill in a <code>-9</code> in the first allele column for that row.
* '''I have made my PCA plot.  Can I add a label for each sample?'''  Yes.  See <code>?text</code>.
* '''In <code>read.GeneMapper</code> I got the error "line 2 did not have X elements".'''  Each line of the file needs to have the same number of tab stops.  You can add these manually in a text editor, or if you open and save the file in a spreadsheet program it should automatically insert the right number of tab stops.
* '''In <code>read.GeneMapper</code> I got the error "line 2 did not have X elements".'''  Each line of the file needs to have the same number of tab stops.  You can add these manually in a text editor, or if you open and save the file in a spreadsheet program it should automatically insert the right number of tab stops.
* '''I tried to do PCoA (cmdscale) but got the error "NA values not allowed in d."'''  If you only have one or two loci, you will need to exclude all individuals with missing data from your analysis.  If you have three or more loci and still see this error, you may need to exclude individuals that are missing genotypes at multiple loci.
* '''Do all populations need to have the same number of individuals?  Do the samples have to be ordered by population?'''  No and no.  [[Media:polysat_tables.txt | Example code]]
 
 
=== R base ===
* '''I have made my PCoA plot.  Can I add a label for each sample?'''  Yes.  See <code>?text</code>.
* '''How can I assign each population its own color or symbol for plotting?'''  See additional [[Media:polysat_colors.txt | Example code]].
* '''How can I find the percentage of variation represented by the axes in Principal Coordinate Analysis?''' When you run cmdscale, use <code>eig=TRUE</code>.  Then (assuming here you name the output <code>pca</code>) you can do <code>pca$eig/sum(pca$eig)</code>.  When you plot the points, you will now have to use <code>pca$points[,1]</code> and <code>pca$points[,2]</code> instead of <code>pca[,1]</code> and <code>pca[,2]</code>.  See <code>?cmdscale</code>.
* '''I tried to do PCoA (cmdscale) but got the error "NA values not allowed in d."'''  If you only have one or two loci, you will need to exclude all individuals with missing data from your analysis.  If you have three or more loci and still see this error, you may need to exclude individuals that are missing genotypes at multiple loci.  This will prevent missing data (<code>NA</code>) in the distance matrix (<code>d</code>).
* '''Can I use my distance matrix to make a neighbor-joining tree instead of a PCoA plot?'''  Yes.  The R package <code>ape</code> works well for making neighbor-joining trees, and I have gotten it to work on distance matrices produced by <code>polysat</code>.


== Known issues ==
== Known issues ==
===Current version (1.2-0)===
===Current version (1.3-3)===
* <code>Bruvo.distance</code>: If both genotypes are missing, returns 0 rather than NASource code with fix, until new version is released: [[Media: Bruvo110504.R.txt]]
* Locus names should not contain a periodFor now this is going to remain a "feature" of polysat.
* <code>read.GenoDive</code>: The function does not expect a "Clones" column, and will simply take sample names from whichever column is second.  This bug will be fixed in the next version of polysat to be released.


===Older versions===
===Older versions===
* <code>read.Structure</code> gives ploidy at each sample*locus based on allele count.  However <code>reformatPloidies</code> and <code>Ploidies</code> can be used to fix ploidy after data has been imported. (1.3-1 and 1.3-2)
* Example files for data import do not install with the package as intended. (1.3-2 only.)  I have made them available on this web page.
* read.STRand has problems if any one locus never has more than one allele per individual (1.3-1 and earlier).
* Lynch.distance does not give the right answer if a genotype contains duplicated alleles (e.g. AAAB instead of AB; 1.3-1 and earlier).
* There is a problem with <code>"genbinary"</code> objects if a particular locus has only one allele for the whole dataset.  (Versions 1.2 and earlier.)
* Rounding errors in R can cause errors when the Simpson or Shannon indices are used. ("p should sum to 1.")  (Version 1.2)
* <code>Bruvo.distance</code>: In version 1.2-0 and earlier, if both genotypes are missing, returns 0 rather than NA.
* <code>read.GenoDive</code>: The function does not expect a "Clones" column, and will simply take sample names from whichever column is second. (Version 1.2-0 and earlier)
* If one locus name is a shorter version of another locus name, ''e.g.'' "ABC1" and "ABC12", there will be some issues with the "genbinary" class and with the allele frequency functions.  (Version 1.2-0 and earlier)
* In version 1.1, <code>write.GeneMapper</code> has problems when genotypes have more alleles than the maximum ploidy in the dataset.
* In version 1.1, <code>write.GeneMapper</code> has problems when genotypes have more alleles than the maximum ploidy in the dataset.
* <code>editGenotypes</code> in version 1.0 rearranges the genotypes if the samples and loci are not in alphabetical order.
* <code>editGenotypes</code> in version 1.0-0 rearranges the genotypes if the samples and loci are not in alphabetical order.
* In version 0.1, <code>read.SPAGeDi</code> will not work with <code>missing=0</code>, <code>missing=00</code>, etc.  This should not be an issue in version 1.0 because of the change in data structure.  (In either version, even if the missing data symbol is at the default, -9, the software still knows that zero indicates missing data in a SPAGeDi file.)
* In version 0.1, <code>read.SPAGeDi</code> will not work with <code>missing=0</code>, <code>missing=00</code>, etc.  This should not be an issue in version 1.0 because of the change in data structure.  (In either version, even if the missing data symbol is at the default, -9, the software still knows that zero indicates missing data in a SPAGeDi file.)


Line 98: Line 117:
For advanced R users, here is the source code for the functions in the package, so that you may tweak them or create new functions for your own use:
For advanced R users, here is the source code for the functions in the package, so that you may tweak them or create new functions for your own use:


=== Current version (1.2-0) ===
=== Current version (1.3-3) ===
*[[Media: classes_generics_methods_polysat_1-0-1.R.txt]]
*[[Media: classes_generics_methods_polysat_1-3-3.R.txt]]
*[[Media: class_conversion_polysat_1-3-1.R.txt]]
*[[Media: dataimport_polysat_1-3-3.R.txt]]
*[[Media: dataexport_polysat_1-3-3.R.txt]]
*[[Media: individual_distance_polysat_1-3-3.R.txt]]
*[[Media: population_stats_polysat_1-3-3.R.txt]]
 
=== Older versions ===
*[[Media: dataimport_polysat_1-3-2.R.txt]]
*[[Media: dataexport_polysat_1-3-1.R.txt]]
*[[Media: individual_distance_polysat_1-3-2.R.txt]]
*[[Media: population_stats_polysat_1-3-2.R.txt]]
*[[Media: classes_generics_methods_polysat_1-3.R.txt]]
*[[Media: individual_distance_polysat_1-3.R.txt]]
*[[Media: population_stats_polysat_1-3-1.R.txt]]
*[[Media: dataimport_polysat_1-3-1.R.txt]]
*[[Media: classes_generics_methods_polysat_1-2-1.R.txt]]
*[[Media: class_conversion_polysat_1-0.R.txt]]
*[[Media: class_conversion_polysat_1-0.R.txt]]
*[[Media: dataimport_polysat_1-1.R.txt]]
*[[Media: dataimport_polysat_1-2-1.R.txt]]
*[[Media: dataexport_polysat_1-2.R.txt]]
*[[Media: dataexport_polysat_1-2.R.txt]]
*[[Media: individual_distance_polysat_1-2-1.R.txt]]
*[[Media: population_stats_polysat_1-2-1.R.txt]]
*[[Media: individual_distance_polysat_1-2.R.txt]]
*[[Media: individual_distance_polysat_1-2.R.txt]]
*[[Media: population_stats_polysat_1-2.R.txt]]
*[[Media: population_stats_polysat_1-2.R.txt]]
 
*[[Media: classes_generics_methods_polysat_1-0-1.R.txt]]
=== Older versions ===
*[[Media: dataimport_polysat_1-1.R.txt]]
 
*[[Media: dataexport_polysat_1-1.R.txt]]
*[[Media: dataexport_polysat_1-1.R.txt]]
*[[Media: individual_distance_polysat_1-0.R.txt]]
*[[Media: individual_distance_polysat_1-0.R.txt]]

Revision as of 09:28, 5 October 2014

polysat is an R package for polyploid microsatellite analysis in ecological genetics. Version 1.3-3 is on CRAN as of September 2014.

Screenshot of polysat from RGui

Version 1.3 handles information about ploidy differently than earlier versions. Due to this, there are minor changes to the source code of most functions with respect to earlier versions. If you think you have found a bug, please feel free to let me know about it! If I fix the bug but am not immediately ready to release a new version of polysat, I will send you the corrected source code so that you can proceed with your analysis in the meantime.

What polysat does

  • Assumes allele copy number ambiguity in partial heterozygotes.
  • Handles data of any ploidy, including mixed ploidy samples.
  • Stores genotype data in a simple format that can be easily manipulated to exclude or add samples and loci.
  • Imports and exports data in ABI GeneMapper Genotypes Table, GenoDive, Structure, SPAGeDi, ATetra, Tetrasat/Tetra, POPDIST, and binary presence/absence formats. Import is also available for STRand format, and export is available for adegenet presence/absence format.
  • Calculates pairwise distances between individuals using a stepwise mutation model or infinite alleles model.
  • Calculates Shannon and Simpson indexes of genotype diversity, and can calculate a confidence interval for the Simpson index. Counts alleles for measures of allelic diversity.
  • Counts alleles to assist user in estimating ploidy.
  • Estimates allele frequencies in autopolyploids using either an iterative or non-iterative algorithm. Calculates pairwise FST based on these estimates. Mixed ploidy population size is measured in genomes rather than individuals.
  • Exports allele frequencies in SPAGeDi and adegenet formats.
  • Easily extensible; ordinary users can write new functions to interface with the package.

Author and Maintainer

User:Lindsay V. Clark

Note that I have changed institutions since the paper was published! I no longer check email at my UC Davis address. If you replace "ucdavis.edu" with "illinois.edu", that is my new email address.

Obtaining polysat

If you don't already have R, download it from CRAN and install it.

At the prompt in the R console, type:

install.packages("combinat")

install.packages("polysat")

library(polysat)

Documentation

Tutorial manual: Most users will want to read this first to get a general idea of how to use the package. It starts with a broad tutorial to familiarize users with the package, then goes into more detail about how data are stored in polysat and which analyses are appropriate for autopolyploid and allopolyploid data.

Reference manual: This is an alphabetized collection of all of the help files provided with the package. It contains more details about each function, as well as additional examples.

Code from the tutorial manual: Download this if you want to be able to easiliy run and edit code from the tutorial manual. (This file is also included with the package installation.)

Example files for data import

Example files referred to in the tutorial manual are linked below:

How to cite polysat

Clark, LV and Jasieniuk, M, 2011. POLYSAT: an R package for polyploid microsatellite analysis. Molecular Ecology Resources 11(3): 562-566. DOI: 10.1111/j.1755-0998.2011.02985.x

Wish List

This section lists additional functionality that I'm thinking of adding to polysat. If you have any additional requests (please be specific), or would like to "vote" for one of the items below to be a top priority, just send me an email! If you have created your own functions to interface with the package and would like to be added as a contributor, I am open to that as well.

  • For allopolyploids, assign alleles to one genome or the other based on what genotypes are found in the population. Use these allele assignments to re-code allopolyploid data into autopolyploid data by splitting each locus into two or more loci. (I've got some ideas on how to do this, but for now I can only work on it in my free time.)
  • Related to the above functionality, use the distribution of genotypes to give a probabalistic estimate of whether a locus is polysomic or disomic.
  • Improve computational efficiency on inter-individual distance measures. (I tried vectorizing some of the code for Bruvo.distance and actually made it slower, but if I ever learn C maybe a compiled version would be faster.)
  • Given probabilities of unambiguous genotypes (genotypeProbs function), randomly generate an unambiguous dataset. This could then be passed to software such as adegenet that allows for polyploidy but not allele copy number ambiguity.
  • More population statistics (Weir and Cockerham 1984, etc.).
  • Parentage analysis
  • Options for handling data where allele copy number is known.
  • Estimate selfing rate under polysomic inheritance, based on observed and expected frequencies of fully heterozygous genotypes. I wrote a function to do this, but the results were imprecise due to stochastic effects in simulated datasets. I can email you the source code and documentation if you would like to tinker with it.
  • Some relatedness coefficients for unambiguous genotypes, to be used with meandistance.matrix2.

Frequently asked questions

If you have never used R before, particularly if you find command-line software to be intimidating, you may need to spend a day or two just learning R before you even touch polysat. (Look for the An Introduction to R manual on the CRAN website.) I have tried to make polysat as user-friendly as possible, but that cannot substitute for a basic understanding of how R works. Trust me, learning R is worth it! R is very powerful and efficient software for data analysis, and if you take the time to learn it for the sake of using polysat, you may find yourself using R in other areas of your research. If you are not sure how something works, try experimenting to see if it does what you think it does.

If you get an error message or if the software behaves in an unexpected way, I'm happy to help you figure out what went wrong. (Particularly since several users have helped me to find bugs in this way.) In this case, please send your dataset (or a subset of your data that you don't mind sharing) and code that can be used to duplicate the error.

polysat

  • Is missing data allowed in polysat? Yes it is! For the Structure, GenoDive, SPAGeDi, and Tetrasat/Tetra formats, you can code the missing data as you normally would for that format. For the GeneMapper format, you can either delete rows with missing data, or fill in a -9 in the first allele column for that row.
  • In read.GeneMapper I got the error "line 2 did not have X elements". Each line of the file needs to have the same number of tab stops. You can add these manually in a text editor, or if you open and save the file in a spreadsheet program it should automatically insert the right number of tab stops.
  • Do all populations need to have the same number of individuals? Do the samples have to be ordered by population? No and no. Example code


R base

  • I have made my PCoA plot. Can I add a label for each sample? Yes. See ?text.
  • How can I assign each population its own color or symbol for plotting? See additional Example code.
  • How can I find the percentage of variation represented by the axes in Principal Coordinate Analysis? When you run cmdscale, use eig=TRUE. Then (assuming here you name the output pca) you can do pca$eig/sum(pca$eig). When you plot the points, you will now have to use pca$points[,1] and pca$points[,2] instead of pca[,1] and pca[,2]. See ?cmdscale.
  • I tried to do PCoA (cmdscale) but got the error "NA values not allowed in d." If you only have one or two loci, you will need to exclude all individuals with missing data from your analysis. If you have three or more loci and still see this error, you may need to exclude individuals that are missing genotypes at multiple loci. This will prevent missing data (NA) in the distance matrix (d).
  • Can I use my distance matrix to make a neighbor-joining tree instead of a PCoA plot? Yes. The R package ape works well for making neighbor-joining trees, and I have gotten it to work on distance matrices produced by polysat.

Known issues

Current version (1.3-3)

  • Locus names should not contain a period. For now this is going to remain a "feature" of polysat.

Older versions

  • read.Structure gives ploidy at each sample*locus based on allele count. However reformatPloidies and Ploidies can be used to fix ploidy after data has been imported. (1.3-1 and 1.3-2)
  • Example files for data import do not install with the package as intended. (1.3-2 only.) I have made them available on this web page.
  • read.STRand has problems if any one locus never has more than one allele per individual (1.3-1 and earlier).
  • Lynch.distance does not give the right answer if a genotype contains duplicated alleles (e.g. AAAB instead of AB; 1.3-1 and earlier).
  • There is a problem with "genbinary" objects if a particular locus has only one allele for the whole dataset. (Versions 1.2 and earlier.)
  • Rounding errors in R can cause errors when the Simpson or Shannon indices are used. ("p should sum to 1.") (Version 1.2)
  • Bruvo.distance: In version 1.2-0 and earlier, if both genotypes are missing, returns 0 rather than NA.
  • read.GenoDive: The function does not expect a "Clones" column, and will simply take sample names from whichever column is second. (Version 1.2-0 and earlier)
  • If one locus name is a shorter version of another locus name, e.g. "ABC1" and "ABC12", there will be some issues with the "genbinary" class and with the allele frequency functions. (Version 1.2-0 and earlier)
  • In version 1.1, write.GeneMapper has problems when genotypes have more alleles than the maximum ploidy in the dataset.
  • editGenotypes in version 1.0-0 rearranges the genotypes if the samples and loci are not in alphabetical order.
  • In version 0.1, read.SPAGeDi will not work with missing=0, missing=00, etc. This should not be an issue in version 1.0 because of the change in data structure. (In either version, even if the missing data symbol is at the default, -9, the software still knows that zero indicates missing data in a SPAGeDi file.)

Source code

For advanced R users, here is the source code for the functions in the package, so that you may tweak them or create new functions for your own use:

Current version (1.3-3)

Older versions

External links

  • You can rate and review polysat on Crantastic. (I am of course also open to questions and comments via email.)
  • CRAN page with source and binary downloads.