Alignreads.py README
Alignreads.py Readme - Last Modified 9/22/2010 for v2.23
What it is: A pipeline of five scripts used for making reference guided assemblies from microreads. It is meant to provide a one-step, automatable, script to produce full alignments, along with all of the associated files, using a single command. It takes a minimum of 2 arguments, but supports over 40 modifiers from the constituent scripts. Since the first step of the pipeline (YASRA) takes the majority of the computing time, the script can also take the output of a previous YASRA run and continue the pipeline from there. It can also rerun the pipeline on the YASRA data in a finished alignreads output; each analysis will have its own subfolder within the main folder containing the YASRA data. If no arguments are supplied a help menu is printed with the most common options displayed. The full help menu with all of the options can be reached by adding the -H/--advanced-help option. Alignreads uses the following scripts from the Liston lab: runyasra.py, sumqual.py, and qualtofa.py. It also uses yasra and its binaries, including lastz. It must be possible for the python interpreter to find these scripts to run alignreads; this can be done saving copies of scripts in the Python26 folder, your bin folder, or modifying your .chsrc file to search where they are saved on your system (nessesary if lastz binaries cannot be found).
How to use it: alignreads.py is a uncompiled python script, so the script must be complied in order to work. alignreads.py takes either 2 fasta files or 1 directory as its required arguments. Since this is a pipeline of five independent programs there are many options that can be used; the full list is at the end of this document. The following configuration will run the full pipeline:
Syntax: python alignreads.py [options] <microreads> <reference>
Example: python alignreads.py --single-step --mask-contigs 35 -e -b 200 ../myReads.fa ../myRef.fa
In order to start the pipeline with data from a previous yasra or alignreads run use the following:
Syntax: python alignreads.py [options] <YASRA output directory>
Example: python alignreads.py --mask-contigs 35 -e -b 200 ../myYasraData
The input directory is used to hold the output of the other 4 scripts.
How it works: YASRA, a open-source short read assembler, is the first step in the pipeline. It uses a reference and a set of microreads to produce various files, including a set of contigs and their quality values. These contigs are then aligned to the same reference by NUCmer, which outputs a ".delta" file that contains the locations and indels of the contigs within the alignment. sumqual.py, a script from the Liston lab, takes the output of NUCmer, the reference, and the quality values for the contigs produced by YASRA, and makes a "consensus" quality value file, which contains multiple contigs, aligned to the reference. Most of the important aspects of the alignment are represented in this data-rich format. Another script form the Liston lab called qualtofa.py, takes the "consensus" quality file and produces a fasta format alignment which is easily viewed by such programs as bio-edit.
Why use it: It is easy to use compared to using the constituent scripts independently. It is composed entirely of open-source code, so its free! It is single command, so it is easy to run multiple assemblies in a batch script, so time-consuming calculations can be done automatically, and unsupervised.
What it can do: Here is the full list of available options. options that are represented by lowercase letter are those that we anticipate will be used most frequently, thus in the standard help menu, reached by giving no arguments, only these will be shown. To see information on all of the options, type either 'python alignreads.py -H' or 'python alignreads.py --advanced-help'.
Usage: python alignreads.py [options] <Reads in .fa file> <Reference> OR...
python alignreads.py [options] <YASRA folder>
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-H, --advanced-help Display help information for all supported options
(Default: only basic options are shown)
-z STRING, --import-options=STRING
specify the path to a Command_Line_Record.txt fie from
a previous run, or the folder that contains one. Any
other options used with this one are overwritten.
(Default: use options supplied)
-Z, --debug Save the debug log to the current working directory.
(Default: dont save)
YASRA-Related Modifiers:
-d, --silent Nothing is printed to the screen (Default: print the
output of yasra to the screen)
-t 454 or solexa, --read-type=454 or solexa
Specify the type of reads. (Default: solexa)
-o circular or linear, --read-orientation=circular or linear
Specify orientation of the sequence. (Default:
circular
-p same, high, medium, low or very low, --percent-identity=same, high, medium, low or very low
The percent identity (PID in yasra). The settings
correspond to different percent values depending on
the read type (-t). (Default: same)
-a, --single-step Activate yasra's single_step option (Default: run
yasra normally)
-E FILEPATH, --external-makefile=FILEPATH
Specify path to external makefile used by YASRA.
(Default: use the makefile built in to runyasra)
-Q, --no-dot-replace-reads
Do NOT replace N's with dots (.) in the microreads
file before running yasra/ (Default: replace dots)
-I, --no-dos2unix-ref
Do NOT run dos2unix on the reference before running
yasra/ (Default: run dos2unix)
NUCmer-Related Modifiers:
-f STRING, --prefix=STRING
Set the output file prefix (Default: out)
-b INT, --break-length=INT
Distance an alignment extension will attempt to extend
poor scoring regions before giving up (Default: 200)
-j INT, --alternate-ref=INT
Specify a new reference to be used in the rest of the
alignment after yasra. (Default: use YASRA's
reference)
-A mum, ref, or max, --anchor-uniqueness=mum, ref, or max
Specify how NUCmer chooses anchor matches using one of
three settings: mum = Use anchor matches that are
unique in both the reference and query, ref = Use
anchor matches that are unique in the reference but
not necessarily unique in the query, max = Use all
anchor matches regardless of their uniqueness.
(Default = ref)
-T INT, --min-cluster=INT
Minimum cluster length used in the NUCmer analysis.
(Default: 65)
-D FLOAT, --diag-factor=FLOAT
Maximum diagonal difference factor for clustering,
i.e. diagonal difference / match separation used by
NUCmer. (Default: 0.12)
-J, --no-extend Prevent alignment extensions from their anchoring
clusters but still align the DNA between clustered
matches in NUCmer. (Default: extend)
-F, --forward-only Align only the forward strands of each sequence.
(Default: forward and reverse)
-X INT, --max-gap=INT
Maximum gap between two adjacent matches in a cluster.
(Default: 90)
-M INT, --min-match=INT
Minimum length of an maximal exact match. (Default:
20)
-C, --coords Automatically generate the <prefix>.coords file using
the 'show-coords' program with the -r option.
(Default: dont)
-O, --no-optimize Toggle alignment score optimization. Setting
--nooptimize will prevent alignment score optimization
and result in sometimes longer, but lower scoring
alignments (default: optimize)
-S, --no-simplify Simplify alignments by removing shadowed clusters.
Turn this option off if aligning a sequence to itself
to look for repeats. (Default: simplify)
Delta-Filter-Related Modifiers:
-y INT, --min-identity=INT
Set the minimum alignment identity [0, 100], (Default:
80)
-l INT, --min-align-length=INT
Set the minimum alignment length (Default: 100)
-K FLOAT, --max-overlap=FLOAT
Set the maximum alignment overlap for -r and -q
options as a percent of the alignment length [0, 100].
(Default 100)
-B, --query-alignment
Query alignment using length*identity weighted LIS.
For each query, leave only the alignments which form
the longest consistent set for the query. (Defualt:
global alignment)
-R, --ref-alignment
Reference alignment using length*identity weighted
LIS. For each reference, leave only the alignments
which form the longest consistent set for the
reference. (Defualt: global alignment)
-G, --global-alignment
Global alignment using length*identity weighted LIS
(longest increasing subset). For every reference-query
pair, leave only the alignments which form the longest
mutually consistent set. (this is the default)
-U FLOAT, --min-uniqueness=FLOAT
Set the minimum alignment uniqueness, i.e. percent of
the alignment matching to unique reference AND query
sequence [0, 100]. (Default 0)
sumqual-Related Modifiers:
-Y, --save-ref-dels
Save the sequence of the reference that corresponds to
empty gaps in the consensus in a fasta file. (Default:
dont save)
qualtofa-Related Modifiers:
-c, --exclude-contigs
Dont include each contig on its own line (Default:
include contigs)
-i, --no-match-overlap
Add deletions (i.e. -'s) to the reference to
accommodate any overlapping matches. (Default:
Condense all overlapping regions of the consensus into
IUPAC ambiguity codes.)
-e, --no-overlap Add deletions (i.e. -'s) to the reference to
accommodate any overlapping sequence, including
unmatched sequence. (Default: Condense all overlapping
regions of the consensus into IUPAC ambiguity codes.)
-k, --keep-contained
Include contained contigs (Defalt: save sequences of
contained contigs to a separate file)
-q INT, --end-trim-qual=INT
Trim all the bases on either end of all contigs that
have a quality value less than the specified amount
(Default: 0)
-s, --dont-save-SNPs
Dont save SNPs to a .qual file(Default: Save SNP file)
-W, --dont-align-contigs
Do NOT align contigs to the reference using '-'s at
the start of each contig; independent of the
consensus. (Default: align contigs)
-N INT, --end-trim-num=INT
Trim the ends of the contigs by the specified number
of bases. (Default: 0)
-L INT, --min-match-length=INT
Set minimum length of the matching region of the
contigs. (Default: 50)
Coverage and Call Proportion Masking:
The following options take one integer argument and one decimal
argument between 0 and 1, if the second is not supplied it is assumed
to be 0.
-m, --mask-contigs Set minimum coverage depth and call proportion for
contig masking; independent of the consensus. Cannot
be used with the -c modifier.(Default: 0, 0)
-n, --mask-contig-SNPs
Set minimum coverage depth and call proportion for
contig SNP masking; independent of the consensus.
Cannot be used with the -c modifier.(Default: 0, 0)
-w, --mask-consensus
Set minimum coverage depth and call proportion for the
consensus; a new masked sequence will be added to the
output file. (Default: 0, 0)
-x, --mask-SNPs Set minimum coverage depth and call proportion for
SNPs in the consensus; a new masked sequence will be
added to the output file. (Default: 0, 0)