Alignreads.py README

Alignreads.py Readme - Last Modified 9/22/2010 for v2.23

What it is: A pipeline of five scripts used for making reference guided assemblies from microreads. It is meant to provide a one-step, automatable, script to produce full alignments, along with all of the associated files, using a single command. It takes a minimum of 2 arguments, but supports over 40 modifiers from the constituent scripts. Since the first step of the pipeline (YASRA) takes the majority of the computing time, the script can also take the output of a previous YASRA run and continue the pipeline from there. It can also rerun the pipeline on the YASRA data in a finished alignreads output; each analysis will have its own subfolder within the main folder containing the YASRA data. If no arguments are supplied a help menu is printed with the most common options displayed. The full help menu with all of the options can be reached by adding the -H/--advanced-help option. Alignreads uses the following scripts from the Liston lab: runyasra.py, sumqual.py, and qualtofa.py. It also uses yasra and its binaries, including lastz. It must be possible for the python interpreter to find these scripts to run alignreads; this can be done saving copies of scripts in the Python26 folder, your bin folder, or modifying your .chsrc file to search where they are saved on your system (nessesary if lastz binaries cannot be found).

How to use it: alignreads.py is a uncompiled python script, so the script must be complied in order to work. alignreads.py takes either 2 fasta files or 1 directory as its required arguments. Since this is a pipeline of five independent programs there are many options that can be used; the full list is at the end of this document. The following configuration will run the full pipeline:

Syntax: python alignreads.py [options] Example: python alignreads.py --single-step --mask-contigs 35 -e -b 200 ../myReads.fa ../myRef.fa In order to start the pipeline with data from a previous yasra or alignreads run use the following:

Syntax: python alignreads.py [options] 

Example: python alignreads.py --mask-contigs 35 -e -b 200 ../myYasraData The input directory is used to hold the output of the other 4 scripts.

How it works: YASRA, a open-source short read assembler, is the first step in the pipeline. It uses a reference and a set of microreads to produce various files, including a set of contigs and their quality values. These contigs are then aligned to the same reference by NUCmer, which outputs a ".delta" file that contains the locations and indels of the contigs within the alignment. sumqual.py, a script from the Liston lab, takes the output of NUCmer, the reference, and the quality values for the contigs produced by YASRA, and makes a "consensus" quality value file, which contains multiple contigs, aligned to the reference. Most of the important aspects of the alignment are represented in this data-rich format. Another script form the Liston lab called qualtofa.py, takes the "consensus" quality file and produces a fasta format alignment which is easily viewed by such programs as bio-edit.

Why use it: It is easy to use compared to using the constituent scripts independently. It is composed entirely of open-source code, so its free! It is single command, so it is easy to run multiple assemblies in a batch script, so time-consuming calculations can be done automatically, and unsupervised.

What it can do: Here is the full list of available options. options that are represented by lowercase letter are those that we anticipate will be used most frequently, thus in the standard help menu, reached by giving no arguments, only these will be shown. To see information on all of the options, type either 'python alignreads.py -H' or 'python alignreads.py --advanced-help'.

Usage: python alignreads.py [options]   OR... python alignreads.py [options] 

Options: --version            show program's version number and exit -h, --help           show this help message and exit -H, --advanced-help  Display help information for all supported options (Default: only basic options are shown) -z STRING, --import-options=STRING specify the path to a Command_Line_Record.txt fie from a previous run, or the folder that contains one. Any other options used with this one are overwritten. (Default: use options supplied) -Z, --debug          Save the debug log to the current working directory. (Default: dont save)

YASRA-Related Modifiers: -d, --silent       Nothing is printed to the screen (Default: print the                        output of yasra to the screen) -t 454 or solexa, --read-type=454 or solexa Specify the type of reads. (Default: solexa) -o circular or linear, --read-orientation=circular or linear Specify orientation of the sequence. (Default:                       circular    -p same, high, medium, low or very low, --percent-identity=same, high, medium, low or very low                        The percent identity (PID in yasra). The settings                        correspond to different percent values depending on                        the read type (-t). (Default: same)    -a, --single-step   Activate yasra's single_step option (Default: run yasra normally)   -E FILEPATH, --external-makefile=FILEPATH                        Specify path to external makefile used by YASRA.                        (Default: use the makefile built in to runyasra)    -Q, --no-dot-replace-reads                        Do NOT replace N's with dots (.) in the microreads                        file before running yasra/ (Default: replace dots)    -I, --no-dos2unix-ref                        Do NOT run dos2unix on the reference before running                        yasra/ (Default: run dos2unix)

NUCmer-Related Modifiers: -f STRING, --prefix=STRING Set the output file prefix (Default: out) -b INT, --break-length=INT Distance an alignment extension will attempt to extend poor scoring regions before giving up (Default: 200) -j INT, --alternate-ref=INT Specify a new reference to be used in the rest of the alignment after yasra. (Default: use YASRA's                       reference) -A mum, ref, or max, --anchor-uniqueness=mum, ref, or max Specify how NUCmer chooses anchor matches using one of                       three settings: mum = Use anchor matches that are unique in both the reference and query, ref = Use anchor matches that are unique in the reference but not necessarily unique in the query, max = Use all anchor matches regardless of their uniqueness. (Default = ref) -T INT, --min-cluster=INT Minimum cluster length used in the NUCmer analysis. (Default: 65) -D FLOAT, --diag-factor=FLOAT Maximum diagonal difference factor for clustering, i.e. diagonal difference / match separation used by                       NUCmer. (Default: 0.12) -J, --no-extend    Prevent alignment extensions from their anchoring clusters but still align the DNA between clustered matches in NUCmer. (Default: extend) -F, --forward-only Align only the forward strands of each sequence. (Default: forward and reverse) -X INT, --max-gap=INT Maximum gap between two adjacent matches in a cluster. (Default: 90) -M INT, --min-match=INT Minimum length of an maximal exact match. (Default:                       20) -C, --coords       Automatically generate the .coords file using the 'show-coords' program with the -r option. (Default: dont) -O, --no-optimize  Toggle alignment score optimization. Setting --nooptimize will prevent alignment score optimization and result in sometimes longer, but lower scoring alignments (default: optimize) -S, --no-simplify  Simplify alignments by removing shadowed clusters. Turn this option off if aligning a sequence to itself to look for repeats. (Default: simplify)

Delta-Filter-Related Modifiers: -y INT, --min-identity=INT Set the minimum alignment identity [0, 100], (Default:                       80) -l INT, --min-align-length=INT Set the minimum alignment length (Default: 100) -K FLOAT, --max-overlap=FLOAT Set the maximum alignment overlap for -r and -q options as a percent of the alignment length [0, 100]. (Default 100) -B, --query-alignment Query alignment using length*identity weighted LIS. For each query, leave only the alignments which form the longest consistent set for the query. (Defualt:                       global alignment) -R, --ref-alignment Reference alignment using length*identity weighted LIS. For each reference, leave only the alignments which form the longest consistent set for the reference. (Defualt: global alignment) -G, --global-alignment Global alignment using length*identity weighted LIS (longest increasing subset). For every reference-query pair, leave only the alignments which form the longest mutually consistent set. (this is the default) -U FLOAT, --min-uniqueness=FLOAT Set the minimum alignment uniqueness, i.e. percent of                       the alignment matching to unique reference AND query sequence [0, 100]. (Default 0)

sumqual-Related Modifiers: -Y, --save-ref-dels Save the sequence of the reference that corresponds to                       empty gaps in the consensus in a fasta file. (Default:                       dont save)

qualtofa-Related Modifiers: -c, --exclude-contigs Dont include each contig on its own line (Default:                       include contigs) -i, --no-match-overlap Add deletions (i.e. -'s) to the reference to                       accommodate any overlapping matches. (Default:                       Condense all overlapping regions of the consensus into                        IUPAC ambiguity codes.) -e, --no-overlap   Add deletions (i.e. -'s) to the reference to                        accommodate any overlapping sequence, including unmatched sequence. (Default: Condense all overlapping                       regions of the consensus into IUPAC ambiguity codes.) -k, --keep-contained Include contained contigs (Defalt: save sequences of                       contained contigs to a separate file) -q INT, --end-trim-qual=INT Trim all the bases on either end of all contigs that have a quality value less than the specified amount (Default: 0) -s, --dont-save-SNPs Dont save SNPs to a .qual file(Default: Save SNP file) -W, --dont-align-contigs Do NOT align contigs to the reference using '-'s at                       the start of each contig; independent of the consensus. (Default: align contigs) -N INT, --end-trim-num=INT Trim the ends of the contigs by the specified number of bases. (Default: 0) -L INT, --min-match-length=INT Set minimum length of the matching region of the contigs. (Default: 50)

Coverage and Call Proportion Masking: The following options take one integer argument and one decimal argument between 0 and 1, if the second is not supplied it is assumed to be 0. -m, --mask-contigs Set minimum coverage depth and call proportion for contig masking; independent of the consensus. Cannot be used with the -c modifier.(Default: 0, 0) -n, --mask-contig-SNPs Set minimum coverage depth and call proportion for contig SNP masking; independent of the consensus. Cannot be used with the -c modifier.(Default: 0, 0) -w, --mask-consensus Set minimum coverage depth and call proportion for the consensus; a new masked sequence will be added to the output file. (Default: 0, 0) -x, --mask-SNPs    Set minimum coverage depth and call proportion for SNPs in the consensus; a new masked sequence will be                       added to the output file. (Default: 0, 0)