Making alignments with cactus

From OpenWetWare
Jump to navigationJump to search

To run Cactus you'll need your genomes. And an input file where the first line is a Newick style tree of the names of what you want to align, followed by lines that have a tab delimited name / path. The name must be unique and it will be what that genome is called in the alignment. The path is where that fasta is located. These are supposed to be softmasked genomes, so if you can do that before.

Here's an example of aligning the Xbir_pacbio_v2023.1 genome with the X. maculatus genome

xbir-COAC-16-VIII-22-M_v2023.1	xbir-COAC-16-VIII-22-M_v2023.1.fa
GCA_002775205.2_X_maculatus-5.0-male_genomic	GCA_002775205.2_X_maculatus-5.0-male_genomic.fna

Cactus and haltools need to be run in a python virtual environment so these commands must be run first.

ml python/3.9.0
virtualenv -p python3.9  /home/groups/schumer/shared_bin/cactus/cactus-bin-v2.2.3/cactus_env
echo "export PATH=/home/groups/schumer/shared_bin/cactus/cactus-bin-v2.2.3/bin:\$PATH" >> /home/groups/schumer/shared_bin/cactus/cactus-bin- v2.2.3/cactus_env/bin/activate
echo "export PYTHONPATH=/home/groups/schumer/shared_bin/cactus/cactus-bin-v2.2.3/lib:\$PYTHONPATH" >> /home/groups/schumer/shared_bin/cactus/cactus-bin-v2.2.3/cactus_env/bin/activate
source /home/groups/schumer/shared_bin/cactus/cactus-bin-v2.2.3/cactus_env/bin/activate

Even aligning two genomes takes more than two days, so here's an example script allowing seven days for it to complete.

#SBATCH --job-name=cactus
#SBATCH --time=168:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=64000
#SBATCH -p schumer,hns

echo "xbir_xmac"

ml python/3.9.0
virtualenv -p python3.9 /home/users/qlangdon/cactus-bin-v2.2.3/cactus_env
echo "export PATH=/home/users/qlangdon/cactus-bin-v2.2.3/bin:\$PATH" >> /home/users/qlangdon/cactus-bin-v2.2.3/cactus_env/bin/activate
echo "export PYTHONPATH=/home/users/qlangdon/cactus-bin-v2.2.3/lib:\$PYTHONPATH" >> /home/users/qlangdon/cactus-bin-v2.2.3/cactus_env/bin/activate
source /home/users/qlangdon/cactus-bin-v2.2.3/cactus_env/bin/activate

cactus ./jobstore ./xbirXmac_cactusInput.txt ./xbir-COAC-16-VIII-22-M_v2023.1_GCA_002775205.2_X_maculatus-5.0-male_genomic.hal --realTimeLogging

halValidate xbir-COAC-16-VIII-22-M_v2023.1_GCA_002775205.2_X_maculatus-5.0-male_genomic.hal
halStats xbir-COAC-16-VIII-22-M_v2023.1_GCA_002775205.2_X_maculatus-5.0-male_genomic.hal > xbir-COAC-16-VIII-22-M_v2023.1_GCA_002775205.2_X_maculatus-5.0- male_genomic.hal_stats
halSummarizeMutations xbir-COAC-16-VIII-22-M_v2023.1_GCA_002775205.2_X_maculatus-5.0-male_genomic.hal > xbir-COAC-16-VIII-22-M_v2023.1_GCA_002775205.2_X_maculatus-5.0-male_genomic.hal_sumMut

The final step are just useful to check that the alignment completed and gets you some summary info. The intermediat files are put into the jobstore folder and if the job completes it should disappear. If the job crashes for ambiguous reasons try removing the jobstore folder and starting the run again (worked for me -Quinn- a surprising number of times.) You may struggle to get a lot of genomes aligned this way, so look into running them step by step or progressively.

From here you can do liftovers or other things

Schumer lab: Commonly used workflows