SDTMPI_Linux32 (Sequence Demarcation Tool - MPI version for Linux 32 bit) ======================================================================== SDTMPI_Linux32 is a free Linux-based program that runs on multiple cores to allow quick calculation of DNA sequence pairwise identities. The program is written in Python and uses the mpi4py library to parallelize the process of pairwise alignment and identity score calculation. Given a FASTA file containing DNA sequences, the program aligns all possible sequence pairs of sequences using MUSCLE (Edgar, 2004), ClustalW2 (Larkin et al., 2007) or MAFFT (Katoh et al., 2009), calculates the sequence identity score for each pair and uses a rooted neighbour joining phylogenetic tree to cluster closely related sequences based on identity scores. It outputs two text files containing the scores (arranged in a matrix and column formats) and a ".sdt" file that can be open with SDTv.1.0 Windows version to visualise the pairwise identity distribution plot and colour coded identity matrix. The indentity scores are calculated as 1-(M/N) where M is the number of mismatching nucleotides and N is the total number of positions along the alignment where neither sequence has a gap character. DOWNLOAD AND INSTALLATION ========================= 1. Software requirements: The program requires the following to be installed: - Python2.7 - mpi4py, see installation instruction on http://mpi4py.scipy.org/docs/usrman/install.html - Mafft v6.923b (optional : only if required by the user), available at http://mafft.cbrc.jp/alignment/software/source.html - Neigbour (from the package phylib-3.69), available at http://evolution.genetics.washington.edu/phylip.html NB: muscle and clustalw executable files are located in the bin directory, in the main script "SDTMPI_Linux32.py" the path to these programs can be changed to where these programs are installed on your system. 2. Install mpi4py and test whether it is working on your system. 3. Donwload SDTMPI_Linux32 from http://web.cbio.uct.ac.za/SDT 4. Extract the SDTMPI_Linux32.tar.gz file into the location you want to place the program. The folder contains: - The bin directory which contains the executable files "muscle3.8.31_i86linux32", "clustalw2" and "neighbor". - The Bio directory which contains the Biopython library. - The output directory in which the output files after each run will be stored. - SDTMPI_Linux32.py which is the main program file. - A sample submission shell file (used to submit a job on the cluster). - test.fas a sample FASTA file to test the program. 5. Please change the mode of "muscle3.8.31_i86linux32", "clustalw2" and "neighbor" in the bin directory to "executable". 6. In the SDMPI_Linux32.py script change the paths to Neighbour and the alignment programs (MUSCLE,MAFFT and CLUSTALW) if you have them already installed on your system. Otherwise, use those provided in the SDTMPI_Linux32/bin directory and remember to enable their executable permission. 7. Running commands: The script, SDTMPI_Linux32.py, takes two parameters, the name of the input FASTA file and the name of the alignment program to be used. The execution command for the parallel version is as follows: mpiexec -n 8 python SDTMPI_Linux32.py test.fas muscle This will result in the use of MUSCLE as the alignment program and will run on 8 cores. Replace "muscle" by "clustal" or "mafft" to change the alignment program that is used. Before using MAFFT, please install it on your computer system, and in the main script change the MAFFT_PATH="/xxx/mafft" to where its executable file is located. When the pairwise alignments and identity score calculations are completed, the scores will be written to (1) into two text files that will be named after the input FASTA file and saved into the output folder and a ".sdt" file will be produced which can be opened using the Windows version of SDTv.1.0 to visualse the program's output in the form of the pairwise identity plot and colour-coded pairwise identity matrix. WARNING ------- This code has been used succesfully on Scientific Linux 5.4, 5.5 and 5.8, OpenMPI 1.4-4 and MPI4PY 1.3. On other platforms for which the OpenMPI implementation does not support a call to the fork() function (use of Popen Python command), the code will generate a error. -------------------------------------------------------------------------------------------------------------------------------------------------------------- References ---------- 1. Edgar, Robert C. (2004), MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Research 32, 1792-1797. 2. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG. (2007). Clustal W and Clustal X version 2.0., Bioinformatics 23, 2947-2948 3. Katoh K, Asimenos G, Toh H. (2009) Multiple Alignment of DNA Sequences with MAFFT, Methods in Molecular Biology 537:39-64 4. Felsenstein, J. (1995)PHYLIP (Phylogeny Inference Package) Version 3.57c, available at http://www.med.nyu.edu/rcr/rcr/phylip/main.html#refs Authors ------- Brejnev Muhire [1] Darren Martin [1] Arvind Varsani [2] [1] Institute of Infectious Diseases and Molecular Medicine (IIDMM) Computational Biology Group, University of Cape Town South Africa [2] School of Biological Sciences University of Canterbury Private Bag 4800 Christchurch, 8140 New Zealand BM is funded by the University of Cape Town website: http://web.cbio.uct.ac.za/SDT email: mhrbre001@myuct.ac.za email: mubrejnev@gmail.com