Bioinformatics - HPC Portal

Institute for Massively Parallel Applications and Computing Technologies (IMPACT)

Pyramid System Overview

The High Performance Computing Laboratory (HPCL) operates a Sun Opteron Cluster (PYRAMID) running on the ROCKS cluster distribution. Pyramid contains 1048 CPUs based on AMD quad-core technology and aggregate memory capacity of 1TB. The aggregate disk capacity of Pyramid is more than 64TB. The system is interconnected with QDR Infiniband network which is one of the fastest available networks, with a bandwidth of 40Gbit/s and latency of ≈1μs. The main purpose of the cluster is to support the computing needs of Institute of Massively Parallel Applications and Computing Technologies (IMPACT) faculty, staff and students. The cluster is also made available to the CNMC-GWU Clinical and Translational Science Institute (CNMC-GWU CTSI) faculty and researchers.

More details about Pyramid is available on Pyramid Wiki

Access to Pyramid
In order to access an HPCL machine, you need to create an HPCL account.
Steps to create an account is available at: HPCL Account Creation

After creating an HPCL account, you can follow the steps available on Pyramid Wiki to login.

Available Software Resources/Packages

HMMER - From Washington University at St. Louis
Description: HMMER is used for searching sequence databases for homologs of protein sequences, and for making protein sequence alignments. It implements methods using probabilistic models called “profile hidden Markov models” (profile HMMs).

Compared to BLAST, FASTA, and other sequence alignment and database search tools based on older scoring methodology, HMMER aims to be significantly more accurate and more able to detect remote homologs because of the strength of its underlying mathematical models. In the past, this strength came at significant computational expense, but in the new HMMER3 project, HMMER is now essentially as fast as BLAST. 

More information: http://hmmer.janelia.org

NCBI BLAST - From National Center for Biotechnology Information
Description: The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.
More information: http://www.ncbi.nlm.nih.gov/BLAST

MpiBLAST - From Los Alamos National Laboratory
Description: MpiBLAST is a freely available, open-source, parallel implementation of NCBI BLAST. mpiBLAST takes advantage of distributed computational resources, i.e., a cluster, through explicit MPI communication and thereby utilizes all available resources unlike standard NCBI BLAST which can only take advantage of shared-memory multi-processors (SMPs).

The primary advantage to using mpiBLAST versus traditional NCBI BLAST is performance. mpiBLAST can increase performance by several orders of magnitude while still retaining identical results as output from NCBI BLAST.

More Information: http://www.mpiblast.org

Description: The Biopython Project is an international association of developers of freely available Python (http://www.python.org) tools for computational molecular biology. The web site http://www.biopython.org provides an online resource for modules, scripts, and web links for developers of Python-based software for life science research.

More information: http://www.biopython.org

ClustalW - From the European BioInformatics Institute
Description: ClustalW2 is a general purpose multiple sequence alignment program for DNA or proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences. It calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen. Evolutionary relationships can be seen via viewing Cladograms or Phylograms.

More information: http://www.ebi.ac.uk/clustalw/

MrBayes - From School of Computational Science at the Florida State University
Description: MrBayes is a program for the Bayesian estimation of phylogeny. Bayesian inference of phylogeny is based upon a quantity called the posterior probability distribution of trees, which is the probability of a tree conditioned on the observations. The conditioning is accomplished using Bayes's theorem. The posterior probability distribution of trees is impossible to calculate analytically; instead, MrBayes uses a simulation technique called Markov chain Monte Carlo (or MCMC) to approximate the posterior probabilities of trees.

More Information: http://mrbayes.csit.fsu.edu/

T_Coffee - From Information Genomique et Structurale at Centre National de la Recherche Scientifique
Description: T-Coffee is a multiple sequence alignment package. You can use T-Coffee to align sequences or to combine the output of your favorite alignment methods (Clustal, Mafft, Probcons, Muscle...) into one unique alignment (M-Coffee). 
T-Coffee can align Protein, DNA and RNA sequences. It is also able to combine sequence information with protein structural information (3D-Coffee/Expresso), profile information (PSI-Coffee) or RNA secondary structures

More Information: http://www.tcoffee.org/homepage.html

Emboss - From European Molecular Biology Institute
Description: EMBOSS is "The European Molecular Biology Open Software Suite". EMBOSS is a free Open Source software analysis package specially developed for the needs of the molecular biology (e.g. EMBnet) user community. The software automatically copes with data in a variety of formats and even allows transparent retrieval of sequence data from the web. Also, as extensive libraries are provided with the package, it is a platform to allow other scientists to develop and release software in true open source spirit. EMBOSS also integrates a range of currently available packages and tools for sequence analysis into a seamless whole. EMBOSS breaks the historical trend towards commercial software packages.

More Information: http://emboss.sourceforge.net/

Phylip - From the Dept. of Biology at the University of Washington
Description: PHYLIP (the PHYLogeny Inference Package) is a package of programs for inferring phylogenies (evolutionary trees). It is available free over the Internet, and written to work on as many different kinds of computer systems as possible. The source code is distributed (in C), and executables are also distributed.

PHYLIP is probably the most widely-distributed phylogeny package. It is the third most frequently cited phylogeny package, after PAUP* and MrBayes, and ahead of MEGA. PHYLIP is also the oldest widely-distributed package. It has been in distribution since October, 1980, and has over 28,000 registered users. It is regularly updated.

More Information: http://evolution.genetics.washington.edu/phylip.html 

FASTA - From the University of Virginia
Description: The FASTA programs find regions of local or global (new) similarity between Protein or DNA sequences, either by searching Protein or DNA databases, or by identifying local duplications within a sequence. Other programs provide information on the statistical significance of an alignment. Like BLAST, FASTA can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

More Information: http://fasta.bioch.virginia.edu/

Glimmer - From Center for Bioinformatics and Computational Biology at the University of Maryland
Description: Glimmer is a system for finding genes in microbial DNA, especially the genomes of bacteria, archaea, and viruses. Glimmer (Gene Locator and Interpolated Markov ModelER) uses interpolated Markov models (IMMs) to identify the coding regions and distinguish them from noncoding DNA. The IMM approach uses a combination of Markov models from 1st through 8th-order, weighting each model according to its predictive power. Glimmer uses 3-periodic nonhomogenous Markov models in its IMMs.

More Information: http://www.cbcb.umd.edu/software/glimmer/ 

TIGR Assembler
TIGR Assembler - From the J. Craig Venter Institute
Description: TIGR Assembler is a new tool for assembling large shotgun sequencing projects. Enabled the first published whole-genome assembly of a free-living organism in 1995.

More Information: http://www.jcvi.org/cms/research/software/

POY - Phylogenetic Analysis of DNA and other data using dynamic homology
Description: POY is a phylogenetic analysis program that supports multiple kinds of data (e.g. morphology, nucleotides, genes and gene regions, chromosomes, whole genomes, etc). POY is particular in that it can perform true alignment and phylogeny inference (i.e. input sequences need not to be prealigned). Insertions, deletions, and rearrangements, can then be included in the overall tree score (under Maximum Parsimony), or in the model (under Maximum Likelihood). A variety of heuristic algorithms have been developed for this purpose and are implemented in POY.

More Information: http://research.amnh.org/scicomp/scripts/download.php

Description: BioPerl is a toolkit of perl modules useful in building bioinformatics solutions in Perl. It is built in an object-oriented manner so that many modules depend on each other to achieve a task. The collection of modules in the bioperl-live repository consist of the core of the functionality of bioperl. Additionally auxiliary modules for creating graphical interfaces (bioperl-gui), persistent storage in RDMBS (bioperl-db), running and parsing the results from hundreds of bioinformatics applications (Run package), software to automate bioinformatic analyses (bioperl-pipeline) are all available as Git modules in our repository.

Packages installed: perl-bioperl, perl-bioperl-run, perl-bioperl-gui, perl-bioperl-db 
More Information: http://www.bioperl.org/wiki/Main_Page

The R Project for Statistical Computing
Description: R is a language and environment for statistical computing and graphics. It provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. R is similar to the S language and environment which was developed at Bell Laboratories by John Chambers and colleagues. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.

To be able to run R on the Pyramid the Rmpi package has to be used. The Rmpi is an interface (wrapper) to MPI (Message-Passing Interface). The main goal of Rmpi is to port low level MPI functions into R so that users do not have to know C or Fortran.

More Information: R Project homepage Rmpi homepage

R and Rmpi on Pyramid: User Guide

Rmpi Tutorial (ACADIA University): Tutorial homepage
R Project

NAMD - Scalable Molecular Dynamics
Description: NAMD, recipient of a 2002 Gordon Bell Award, is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. Based on Charm++ parallel objects, NAMD scales to hundreds of processors on high-end parallel platforms and tens of processors on commodity clusters using gigabit ethernet. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER, CHARMM, and X-PLOR.
More Information: NAMD homepage

NAMD on Pyramid: User Guide