HaMStR

Feature-aware orthology prediction tool. Provided data package: Quest for Orthologs reference gene sets 2019.

View project on GitHub

HaMStR-OneSeq

conda-install conda-version GPLv3-license

Table of Contents

How to install

0. Basic system tools requirement

You need to have wget, grep and sed (or gsed for MacOS) to install HaMStR. So please install them if they are missing. For MacOS users, we recommend using Homebrew to install those command line tools. To use FAS tool (a dependency of HaMStR), you also need Python 3.

1a. Install using Anaconda

Follow this link to install conda (anaconda or miniconda) to your system.

Add additional channels bioconda and conda-forge:

conda config --add channels bioconda
conda config --add channels conda-forge

Create and activate a conda environment for HaMStR

conda create --name hamstr -y
conda activate hamstr

Install HaMStR

conda install -c BIONF hamstr
setup_hamstr

HaMStR will be installed under the subfolder HaMStR in side your current working directory. After the setup run successfully, you can start using HaMStR (in some cases you should restart the terminal).

1b. Install in Ubuntu/MacOS

Get HaMStR source code from GitHub

git clone --depth=1 https://github.com/BIONF/HaMStR

Run setup.sh script in the HaMStR folder to install HaMStR and its dependencies

cd HaMStR
./setup.sh

You should have the sudo password ready, otherwise some missing dependencies cannot be installed. See dependency list for more info. If you do not have root privileges, ask your admin to install those dependencies using install_lib.sh script.

After the setup run successfully, you can start using HaMStR (in some cases you should restart the terminal).

For debugging the installation, please create a log file by running the setup as e.g. bin/setup.sh | tee log.txt for Linux/MacOS or setup_hamstr | tee log.txt for Anaconda and send us that log file, so that we can trouble shoot the issues. Most of the problems can be solved by just re-running the setup.

Usage

HaMStR will run smoothly with the provided sample input file in ‘HaMStR/data/infile.fa’ if everything is set correctly.

oneSeq -seqFile=infile.fa -seqName=test -refspec=HUMAN@9606@1 -minDist=genus -maxDist=kingdom -coreOrth=5 -cleanup -global -cpu=8

The output files with the prefix test will be saved at your current working directory. You can have an overview about the available options with the command

oneSeq -h

If you get the error message that oneSeq command not found, you should restart the terminal, or replace oneSeq by perl bin/oneSeq

The output consist of these text files (note: test is your defined -seqName parameter) 1) test.extended.fa: a multiple FASTA file containing ortholog sequences and the query gene 2) test.extended.profile: a tab-delimited file containing list of orthologous sequences and their correspoding similarity scores by comparing their feature architectures with the one of the query gene (for more info about this score, please read this document) 3) test.phyloprofile: an input file for visualisation the phylogenetic profile of the query gene using PhyloProfile tool 4) test_1.domains (and optional, test_0.domains): a protein domain annotation file for all the sequences present in the orthologous group. The _0 or _1 suffix indicates the direction of the feature architecture comparison, in which _1 (forward) means that the query gene is used as seed and it orthologs as target for the comparison, while _0 (backward) is vice versa.

HaMStR and the utilisation of FAS

HaMStR integrates the prediction of orthologs and the calculation of the Feature Architecture Similarty (FAS) scores. FAS scores are computed pairwise between the query gene and it’s predicted orthologous genes using FAS tool, which will be automatically installed during the setup of HaMStR.

Output visualization using PhyloProfile

For a rich visualisation of the provided information from the HaMStR outputs, you can plug them into the Phyloprofile tool.

The main input file for PhyloProfile is seqname.phyloprofile, which contains list of all orthologous gene names and the taxonomy IDs of their taxa together with the FAS scores (if available). For analysing more information such as the FASTA sequences or the domain annotations, you can optionally input seqname.extended.fa and seqname_1.domains (or seqname_0.domains) to PhyloProfile.

You can combine multiple HaMStR runs into a single phylogenetic profile input for data visualisation and data exploration. Each run is identified by the given seqname (opt -seqname=<>). This is either given by the user or randomly assigned. The following steps are necessary:

# concatenate all desired profile files into one combined profile

cat *.extended.profile > combined.extended.profile

# re-run the parsing script from your current data directory with the combined profile

perl /path/to/HaMStR/bin/visuals/parseOneSeq.pl -i combined.extended.profile -o combined.phyloprofile

To prepare the additional input file (*.domains) you just need to concatenate them with each other (please mind the distinction between forward (1) and backward (0) FAS comparisons and do not mix them up).

cat *_0.domains > combined_0.domains
cat *_1.domains > combined_1.domains

The resulting file combined.phyloprofile, combined_0.matrix and combined_1.domains can be then plugged into the Phyloprofile tool for further investigation.

Pre-calculated data set

Within the data package (https://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml) we provide a set of 78 reference taxa (gene sets in genome_dir, annotations in weight_dir, blast databases in blast_dir). They can be automatically downloaded during the setup. This data comes “ready to use” with the HaMStR-OneSeq framework. Species data must be present in the three directories listed below. For each species/taxon there is a sub-directory named in accordance to the naming schema ([Species acronym]@[NCBI ID]@[Proteome version]).:

  • genome_dir (Contains sub-directories for proteome fasta files for each species)
  • blast_dir (Contains sub-directories for BLAST databases made with makeblastdb out of your proteomes)
  • weight_dir (Contains sub-directories for feature annotation files for each proteome)

However, if needed the user can manually add further gene sets (multifasta format) and place them into the respective directories (genome_dir, weight_dir, blast_dir). Please note, that every taxon/species must be present in the NCBI taxonomy. The following steps need to be conducted:

1) Download the gene set of your taxon of interest as amino acid sequences from the NCBI database.

2) Rename the file in accordance to the naming schema of hamstr: SPECIES@12345@1.fa ([Species acronym]@[NCBI ID]@[Proteome version])

3) Fasta header must be whitespace free and unique within the gene set (short header make your life easier for downstream analysis). The following bash command uses sed to cut the header at the first whitespace:

sed -i "s/ .*//" SPECIES@12345@1.fa

Example, a before fasta file:

>EXR66326.1 biofilm-associated domain protein, partial [Acinetobacter baumannii 339786]
MTGEGPVAIHAEAVDAQGNVDVADADVTLTIDTTPQDLITAITVPEDLNGDGILNAAELGTDGSFNAQVALGPDAVDGTV
>EXR66351.1 hypothetical protein J700_4015, partial [Acinetobacter baumannii 339786]
NRRLLITTQPTATDSNYKTPIYINAPNGELYFANQDETSVSSVVFKRVIGATAANAPYVASDSWTKKIRKWNTYNHEVSK
...

and after (this is how your sequence data should look like):

>EXR66326.1
MTGEGPVAIHAEAVDAQGNVDVADADVTLTIDTTPQDLITAITVPEDLNGDGILNAAELGTDGSFNAQVALGPDAVDGTV
>EXR66351.1
NRRLLITTQPTATDSNYKTPIYINAPNGELYFANQDETSVSSVVFKRVIGATAANAPYVASDSWTKKIRKWNTYNHEVSK
...

4) After your gene set (proteomic data) is prepared and placed into the respective sub-directory in the genome_dir directory you can conduct the following instructions:

5) Create a Blast DB for the species within the blast_dir

a) Create a Blast DB using makeblastdb

makeblastdb -dbtype prot -in genome_dir/SPECI@00001@1/SPECI@00001@1.fa -out blast_dir/SPECI@00001@1/SPECI@00001@1

b) Create a symbolic link with the blast_dir (change into the respective sub-directory in the blast_dir)

cd blast_dir/SPECI@00001@1
ln -s ../../genome_dir/SPECI@00001@1/SPECI@00001@1.fa SPECI@00001@1.fa

6) Create the annotation files for your taxon with the provided perl script

annoFAS --fasta=/path/to/your/hamstr/genome_dir/SPECI@00001@1/SPECI@00001@1.fa --path=/path/to/your/hamstr/weight_dir --name=SPECI@00001@1

Please take care that all parameter paths are provided as absolute paths. This action takes considerably longer than the BLAST database creation with makeblastdb (it takes about one hour to annotate a gene set with 5000 sequences).

To prove if your manually added species is integrated into the HaMStR framework your can run:

oneSeq -showTaxa

This command simply prints a list of all available taxa.

Dependencies

HaMStR has some dependencies, that either will be automatically installed via the setup script, or must be installed by your system admin if you don’t have the root privileges. In the following you will find the full list of HaMStR’s dependencies for Ubuntu system as well as the alternatives for MacOS. In Ubuntu, you can install those system and bioinformatics tools/libraries using apt-get tool

sudo apt-get update -y
sudo apt-get install tool_name -y

In MacOS, we suggest using Homebrew as a replacement for apt-get. After having Homebrew, you can install tools/libraries by using the command

brew install tool_name

In both operation systems, you can install Perl modules using cpanm.

# first, install cpanm
curl -L http://cpanmin.us | perl - --sudo App::cpanminus
# then, install perl module using cpanm
sudo cpanm perl_module_name

If you do not have root privileges, ask your admin to install these dependencies using the install_lib.sh script.

cd HaMStR
sudo ./install_lib.sh

Note: After having all these dependencies installed, you still need to run the setup script to configure HaMStR!!!

System tools/libraries

  • grep (ggrep)
  • sed (gsed)
  • wget (wget)
  • build-essential
  • curl (curl)
  • locales
  • lib32ncurses5
  • lib32z1

(In parentheses are Mac’s alternative tools)

Bioinformatics tools

  • wise (brewsci/bio/genewise)
  • hmmer (hmmer)
  • ncbi-blast+ (blast)
  • blast2
  • clustalw (brewsci/bio/clustal-w)
  • mafft (mafft)
  • muscle (brewsci/bio/muscle)

(In parentheses are Mac’s alternative tools)

Perl modules

  • libdbi-perl
  • libipc-run-perl
  • perl-doc
  • DBI
  • DB_File
  • File::Copy
  • File::Path
  • File::Basename
  • File::Which
  • List::Util
  • Parallel::ForkManager
  • POSIX
  • XML::SAX
  • XML::NamespaceSupport
  • XML::Parser
  • Getopt::Long
  • IO::Handle
  • IPC::Run
  • Statistics::R
  • Term::Cap
  • Time::HiRes
  • Bio::AlignIO
  • Bio::Align::ProteinStatistics
  • Bio::DB::Taxonomy
  • Bio::SearchIO
  • Bio::SearchIO::blastxml
  • Bio::Search::Hit::BlastHit
  • Bio::Seq
  • Bio::SeqIO
  • Bio::SeqUtils
  • Bio::Tree::Tree
  • Bio::Tools::Run::StandAloneBlast

How to cite

Ebersberger, I., Strauss, S. & von Haeseler, A. HaMStR: Profile hidden markov model based search for orthologs in ESTs. BMC Evol Biol 9, 157 (2009), doi:10.1186/1471-2148-9-157

Contributors

Contact

For further support or bug reports please contact: ebersberger@bio.uni-frankfurt.de