PhyloProfile

Introduction

Phylogenetic profiles

presence/absence pattern of genes (the seed) across a set of taxa

Phylogenetic profiles

presence of ortholog <=> presence of seed's function

=> transfer functions between orthologs

Phylogenetic profiles

similar profile <=> functionally interaction

=> trace functional protein clusters or metabolic networks across species

Phylogenetic profiles

PROBLEM:

specificity of orthology inference
orthology does not guarantee functional equivalence

=> increase the informativeness of orthology assignemnt by using complementary information

Introduction

PhyloProfile tool

Shiny(R)-based
dynamically visualize and explore multi-layered phylogenetic profiles
provide several functions for analysis phylogenetic profiles
available online & standalone

dynamically filter & analyze profiles with different thresholds

dynamically change the resolution of the profile analysis

dynamically change profile appearance

Multi-layers

presence/absence pattern + two additional layers of information

sequence similarity
domain architecture similarity
semantic similarities of Gene Ontology-terms
taxonomic distances
etc.

Analysis functions

Profile clustering
Gene age estimation
Core gene identification
Distribution analysis

Main input

Required information

for basic phylogenetic profile:
- Gene ID of seed protein or ortholog group ID - geneID
- Taxonomy ID of species having orthologs (ncbi+taxonID, e.g. ncbi7029, ncbi3702) - ncbiID (*)
- Ortholog ID - orthoID
for additional information layers (optionally):
- Value for first additional information layer - var1
- Value for second additional information layer - var2

(*) PhyloProfile requires taxonomy IDs in all of input files.

PhyloProfile is provided with a function for searching taxon IDs from a list of taxon names.

Main input

presence/absence pattern + up to 2 additional variables

FASTA format

Sequence header: >geneID|ncbiID|orthoID|var1|var2

Main input

presence/absence pattern + up to 2 additional variables

orthoXML format

supported by OMA, OrthoMCL, InParanoid, Hieranoid, Panther, Roundup, etc.

*Click here for "How to use OMA orthoXML file"

Main input

presence/absence pattern + up to 2 additional variables

long format

tab delimited file containing 5 columns: geneID, ncbiID, orthoID, var1, var2

Main input

presence/absence pattern + up to 2 additional variables

matrix (wide) format

header line contains ncbiID

each cell contains orthoID#var1#var2

NOTE: this format is not suitable for profiles containing paralogs (co-orthologs)

Annotation input (optional)

domain annotations or structural information

Domain file is a tab-liminated file contains:

pairID: geneID#orthoID
orthoID (required) and seedID: if both are present, domain plot will show domain architectures of both proteins for comparison. Otherwise only domains of orthologs will be plotted.
feature name: name of domains
start and end postion of each domain in the protein.
weight value for each feature (set as NA if unavailable)

How to prepare input files

Main input:
1. Use newest version of HaMStR can directly generate compartible profile and domain files for PhyloProfile.
2. Obtain HOGs from OMA Browser:
```
python ./scripts/get_oma_hogs.py -i data/demo/oma/omaIDs.list > data/demo/oma/omaHogs.orthoxml
```
3. Convert OMA Standalone:
```
python ./scripts/convert_oma_standalone_orthoxml.py -x data/demo/oma/oma_example.orthoxml-m data/demo/oma/taxon_mapping_oma_orthoxml.csv > data/demo/oma_example_phyloprofile_compatible.orthoxml
```
4. Download orthoXML file from other supported public databases (if available).
5. Prepare input file manually by yourself in FASTA, long or matrix format :)

How to prepare input files

Domain file:

HaMStR can directly generate compartible profile and domain files for PhyloProfile.

Use HMMScan and PfamScan to do PFAM annotation:

hmmscan -E 0.001 --noali --domtblout hmmscanOut.txt path_to_Pfam-a_files/Pfam-A.hmm data/demo/test.input.fasta

perl pfamscan.pl -fasta test.input.fasta -dir path_to_Pfam-A_files > pfamscanOut.txt

then parse output files into compatible domain files:

python ./scripts/hmmscanParser.py -i data/demo/pfamAnno/hmmscanOut.txt > hmmscanOut.domains

python ./scripts/pfamscanParser.py -i data/demo/pfamAnno/pfamscanOut.txt > pfamscanOut.domains

Prepare input file manually by following this format.

Input

Variables
Configurations
Seed taxon selection

Variables

2 additional information layers
Use max, min, mean or median values for aggregation

Configurations

use all genes for analysis or upload list of genes of interest (Choose genes of interest)
alphabetical order sequences IDs or keep the order as in the input (Order sequence IDs)
order taxa based on NCBI taxonomy tree or by user-defined tree(*) (Order taxa)
set path to sequence files (FASTA config)
change default colors for profile plots (Set colors for profile)