PhyloProfile

Introduction

Phylogenetic profiles

  • presence/absence pattern of genes (the seed) across a set of taxa

Phylogenetic profiles

  • presence of ortholog <=> presence of seed's function

    => transfer functions between orthologs

Phylogenetic profiles

  • similar profile <=> functionally interaction

    => trace functional protein clusters or metabolic networks across species

Phylogenetic profiles

PROBLEM:

  • specificity of orthology inference
  • orthology does not guarantee functional equivalence

=> increase the informativeness of orthology assignemnt by using complementary information

Introduction

PhyloProfile tool

  • Shiny(R)-based
  • dynamically visualize and explore multi-layered phylogenetic profiles
  • provide several functions for analysis phylogenetic profiles
  • available online & standalone

  • dynamically filter & analyze profiles with different thresholds
  • dynamically change the resolution of the profile analysis
  • dynamically change profile appearance
  • Multi-layers

    =

    presence/absence pattern + two additional layers of information


    • sequence similarity
    • domain architecture similarity
    • semantic similarities of Gene Ontology-terms
    • taxonomic distances
    • etc.

    Analysis functions

    Input

    Main input

    Required information

    1. for basic phylogenetic profile:
      • Gene ID of seed protein or ortholog group ID - geneID
      • Taxonomy ID of species having orthologs (ncbi+taxonID, e.g. ncbi7029, ncbi3702) - ncbiID (*)
      • Ortholog ID - orthoID
    2. for additional information layers (optionally):
      • Value for first additional information layer - var1
      • Value for second additional information layer - var2

    (*) PhyloProfile requires taxonomy IDs in all of input files.

    PhyloProfile is provided with a function for searching taxon IDs from a list of taxon names.

    Main input

    presence/absence pattern + up to 2 additional variables

    FASTA format

    Sequence header: >geneID|ncbiID|orthoID|var1|var2

    Main input

    presence/absence pattern + up to 2 additional variables

    orthoXML format

    supported by OMA, OrthoMCL, InParanoid, Hieranoid, Panther, Roundup, etc.

    *Click here for "How to use OMA orthoXML file"

    Main input

    presence/absence pattern + up to 2 additional variables

    long format

    tab delimited file containing 5 columns: geneID, ncbiID, orthoID, var1, var2

    Main input

    presence/absence pattern + up to 2 additional variables

    matrix (wide) format

    header line contains ncbiID

    each cell contains orthoID#var1#var2

    NOTE: this format is not suitable for profiles containing paralogs (co-orthologs)

    Annotation input (optional)

    domain annotations or structural information

    Domain file is a tab-liminated file contains:

    1. pairID: geneID#orthoID
    2. orthoID (required) and seedID: if both are present, domain plot will show domain architectures of both proteins for comparison. Otherwise only domains of orthologs will be plotted.
    3. feature name: name of domains
    4. start and end postion of each domain in the protein.
    5. weight value for each feature (set as NA if unavailable)

    How to prepare input files

    • Main input:
      1. Use newest version of HaMStR can directly generate compartible profile and domain files for PhyloProfile.
      2. Obtain HOGs from OMA Browser:
        python ./scripts/get_oma_hogs.py -i data/demo/oma/omaIDs.list > data/demo/oma/omaHogs.orthoxml
      3. Convert OMA Standalone:
        python ./scripts/convert_oma_standalone_orthoxml.py -x data/demo/oma/oma_example.orthoxml-m data/demo/oma/taxon_mapping_oma_orthoxml.csv > data/demo/oma_example_phyloprofile_compatible.orthoxml
      4. Download orthoXML file from other supported public databases (if available).
      5. Prepare input file manually by yourself in FASTA, long or matrix format :)

    How to prepare input files

    • Domain file:
      1. HaMStR can directly generate compartible profile and domain files for PhyloProfile.
      2. Use HMMScan and PfamScan to do PFAM annotation:
        hmmscan -E 0.001 --noali --domtblout hmmscanOut.txt path_to_Pfam-a_files/Pfam-A.hmm data/demo/test.input.fasta
        perl pfamscan.pl -fasta test.input.fasta -dir path_to_Pfam-A_files > pfamscanOut.txt

        then parse output files into compatible domain files:

        python ./scripts/hmmscanParser.py -i data/demo/pfamAnno/hmmscanOut.txt > hmmscanOut.domains
        python ./scripts/pfamscanParser.py -i data/demo/pfamAnno/pfamscanOut.txt > pfamscanOut.domains
      3. Prepare input file manually by following this format.
    Input

    Variables

    • 2 additional information layers
    • Use max, min, mean or median values for aggregation

    Configurations

    • use all genes for analysis or upload list of genes of interest (Choose genes of interest)
    • alphabetical order sequences IDs or keep the order as in the input (Order sequence IDs)
    • order taxa based on NCBI taxonomy tree or by user-defined tree(*) (Order taxa)
    • set path to sequence files (FASTA config)
    • change default colors for profile plots (Set colors for profile)

    (*) input species tree has to be in newick format and should not contain singletons

    Seed taxon selection

    higher taxonomic level = more general analysis

    Main profile

    Profile appearance

    Variable thresholds

    dynamically filter profile

    Point's info

    info of a selected point on the profile

    Detail information

    • detailed scores for 2 additional variables
    • FASTA sequence of ortholog

    Domain architecture plot

    for comparing protein architectures

    Customized profile

    Profile for a subset of genes and/or taxa, which are

    • manually selected from the lists
    • inputed using "Browse..." option
    Analysis functions

    Profile clustering



    cluster similar profiles based on a calculated distance matrix

    *drag to select a set of genes to submit to customized profile

    Distribution analysis

    distribution of additional information and precentage of taxa summarized in supertaxon

    Gene age estimation

    evolutionary age estimated using LCA algorithm

    *list of selected genes that have the same estimated age can be submitted to customized profile

    Core gene identification

    search for genes that have orthologs in all selected taxa

    *those genes can then be added to customized profile

    Search NCBI taxonomy IDs

    for a given list of taxa names

    Download filtered data

    filtered data & sequences from main and customized profiles

    * If in-paralogs are present, option "Download representative sequences" will select only one sequence for each species for downloading

    Summary

    Extensive documentation, FAQs and known bugs are described on our GitHub wiki pages.

    Links

    online version: phyloprofile.shinyapps.io/phyloprofile

    standalone version: github.com/BIONF/phyloprofile

    contact: tran@bio.uni-frankfurt.de