MetaPhOrs is a public repository of phylogeny-based orthologs and paralogs that were computed using phylogenetic trees available in twelve public repositories. Currently, over 6.8 billion of unique homologs are deposited in MetaPhOrs database. These predictions were retrieved from 7 million Maximum Likelihood trees for 2,714 species. For each prediction, MetaPhOrs provides a Consistency Score and Evidence Level describing its goodness, together with number of trees and links to their source databases.
Reliable orthology prediction is central to comparative genomics and the annotation of newly sequenced genomes. Since orthology and paralogy are both evolutionary concepts, phylogeny-based strategies are expected to provide the most accurate predictions. However, given the high computational cost associated to phylogenetic analyses, the majority of automated orthology prediction methods rely on faster but less accurate pairwise sequence comparisons. Only recently, thanks to the availability of faster computers and better algorithms, it is feasible to use phylogeny-based orthology prediction at genomic scale.
Recently, several projects have addressed the reconstruction of large collections of high-quality phylogenetic trees from which orthology can be inferred. This provides us with the opportunity to infer the evolutionary relationships of two genes from multiple, independent, phylogenetic trees and use the consistency across predictions as a reliability measure of an orthology assignment. By using phylogenetic trees available at PhylomeDB, Ensembl, TreeFam and Fungal Orthogroups databases and those reconstructed for EggNOG, OrthoMCL, and COG, we predict orthology and paralogy relationships for over XXX millions proteins in XXXX fully-sequenced genomes and provide a reliability score for all of them, based on the number of independent trees and the consistency across predictions.
The species-overlap algorithm is an alternative approach of inferring evolutionary events from gene phylogenies. The only evolutionary information required by such algorithm is a rooted gene tree. This method requires neither a fully-resolved species phylogeny, nor reconciliation steps. To decide whether a given node represents a speciation or a duplication event, this algorithm employs the level of overlap between species represented under its two descendant nodes. In brief, a species-overlap score (SOS) is calculated for every node as the proportion of shared species between child branches over the total number of organisms under the node. If the SOS is higher than given threshold, the parental node is mapped as duplication, otherwise as speciation event. The best performance of the algorithm has been reported to be associated with the use of a SOS threshold equal to 0.0, so speciation is only assumed if no species overlap is detected between its descendant nodes.
MetaPhOrs combines information from multiple strains into single meta-proteome for each species. In result, the phylogenetic signals from multiple strains of one species present in given tree are counted multiple times and number of trees in orthology tables may be slightly larger than number of trees retrieved in tree page.
Orthology/paralogy assignment in MetaPhOrs is based on Consistency Score (CS). Consistency score ranges from 0 to 1. In brief, the closer the value of CS to 1.0, the more confident the prediction.
Consistency score is the ratio of the number of trees confirming given relationship over the total number of trees that were used to infer the relationship between particular protein pair. Orthology Consistency Score (CSo) is calculated for orthology searches, respectively paralogy Consistency Score (CSp) for paralogy queries, as follows:
CSo = To / (To + Tp)
CSp = Tp / (To + Tp)
To stands for number of trees confirming orthology
Tp for number of trees confirming paralogy relationship.
The recommended CSo threshold for orthology prediction is 0.5. The CS might be altered by the user in order to adjust sensitivity/positivity of each query accordingly. All homology relationships are returned when CS cut-off of 0.0 is applied, while CS cut-off of 1.0 returns only fully consistent predictions.
Evidence level defines the number of independent sources (databases), in which trees confirming each prediction have been found. In general the higher evidence level, the better reliability of the prediction as more sources were used to infer it.
Evidence level may vary from 1 to 12 (as trees were retrieved from 12 databases). The Evidence Level cut-off has to altered with care, as external databases overlap partially, and for some pairs of species there is only one source of data (Evidence Level of 1). It's recommended to start queries with Evidence Level cut-off of 1, and then eventually increase the cut-off.
Note, in the first releases (200909 and 200911), evidence level was counting different phylomes as independent source. From release 201405 on, only different databases are counted.