|DaGO-Fun - Database for GO-based Functional Annotation Analysis|
|Guide to Selecting a Semantic Similarity Measure|
Many Gene Ontology (GO) semantic similarity measures have been developed recently and it is not clear to users which is the best to use and what the differences are between the approaches. This site provides a guide that allows users to select the best possible approach for measuring specificity of a given GO term in the GO structure (Directed Acyclic Graph), the similarity in biological content between GO terms and comparison of proteins using their GO annotations.
1. Selecting the best approach for scoring GO term specificity
Depending on the conception of the GO term information content (IC) or how the IC score is computed, GO term semantic similarity approaches are divided into two main families, namely annotation-based and topology-based families. Those depending only on the intrinsic topology of the GO structure are referred to as topology-based approaches and those depending also on the frequencies at which terms occur in the corpus under consideration are referred to as annotation-based approaches. However, the dependence on annotation data of annotation-based approaches has been seriously critized and constitutes a major drawback of these approaches. Topology-based approaches aim to correct the effect of annotation dependence to provide an effective way to measure similarity between proteins based only on the GO DAG, producing a fixed and well defined information content for a given GO term independent of the corpus under consideration. These topology-based approaches include:
Recently, it has been observed that the GO-universal metric is a good solution to the issue of scoring term specificity in the GO DAG (view more details). Therefore, we suggest choosing the topology-based approach: GO-universal metric when you are not sure about the most appropriate measure for your application.
2. Selecting the best possible GO term semantic similarity measures
As stated above, the GO-universal metric has good mathematical properties and has been shown to provide good performance biologically compared to other approaches. Once again we recommend the GO-universal metric, however, if you prefer to go for annotation-based approaches, we advise the use of the eXtended GraSM (XGraSM) tools as they have been observed to perform better than other measures (view more details).
3. Selecting the best possible protein functional similarity measures
We have categorized protein functional similarity measures into two models: Direct Term- and Term Semantic-based models. Direct Term-based approaches are those using term IC scores directly to retrieve protein functional similarity scores, e.g. SimGIC, SimDIC, SimUIC and SimUI. On the other hand, Term Semantic-based models are those using term semantic similarity scores to derive protein functional similarity scores. There are 4 Term Semantic-based protein similarity measures, including average (Avg) (click 1 or 2 for more information), maximum (MAX) (click here for more details), best match average (BMA) (click 1 or 2 for more information), and average best matches (ABM) (click 1 or 2 for details).