Browsing Tools

GO analysis Tools

Functional Networks

Annotation Prediction

Browsing Resources

Protein Resources

Protein Interactions

Annotation Analysis

Guide to Selecting a Semantic Similarity Measure

Many Gene Ontology (GO) semantic similarity measures have been developed recently and it is not clear to users which is the best to use and what the differences are between the approaches. This site provides a guide that allows users to select the best possible approach for measuring specificity of a given GO term in the GO structure (Directed Acyclic Graph), the similarity in biological content between GO terms and comparison of proteins using their GO annotations.

1. Selecting the best approach for scoring GO term specificity

Depending on the conception of the GO term information content (IC) or how the IC score is computed, GO term semantic similarity approaches are divided into two main families, namely annotation-based and topology-based families. Those depending only on the intrinsic topology of the GO structure are referred to as topology-based approaches and those depending also on the frequencies at which terms occur in the corpus under consideration are referred to as annotation-based approaches. However, the dependence on annotation data of annotation-based approaches has been seriously critized and constitutes a major drawback of these approaches. Topology-based approaches aim to correct the effect of annotation dependence to provide an effective way to measure similarity between proteins based only on the GO DAG, producing a fixed and well defined information content for a given GO term independent of the corpus under consideration. These topology-based approaches include:

The Zhang et al. Model (view more details about the model).
The Wang et al. approach (view more details about the approach).
The GO-universal metric (view more details about the metric).

Recently, it has been observed that the GO-universal metric is a good solution to the issue of scoring term specificity in the GO DAG (view more details). Therefore, we suggest choosing the topology-based approach: GO-universal metric when you are not sure about the most appropriate measure for your application.

2. Selecting the best possible GO term semantic similarity measures

As stated above, the GO-universal metric has good mathematical properties and has been shown to provide good performance biologically compared to other approaches. Once again we recommend the GO-universal metric, however, if you prefer to go for annotation-based approaches, we advise the use of the eXtended GraSM (XGraSM) tools as they have been observed to perform better than other measures (view more details).

3. Selecting the best possible protein functional similarity measures

We have categorized protein functional similarity measures into two models: Direct Term- and Term Semantic-based models. Direct Term-based approaches are those using term IC scores directly to retrieve protein functional similarity scores, e.g. SimGIC, SimDIC, SimUIC and SimUI. On the other hand, Term Semantic-based models are those using term semantic similarity scores to derive protein functional similarity scores. There are 4 Term Semantic-based protein similarity measures, including average (Avg) (click 1 or 2 for more information), maximum (MAX) (click here for more details), best match average (BMA) (click 1 or 2 for more information), and average best matches (ABM) (click 1 or 2 for details).

In the context of annotation-based approaches, while the performance of SimDIC and SimUIC models are still to be asessed, it has been observed that Direct Term-based models, in particular the SimGIC model, perform better than Term Semantic-based models (click 1 or 2 for details). Thus, we suggest choosing the SimGIC model when opting for the annotation-based approaches and you are not sure about the most appropriate measure for your application.
On the other hand, if you prefer to use Term Semantic-based models, then it will be benificial to use the Best Match Average (BMA), shown to perform better than other similar approaches and suggested to be more biologically relevant (Click here for more detail).
Finally, for topology-based approaches, these options are not available as each scheme has provided its own model for computing protein functional similarity scores. The GO-universal approach uses the BMA measure, ABM is used in the Wang et al. approach, and the Zhang et al. approach, which has proposed context dependent methods, also uses the ABM measure, although the authors initially suggested using the Avg scheme.

Important Notes: All these should be considered to be suggestions and users are free to use any measure suitable for their application. Please refer to Limited Warranty and Liability section (Terms of Use) for more details.