DaGO-Fun - Database for GO-based Functional Annotation Analysis

Browsing Tools

Browsing Resources

Protein Resources

Protein Interactions

Annotation Analysis

GOSP-FCT Help and Description

Welcome to the user guide for the GOSP-FCT tool for clustering proteins or genes based on their GO annotations using protein semantic similarity measures. The tool provides a stepwise query selection menu, enabling the user to construct a query and adapting the selection choices in the process, leaving only relevant options open that correspond to his/her selections.

1. User Input step and Result outputs

The user input is a list of UniProt protein accessions or gene names. In this case, input is proteins aligned and pasted in the Input text area or uploaded from a file. The user input is of the following form:
Depending on the protein functional similarity measure selected, the GOSP-FCT tool produces a comprehensive summary in a table format for the graph spectral (kmeans) and model-based clustering models or a dendrogram plot for the hierarchical clustering model on the next page of the user interface. The table output has two columns as shown below:

Note that by clicking on a given UniProt accession protein or gene name, the associated details about the term in its GO annotations from the QuickGO tool are displayed in the new page.

2. Different Clustering Approaches in GOSP-FCT

GOSP-FCT supports three clustering approaches, namely hierarchical clustering, graph spectral or kmeans clustering and the community detecting model, which is referred to as a model-based approach.

When using k-means or graph spectral-based approach, the user needs to provide the number of clusters as an input. This indicates that an inappropriate choice of this number of clusters may produce poor results. Thus, we urge users to perform a diagnostic checks to determine the number of clusters for their sets of proteins before running this approach. In addition, this k-means approach is also very sensitive to the initial randomly selected cluster centers and can get stuck to a local optimum, which may increase the computational time and miss the most optimal cluster configuration. This suggest that this algorithm should be run multiple time to reduce the effect of sensitivity on initial randomly selected cluster centers and restart the browser in case where time tends to increase.

In general, Clustering algorithms are computational complex and may be too slow for a large number of uniProt protein accessions or gene names. The choice of protein functional similarity measure ( metric or distance) influence the clustering results, as some proteins may be similar (close) to another according to one functional similarity measure and farther away according to another. Users should be aware of the fact that different decisions about protein functional similarity measures can lead to vastly different output results.

3. Important Note:

We aim to let the GOSP-FCT tool clusters proteins for as many user UniProt protein accession or gene name inputs as possible. However, because of limitations in computational complexity of these clustering approaches, resources and visualization constraints specially when running hierarchical clustering approach, we have to balance the maximum number of UniProt protein accessions or gene names for each user query. A list of no more than 200 protein UniProt accessions or gene names is recommended for GOSP-FCT. Unfortunately if you have cases where your data exceed these limitations, you can contact the administrators who are willing to collaborate and run large data sets for analysis.

For more information, please refer to the associated publication: "Gaston K. Mazandu and Nicola J. Mulder. DaGO-Fun: Tool for Gene Ontology-based functional analysis using term information content measures, 2013", DaGO-Fun preliminary paper currently under review.