DaGO-Fun - Database for GO-based Functional Annotation Analysis

Browsing Tools

Browsing Resources

Protein Resources

Protein Interactions

Annotation Analysis

Protein Interaction Sources and Confidence Level

The recent availability of large amounts of biological data from primary genomic sequences and the exponential growth of high-throughput experimental datasets have made possible the inference of organism's protein interactions with improved confidence. These interactions are discovered by various experimental approaches, and often partially complemented with prediction techniques. Due to some issues related to high-throughput data, including noise, environment and the nature of the approaches used for each experiment, a given approach may incorrectly classify interactions, i.e., either failing to detect interactions, referred to as false negatives or wrongly identifying some other interactions, referred to as false positives. The lack of appropriate techniques to address these shortcomings results in biases in the outputs and this is obviously a technology-dependent problem. In order to alleviate the former issue, data integration combining information from multiple interacting data sources into one unified network is deployed, leading to a higher confidence and an increased coverage. For the latter issue, a reliability threshold is applied, thus discarding all functional interactions whose reliability or confidence score is less than the threshold. These techniques are expected to significantly reduce the false negative and positive rate of the network produced, thus yielding a network of high confidence interactions.

1. Data Sources for Generating Functional Networks

Organim's functional networks were obtained by combining heterogeneous sources of biological data with protein identifiers mapped to UniProt IDs from the UniProt database. These biological data are categorized into two classes: namely genomic and functional data. Functional data include data from

  1. The STRING database comprising of interactions derived from genomic context (genomic conserved neighbour or gene order, gene fusion events and gene co-occurrence or phylogenetic profiles across genomes) and text mining.
  2. Other Protein Interaction databases, which include experimental data from BIND, GRID, GRID, HPRD, and MINT data, Coexpressed proteins derived from similar pattern of mRNA expression, curated data from KEGG, Reactome, BioGRID, MIPS and others.
  3. Interologs data comprising of interacting proteins in one organism whose corresponding orthologs also interact in another organism and curated physical interactions from IntAct and DIP databases, etc.

And genomic data consists of sequence data including:

  1. Sequence Similarity data, and
  2. Protein family and shared domain data derived from the InterPro database.

mRNA expression measured by DNA arrays were downloaded from the Stanford Microarray Database (SMD) and NCBI Gene Expression Omnibus (GEO) database. Protein sequences in FASTA format, InterPro data and orthologs' file were retrieved from the Integr8 project at the European Bioinformatics Institute (EBI).

2. Scoring Interaction Confidence

A major challenge is to produce the accurate mapping of protein-protein interactions occurring within a cell as large-scale interactions determined experimentally are incomplete and produce relatively high error rates. To reduce these biases, which are hard to overcome, we integrated information from multiple interaction data sources into a unified network, assessed reliability of various protein interaction datasets and applied a reliability threshold to discard all functional interactions with scores less than the threshold. Each interaction is scored according to the computational approach used to derive it or its source and understanding the properties of these functional interactions is key to successful mathematical modeling of such a system and developing efficient scoring techniques. Thus,

  • Functional interactions from the STRING database are used with confidence scores as defined by the STRING schemes.
  • Iteractions derived from other databases and Interologs are scored depending on the confidence level or reliability of sources and a random partial least squares regression technique was used to score co-expressed genes.
  • Functional interaction pairs predicted from protein sequence similarity and shared domain data were scored using information theoretic-based approaches.

The combined link confidence score between two proteins is applied for an integrated view of all datasets through a unified network under the assumption of independency. All interactions whose scores are strictly less than 0.3 are considered to be low confidence, scores ranging from 0.3 to 0.7 are classified as medium confidence and scores greater than 0.7 yield high confidence. Different functional networks consider only interactions from medium confidence and those predicted by at least two different sources (Click here for more information).