|DaGO-Fun - Database for GO-based Functional Annotation Analysis|
|Protein Interaction Sources and Confidence Level|
The recent availability of large amounts of biological data from primary genomic sequences and the exponential growth of high-throughput experimental datasets have made possible the inference of organism's protein interactions with improved confidence. These interactions are discovered by various experimental approaches, and often partially complemented with prediction techniques. Due to some issues related to high-throughput data, including noise, environment and the nature of the approaches used for each experiment, a given approach may incorrectly classify interactions, i.e., either failing to detect interactions, referred to as false negatives or wrongly identifying some other interactions, referred to as false positives. The lack of appropriate techniques to address these shortcomings results in biases in the outputs and this is obviously a technology-dependent problem. In order to alleviate the former issue, data integration combining information from multiple interacting data sources into one unified network is deployed, leading to a higher confidence and an increased coverage. For the latter issue, a reliability threshold is applied, thus discarding all functional interactions whose reliability or confidence score is less than the threshold. These techniques are expected to significantly reduce the false negative and positive rate of the network produced, thus yielding a network of high confidence interactions.
|1. Data Sources for Generating Functional Networks|
Organim's functional networks were obtained by combining heterogeneous sources of biological data with protein identifiers mapped to UniProt IDs from the UniProt database. These biological data are categorized into two classes: namely genomic and functional data. Functional data include data from
And genomic data consists of sequence data including:
mRNA expression measured by DNA arrays were downloaded from the Stanford Microarray Database (SMD) and NCBI Gene Expression Omnibus (GEO) database. Protein sequences in FASTA format, InterPro data and orthologs' file were retrieved from the Integr8 project at the European Bioinformatics Institute (EBI).
|2. Scoring Interaction Confidence|
A major challenge is to produce the accurate mapping of protein-protein interactions occurring within a cell as large-scale interactions determined experimentally are incomplete and produce relatively high error rates. To reduce these biases, which are hard to overcome, we integrated information from multiple interaction data sources into a unified network, assessed reliability of various protein interaction datasets and applied a reliability threshold to discard all functional interactions with scores less than the threshold. Each interaction is scored according to the computational approach used to derive it or its source and understanding the properties of these functional interactions is key to successful mathematical modeling of such a system and developing efficient scoring techniques. Thus,
The combined link confidence score between two proteins is applied for an integrated view of all datasets through a unified network under the assumption of independency. All interactions whose scores are strictly less than 0.3 are considered to be low confidence, scores ranging from 0.3 to 0.7 are classified as medium confidence and scores greater than 0.7 yield high confidence. Different functional networks consider only interactions from medium confidence and those predicted by at least two different sources (Click here for more information).