Identifying Gene or Probe ID Type

Description

The S4 generic idtype automatically determine the type of gene/feature identifiers stored in objects, based on a combination of regular expression patterns and test functions.

Usage

idtype(object, ...)

S4 (missing)
`idtype`(object, def = FALSE)

S4 (ProbeAnnDbBimap)
`idtype`(object, limit = 500L, ...)

S4 (ChipDb)
`idtype`(object, limit = 500L, ...)

S4 (AnnDbBimap)
`idtype`(object, limit = 500L, ...)

S4 (MarkerList)
`idtype`(object, each = FALSE, ...)

S4 (vector)
`idtype`(object, each = FALSE, limit = NULL, no.match = "")

Arguments

object
an R object that contains the gene identifiers whose type is to be determined.
...
extra argument to allow extension, generally passed down to idtype,character-method. See each method's description for more details.
def
a logical or a subsetting vector, used when object is missing, which indicates that the result should contain the definition of the matching pattern/function of each type, or which type's deifnition should be included in the result list.
each
logical indicating whether the type of each element should be returned (TRUE) or only the type of the vector as a whole (default).
limit
specification for limiting which elements are used to detect the type of identifiers. If a single numeric, then only the first limit elements are used. Otherwise it must be a subsetting logical or numeric vector.
no.match
character string that specifies the string to use when the type cannot be determined. The IDs can be either:
  • probe IDs (e.g. 123456_at or ILMN_123456 for Affymetrix or Illumina chips respectively), the type starts with a dot '.', allowing the subsequent handling of such IDs as a group.
  • other biological ID types, the result are character strings such as those used as attributes in Bioconductor annotation packages (e.g. "ENTREZID" or "ENSEMBL")
  • Names of annotation packages e.g. "hgu133plus2.db".
This function is able to identify the following ID types using regular expression patterns or dedicated function:
  • ENSEMBL = "^ENSG[0-9]+$"
  • ENSEMBLTRANS = "^ENST[0-9]+$"
  • ENSEMBLPROT = "^ENSP[0-9]+$"
  • ENTREZID = "^[0-9]+$"
  • IMAGE = "^IMAGE:[0-9]+$"
  • GOID = "^GO:[0-9]+$"
  • PFAM = "^PF[0-9]+$"
  • REFSEQ = "^N[MP]_[0-9]+$"
  • ENZYME = "^[0-9]+(\.(([0-9]+)|-)+)3$"
  • MAP = "^[0-9XY]+((([pq])|(cen))(([0-9]+(\.[0-9]+)?)|(ter))?(-([0-9XY]+)?(([pq]?)|(cen))((ter)|([0-9]+(\.[0-9]+)?))?)?)?$"
  • GENEBANK (Nucleotide) = "^[A-Z][0-9]5$" | "^[A-Z]2[0-9]6$"
  • GENEBANK (Protein) = "^[A-Z]3[0-9]5$"
  • GENEBANK (WGS) = "^[A-Z]4[0-9]8[0-9]?[0-9]?$"
  • GENEBANK (MGA) = "^[A-Z]5[0-9]7$"
  • GENENAME = " "
  • .Affymetrix = "(^AFFX-)|(^[0-9]+_([abfgilrsx]_)?([as]t)|(i))$"
  • .Illumina = "^ILMN_[0-9]+$"
  • .Agilent = "^A_[0-9]+_P[0-9]+$"
  • .nuID = use the function nuIDdecode to try converting the ids into nucleotide sequences. Identification is positive if no error is thrown during the conversion.

Value

a single character string (possibly empty) if each=FALSE (default) or a character vector of the same "length" as object otherwise.

Details

It uses a heuristic based on a set of regular expressions and functions that uniquely match most common types of identifiers, such as Unigene, entrez gene, Affymetrix probe ids, Illumina probe ids, etc..

Methods

  1. idtypesignature(object = "missing"): Method for when idtype is called with its first argument missing, in which case it returns all or a subset of the known type names as a character vector, or optionally as a list that contains their definition, i.e. a regular expression or a matching function.

  2. idtypesignature(object = "matrix"): Detects the type of identifiers used in the row names of a matrix.

  3. idtypesignature(object = "ExpressionSet"): Detects the type of identifiers used in the feature names of an ExpressionSet object.

  4. idtypesignature(object = "NMF"): Detects the type of identifiers used in the rownames of the basis matrix of an NMF model.

  5. idtypesignature(object = "ProbeAnnDbBimap"): Detects the type of the primary identifiers of a probe annotation bimap object.

    To speedup the identification, only the first 500 probes are used by default, since the IDs are very likely to have been curated and to be of the same type. This can be changed using argument limit.

  6. idtypesignature(object = "ChipDb"): Detects the type of the identifiers of a chip annotation object.

    To speedup the identification, only the first 500 probes are used by default, since the IDs are very likely to have been curated and to be of the same type. This can be changed using argument limit.

  7. idtypesignature(object = "AnnDbBimap"): Detects the type of the identifiers of an organism annotation object.

    To speedup the identification, only the first 500 probes are used by default, since the IDs are very likely to have been curated and to be of the same type. This can be changed using argument limit.

  8. idtypesignature(object = "GeneIdentifierType"): Returns the type of identifier defined by a GeneIdentifierType object. Note that this methods is a bit special in the sense that it will return the string “ANNOTATION” for annotation based identifiers, but will not tell which platform it is relative to. This is different to what idtype would do if applied to the primary identifiers of the corresponding annotation package.

  9. idtypesignature(object = "list"): Detects the type of all elements in a list, but provides the option of detecting the type of each element separately.

  10. idtypesignature(object = "NULL"): Dummy method -- defined for convenience -- that returns ''

  11. idtypesignature(object = "vector"): This is the workhorse method that determine the type of ids contained in a character vector.

Examples


# all known types
idtype()
##  [1] "UNIGENE"      "ENSEMBL"      "ENSEMBLTRANS" "ENSEMBLPROT" 
##  [5] "ENTREZID"     "IMAGE"        "GOID"         "PFAM"        
##  [9] "REFSEQ"       "ENZYME"       "MAP"          "GENEBANK"    
## [13] "GENEBANK"     "GENEBANK"     "GENEBANK"     "GENENAME"    
## [17] ".Affymetrix"  ".Illumina"    ".Agilent"     ".nuID"
# with their definitions
idtype(def=TRUE)
## $UNIGENE
## [1] "^[A-Z][a-z]\\.[0-9]+$"
## 
## $ENSEMBL
## [1] "^ENSG[0-9]+$"
## 
## $ENSEMBLTRANS
## [1] "^ENST[0-9]+$"
## 
## $ENSEMBLPROT
## [1] "^ENSP[0-9]+$"
## 
## $ENTREZID
## [1] "^[0-9]+$"
## 
## $IMAGE
## [1] "^IMAGE:[0-9]+$"
## 
## $GOID
## [1] "^GO:[0-9]+$"
## 
## $PFAM
## [1] "^PF[0-9]+$"
## 
## $REFSEQ
## [1] "^[XYN][MPR]_[0-9]+$"
## 
## $ENZYME
## [1] "^[0-9]+(\\.(([0-9]+)|-)+){3}$"
## 
## $MAP
## [1] "^(([0-9]{1,2})|([XY]))((([pq])|(cen))(([0-9]+(\\.[0-9]+)?)|(ter))?(-([0-9]{1,2})|([XY]))?(([pq]?)|(cen))((ter)|([0-9]+(\\.[0-9]+)?))?)?)?$"
## 
## $GENEBANK
## [1] "^[A-Z][0-9]{5}$"    "^[A-Z]{2}[0-9]{6}$"
## 
## $GENEBANK
## [1] "^[A-Z]{3}[0-9]{5}$"
## 
## $GENEBANK
## [1] "^[A-Z]{4}[0-9]{8}[0-9]?[0-9]?$"
## 
## $GENEBANK
## [1] "^[A-Z]{5}[0-9]{7}$"
## 
## $GENENAME
## [1] " "
## 
## $.Affymetrix
## [1] "(^AFFX[-_])|(^[0-9]+_([abfgilrsx]_)?([as]t)|(i))$"
## 
## $.Illumina
## [1] "^ILMN_[0-9]+$"
## 
## $.Agilent
## [1] "^A_[0-9]+_P[0-9]+$"
## 
## $.nuID
## function (x) 
## !is.na(nuIDdecode(x, error = NA))
## <environment: 0xd36c138>
idtype(def='ENTREZID')
## [1] "^[0-9]+$"
idtype(def=c('ENTREZID', 'ENSEMBLTRANS'))
## $ENTREZID
## [1] "^[0-9]+$"
## 
## $ENSEMBLTRANS
## [1] "^ENST[0-9]+$"
# from GeneIdentifierType objects
idtype(NullIdentifier())
## [1] ""
idtype(AnnotationIdentifier('hgu133a.db'))
## "ANNOTATION"
# but
## Not run: 
##D     library(hgu133a.db)
##D     idtype(hgu133a.db)
## End(Not run)
idtype("12345_at")
## [1] ".Affymetrix"
idtype(c("12345_at", "23232_at", "555_x_at"))
## [1] ".Affymetrix"
# mixed types
ids <- c("12345_at", "23232_at", "Hs.1213")
idtype(ids) # not detected
## [1] ""
idtype(ids, each=TRUE)
##      12345_at      23232_at       Hs.1213 
## ".Affymetrix" ".Affymetrix"     "UNIGENE"