Converting Gene or Probeset IDs

Description

The CellMix package implements methods for converting markers identifiers cross-platform and cross-organism, to facilitate deconvolution analysis to be carried out using multiple independent sources of data, e.g., use a marker gene list obtained from data on one platform to deconvolve gene expression data generated on another platform.

Usage

convertIDs(object, to, from, ...)

S4 (list,GeneIdentifierType,GeneIdentifierType)
`mapIdentifiers`(what, to, from, ..., verbose = FALSE)

S4 (MarkerList,GeneIdentifierType,GeneIdentifierType)
`convertIDs`(object, to, from, verbose = FALSE, nodups = NULL, unlist = TRUE, 
  ...)

S4 (character,GeneIdentifierType,GeneIdentifierType)
`convertIDs`(object, to, from, method = "auto", unlist = TRUE, ...)

S4 (matrix,GeneIdentifierType,GeneIdentifierType)
`convertIDs`(object, to, from, ..., unlist = TRUE, rm.duplicates = NULL)

S4 (ExpressionSet,GeneIdentifierType,GeneIdentifierType)
`convertIDs`(object, to, from, ..., unlist = TRUE, rm.duplicates = NULL)

Arguments

object
object whose identifiers are converted.
to
specification of the type of identifiers to convert to.
from
specification of the type of identifiers of object. This is only neeeded when the source type cannot be inferred from object itself.
...
extra arguments to allow extension, which are passed down to the workhorse method convertIDs,character,GeneIdentifierType,GeneIdentifierType. See each method's description for more details.
what
identifiers to map
verbose
a logical or integer that sets the vverbosity level.
nodups
specifies if marker identifiers that are duplicated across cell types in the result should be removed (TRUE) or not (FALSE). If NULL, then duplicates are removed only if there were no duplicates in the source object.
method
mapping method, passed to mapIDs, that indicates how to carry the mapping between the original and final identifier type.
unlist
logical that indicates if the result should be flatten, i.e. turned into a vector rather than a list -- using unlist2. In this case, the vector's name then correspond to the source identifiers.
rm.duplicates
logical that how duplicated should be treated. rm.duplicates=FALSE does not allow any duplicated match and throws an error if any is present. If TRUE or NULL duplicates only the first match is kept, but a warning is thrown only when NULL.

Details

The function convertIDs provides the main interface to convert genes/probeset ids into IDs compatible with another given data. It is typically useful to convert built-in marker gene lists (see link{cellMarkers}).

The identifier conversion functions and methods defined in the CellMix package can be seen as extending the existing framework defined in the GSEABase package, with the generic mapIdentifiers.

Methods

  1. convertIDssignature(object = "list", to = "GeneIdentifierType", from = "GeneIdentifierType"): Apply the conversion to each element of the list.

  2. convertIDssignature(object = "MarkerList", to = "GeneIdentifierType", from = "GeneIdentifierType"): Convert IDs from a MarkerList object.

    In this case, argument unlist indicates if the result should be a simple list containing the mapping (a list) for each cell type or a MarkerList-class object (default).

  3. convertIDssignature(object = "character", to = "GeneIdentifierType", from = "GeneIdentifierType"): This is the workhorse method that is eventually called by all other convertIDs methods. The actual conversions are perforemd by mapIDs, to which are passed all arguments in ..., in particular, arguments verbose and method.

  4. convertIDssignature(object = "matrix", to = "GeneIdentifierType", from = "GeneIdentifierType"): Convert the row names of a matrix into other identifiers.

    In this case, argument unlist indicates if the converted ids should be used to subset the original matrix object, or returned directly returned as a list.

  5. convertIDssignature(object = "ExpressionSet", to = "GeneIdentifierType", from = "GeneIdentifierType"): Convert the feature names of an ExpressionSet into other identifiers.

    In this case, argument unlist indicates if the converted ids should be used to subset the original ExpressionSet object, or returned directly returned as a list.

  6. convertIDssignature(object = "ANY", to = "ANY", from = "NullIdentifier"): Convert identifiers, inferring the type of origin from the object itself, but keep the annotation specification embedded in from.

  7. convertIDssignature(object = "ANY", to = "ANY", from = "ANY"): Convert identifiers, inferring the type from the specifications in to and from, eg., to='ENTREZID', or 'UNIGENE'. If not specified in either to or from, the annotation is taken from object. If from is missing, the source type is infered from object itself.

  8. convertIDssignature(object = "ANY", to = "list", from = "missing"): Convert identifiers using a given map or list of maps.

  9. mapIdentifierssignature(what = "list", to = "GeneIdentifierType", from = "GeneIdentifierType"): Applies mapIdentifier to each element in a list.

    All arguments in ... are passed to the subsequent calls to mapIdentifiers.

Examples


# load a marker list from the registry
m <- MarkerList('IRIS')
summary(m)
##            Length Class  Mode   
## B           121   -none- numeric
## T            94   -none- numeric
## NK           24   -none- numeric
## Dendritic    86   -none- numeric
## Monocyte    103   -none- numeric
## Neutrophil   54   -none- numeric
## Lymphoid    302   -none- numeric
## Myeloid     449   -none- numeric
## Multiple   1037   -none- numeric
head(m[[1]])
##   205267_at 211048_s_at 206398_s_at 217823_s_at 217825_s_at 217826_s_at 
##       6.884       5.298       4.678       4.481       4.374       4.206

# convert Entrez gene ids to Affy probeset ids chip hgu133b
m2 <- convertIDs(m, 'hgu133b.db', verbose=2)
## # Converting 2270 markers from Annotation (hgu133a.db, hgu133b.db) to Annotation (hgu133b.db) ... OK [1402/2270 (1:1)]
## # Processing 2270 markers from Annotation (hgu133a.db, hgu133b.db) to Annotation (hgu133b.db) ... 
##  ** Processing ids for 'B' ...  OK [69/121 (1:1)]
##  ** Processing ids for 'T' ...  OK [44/94 (1:1)]
##  ** Processing ids for 'NK' ...  OK [8/24 (1:1)]
##  ** Processing ids for 'Dendritic' ...  OK [43/86 (1:1)]
##  ** Processing ids for 'Monocyte' ...  OK [43/103 (1:1)]
##  ** Processing ids for 'Neutrophil' ...  OK [21/54 (1:1)]
##  ** Processing ids for 'Lymphoid' ...  OK [166/302 (1:1)]
##  ** Processing ids for 'Myeloid' ...  OK [238/449 (1:1)]
##  ** Processing ids for 'Multiple' ...  OK [636/1037 (1:1)]
## # Checking for duplicated marker(s) across cell-types ... OK [dropped 83/1268]
## OK [1185/2270 (1:1)]
summary(m2)
##            Length Class  Mode   
## B           66    -none- numeric
## T           40    -none- numeric
## NK           6    -none- numeric
## Dendritic   41    -none- numeric
## Monocyte    40    -none- numeric
## Neutrophil  17    -none- numeric
## Lymphoid   150    -none- numeric
## Myeloid    219    -none- numeric
## Multiple   606    -none- numeric
#----------------------------------------------
# 1. Conversion from biological IDs
#----------------------------------------------
# For this kind of IDs, a source annotation package can often be inferred
# from the ID type, using regular expression patterns (e.g. "^ENS[0-9]+$"
# identifies Ensembl gene IDs)

ids <- c("Hs.1", "Hs.2", "Hs.3")
# get Entrez gene IDs (based on annotation from the org.Hs.gene.eg package)
convertIDs(ids, 'ENTREZID', 'org.Hs.eg.db', verbose=TRUE)
##  # Converting from Unigene (org.Hs.eg.db) to EntrezId (org.Hs.eg.db) ...  OK [1/3 mapped (1:1)]
## Hs.1 Hs.2 Hs.3 
##   NA "10"   NA 
## attr(,"from")
## geneIdType: Unigene (org.Hs.eg.db)
## attr(,"to")
## geneIdType: EntrezId (org.Hs.eg.db)

# map to other IDs
convertIDs(ids, 'REFSEQ')
##        Hs.1        Hs.2        Hs.3 
##          NA "NM_000015"          NA 
## attr(,"from")
## geneIdType: Unigene (org.Hs.eg.db)
## attr(,"to")
## geneIdType: Refseq (org.Hs.eg.db)
convertIDs(ids, 'ENSEMBL')
##              Hs.1              Hs.2              Hs.3 
##                NA "ENSG00000156006"                NA 
## attr(,"from")
## geneIdType: Unigene (org.Hs.eg.db)
## attr(,"to")
## geneIdType: ENSEMBL (org.Hs.eg.db)
# convert across ogranism
convertIDs(ids, 'rat2302.db')
## Warning: An error occured when converting ids cross-species from Homo
## sapiens to Rattus norvegicus: Error in names(destIDs) = dnames : attempt
## to set an attribute on NULL
## Hs.1 Hs.2 Hs.3 
##   NA   NA   NA 
## attr(,"from")
## geneIdType: Unigene (org.Hs.eg.db)
## attr(,"to")
## geneIdType: Annotation (rat2302.db)
# get Affy probeset IDs for chip hgu133a
affy <- convertIDs(ids, 'hgu133a.db')

# assume we have a vector of IDs, e.g. Entrez gene ids
id <- c("673", "725", "10115")
# get associated probesets on chip hgu133a
convertIDs(id, 'hgu133a.db')
##           673           725         10115 
## "206044_s_at" "208209_s_at"            NA 
## attr(,"from")
## geneIdType: EntrezId (hgu133a.db)
## attr(,"to")
## geneIdType: Annotation (hgu133a.db)
# get all associated probesets on chip hgu133a
convertIDs(id, 'hgu133a.db', method='all')
##           673           725         10115 
## "206044_s_at" "208209_s_at"            NA 
## attr(,"from")
## geneIdType: EntrezId (hgu133a.db)
## attr(,"to")
## geneIdType: Annotation (hgu133a.db)
# same as a vector with duplicated names
convertIDs(id, 'hgu133a.db', method='all', unlist=FALSE)
## $`673`
##           673 
## "206044_s_at" 
## 
## $`725`
##           725 
## "208209_s_at" 
## 
## $`10115`
## [1] NA
## 
## attr(,"from")
## geneIdType: EntrezId (hgu133a.db)
## attr(,"to")
## geneIdType: Annotation (hgu133a.db)
# specification using ProbeAnnDbBimap objects
library(hgu133b.db)
convertIDs(id, 'hgu133a.db', hgu133bENTREZID, verbose=2)
##  # Converting from EntrezId (hgu133b.db) to Annotation (hgu133a.db) ... 
##  # Limiting query to EntrezId (hgu133b.db) ...  [3 -> 1 id(s)]
##   # Loading map(s) from EntrezId (hgu133b.db) to Annotation (hgu133a.db)   [x-platform  /x-id] ...   OK [1 step(s)]
##   # Mapping from EntrezId (hgu133a.db) to Annotation (hgu133a.db) [43827 entries] ...   [1/1 mapped (1:1)]
##   # Applying filtering strategy 'auto' ... (kept 1 2nd-affy probes)   [1/1 passed (1:1)]
##  OK [1/3 mapped (1:1)]
##           673           725         10115 
## "206044_s_at"            NA            NA 
## attr(,"from")
## geneIdType: EntrezId (hgu133b.db)
## attr(,"to")
## geneIdType: Annotation (hgu133a.db)

#----------------------------------------------
# 2. Conversion from probeset IDs
#----------------------------------------------
# For this kind of IDs, a source annotation package is required, because it
# cannot be easily inferred from the ID type.

# get Affy probeset IDs for chip hgu133b from ids for hgu133b
convertIDs(affy, 'hgu133a.db', 'hgu133b.db')
##      <NA> 206797_at      <NA> 
##        NA        NA        NA 
## attr(,"from")
## geneIdType: Annotation (hgu133b.db)
## attr(,"to")
## geneIdType: Annotation (hgu133a.db)
# across organism
convertIDs(affy, 'hgu133a.db', 'rat2302.db')
##      <NA> 206797_at      <NA> 
##        NA        NA        NA 
## attr(,"from")
## geneIdType: Annotation (rat2302.db)
## attr(,"to")
## geneIdType: Annotation (hgu133a.db)