Introduction to Bioconductor and the SingleCellExperiment class

Setup

library(SingleCellExperiment)
library(MouseGastrulationData)

The `SingleCellExperiment` class

One of the main strengths of the Bioconductor project lies in the use of a common data infrastructure that powers interoperability across packages.

Users should be able to analyze their data using functions from different Bioconductor packages without the need to convert between formats. To this end, the SingleCellExperiment class (from the SingleCellExperiment package) serves as the common currency for data exchange across 70+ single-cell-related Bioconductor packages.

This class implements a data structure that stores all aspects of our single-cell data - gene-by-cell expression data, per-cell metadata and per-gene annotation - and manipulate them in a synchronized manner.

Let’s start with an example dataset.

sce <- WTChimeraData(sample=5)
sce

## class: SingleCellExperiment 
## dim: 29453 2411 
## metadata(0):
## assays(1): counts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
##   ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(2411): cell_9769 cell_9770 ... cell_12178 cell_12179
## colData names(11): cell barcode ... doub.density sizeFactor
## reducedDimNames(2): pca.corrected.E7.5 pca.corrected.E8.5
## mainExpName: NULL
## altExpNames(0):

We can think of this (and other) class as a container, that contains several different pieces of data in so-called slots.

The getter methods are used to extract information from the slots and the setter methods are used to add information into the slots. These are the only ways to interact with the objects (rather than directly accessing the slots).

Depending on the object, slots can contain different types of data (e.g., numeric matrices, lists, etc.). We will here review the main slots of the SingleCellExperiment class as well as their getter/setter methods.

The `assays`

This is arguably the most fundamental part of the object that contains the count matrix, and potentially other matrices with transformed data. We can access the list of matrices with the assays function and individual matrices with the assay function. If one of these matrices is called “counts”, we can use the special counts getter (and the analogous logcounts).

identical(assay(sce), counts(sce))

## [1] TRUE

counts(sce)[1:3, 1:3]

## 3 x 3 sparse Matrix of class "dgTMatrix"
##                    cell_9769 cell_9770 cell_9771
## ENSMUSG00000051951         .         .         .
## ENSMUSG00000089699         .         .         .
## ENSMUSG00000102343         .         .         .

You will notice that in this case we have a sparse matrix of class “dgTMatrix” inside the object. More generally, any “matrix-like” object can be used, e.g., dense matrices or HDF5-backed matrices (see “Working with large data”).

The `colData` and `rowData`

Conceptually, these are two data frames that annotate the columns and the rows of your assay, respectively.

One can interact with them as usual, e.g., by extracting columns or adding additional variables as columns.

colData(sce)

## DataFrame with 2411 rows and 11 columns
##                   cell          barcode    sample       stage    tomato
##            <character>      <character> <integer> <character> <logical>
## cell_9769    cell_9769 AAACCTGAGACTGTAA         5        E8.5      TRUE
## cell_9770    cell_9770 AAACCTGAGATGCCTT         5        E8.5      TRUE
## cell_9771    cell_9771 AAACCTGAGCAGCCTC         5        E8.5      TRUE
## cell_9772    cell_9772 AAACCTGCATACTCTT         5        E8.5      TRUE
## cell_9773    cell_9773 AAACGGGTCAACACCA         5        E8.5      TRUE
## ...                ...              ...       ...         ...       ...
## cell_12175  cell_12175 TTTGGTTAGTCCGTAT         5        E8.5      TRUE
## cell_12176  cell_12176 TTTGGTTAGTGTTGAA         5        E8.5      TRUE
## cell_12177  cell_12177 TTTGGTTGTTAAAGAC         5        E8.5      TRUE
## cell_12178  cell_12178 TTTGGTTTCAGTCAGT         5        E8.5      TRUE
## cell_12179  cell_12179 TTTGGTTTCGCCATAA         5        E8.5      TRUE
##                 pool stage.mapped        celltype.mapped closest.cell
##            <integer>  <character>            <character>  <character>
## cell_9769          3        E8.25             Mesenchyme   cell_24159
## cell_9770          3         E8.5            Endothelium   cell_96660
## cell_9771          3         E8.5              Allantois  cell_134982
## cell_9772          3         E8.5             Erythroid3  cell_133892
## cell_9773          3        E8.25             Erythroid1   cell_76296
## ...              ...          ...                    ...          ...
## cell_12175         3         E8.5             Erythroid3  cell_138060
## cell_12176         3         E8.5 Forebrain/Midbrain/H..   cell_72709
## cell_12177         3        E8.25       Surface ectoderm  cell_100275
## cell_12178         3        E8.25             Erythroid2   cell_70906
## cell_12179         3         E8.5            Spinal cord  cell_102334
##            doub.density sizeFactor
##               <numeric>  <numeric>
## cell_9769    0.02985045    1.41243
## cell_9770    0.00172753    1.22757
## cell_9771    0.01338013    1.15439
## cell_9772    0.00218402    1.28676
## cell_9773    0.00211723    1.78719
## ...                 ...        ...
## cell_12175   0.00129403   1.219506
## cell_12176   0.01833074   1.095753
## cell_12177   0.03104037   0.910728
## cell_12178   0.00169483   2.061701
## cell_12179   0.03767894   1.798687

rowData(sce)

## DataFrame with 29453 rows and 2 columns
##                               ENSEMBL         SYMBOL
##                           <character>    <character>
## ENSMUSG00000051951 ENSMUSG00000051951           Xkr4
## ENSMUSG00000089699 ENSMUSG00000089699         Gm1992
## ENSMUSG00000102343 ENSMUSG00000102343        Gm37381
## ENSMUSG00000025900 ENSMUSG00000025900            Rp1
## ENSMUSG00000025902 ENSMUSG00000025902          Sox17
## ...                               ...            ...
## ENSMUSG00000095041 ENSMUSG00000095041     AC149090.1
## ENSMUSG00000063897 ENSMUSG00000063897          DHRSX
## ENSMUSG00000096730 ENSMUSG00000096730       Vmn2r122
## ENSMUSG00000095742 ENSMUSG00000095742 CAAA01147332.1
## tomato-td                   tomato-td      tomato-td

Note the $ short cut.

identical(colData(sce)$sum, sce$sum)

## [1] TRUE

sce$my_sum <- colSums(counts(sce))
colData(sce)

## DataFrame with 2411 rows and 12 columns
##                   cell          barcode    sample       stage    tomato
##            <character>      <character> <integer> <character> <logical>
## cell_9769    cell_9769 AAACCTGAGACTGTAA         5        E8.5      TRUE
## cell_9770    cell_9770 AAACCTGAGATGCCTT         5        E8.5      TRUE
## cell_9771    cell_9771 AAACCTGAGCAGCCTC         5        E8.5      TRUE
## cell_9772    cell_9772 AAACCTGCATACTCTT         5        E8.5      TRUE
## cell_9773    cell_9773 AAACGGGTCAACACCA         5        E8.5      TRUE
## ...                ...              ...       ...         ...       ...
## cell_12175  cell_12175 TTTGGTTAGTCCGTAT         5        E8.5      TRUE
## cell_12176  cell_12176 TTTGGTTAGTGTTGAA         5        E8.5      TRUE
## cell_12177  cell_12177 TTTGGTTGTTAAAGAC         5        E8.5      TRUE
## cell_12178  cell_12178 TTTGGTTTCAGTCAGT         5        E8.5      TRUE
## cell_12179  cell_12179 TTTGGTTTCGCCATAA         5        E8.5      TRUE
##                 pool stage.mapped        celltype.mapped closest.cell
##            <integer>  <character>            <character>  <character>
## cell_9769          3        E8.25             Mesenchyme   cell_24159
## cell_9770          3         E8.5            Endothelium   cell_96660
## cell_9771          3         E8.5              Allantois  cell_134982
## cell_9772          3         E8.5             Erythroid3  cell_133892
## cell_9773          3        E8.25             Erythroid1   cell_76296
## ...              ...          ...                    ...          ...
## cell_12175         3         E8.5             Erythroid3  cell_138060
## cell_12176         3         E8.5 Forebrain/Midbrain/H..   cell_72709
## cell_12177         3        E8.25       Surface ectoderm  cell_100275
## cell_12178         3        E8.25             Erythroid2   cell_70906
## cell_12179         3         E8.5            Spinal cord  cell_102334
##            doub.density sizeFactor    my_sum
##               <numeric>  <numeric> <numeric>
## cell_9769    0.02985045    1.41243     27577
## cell_9770    0.00172753    1.22757     29309
## cell_9771    0.01338013    1.15439     28795
## cell_9772    0.00218402    1.28676     34794
## cell_9773    0.00211723    1.78719     38300
## ...                 ...        ...       ...
## cell_12175   0.00129403   1.219506     26680
## cell_12176   0.01833074   1.095753     19013
## cell_12177   0.03104037   0.910728     24627
## cell_12178   0.00169483   2.061701     46162
## cell_12179   0.03767894   1.798687     38398

The `reducedDims`

Everything that we have described so far (except for the counts getter) is part of the SummarizedExperiment class that SingleCellExperiment extends.

One of the peculiarity of SingleCellExperiment is its ability to store reduced dimension matrices within the object. These may include PCA, t-SNE, UMAP, etc.

reducedDims(sce)

## List of length 2
## names(2): pca.corrected.E7.5 pca.corrected.E8.5

As for the other slots, we have the usual setter/getter, but it is somewhat rare to interact directly with these functions.

It is more common for other functions to store this information in the object, e.g., the runPCA function from the scater package.

Here, we use scater’s plotReducedDim function as an example of how to extract this information indirectly from the objects. Note that one could obtain the same results (somewhat less efficiently) by extracting the corresponding reducedDim matrix and ggplot.

library(scater)

## Loading required package: scuttle

## Loading required package: ggplot2

plotReducedDim(sce, "pca.corrected.E8.5", colour_by = "celltype.mapped")

## Warning: Removed 131 rows containing missing values (`geom_point()`).

Session Info

sessionInfo()

## R Under development (unstable) (2023-11-22 r85609)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scater_1.31.1                ggplot2_3.4.4               
##  [3] scuttle_1.13.0               MouseGastrulationData_1.17.0
##  [5] SpatialExperiment_1.13.0     SingleCellExperiment_1.25.0 
##  [7] SummarizedExperiment_1.33.0  Biobase_2.63.0              
##  [9] GenomicRanges_1.55.1         GenomeInfoDb_1.39.1         
## [11] IRanges_2.37.0               S4Vectors_0.41.2            
## [13] BiocGenerics_0.49.1          MatrixGenerics_1.15.0       
## [15] matrixStats_1.1.0           
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.1.3                     bitops_1.0-7                 
##   [3] formatR_1.14                  gridExtra_2.3                
##   [5] rlang_1.1.2                   magrittr_2.0.3               
##   [7] compiler_4.4.0                RSQLite_2.3.3                
##   [9] DelayedMatrixStats_1.25.1     png_0.1-8                    
##  [11] systemfonts_1.0.5             vctrs_0.6.4                  
##  [13] stringr_1.5.1                 pkgconfig_2.0.3              
##  [15] crayon_1.5.2                  fastmap_1.1.1                
##  [17] dbplyr_2.4.0                  magick_2.8.1                 
##  [19] XVector_0.43.0                ellipsis_0.3.2               
##  [21] labeling_0.4.3                utf8_1.2.4                   
##  [23] promises_1.2.1                rmarkdown_2.25               
##  [25] ggbeeswarm_0.7.2              ragg_1.2.6                   
##  [27] purrr_1.0.2                   bit_4.0.5                    
##  [29] xfun_0.41                     zlibbioc_1.49.0              
##  [31] cachem_1.0.8                  beachmat_2.19.0              
##  [33] jsonlite_1.8.7                blob_1.2.4                   
##  [35] highr_0.10                    later_1.3.1                  
##  [37] DelayedArray_0.29.0           BiocParallel_1.37.0          
##  [39] interactiveDisplayBase_1.41.0 irlba_2.3.5.1                
##  [41] parallel_4.4.0                R6_2.5.1                     
##  [43] bslib_0.6.1                   stringi_1.8.2                
##  [45] jquerylib_0.1.4               Rcpp_1.0.11                  
##  [47] knitr_1.45                    httpuv_1.6.12                
##  [49] Matrix_1.6-3                  tidyselect_1.2.0             
##  [51] viridis_0.6.4                 abind_1.4-5                  
##  [53] yaml_2.3.7                    codetools_0.2-19             
##  [55] curl_5.1.0                    lattice_0.22-5               
##  [57] tibble_3.2.1                  shiny_1.8.0                  
##  [59] withr_2.5.2                   KEGGREST_1.43.0              
##  [61] BumpyMatrix_1.11.0            evaluate_0.23                
##  [63] desc_1.4.2                    BiocFileCache_2.11.1         
##  [65] ExperimentHub_2.11.0          Biostrings_2.71.1            
##  [67] pillar_1.9.0                  BiocManager_1.30.22          
##  [69] filelock_1.0.2                generics_0.1.3               
##  [71] rprojroot_2.0.4               RCurl_1.98-1.13              
##  [73] BiocVersion_3.19.1            munsell_0.5.0                
##  [75] scales_1.3.0                  sparseMatrixStats_1.15.0     
##  [77] xtable_1.8-4                  glue_1.6.2                   
##  [79] tools_4.4.0                   AnnotationHub_3.11.0         
##  [81] BiocNeighbors_1.21.0          ScaledMatrix_1.11.0          
##  [83] fs_1.6.3                      grid_4.4.0                   
##  [85] colorspace_2.1-0              AnnotationDbi_1.65.2         
##  [87] GenomeInfoDbData_1.2.11       beeswarm_0.4.0               
##  [89] BiocSingular_1.19.0           vipor_0.4.5                  
##  [91] rsvd_1.0.5                    cli_3.6.1                    
##  [93] rappdirs_0.3.3                textshaping_0.3.7            
##  [95] fansi_1.0.5                   viridisLite_0.4.2            
##  [97] S4Arrays_1.3.1                dplyr_1.1.4                  
##  [99] gtable_0.3.4                  sass_0.4.7                   
## [101] digest_0.6.33                 ggrepel_0.9.4                
## [103] SparseArray_1.3.1             farver_2.1.1                 
## [105] rjson_0.2.21                  memoise_2.0.1                
## [107] htmltools_0.5.7               pkgdown_2.0.7                
## [109] lifecycle_1.0.4               httr_1.4.7                   
## [111] mime_0.12                     bit64_4.0.5

Exercises

Create a SingleCellExperiment object: Try and create a SingleCellExperiment object “from scratch”. Start from a matrix (either randomly generated or with some fake data in it) and add one or more columns as colData.

Hint: the SingleCellExperiment function can be used to create a new SingleCellExperiment object.

Combining two objects: The MouseGastrulationData package contains several datasets. Download sample 6 of the chimera experiment by running sce6 <- WTChimeraData(sample=6). Use the cbind function to combine the new data with the sce object created before.

Setup

The SingleCellExperiment class

The assays

The colData and rowData

The reducedDims