vignettes/intro_bioc_sce.Rmd
intro_bioc_sce.Rmd
library(SingleCellExperiment)
library(MouseGastrulationData)
SingleCellExperiment
class
One of the main strengths of the Bioconductor project lies in the use of a common data infrastructure that powers interoperability across packages.
Users should be able to analyze their data using functions from
different Bioconductor packages without the need to convert between
formats. To this end, the SingleCellExperiment
class (from
the SingleCellExperiment package) serves as the common currency
for data exchange across 70+ single-cell-related Bioconductor
packages.
This class implements a data structure that stores all aspects of our single-cell data - gene-by-cell expression data, per-cell metadata and per-gene annotation - and manipulate them in a synchronized manner.
Let’s start with an example dataset.
sce <- WTChimeraData(sample=5)
sce
## class: SingleCellExperiment
## dim: 29453 2411
## metadata(0):
## assays(1): counts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
## ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(2411): cell_9769 cell_9770 ... cell_12178 cell_12179
## colData names(11): cell barcode ... doub.density sizeFactor
## reducedDimNames(2): pca.corrected.E7.5 pca.corrected.E8.5
## mainExpName: NULL
## altExpNames(0):
We can think of this (and other) class as a container, that contains several different pieces of data in so-called slots.
The getter methods are used to extract information from the slots and the setter methods are used to add information into the slots. These are the only ways to interact with the objects (rather than directly accessing the slots).
Depending on the object, slots can contain different types of data (e.g., numeric matrices, lists, etc.). We will here review the main slots of the SingleCellExperiment class as well as their getter/setter methods.
assays
This is arguably the most fundamental part of the object that
contains the count matrix, and potentially other matrices with
transformed data. We can access the list of matrices with the
assays
function and individual matrices with the
assay
function. If one of these matrices is called
“counts”, we can use the special counts
getter (and the
analogous logcounts
).
## [1] TRUE
counts(sce)[1:3, 1:3]
## 3 x 3 sparse Matrix of class "dgTMatrix"
## cell_9769 cell_9770 cell_9771
## ENSMUSG00000051951 . . .
## ENSMUSG00000089699 . . .
## ENSMUSG00000102343 . . .
You will notice that in this case we have a sparse matrix of class “dgTMatrix” inside the object. More generally, any “matrix-like” object can be used, e.g., dense matrices or HDF5-backed matrices (see “Working with large data”).
colData
and rowData
Conceptually, these are two data frames that annotate the columns and the rows of your assay, respectively.
One can interact with them as usual, e.g., by extracting columns or adding additional variables as columns.
colData(sce)
## DataFrame with 2411 rows and 11 columns
## cell barcode sample stage tomato
## <character> <character> <integer> <character> <logical>
## cell_9769 cell_9769 AAACCTGAGACTGTAA 5 E8.5 TRUE
## cell_9770 cell_9770 AAACCTGAGATGCCTT 5 E8.5 TRUE
## cell_9771 cell_9771 AAACCTGAGCAGCCTC 5 E8.5 TRUE
## cell_9772 cell_9772 AAACCTGCATACTCTT 5 E8.5 TRUE
## cell_9773 cell_9773 AAACGGGTCAACACCA 5 E8.5 TRUE
## ... ... ... ... ... ...
## cell_12175 cell_12175 TTTGGTTAGTCCGTAT 5 E8.5 TRUE
## cell_12176 cell_12176 TTTGGTTAGTGTTGAA 5 E8.5 TRUE
## cell_12177 cell_12177 TTTGGTTGTTAAAGAC 5 E8.5 TRUE
## cell_12178 cell_12178 TTTGGTTTCAGTCAGT 5 E8.5 TRUE
## cell_12179 cell_12179 TTTGGTTTCGCCATAA 5 E8.5 TRUE
## pool stage.mapped celltype.mapped closest.cell
## <integer> <character> <character> <character>
## cell_9769 3 E8.25 Mesenchyme cell_24159
## cell_9770 3 E8.5 Endothelium cell_96660
## cell_9771 3 E8.5 Allantois cell_134982
## cell_9772 3 E8.5 Erythroid3 cell_133892
## cell_9773 3 E8.25 Erythroid1 cell_76296
## ... ... ... ... ...
## cell_12175 3 E8.5 Erythroid3 cell_138060
## cell_12176 3 E8.5 Forebrain/Midbrain/H.. cell_72709
## cell_12177 3 E8.25 Surface ectoderm cell_100275
## cell_12178 3 E8.25 Erythroid2 cell_70906
## cell_12179 3 E8.5 Spinal cord cell_102334
## doub.density sizeFactor
## <numeric> <numeric>
## cell_9769 0.02985045 1.41243
## cell_9770 0.00172753 1.22757
## cell_9771 0.01338013 1.15439
## cell_9772 0.00218402 1.28676
## cell_9773 0.00211723 1.78719
## ... ... ...
## cell_12175 0.00129403 1.219506
## cell_12176 0.01833074 1.095753
## cell_12177 0.03104037 0.910728
## cell_12178 0.00169483 2.061701
## cell_12179 0.03767894 1.798687
rowData(sce)
## DataFrame with 29453 rows and 2 columns
## ENSEMBL SYMBOL
## <character> <character>
## ENSMUSG00000051951 ENSMUSG00000051951 Xkr4
## ENSMUSG00000089699 ENSMUSG00000089699 Gm1992
## ENSMUSG00000102343 ENSMUSG00000102343 Gm37381
## ENSMUSG00000025900 ENSMUSG00000025900 Rp1
## ENSMUSG00000025902 ENSMUSG00000025902 Sox17
## ... ... ...
## ENSMUSG00000095041 ENSMUSG00000095041 AC149090.1
## ENSMUSG00000063897 ENSMUSG00000063897 DHRSX
## ENSMUSG00000096730 ENSMUSG00000096730 Vmn2r122
## ENSMUSG00000095742 ENSMUSG00000095742 CAAA01147332.1
## tomato-td tomato-td tomato-td
Note the $
short cut.
## [1] TRUE
## DataFrame with 2411 rows and 12 columns
## cell barcode sample stage tomato
## <character> <character> <integer> <character> <logical>
## cell_9769 cell_9769 AAACCTGAGACTGTAA 5 E8.5 TRUE
## cell_9770 cell_9770 AAACCTGAGATGCCTT 5 E8.5 TRUE
## cell_9771 cell_9771 AAACCTGAGCAGCCTC 5 E8.5 TRUE
## cell_9772 cell_9772 AAACCTGCATACTCTT 5 E8.5 TRUE
## cell_9773 cell_9773 AAACGGGTCAACACCA 5 E8.5 TRUE
## ... ... ... ... ... ...
## cell_12175 cell_12175 TTTGGTTAGTCCGTAT 5 E8.5 TRUE
## cell_12176 cell_12176 TTTGGTTAGTGTTGAA 5 E8.5 TRUE
## cell_12177 cell_12177 TTTGGTTGTTAAAGAC 5 E8.5 TRUE
## cell_12178 cell_12178 TTTGGTTTCAGTCAGT 5 E8.5 TRUE
## cell_12179 cell_12179 TTTGGTTTCGCCATAA 5 E8.5 TRUE
## pool stage.mapped celltype.mapped closest.cell
## <integer> <character> <character> <character>
## cell_9769 3 E8.25 Mesenchyme cell_24159
## cell_9770 3 E8.5 Endothelium cell_96660
## cell_9771 3 E8.5 Allantois cell_134982
## cell_9772 3 E8.5 Erythroid3 cell_133892
## cell_9773 3 E8.25 Erythroid1 cell_76296
## ... ... ... ... ...
## cell_12175 3 E8.5 Erythroid3 cell_138060
## cell_12176 3 E8.5 Forebrain/Midbrain/H.. cell_72709
## cell_12177 3 E8.25 Surface ectoderm cell_100275
## cell_12178 3 E8.25 Erythroid2 cell_70906
## cell_12179 3 E8.5 Spinal cord cell_102334
## doub.density sizeFactor my_sum
## <numeric> <numeric> <numeric>
## cell_9769 0.02985045 1.41243 27577
## cell_9770 0.00172753 1.22757 29309
## cell_9771 0.01338013 1.15439 28795
## cell_9772 0.00218402 1.28676 34794
## cell_9773 0.00211723 1.78719 38300
## ... ... ... ...
## cell_12175 0.00129403 1.219506 26680
## cell_12176 0.01833074 1.095753 19013
## cell_12177 0.03104037 0.910728 24627
## cell_12178 0.00169483 2.061701 46162
## cell_12179 0.03767894 1.798687 38398
reducedDims
Everything that we have described so far (except for the
counts
getter) is part of the
SummarizedExperiment
class that SingleCellExperiment
extends.
One of the peculiarity of SingleCellExperiment is its ability to store reduced dimension matrices within the object. These may include PCA, t-SNE, UMAP, etc.
reducedDims(sce)
## List of length 2
## names(2): pca.corrected.E7.5 pca.corrected.E8.5
As for the other slots, we have the usual setter/getter, but it is somewhat rare to interact directly with these functions.
It is more common for other functions to store this
information in the object, e.g., the runPCA
function from
the scater
package.
Here, we use scater
’s plotReducedDim
function as an example of how to extract this information
indirectly from the objects. Note that one could obtain the
same results (somewhat less efficiently) by extracting the corresponding
reducedDim
matrix and ggplot
.
## Loading required package: scuttle
## Loading required package: ggplot2
plotReducedDim(sce, "pca.corrected.E8.5", colour_by = "celltype.mapped")
## Warning: Removed 131 rows containing missing values (`geom_point()`).
## R Under development (unstable) (2023-11-22 r85609)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scater_1.31.1 ggplot2_3.4.4
## [3] scuttle_1.13.0 MouseGastrulationData_1.17.0
## [5] SpatialExperiment_1.13.0 SingleCellExperiment_1.25.0
## [7] SummarizedExperiment_1.33.0 Biobase_2.63.0
## [9] GenomicRanges_1.55.1 GenomeInfoDb_1.39.1
## [11] IRanges_2.37.0 S4Vectors_0.41.2
## [13] BiocGenerics_0.49.1 MatrixGenerics_1.15.0
## [15] matrixStats_1.1.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.1.3 bitops_1.0-7
## [3] formatR_1.14 gridExtra_2.3
## [5] rlang_1.1.2 magrittr_2.0.3
## [7] compiler_4.4.0 RSQLite_2.3.3
## [9] DelayedMatrixStats_1.25.1 png_0.1-8
## [11] systemfonts_1.0.5 vctrs_0.6.4
## [13] stringr_1.5.1 pkgconfig_2.0.3
## [15] crayon_1.5.2 fastmap_1.1.1
## [17] dbplyr_2.4.0 magick_2.8.1
## [19] XVector_0.43.0 ellipsis_0.3.2
## [21] labeling_0.4.3 utf8_1.2.4
## [23] promises_1.2.1 rmarkdown_2.25
## [25] ggbeeswarm_0.7.2 ragg_1.2.6
## [27] purrr_1.0.2 bit_4.0.5
## [29] xfun_0.41 zlibbioc_1.49.0
## [31] cachem_1.0.8 beachmat_2.19.0
## [33] jsonlite_1.8.7 blob_1.2.4
## [35] highr_0.10 later_1.3.1
## [37] DelayedArray_0.29.0 BiocParallel_1.37.0
## [39] interactiveDisplayBase_1.41.0 irlba_2.3.5.1
## [41] parallel_4.4.0 R6_2.5.1
## [43] bslib_0.6.1 stringi_1.8.2
## [45] jquerylib_0.1.4 Rcpp_1.0.11
## [47] knitr_1.45 httpuv_1.6.12
## [49] Matrix_1.6-3 tidyselect_1.2.0
## [51] viridis_0.6.4 abind_1.4-5
## [53] yaml_2.3.7 codetools_0.2-19
## [55] curl_5.1.0 lattice_0.22-5
## [57] tibble_3.2.1 shiny_1.8.0
## [59] withr_2.5.2 KEGGREST_1.43.0
## [61] BumpyMatrix_1.11.0 evaluate_0.23
## [63] desc_1.4.2 BiocFileCache_2.11.1
## [65] ExperimentHub_2.11.0 Biostrings_2.71.1
## [67] pillar_1.9.0 BiocManager_1.30.22
## [69] filelock_1.0.2 generics_0.1.3
## [71] rprojroot_2.0.4 RCurl_1.98-1.13
## [73] BiocVersion_3.19.1 munsell_0.5.0
## [75] scales_1.3.0 sparseMatrixStats_1.15.0
## [77] xtable_1.8-4 glue_1.6.2
## [79] tools_4.4.0 AnnotationHub_3.11.0
## [81] BiocNeighbors_1.21.0 ScaledMatrix_1.11.0
## [83] fs_1.6.3 grid_4.4.0
## [85] colorspace_2.1-0 AnnotationDbi_1.65.2
## [87] GenomeInfoDbData_1.2.11 beeswarm_0.4.0
## [89] BiocSingular_1.19.0 vipor_0.4.5
## [91] rsvd_1.0.5 cli_3.6.1
## [93] rappdirs_0.3.3 textshaping_0.3.7
## [95] fansi_1.0.5 viridisLite_0.4.2
## [97] S4Arrays_1.3.1 dplyr_1.1.4
## [99] gtable_0.3.4 sass_0.4.7
## [101] digest_0.6.33 ggrepel_0.9.4
## [103] SparseArray_1.3.1 farver_2.1.1
## [105] rjson_0.2.21 memoise_2.0.1
## [107] htmltools_0.5.7 pkgdown_2.0.7
## [109] lifecycle_1.0.4 httr_1.4.7
## [111] mime_0.12 bit64_4.0.5
SingleCellExperiment
object: Try and create a
SingleCellExperiment object “from scratch”. Start from a matrix (either
randomly generated or with some fake data in it) and add one or more
columns as colData.Hint: the SingleCellExperiment
function can be used to
create a new SingleCellExperiment object.
MouseGastrulationData
package contains several datasets. Download sample 6 of the chimera
experiment by running sce6 <- WTChimeraData(sample=6)
.
Use the cbind
function to combine the new data with the
sce
object created before.