As mentioned in the introduction, in this tutorial we will use the wild-type data from the Tal1 chimera experiment. These data are available through the MouseGastrulationData Bioconductor package, which contains several datasets.
In particular, the package contains the following samples that we will use for the tutorial:
We start our analysis by selecting only sample 5, which contains the injected cells in one biological replicate. We download the “raw” data that contains all the droplets for which we have sequenced reads.
library(MouseGastrulationData)
sce <- WTChimeraData(samples=5, type="raw")
sce <- sce[[1]]
sce
## class: SingleCellExperiment
## dim: 29453 522554
## metadata(0):
## assays(1): counts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
## ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(522554): AAACCTGAGAAACCAT AAACCTGAGAAACCGC ...
## TTTGTCATCTTTACGT TTTGTCATCTTTCCTC
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
From the experiment, we expect to have only a few thousand cells, while we can see that we have data for more than 500,000 droplets. It is likely that most of these droplets are empty and are capturing only ambient or background RNA.
library(DropletUtils)
bcrank <- barcodeRanks(counts(sce))
# Only showing unique points for plotting speed.
uniq <- !duplicated(bcrank$rank)
plot(bcrank$rank[uniq], bcrank$total[uniq], log="xy",
xlab="Rank", ylab="Total UMI count", cex.lab=1.2)
## Warning in xy.coords(x, y, xlabel, ylabel, log): 1 y value <= 0 omitted from
## logarithmic plot
abline(h=metadata(bcrank)$inflection, col="darkgreen", lty=2)
abline(h=metadata(bcrank)$knee, col="dodgerblue", lty=2)
legend("bottomleft", legend=c("Inflection", "Knee"),
col=c("darkgreen", "dodgerblue"), lty=2, cex=1.2)
The distribution of total counts exhibits a sharp transition between barcodes with large and small total counts, probably corresponding to cell-containing and empty droplets respectively.
A simple approach would be to apply a threshold on the total count to only retain those barcodes with large totals. However, this may unnecessarily discard libraries derived from cell types with low RNA content.
A better approach is to test whether the expression profile for each cell barcode is significantly different from the ambient RNA pool (Aaron TL Lun et al. 2019). Any significant deviation indicates that the barcode corresponds to a cell-containing droplet. This allows us to discriminate between well-sequenced empty droplets and droplets derived from cells with little RNA, both of which would have similar total counts.
We call cells at a false discovery rate (FDR) of 0.1%, meaning that no more than 0.1% of our called barcodes should be empty droplets on average.
# emptyDrops performs Monte Carlo simulations to compute p-values,
# so we need to set the seed to obtain reproducible results.
set.seed(100)
# this may take a few minutes
e.out <- emptyDrops(counts(sce))
summary(e.out$FDR <= 0.001)
## Mode FALSE TRUE NA's
## logical 6184 3131 513239
sce <- sce[,which(e.out$FDR <= 0.001)]
sce
## class: SingleCellExperiment
## dim: 29453 3131
## metadata(0):
## assays(1): counts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
## ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(3131): AAACCTGAGACTGTAA AAACCTGAGATGCCTT ... TTTGTCAGTCTGATTG
## TTTGTCATCTGAGTGT
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
The end result confirms our prior expectation: only 3131 droplets contain a cell, while the large majority of droplets are empty.
While we have removed empty droplets, this does not necessarily imply that all the cell-containing droplets should be kept for downstream analysis. In fact, some droplets could contain low-quality samples, due to cell damage or failure in library preparation.
Retaining these low-quality samples in the analysis could be problematic as they could:
To mitigate these problems, we can check a few quality-control metrics and, if needed, remove low-quality samples.
There are many possibile ways to define a set of quality-control metrics, see for instance Cole et al. (2019). Here, we keep it simple and consider only:
In particular, high proportions of mitochondrial genes are indicative of poor-quality cells, presumably because of loss of cytoplasmic RNA from perforated cells. The reasoning is that, in the presence of modest damage, the holes in the cell membrane permit efflux of individual transcript molecules but are too small to allow mitochondria to escape, leading to a relative enrichment of mitochondrial transcripts. For single-nucleus RNA-seq experiments, high proportions are also useful as they can mark cells where the cytoplasm has not been successfully stripped.
First, we need to identify mitochondrial genes. We use the available
EnsDb
mouse package available in Bioconductor, but a more
updated version of Ensembl can be used through the
AnnotationHub
or biomaRt
packages.
library(EnsDb.Mmusculus.v79)
## Loading required package: ensembldb
## Loading required package: GenomicFeatures
## Loading required package: AnnotationDbi
## Loading required package: AnnotationFilter
##
## Attaching package: 'ensembldb'
## The following object is masked from 'package:stats':
##
## filter
chr.loc <- mapIds(EnsDb.Mmusculus.v79, keys=rownames(sce),
keytype="GENEID", column="SEQNAME")
## Warning: Unable to map 2918 of 29453 requested IDs.
is.mito <- which(chr.loc=="MT")
We can use the scuttle
package to compute a set of
quality-control metrics, specifying that we want to use the
mitochondrial genes as a special set of features.
library(scuttle)
df <- perCellQCMetrics(sce, subsets=list(Mito=is.mito))
# include them in the object
colData(sce) <- cbind(colData(sce), df)
colData(sce)
## DataFrame with 3131 rows and 6 columns
## sum detected subsets_Mito_sum subsets_Mito_detected
## <numeric> <numeric> <numeric> <numeric>
## AAACCTGAGACTGTAA 27577 5418 471 10
## AAACCTGAGATGCCTT 29309 5405 679 10
## AAACCTGAGCAGCCTC 28795 5218 480 12
## AAACCTGCATACTCTT 34794 4781 496 12
## AAACCTGGTGGTACAG 262 229 0 0
## ... ... ... ... ...
## TTTGGTTTCGCCATAA 38398 6020 252 12
## TTTGTCACACCCTATC 3013 1451 123 9
## TTTGTCACATTCTCAT 1472 675 599 11
## TTTGTCAGTCTGATTG 361 293 0 0
## TTTGTCATCTGAGTGT 267 233 16 6
## subsets_Mito_percent total
## <numeric> <numeric>
## AAACCTGAGACTGTAA 1.70795 27577
## AAACCTGAGATGCCTT 2.31669 29309
## AAACCTGAGCAGCCTC 1.66696 28795
## AAACCTGCATACTCTT 1.42553 34794
## AAACCTGGTGGTACAG 0.00000 262
## ... ... ...
## TTTGGTTTCGCCATAA 0.656284 38398
## TTTGTCACACCCTATC 4.082310 3013
## TTTGTCACATTCTCAT 40.692935 1472
## TTTGTCAGTCTGATTG 0.000000 361
## TTTGTCATCTGAGTGT 5.992509 267
Now that we have computed the metrics, we have to decide on thresholds to define high- and low-quality samples. We could check how many cells are above/below a certain fixed threshold. For instance,
table(df$sum < 10000)
##
## FALSE TRUE
## 2477 654
table(df$subsets_Mito_percent > 10)
##
## FALSE TRUE
## 2761 370
or we could look at the distribution of such metrics and use a data adaptive threshold.
summary(df$detected)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 98 4126 5168 4455 5670 7908
summary(df$subsets_Mito_percent)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.155 1.608 5.079 2.182 66.968
This can be achieved with the perCellQCFilters
function.
By default, we consider a value to be an outlier if it is more than 3
median absolute deviations (MADs) from the median in the “problematic”
direction. This is loosely motivated by the fact that such a filter will
retain 99% of non-outlier values that follow a normal distribution.
reasons <- perCellQCFilters(df, sub.fields="subsets_Mito_percent")
colSums(as.matrix(reasons))
## low_lib_size low_n_features high_subsets_Mito_percent
## 601 654 484
## discard
## 694
summary(reasons$discard)
## Mode FALSE TRUE
## logical 2437 694
# include in object
sce$discard <- reasons$discard
It is always a good idea to check the distribution of the QC metrics and to visualize the cells that were removed, to identify possible problems with the procedure. In particular, we expect to have few outliers and with a marked difference from “regular” cells (e.g., a bimodal distribution or a long tail). Moreover, if there are too many discarded cells, further exploration might be needed.
## Loading required package: ggplot2
plotColData(sce, y = "sum", colour_by = "discard") + ggtitle("Total count")
plotColData(sce, y = "detected", colour_by = "discard") + ggtitle("Detected features")
plotColData(sce, y = "subsets_Mito_percent", colour_by = "discard") + ggtitle("Mito percent")
While the univariate distribution of QC metrics can give some insight on the quality of the sample, often looking at the bivariate distribution of QC metrics is useful, e.g., to confirm that there are no cells with both large total counts and large mitochondrial counts, to ensure that we are not inadvertently removing high-quality cells that happen to be highly metabolically active.
plotColData(sce, x="sum", y="subsets_Mito_percent", colour_by="discard")
It could also be a good idea to perform a differential expression analysis between retained and discarded cells to check wether we are removing an unusual cell population rather than low-quality libraries (see Section 1.5 of OSCA advanced).
Once we are happy with the results, we can discard the low-quality cells by subsetting the original object.
sce <- sce[,!sce$discard]
sce
## class: SingleCellExperiment
## dim: 29453 2437
## metadata(0):
## assays(1): counts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
## ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(2437): AAACCTGAGACTGTAA AAACCTGAGATGCCTT ... TTTGGTTTCAGTCAGT
## TTTGGTTTCGCCATAA
## colData names(7): sum detected ... total discard
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
Systematic differences in sequencing coverage between libraries are often observed in single-cell RNA sequencing data. They typically arise from technical differences in cDNA capture or PCR amplification efficiency across cells, attributable to the difficulty of achieving consistent library preparation with minimal starting material (Vallejos et al. 2017). Normalization aims to remove these differences such that they do not interfere with comparisons of the expression profiles between cells. The hope is that the observed heterogeneity or differential expression within the cell population are driven by biology and not technical biases.
We will mostly focus our attention on scaling normalization, which is the simplest and most commonly used class of normalization strategies. This involves dividing all counts for each cell by a cell-specific scaling factor, often called a size factor. The assumption here is that any cell-specific bias (e.g., in capture or amplification efficiency) affects all genes equally via scaling of the expected mean count for that cell. The size factor for each cell represents the estimate of the relative bias in that cell, so division of its counts by its size factor should remove that bias. The resulting “normalized expression values” can then be used for downstream analyses such as clustering and dimensionality reduction.
The simplest and most natural strategy would be to normalize by the total sum of counts across all genes for each cell. This is often called the library size normalization.
The library size factor for each cell is directly proportional to its library size where the proportionality constant is defined such that the mean size factor across all cells is equal to 1. This ensures that the normalized expression values are on the same scale as the original counts.
lib.sf <- librarySizeFactors(sce)
summary(lib.sf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2730 0.7879 0.9600 1.0000 1.1730 2.5598
Library size normalization is not optimal, as it assumes that the total sum of UMI counts differ between cells only for technical and not biological reason. This can be a problem if a highly-expressed subset of genes is differentially expressed between cells or cell types.
Several robust normalization methods have been proposed for bulk RNA-seq. However, these methods may underperform in single-cell data due to the dominance of low and zero counts. To overcome this, one solution is to pool counts from many cells to increase the size of the counts for accurate size factor estimation (Aaron T. Lun, Bach, and Marioni 2016). Pool-based size factors are then deconvolved into cell-based factors for normalization of each cell’s expression profile.
We use a pre-clustering step: cells in each cluster are normalized separately and the size factors are rescaled to be comparable across clusters. This avoids the assumption that most genes are non-DE across the entire population – only a non-DE majority is required between pairs of clusters, which is a weaker assumption for highly heterogeneous populations.
Note that while we focus on normalization by deconvolution here, many other methods have been proposed and lead to similar performance (see Borella et al. (2022) for a comparative review).
library(scran)
set.seed(100)
clust <- quickCluster(sce)
table(clust)
## clust
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 273 159 250 122 187 201 154 252 152 169 199 215 104
deconv.sf <- calculateSumFactors(sce, cluster=clust)
summary(deconv.sf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3100 0.8028 0.9626 1.0000 1.1736 2.7858
plot(lib.sf, deconv.sf, xlab="Library size factor",
ylab="Deconvolution size factor", log='xy', pch=16,
col=as.integer(clust))
abline(a=0, b=1, col="red")
Once we have computed the size factors, we compute the normalized
expression values for each cell by dividing the count for each gene with
the appropriate size factor for that cell. Since we are typically going
to work with log-transformed counts, the function
logNormCounts
also log-transforms the normalized values,
creating a new assay called logcounts
.
sizeFactors(sce) <- deconv.sf
sce <- logNormCounts(sce)
sce
## class: SingleCellExperiment
## dim: 29453 2437
## metadata(0):
## assays(2): counts logcounts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
## ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(2437): AAACCTGAGACTGTAA AAACCTGAGATGCCTT ... TTTGGTTTCAGTCAGT
## TTTGGTTTCGCCATAA
## colData names(8): sum detected ... discard sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
The typical next steps in the analysis of single-cell data are dimensionality reduction and clustering, which involve measuring the similarity between cells.
The choice of genes to use in this calculation has a major impact on the results. We want to select genes that contain useful information about the biology of the system while removing genes that contain only random noise. This aims to preserve interesting biological structure without the variance that obscures that structure, and to reduce the size of the data to improve computational efficiency of later steps.
The simplest approach to feature selection is to select the most variable genes based on their log-normalized expression across the population.
Calculation of the per-gene variance is simple but feature selection
requires modelling of the mean-variance relationship. The
log-transformation is not a variance stabilizing transformation in most
cases, which means that the total variance of a gene is driven more by
its abundance than its underlying biological heterogeneity. To account
for this, the modelGeneVar
function fits a trend to the
variance with respect to abundance across all genes.
dec.sce <- modelGeneVar(sce)
fit.sce <- metadata(dec.sce)
plot(fit.sce$mean, fit.sce$var, xlab = "Mean of log-expression",
ylab = "Variance of log-expression")
curve(fit.sce$trend(x), col = "dodgerblue", add = TRUE, lwd = 2)
The blue line represents the uninteresting “technical” variance for any given gene abundance. The genes with a lot of additional variance exhibit interesting “biological” variation.
The next step is to select the subset of HVGs to use in downstream
analyses. A larger set will assure that we do not remove important
genes, at the cost of potentially increasing noise. Typically, we
restrict ourselves to the top \(n\)
genes, here we chose \(n=1000\), but
this choice should be guided by prior biological knowledge; for
instance, we may expect that only about 10% of genes to be
differentially expressed across our cell populations and hence select
10% of genes as higly variable (e.g., by setting
prop=0.1
).
hvg.sce.var <- getTopHVGs(dec.sce, n=1000)
head(hvg.sce.var)
## [1] "ENSMUSG00000055609" "ENSMUSG00000052217" "ENSMUSG00000069919"
## [4] "ENSMUSG00000052187" "ENSMUSG00000048583" "ENSMUSG00000051855"
Many scRNA-seq analysis procedures involve comparing cells based on their expression values across multiple genes. For example, clustering aims to identify cells with similar transcriptomic profiles by computing Euclidean distances across genes. In these applications, each individual gene represents a dimension of the data, hence we can think of the data as “living” in a ten-thousand-dimensional space.
As the name suggests, dimensionality reduction aims to reduce the number of dimensions, while preserving as much as possible of the original information. This obviously reduces the computational work (e.g., it is easier to compute distance in lower-dimensional spaces), and more importantly leads to less noisy and more interpretable results (cf. the curse of dimensionality).
Principal component analysis (PCA) is a dimensionality reduction technique that provides a parsimonious summarization of the data by replacing the original variables (genes) by fewer linear combinations of these variables, that are orthogonal and have successively maximal variance. Such linear combinations seek to “separate out” the observations (cells), while loosing as little information as possible.
Without getting into the technical details, one nice feature of PCA is that the principal components (PCs) are ordered by how much variance of the original data they “explain”. Furthermore, by focusing on the top \(k\) PC we are focusing on the most important directions of variability, which hopefully correspond to biological rather than technical variance. (It is however good practice to check this by e.g. looking at correlation between technical QC metrics and PCs).
One simple way to maximize our chance of capturing biological variation is by computing the PCs starting from the highly variable genes identified before.
sce <- runPCA(sce, subset_row=hvg.sce.var)
sce
## class: SingleCellExperiment
## dim: 29453 2437
## metadata(0):
## assays(2): counts logcounts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
## ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(2437): AAACCTGAGACTGTAA AAACCTGAGATGCCTT ... TTTGGTTTCAGTCAGT
## TTTGGTTTCGCCATAA
## colData names(8): sum detected ... discard sizeFactor
## reducedDimNames(1): PCA
## mainExpName: NULL
## altExpNames(0):
By default, runPCA
computes the first 50 principal
components. We can check how much original variability they explain.
percent.var <- attr(reducedDim(sce), "percentVar")
plot(percent.var, log="y", xlab="PC", ylab="Variance explained (%)")
And we can of course visualize the first 2-3 components, perhaps color-coding each point by an interesting feature, in this case the total number of UMIs per cell.
plotPCA(sce, colour_by="sum")
plotReducedDim(sce, dimred="PCA", ncomponents=3)
While PCA is a simple and effective way to visualize (and interpret!) scRNA-seq data, non-linear methods such as t-SNE (t-stochastic neighbor embedding) and UMAP (uniform manifold approximation and projection) have gained much popularity in the literature.
These methods attempt to find a low-dimensional representation of the data that preserves the distances between each point and its neighbors in the high-dimensional space.
It is easy to over-interpret t-SNE and UMAP plots. We note that the relative sizes and positions of the visual clusters may be misleading, as they tend to inflate dense clusters and compress sparse ones, such that we cannot use the size as a measure of subpopulation heterogeneity.
In addition, these methods are not guaranteed to preserve the global structure of the data (e.g., the relative locations of non-neighboring clusters), such that we cannot use their positions to determine relationships between distant clusters.
Note that the sce
object now includes all the computed
dimensionality reduced representations of the data for ease of reusing
and replotting without the need for recomputing.
sce
## class: SingleCellExperiment
## dim: 29453 2437
## metadata(0):
## assays(2): counts logcounts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
## ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(2437): AAACCTGAGACTGTAA AAACCTGAGATGCCTT ... TTTGGTTTCAGTCAGT
## TTTGGTTTCGCCATAA
## colData names(8): sum detected ... discard sizeFactor
## reducedDimNames(3): PCA TSNE UMAP
## mainExpName: NULL
## altExpNames(0):
Despite their shortcomings, t-SNE and UMAP may be useful visualization techniques. When using them, it is important to consider that they are stochastic methods that involve a random component (each run will lead to different plots) and that there are key parameters to be set that change the results substantially (e.g., the “perplexity” parameter of t-SNE).
Doublets are artifactual libraries generated from two cells. They typically arise due to errors in cell sorting or capture. Specifically, in droplet-based protocols, it may happen that two cells are captured in the same droplet.
Doublets are obviously undesirable when the aim is to characterize populations at the single-cell level. In particular, doublets can be mistaken for intermediate populations or transitory states that do not actually exist. Thus, it is desirable to identify and remove doublet libraries so that they do not compromise interpretation of the results.
It is not easy to computationally identify doublets as they can be hard to distinguish from transient states and/or cell populations with high RNA content. When possible, it is good to rely on experimental strategies to minimize doublets, e.g., by using genetic variation (e.g., pooling multiple donors in one run) or antibody tagging (e.g., CITE-seq).
There are several computational methods to identify doublets; we describe only one here based on in-silico simulation of doublets.
At a high level, the algorithm can be defined by the following steps:
Intuitively, if a “cell” is surrounded only by simulated doublets is very likely to be a doublet itself.
This approach is implemented below, where we visualize the scores in a t-SNE plot.
library(scDblFinder)
set.seed(100)
dbl.dens <- computeDoubletDensity(sce, subset.row=hvg.sce.var,
d=ncol(reducedDim(sce)))
summary(dbl.dens)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.002014 0.000000 1.886238
sce$DoubletScore <- dbl.dens
plotTSNE(sce, colour_by="DoubletScore")
We can explicitly convert this into doublet calls by identifying large outliers for the score within each sample. Here we use the “griffiths” method to do so.
dbl.calls <- doubletThresholding(data.frame(score=dbl.dens),
method="griffiths", returnType="call")
summary(dbl.calls)
## singlet doublet
## 2407 30
sce$doublet <- dbl.calls
plotColData(sce, y="DoubletScore", colour_by="doublet")
plotTSNE(sce, colour_by="doublet")
One way to determine whether a cell is in a real transient state or it is a doublet is to check the number of detected genes and total UMI counts.
plotColData(sce, "detected", "sum", colour_by = "DoubletScore")
plotColData(sce, "detected", "sum", colour_by = "doublet")
In this case, we only have a few doublets at the periphery of some clusters. It could be fine to keep the doublets in the analysis, but it may be safer to discard them to avoid biases in downstream analysis (e.g., differential expression).
## R Under development (unstable) (2023-11-22 r85609)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scDblFinder_1.17.0 scran_1.31.0
## [3] scater_1.31.1 ggplot2_3.4.4
## [5] scuttle_1.13.0 EnsDb.Mmusculus.v79_2.99.0
## [7] ensembldb_2.27.1 AnnotationFilter_1.27.0
## [9] GenomicFeatures_1.55.1 AnnotationDbi_1.65.2
## [11] DropletUtils_1.23.0 MouseGastrulationData_1.17.0
## [13] SpatialExperiment_1.13.0 SingleCellExperiment_1.25.0
## [15] SummarizedExperiment_1.33.0 Biobase_2.63.0
## [17] GenomicRanges_1.55.1 GenomeInfoDb_1.39.1
## [19] IRanges_2.37.0 S4Vectors_0.41.2
## [21] BiocGenerics_0.49.1 MatrixGenerics_1.15.0
## [23] matrixStats_1.1.0 BiocStyle_2.31.0
##
## loaded via a namespace (and not attached):
## [1] later_1.3.1 BiocIO_1.13.0
## [3] bitops_1.0-7 filelock_1.0.2
## [5] tibble_3.2.1 R.oo_1.25.0
## [7] XML_3.99-0.15 lifecycle_1.0.4
## [9] edgeR_4.1.2 rprojroot_2.0.4
## [11] MASS_7.3-60.1 lattice_0.22-5
## [13] magrittr_2.0.3 limma_3.59.1
## [15] sass_0.4.7 rmarkdown_2.25
## [17] jquerylib_0.1.4 yaml_2.3.7
## [19] metapod_1.11.0 httpuv_1.6.12
## [21] DBI_1.1.3 abind_1.4-5
## [23] zlibbioc_1.49.0 Rtsne_0.16
## [25] purrr_1.0.2 R.utils_2.12.3
## [27] BumpyMatrix_1.11.0 RCurl_1.98-1.13
## [29] rappdirs_0.3.3 GenomeInfoDbData_1.2.11
## [31] ggrepel_0.9.4 irlba_2.3.5.1
## [33] dqrng_0.3.1 pkgdown_2.0.7
## [35] DelayedMatrixStats_1.25.1 codetools_0.2-19
## [37] DelayedArray_0.29.0 xml2_1.3.5
## [39] tidyselect_1.2.0 farver_2.1.1
## [41] ScaledMatrix_1.11.0 viridis_0.6.4
## [43] BiocFileCache_2.11.1 GenomicAlignments_1.39.0
## [45] jsonlite_1.8.7 BiocNeighbors_1.21.0
## [47] ellipsis_0.3.2 systemfonts_1.0.5
## [49] tools_4.4.0 progress_1.2.2
## [51] ragg_1.2.6 Rcpp_1.0.11
## [53] glue_1.6.2 gridExtra_2.3
## [55] SparseArray_1.3.1 xfun_0.41
## [57] dplyr_1.1.4 HDF5Array_1.31.0
## [59] withr_2.5.2 formatR_1.14
## [61] BiocManager_1.30.22 fastmap_1.1.1
## [63] bluster_1.13.0 rhdf5filters_1.15.1
## [65] fansi_1.0.5 digest_0.6.33
## [67] rsvd_1.0.5 R6_2.5.1
## [69] mime_0.12 textshaping_0.3.7
## [71] colorspace_2.1-0 biomaRt_2.59.0
## [73] RSQLite_2.3.3 R.methodsS3_1.8.2
## [75] utf8_1.2.4 generics_0.1.3
## [77] data.table_1.14.8 FNN_1.1.3.2
## [79] rtracklayer_1.63.0 prettyunits_1.2.0
## [81] httr_1.4.7 S4Arrays_1.3.1
## [83] uwot_0.1.16 pkgconfig_2.0.3
## [85] gtable_0.3.4 blob_1.2.4
## [87] XVector_0.43.0 htmltools_0.5.7
## [89] ProtGenerics_1.35.0 scales_1.3.0
## [91] png_0.1-8 knitr_1.45
## [93] rjson_0.2.21 curl_5.1.0
## [95] cachem_1.0.8 rhdf5_2.47.0
## [97] stringr_1.5.1 BiocVersion_3.19.1
## [99] parallel_4.4.0 vipor_0.4.5
## [101] restfulr_0.0.15 desc_1.4.2
## [103] pillar_1.9.0 grid_4.4.0
## [105] vctrs_0.6.4 promises_1.2.1
## [107] BiocSingular_1.19.0 dbplyr_2.4.0
## [109] beachmat_2.19.0 cluster_2.1.5
## [111] xtable_1.8-4 beeswarm_0.4.0
## [113] evaluate_0.23 magick_2.8.1
## [115] cli_3.6.1 locfit_1.5-9.8
## [117] compiler_4.4.0 Rsamtools_2.19.2
## [119] rlang_1.1.2 crayon_1.5.2
## [121] labeling_0.4.3 fs_1.6.3
## [123] ggbeeswarm_0.7.2 stringi_1.8.2
## [125] viridisLite_0.4.2 BiocParallel_1.37.0
## [127] munsell_0.5.0 Biostrings_2.71.1
## [129] lazyeval_0.2.2 Matrix_1.6-3
## [131] ExperimentHub_2.11.0 hms_1.1.3
## [133] sparseMatrixStats_1.15.0 bit64_4.0.5
## [135] Rhdf5lib_1.25.0 KEGGREST_1.43.0
## [137] statmod_1.5.0 shiny_1.8.0
## [139] interactiveDisplayBase_1.41.0 highr_0.10
## [141] AnnotationHub_3.11.0 igraph_1.5.1
## [143] memoise_2.0.1 bslib_0.6.1
## [145] xgboost_1.7.5.1 bit_4.0.5
Normalization: Here we used the deconvolution method implemented
in scran
based on a previous clustering step. Use the
calculateSumFactors
to compute the size factors without
considering a preliminary clustering. Compare the resulting size factors
via a scatter plot. How do the results change? What are the risks of not
including clustering information?
An alternative to PCA for dimensionality reduction is the NewWave method. Apply the NewWave method to the SingleCellExperiment object and visually compare the first two dimensions with the first two principal components. What are the major differences in terms of results?
Hint: First subset the object to include only highly variable genes
(sce2 <- sce[hvg.sce.var,]
) and then apply the
NewWave
function to the new object setting
K=10
to obtain the first 10 dimensions.
DropletTestFiles
includes the
raw output from Cell Ranger of the peripheral blood mononuclear cell
(PBMC) dataset from 10X Genomics, publicly available from the 10X
Genomics website. Repeat the analysis of this vignette using those
data.