Exploratory data analysis and quality control

Setup and experimental design

As mentioned in the introduction, in this tutorial we will use the wild-type data from the Tal1 chimera experiment. These data are available through the MouseGastrulationData Bioconductor package, which contains several datasets.

In particular, the package contains the following samples that we will use for the tutorial:

Sample 5: E8.5 injected cells (tomato positive), pool 3
Sample 6: E8.5 host cells (tomato negative), pool 3
Sample 7: E8.5 injected cells (tomato positive), pool 4
Sample 8: E8.5 host cells (tomato negative), pool 4
Sample 9: E8.5 injected cells (tomato positive), pool 5
Sample 10: E8.5 host cells (tomato negative), pool 5

We start our analysis by selecting only sample 5, which contains the injected cells in one biological replicate. We download the “raw” data that contains all the droplets for which we have sequenced reads.

library(MouseGastrulationData)
sce <- WTChimeraData(samples=5, type="raw")
sce <- sce[[1]]
sce

## class: SingleCellExperiment 
## dim: 29453 522554 
## metadata(0):
## assays(1): counts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
##   ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(522554): AAACCTGAGAAACCAT AAACCTGAGAAACCGC ...
##   TTTGTCATCTTTACGT TTTGTCATCTTTCCTC
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

Droplet processing

From the experiment, we expect to have only a few thousand cells, while we can see that we have data for more than 500,000 droplets. It is likely that most of these droplets are empty and are capturing only ambient or background RNA.

library(DropletUtils)
bcrank <- barcodeRanks(counts(sce))

# Only showing unique points for plotting speed.
uniq <- !duplicated(bcrank$rank)
plot(bcrank$rank[uniq], bcrank$total[uniq], log="xy",
    xlab="Rank", ylab="Total UMI count", cex.lab=1.2)

## Warning in xy.coords(x, y, xlabel, ylabel, log): 1 y value <= 0 omitted from
## logarithmic plot

abline(h=metadata(bcrank)$inflection, col="darkgreen", lty=2)
abline(h=metadata(bcrank)$knee, col="dodgerblue", lty=2)

legend("bottomleft", legend=c("Inflection", "Knee"), 
        col=c("darkgreen", "dodgerblue"), lty=2, cex=1.2)

The distribution of total counts exhibits a sharp transition between barcodes with large and small total counts, probably corresponding to cell-containing and empty droplets respectively.

A simple approach would be to apply a threshold on the total count to only retain those barcodes with large totals. However, this may unnecessarily discard libraries derived from cell types with low RNA content.

Testing for empty droplets

A better approach is to test whether the expression profile for each cell barcode is significantly different from the ambient RNA pool (Aaron TL Lun et al. 2019). Any significant deviation indicates that the barcode corresponds to a cell-containing droplet. This allows us to discriminate between well-sequenced empty droplets and droplets derived from cells with little RNA, both of which would have similar total counts.

We call cells at a false discovery rate (FDR) of 0.1%, meaning that no more than 0.1% of our called barcodes should be empty droplets on average.

# emptyDrops performs Monte Carlo simulations to compute p-values,
# so we need to set the seed to obtain reproducible results.
set.seed(100)

# this may take a few minutes
e.out <- emptyDrops(counts(sce))

summary(e.out$FDR <= 0.001)

##    Mode   FALSE    TRUE    NA's 
## logical    6184    3131  513239

sce <- sce[,which(e.out$FDR <= 0.001)]
sce

## class: SingleCellExperiment 
## dim: 29453 3131 
## metadata(0):
## assays(1): counts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
##   ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(3131): AAACCTGAGACTGTAA AAACCTGAGATGCCTT ... TTTGTCAGTCTGATTG
##   TTTGTCATCTGAGTGT
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

The end result confirms our prior expectation: only 3131 droplets contain a cell, while the large majority of droplets are empty.

Quality control

While we have removed empty droplets, this does not necessarily imply that all the cell-containing droplets should be kept for downstream analysis. In fact, some droplets could contain low-quality samples, due to cell damage or failure in library preparation.

Retaining these low-quality samples in the analysis could be problematic as they could:

form their own cluster, complicating the interpretation of the results
interfere with variance estimation and principal component analysis
contain contaminating transcripts from ambient RNA

To mitigate these problems, we can check a few quality-control metrics and, if needed, remove low-quality samples.

Choice of quality-control metrics

There are many possibile ways to define a set of quality-control metrics, see for instance Cole et al. (2019). Here, we keep it simple and consider only:

the library size, defined as the total sum of counts across all relevant features for each cell;
the number of expressed features in each cell, defined as the number of endogenous genes with non-zero counts for that cell;
the proportion of reads mapped to genes in the mitochondrial genome.

In particular, high proportions of mitochondrial genes are indicative of poor-quality cells, presumably because of loss of cytoplasmic RNA from perforated cells. The reasoning is that, in the presence of modest damage, the holes in the cell membrane permit efflux of individual transcript molecules but are too small to allow mitochondria to escape, leading to a relative enrichment of mitochondrial transcripts. For single-nucleus RNA-seq experiments, high proportions are also useful as they can mark cells where the cytoplasm has not been successfully stripped.

First, we need to identify mitochondrial genes. We use the available EnsDb mouse package available in Bioconductor, but a more updated version of Ensembl can be used through the AnnotationHub or biomaRt packages.

library(EnsDb.Mmusculus.v79)

## Loading required package: ensembldb

## Loading required package: GenomicFeatures

## Loading required package: AnnotationDbi

## Loading required package: AnnotationFilter

## 
## Attaching package: 'ensembldb'

## The following object is masked from 'package:stats':
## 
##     filter

chr.loc <- mapIds(EnsDb.Mmusculus.v79, keys=rownames(sce),
    keytype="GENEID", column="SEQNAME")

## Warning: Unable to map 2918 of 29453 requested IDs.

is.mito <- which(chr.loc=="MT")

We can use the scuttle package to compute a set of quality-control metrics, specifying that we want to use the mitochondrial genes as a special set of features.

library(scuttle)
df <- perCellQCMetrics(sce, subsets=list(Mito=is.mito))

# include them in the object
colData(sce) <- cbind(colData(sce), df)
colData(sce)

## DataFrame with 3131 rows and 6 columns
##                        sum  detected subsets_Mito_sum subsets_Mito_detected
##                  <numeric> <numeric>        <numeric>             <numeric>
## AAACCTGAGACTGTAA     27577      5418              471                    10
## AAACCTGAGATGCCTT     29309      5405              679                    10
## AAACCTGAGCAGCCTC     28795      5218              480                    12
## AAACCTGCATACTCTT     34794      4781              496                    12
## AAACCTGGTGGTACAG       262       229                0                     0
## ...                    ...       ...              ...                   ...
## TTTGGTTTCGCCATAA     38398      6020              252                    12
## TTTGTCACACCCTATC      3013      1451              123                     9
## TTTGTCACATTCTCAT      1472       675              599                    11
## TTTGTCAGTCTGATTG       361       293                0                     0
## TTTGTCATCTGAGTGT       267       233               16                     6
##                  subsets_Mito_percent     total
##                             <numeric> <numeric>
## AAACCTGAGACTGTAA              1.70795     27577
## AAACCTGAGATGCCTT              2.31669     29309
## AAACCTGAGCAGCCTC              1.66696     28795
## AAACCTGCATACTCTT              1.42553     34794
## AAACCTGGTGGTACAG              0.00000       262
## ...                               ...       ...
## TTTGGTTTCGCCATAA             0.656284     38398
## TTTGTCACACCCTATC             4.082310      3013
## TTTGTCACATTCTCAT            40.692935      1472
## TTTGTCAGTCTGATTG             0.000000       361
## TTTGTCATCTGAGTGT             5.992509       267

Now that we have computed the metrics, we have to decide on thresholds to define high- and low-quality samples. We could check how many cells are above/below a certain fixed threshold. For instance,

table(df$sum < 10000)

## 
## FALSE  TRUE 
##  2477   654

table(df$subsets_Mito_percent > 10)

## 
## FALSE  TRUE 
##  2761   370

or we could look at the distribution of such metrics and use a data adaptive threshold.

summary(df$detected)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      98    4126    5168    4455    5670    7908

summary(df$subsets_Mito_percent)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.155   1.608   5.079   2.182  66.968

This can be achieved with the perCellQCFilters function. By default, we consider a value to be an outlier if it is more than 3 median absolute deviations (MADs) from the median in the “problematic” direction. This is loosely motivated by the fact that such a filter will retain 99% of non-outlier values that follow a normal distribution.

reasons <- perCellQCFilters(df, sub.fields="subsets_Mito_percent")
colSums(as.matrix(reasons))

##              low_lib_size            low_n_features high_subsets_Mito_percent 
##                       601                       654                       484 
##                   discard 
##                       694

summary(reasons$discard)

##    Mode   FALSE    TRUE 
## logical    2437     694

# include in object
sce$discard <- reasons$discard

Diagnostic plots

It is always a good idea to check the distribution of the QC metrics and to visualize the cells that were removed, to identify possible problems with the procedure. In particular, we expect to have few outliers and with a marked difference from “regular” cells (e.g., a bimodal distribution or a long tail). Moreover, if there are too many discarded cells, further exploration might be needed.

library(scater)

## Loading required package: ggplot2

plotColData(sce, y = "sum", colour_by = "discard") + ggtitle("Total count")

plotColData(sce, y = "detected", colour_by = "discard") + ggtitle("Detected features")

plotColData(sce, y = "subsets_Mito_percent", colour_by = "discard") + ggtitle("Mito percent")

While the univariate distribution of QC metrics can give some insight on the quality of the sample, often looking at the bivariate distribution of QC metrics is useful, e.g., to confirm that there are no cells with both large total counts and large mitochondrial counts, to ensure that we are not inadvertently removing high-quality cells that happen to be highly metabolically active.

plotColData(sce, x="sum", y="subsets_Mito_percent", colour_by="discard")

It could also be a good idea to perform a differential expression analysis between retained and discarded cells to check wether we are removing an unusual cell population rather than low-quality libraries (see Section 1.5 of OSCA advanced).

Once we are happy with the results, we can discard the low-quality cells by subsetting the original object.

sce <- sce[,!sce$discard]
sce

## class: SingleCellExperiment 
## dim: 29453 2437 
## metadata(0):
## assays(1): counts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
##   ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(2437): AAACCTGAGACTGTAA AAACCTGAGATGCCTT ... TTTGGTTTCAGTCAGT
##   TTTGGTTTCGCCATAA
## colData names(7): sum detected ... total discard
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

Normalization

Systematic differences in sequencing coverage between libraries are often observed in single-cell RNA sequencing data. They typically arise from technical differences in cDNA capture or PCR amplification efficiency across cells, attributable to the difficulty of achieving consistent library preparation with minimal starting material (Vallejos et al. 2017). Normalization aims to remove these differences such that they do not interfere with comparisons of the expression profiles between cells. The hope is that the observed heterogeneity or differential expression within the cell population are driven by biology and not technical biases.

We will mostly focus our attention on scaling normalization, which is the simplest and most commonly used class of normalization strategies. This involves dividing all counts for each cell by a cell-specific scaling factor, often called a size factor. The assumption here is that any cell-specific bias (e.g., in capture or amplification efficiency) affects all genes equally via scaling of the expected mean count for that cell. The size factor for each cell represents the estimate of the relative bias in that cell, so division of its counts by its size factor should remove that bias. The resulting “normalized expression values” can then be used for downstream analyses such as clustering and dimensionality reduction.

The simplest and most natural strategy would be to normalize by the total sum of counts across all genes for each cell. This is often called the library size normalization.

The library size factor for each cell is directly proportional to its library size where the proportionality constant is defined such that the mean size factor across all cells is equal to 1. This ensures that the normalized expression values are on the same scale as the original counts.

lib.sf <- librarySizeFactors(sce)
summary(lib.sf)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2730  0.7879  0.9600  1.0000  1.1730  2.5598

hist(log10(lib.sf), xlab="Log10[Size factor]", col='grey80', breaks = 30)

Normalization by deconvolution

Library size normalization is not optimal, as it assumes that the total sum of UMI counts differ between cells only for technical and not biological reason. This can be a problem if a highly-expressed subset of genes is differentially expressed between cells or cell types.

Several robust normalization methods have been proposed for bulk RNA-seq. However, these methods may underperform in single-cell data due to the dominance of low and zero counts. To overcome this, one solution is to pool counts from many cells to increase the size of the counts for accurate size factor estimation (Aaron T. Lun, Bach, and Marioni 2016). Pool-based size factors are then deconvolved into cell-based factors for normalization of each cell’s expression profile.

We use a pre-clustering step: cells in each cluster are normalized separately and the size factors are rescaled to be comparable across clusters. This avoids the assumption that most genes are non-DE across the entire population – only a non-DE majority is required between pairs of clusters, which is a weaker assumption for highly heterogeneous populations.

Note that while we focus on normalization by deconvolution here, many other methods have been proposed and lead to similar performance (see Borella et al. (2022) for a comparative review).

library(scran)
set.seed(100)
clust <- quickCluster(sce) 
table(clust)

## clust
##   1   2   3   4   5   6   7   8   9  10  11  12  13 
## 273 159 250 122 187 201 154 252 152 169 199 215 104

deconv.sf <- calculateSumFactors(sce, cluster=clust)
summary(deconv.sf)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3100  0.8028  0.9626  1.0000  1.1736  2.7858

plot(lib.sf, deconv.sf, xlab="Library size factor",
    ylab="Deconvolution size factor", log='xy', pch=16,
    col=as.integer(clust))
abline(a=0, b=1, col="red")

Once we have computed the size factors, we compute the normalized expression values for each cell by dividing the count for each gene with the appropriate size factor for that cell. Since we are typically going to work with log-transformed counts, the function logNormCounts also log-transforms the normalized values, creating a new assay called logcounts.

sizeFactors(sce) <- deconv.sf
sce <- logNormCounts(sce)
sce

## class: SingleCellExperiment 
## dim: 29453 2437 
## metadata(0):
## assays(2): counts logcounts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
##   ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(2437): AAACCTGAGACTGTAA AAACCTGAGATGCCTT ... TTTGGTTTCAGTCAGT
##   TTTGGTTTCGCCATAA
## colData names(8): sum detected ... discard sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

Feature Selection

The typical next steps in the analysis of single-cell data are dimensionality reduction and clustering, which involve measuring the similarity between cells.

The choice of genes to use in this calculation has a major impact on the results. We want to select genes that contain useful information about the biology of the system while removing genes that contain only random noise. This aims to preserve interesting biological structure without the variance that obscures that structure, and to reduce the size of the data to improve computational efficiency of later steps.

Quantifying per-gene variation

The simplest approach to feature selection is to select the most variable genes based on their log-normalized expression across the population.

Calculation of the per-gene variance is simple but feature selection requires modelling of the mean-variance relationship. The log-transformation is not a variance stabilizing transformation in most cases, which means that the total variance of a gene is driven more by its abundance than its underlying biological heterogeneity. To account for this, the modelGeneVar function fits a trend to the variance with respect to abundance across all genes.

dec.sce <- modelGeneVar(sce)
fit.sce <- metadata(dec.sce)

plot(fit.sce$mean, fit.sce$var, xlab = "Mean of log-expression",
     ylab = "Variance of log-expression")
curve(fit.sce$trend(x), col = "dodgerblue", add = TRUE, lwd = 2)

The blue line represents the uninteresting “technical” variance for any given gene abundance. The genes with a lot of additional variance exhibit interesting “biological” variation.

Selecting highly variable genes

The next step is to select the subset of HVGs to use in downstream analyses. A larger set will assure that we do not remove important genes, at the cost of potentially increasing noise. Typically, we restrict ourselves to the top \(n\) genes, here we chose \(n=1000\), but this choice should be guided by prior biological knowledge; for instance, we may expect that only about 10% of genes to be differentially expressed across our cell populations and hence select 10% of genes as higly variable (e.g., by setting prop=0.1).

hvg.sce.var <- getTopHVGs(dec.sce, n=1000)
head(hvg.sce.var)

## [1] "ENSMUSG00000055609" "ENSMUSG00000052217" "ENSMUSG00000069919"
## [4] "ENSMUSG00000052187" "ENSMUSG00000048583" "ENSMUSG00000051855"

Dimensionality Reduction

Many scRNA-seq analysis procedures involve comparing cells based on their expression values across multiple genes. For example, clustering aims to identify cells with similar transcriptomic profiles by computing Euclidean distances across genes. In these applications, each individual gene represents a dimension of the data, hence we can think of the data as “living” in a ten-thousand-dimensional space.

As the name suggests, dimensionality reduction aims to reduce the number of dimensions, while preserving as much as possible of the original information. This obviously reduces the computational work (e.g., it is easier to compute distance in lower-dimensional spaces), and more importantly leads to less noisy and more interpretable results (cf. the curse of dimensionality).

Principal Component Analysis (PCA)

Principal component analysis (PCA) is a dimensionality reduction technique that provides a parsimonious summarization of the data by replacing the original variables (genes) by fewer linear combinations of these variables, that are orthogonal and have successively maximal variance. Such linear combinations seek to “separate out” the observations (cells), while loosing as little information as possible.

Without getting into the technical details, one nice feature of PCA is that the principal components (PCs) are ordered by how much variance of the original data they “explain”. Furthermore, by focusing on the top \(k\) PC we are focusing on the most important directions of variability, which hopefully correspond to biological rather than technical variance. (It is however good practice to check this by e.g. looking at correlation between technical QC metrics and PCs).

One simple way to maximize our chance of capturing biological variation is by computing the PCs starting from the highly variable genes identified before.

sce <- runPCA(sce, subset_row=hvg.sce.var)
sce

## class: SingleCellExperiment 
## dim: 29453 2437 
## metadata(0):
## assays(2): counts logcounts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
##   ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(2437): AAACCTGAGACTGTAA AAACCTGAGATGCCTT ... TTTGGTTTCAGTCAGT
##   TTTGGTTTCGCCATAA
## colData names(8): sum detected ... discard sizeFactor
## reducedDimNames(1): PCA
## mainExpName: NULL
## altExpNames(0):

By default, runPCA computes the first 50 principal components. We can check how much original variability they explain.

percent.var <- attr(reducedDim(sce), "percentVar")
plot(percent.var, log="y", xlab="PC", ylab="Variance explained (%)")

And we can of course visualize the first 2-3 components, perhaps color-coding each point by an interesting feature, in this case the total number of UMIs per cell.

plotPCA(sce, colour_by="sum")

plotReducedDim(sce, dimred="PCA", ncomponents=3)

Non-linear methods

While PCA is a simple and effective way to visualize (and interpret!) scRNA-seq data, non-linear methods such as t-SNE (t-stochastic neighbor embedding) and UMAP (uniform manifold approximation and projection) have gained much popularity in the literature.

These methods attempt to find a low-dimensional representation of the data that preserves the distances between each point and its neighbors in the high-dimensional space.

set.seed(100)
sce <- runTSNE(sce, dimred="PCA")
plotTSNE(sce)

set.seed(111)
sce <- runUMAP(sce, dimred="PCA")
plotUMAP(sce)

It is easy to over-interpret t-SNE and UMAP plots. We note that the relative sizes and positions of the visual clusters may be misleading, as they tend to inflate dense clusters and compress sparse ones, such that we cannot use the size as a measure of subpopulation heterogeneity.

In addition, these methods are not guaranteed to preserve the global structure of the data (e.g., the relative locations of non-neighboring clusters), such that we cannot use their positions to determine relationships between distant clusters.

Note that the sce object now includes all the computed dimensionality reduced representations of the data for ease of reusing and replotting without the need for recomputing.

sce

## class: SingleCellExperiment 
## dim: 29453 2437 
## metadata(0):
## assays(2): counts logcounts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
##   ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(2437): AAACCTGAGACTGTAA AAACCTGAGATGCCTT ... TTTGGTTTCAGTCAGT
##   TTTGGTTTCGCCATAA
## colData names(8): sum detected ... discard sizeFactor
## reducedDimNames(3): PCA TSNE UMAP
## mainExpName: NULL
## altExpNames(0):

Despite their shortcomings, t-SNE and UMAP may be useful visualization techniques. When using them, it is important to consider that they are stochastic methods that involve a random component (each run will lead to different plots) and that there are key parameters to be set that change the results substantially (e.g., the “perplexity” parameter of t-SNE).

Doublet identification

Doublets are artifactual libraries generated from two cells. They typically arise due to errors in cell sorting or capture. Specifically, in droplet-based protocols, it may happen that two cells are captured in the same droplet.

Doublets are obviously undesirable when the aim is to characterize populations at the single-cell level. In particular, doublets can be mistaken for intermediate populations or transitory states that do not actually exist. Thus, it is desirable to identify and remove doublet libraries so that they do not compromise interpretation of the results.

It is not easy to computationally identify doublets as they can be hard to distinguish from transient states and/or cell populations with high RNA content. When possible, it is good to rely on experimental strategies to minimize doublets, e.g., by using genetic variation (e.g., pooling multiple donors in one run) or antibody tagging (e.g., CITE-seq).

There are several computational methods to identify doublets; we describe only one here based on in-silico simulation of doublets.

Computing doublet desities

At a high level, the algorithm can be defined by the following steps:

Simulate thousands of doublets by adding together two randomly chosen single-cell profiles.
For each original cell, compute the density of simulated doublets in the surrounding neighborhood.
For each original cell, compute the density of other observed cells in the neighborhood.
Return the ratio between the two densities as a “doublet score” for each cell.

Intuitively, if a “cell” is surrounded only by simulated doublets is very likely to be a doublet itself.

This approach is implemented below, where we visualize the scores in a t-SNE plot.

library(scDblFinder)
set.seed(100)

dbl.dens <- computeDoubletDensity(sce, subset.row=hvg.sce.var,
                                  d=ncol(reducedDim(sce)))
summary(dbl.dens)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.002014 0.000000 1.886238

sce$DoubletScore <- dbl.dens

plotTSNE(sce, colour_by="DoubletScore")

We can explicitly convert this into doublet calls by identifying large outliers for the score within each sample. Here we use the “griffiths” method to do so.

dbl.calls <- doubletThresholding(data.frame(score=dbl.dens),
                                 method="griffiths", returnType="call")
summary(dbl.calls)

## singlet doublet 
##    2407      30

sce$doublet <- dbl.calls
plotColData(sce, y="DoubletScore", colour_by="doublet")

plotTSNE(sce, colour_by="doublet")

One way to determine whether a cell is in a real transient state or it is a doublet is to check the number of detected genes and total UMI counts.

plotColData(sce, "detected", "sum", colour_by = "DoubletScore")

plotColData(sce, "detected", "sum", colour_by = "doublet")

In this case, we only have a few doublets at the periphery of some clusters. It could be fine to keep the doublets in the analysis, but it may be safer to discard them to avoid biases in downstream analysis (e.g., differential expression).

Session Info

sessionInfo()

## R Under development (unstable) (2023-11-22 r85609)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scDblFinder_1.17.0           scran_1.31.0                
##  [3] scater_1.31.1                ggplot2_3.4.4               
##  [5] scuttle_1.13.0               EnsDb.Mmusculus.v79_2.99.0  
##  [7] ensembldb_2.27.1             AnnotationFilter_1.27.0     
##  [9] GenomicFeatures_1.55.1       AnnotationDbi_1.65.2        
## [11] DropletUtils_1.23.0          MouseGastrulationData_1.17.0
## [13] SpatialExperiment_1.13.0     SingleCellExperiment_1.25.0 
## [15] SummarizedExperiment_1.33.0  Biobase_2.63.0              
## [17] GenomicRanges_1.55.1         GenomeInfoDb_1.39.1         
## [19] IRanges_2.37.0               S4Vectors_0.41.2            
## [21] BiocGenerics_0.49.1          MatrixGenerics_1.15.0       
## [23] matrixStats_1.1.0            BiocStyle_2.31.0            
## 
## loaded via a namespace (and not attached):
##   [1] later_1.3.1                   BiocIO_1.13.0                
##   [3] bitops_1.0-7                  filelock_1.0.2               
##   [5] tibble_3.2.1                  R.oo_1.25.0                  
##   [7] XML_3.99-0.15                 lifecycle_1.0.4              
##   [9] edgeR_4.1.2                   rprojroot_2.0.4              
##  [11] MASS_7.3-60.1                 lattice_0.22-5               
##  [13] magrittr_2.0.3                limma_3.59.1                 
##  [15] sass_0.4.7                    rmarkdown_2.25               
##  [17] jquerylib_0.1.4               yaml_2.3.7                   
##  [19] metapod_1.11.0                httpuv_1.6.12                
##  [21] DBI_1.1.3                     abind_1.4-5                  
##  [23] zlibbioc_1.49.0               Rtsne_0.16                   
##  [25] purrr_1.0.2                   R.utils_2.12.3               
##  [27] BumpyMatrix_1.11.0            RCurl_1.98-1.13              
##  [29] rappdirs_0.3.3                GenomeInfoDbData_1.2.11      
##  [31] ggrepel_0.9.4                 irlba_2.3.5.1                
##  [33] dqrng_0.3.1                   pkgdown_2.0.7                
##  [35] DelayedMatrixStats_1.25.1     codetools_0.2-19             
##  [37] DelayedArray_0.29.0           xml2_1.3.5                   
##  [39] tidyselect_1.2.0              farver_2.1.1                 
##  [41] ScaledMatrix_1.11.0           viridis_0.6.4                
##  [43] BiocFileCache_2.11.1          GenomicAlignments_1.39.0     
##  [45] jsonlite_1.8.7                BiocNeighbors_1.21.0         
##  [47] ellipsis_0.3.2                systemfonts_1.0.5            
##  [49] tools_4.4.0                   progress_1.2.2               
##  [51] ragg_1.2.6                    Rcpp_1.0.11                  
##  [53] glue_1.6.2                    gridExtra_2.3                
##  [55] SparseArray_1.3.1             xfun_0.41                    
##  [57] dplyr_1.1.4                   HDF5Array_1.31.0             
##  [59] withr_2.5.2                   formatR_1.14                 
##  [61] BiocManager_1.30.22           fastmap_1.1.1                
##  [63] bluster_1.13.0                rhdf5filters_1.15.1          
##  [65] fansi_1.0.5                   digest_0.6.33                
##  [67] rsvd_1.0.5                    R6_2.5.1                     
##  [69] mime_0.12                     textshaping_0.3.7            
##  [71] colorspace_2.1-0              biomaRt_2.59.0               
##  [73] RSQLite_2.3.3                 R.methodsS3_1.8.2            
##  [75] utf8_1.2.4                    generics_0.1.3               
##  [77] data.table_1.14.8             FNN_1.1.3.2                  
##  [79] rtracklayer_1.63.0            prettyunits_1.2.0            
##  [81] httr_1.4.7                    S4Arrays_1.3.1               
##  [83] uwot_0.1.16                   pkgconfig_2.0.3              
##  [85] gtable_0.3.4                  blob_1.2.4                   
##  [87] XVector_0.43.0                htmltools_0.5.7              
##  [89] ProtGenerics_1.35.0           scales_1.3.0                 
##  [91] png_0.1-8                     knitr_1.45                   
##  [93] rjson_0.2.21                  curl_5.1.0                   
##  [95] cachem_1.0.8                  rhdf5_2.47.0                 
##  [97] stringr_1.5.1                 BiocVersion_3.19.1           
##  [99] parallel_4.4.0                vipor_0.4.5                  
## [101] restfulr_0.0.15               desc_1.4.2                   
## [103] pillar_1.9.0                  grid_4.4.0                   
## [105] vctrs_0.6.4                   promises_1.2.1               
## [107] BiocSingular_1.19.0           dbplyr_2.4.0                 
## [109] beachmat_2.19.0               cluster_2.1.5                
## [111] xtable_1.8-4                  beeswarm_0.4.0               
## [113] evaluate_0.23                 magick_2.8.1                 
## [115] cli_3.6.1                     locfit_1.5-9.8               
## [117] compiler_4.4.0                Rsamtools_2.19.2             
## [119] rlang_1.1.2                   crayon_1.5.2                 
## [121] labeling_0.4.3                fs_1.6.3                     
## [123] ggbeeswarm_0.7.2              stringi_1.8.2                
## [125] viridisLite_0.4.2             BiocParallel_1.37.0          
## [127] munsell_0.5.0                 Biostrings_2.71.1            
## [129] lazyeval_0.2.2                Matrix_1.6-3                 
## [131] ExperimentHub_2.11.0          hms_1.1.3                    
## [133] sparseMatrixStats_1.15.0      bit64_4.0.5                  
## [135] Rhdf5lib_1.25.0               KEGGREST_1.43.0              
## [137] statmod_1.5.0                 shiny_1.8.0                  
## [139] interactiveDisplayBase_1.41.0 highr_0.10                   
## [141] AnnotationHub_3.11.0          igraph_1.5.1                 
## [143] memoise_2.0.1                 bslib_0.6.1                  
## [145] xgboost_1.7.5.1               bit_4.0.5

Exercises

Normalization: Here we used the deconvolution method implemented in scran based on a previous clustering step. Use the calculateSumFactors to compute the size factors without considering a preliminary clustering. Compare the resulting size factors via a scatter plot. How do the results change? What are the risks of not including clustering information?
An alternative to PCA for dimensionality reduction is the NewWave method. Apply the NewWave method to the SingleCellExperiment object and visually compare the first two dimensions with the first two principal components. What are the major differences in terms of results?

Hint: First subset the object to include only highly variable genes (sce2 <- sce[hvg.sce.var,]) and then apply the NewWave function to the new object setting K=10 to obtain the first 10 dimensions.

PBMC Data: The package DropletTestFiles includes the raw output from Cell Ranger of the peripheral blood mononuclear cell (PBMC) dataset from 10X Genomics, publicly available from the 10X Genomics website. Repeat the analysis of this vignette using those data.

References

Borella, Matteo, Graziano Martello, Davide Risso, and Chiara Romualdi. 2022. “PsiNorm: A Scalable Normalization for Single-Cell RNA-Seq Data.” Bioinformatics 38 (1): 164–72.

Cole, Michael B, Davide Risso, Allon Wagner, David DeTomaso, John Ngai, Elizabeth Purdom, Sandrine Dudoit, and Nir Yosef. 2019. “Performance Assessment and Selection of Normalization Procedures for Single-Cell RNA-Seq.” Cell Systems 8 (4): 315–28.

Lun, Aaron T, Karsten Bach, and John C Marioni. 2016. “Pooling Across Cells to Normalize Single-Cell RNA Sequencing Data with Many Zero Counts.” Genome Biology 17 (1): 1–14.

Lun, Aaron TL, Samantha Riesenfeld, Tallulah Andrews, Tomas Gomes, John C Marioni, et al. 2019. “EmptyDrops: Distinguishing Cells from Empty Droplets in Droplet-Based Single-Cell RNA Sequencing Data.” Genome Biology 20 (1): 1–9.

Vallejos, Catalina A, Davide Risso, Antonio Scialdone, Sandrine Dudoit, and John C Marioni. 2017. “Normalizing Single-Cell RNA Sequencing Data: Challenges and Opportunities.” Nature Methods 14 (6): 565–71.