{cscb.tools}
is a set of utility wrapper functions designed to work with a Seurat-based approach to single-cell analysis. The standard workflow for CSCB’s Seurat-based preprocessing steps are as follows:
This vignette will provide an overview of each/all of these steps in the context of a single-cell analysis workflow.
The following libraries are required for the proper functioning of these tools:
require(Seurat)
require(ggplot2)
require(DoubletFinder)
require(SoupX)
require(sva)
require(reticulate)
Additionally, you will need to set up reticulate to use a conda environment with ScanPy. The default environment is r-reticulate
, so it may be easiest to install to that conda environment:
reticulate::use_condaenv("{Insert Conda Environment Here")
or
reticulate::conda_install("r-reticulate", "ScanPy")
Below, we will overview the basic steps in implementing these tools in a single-cell RNA seq analysis.
If loading 10x CellRanger multi
results as part of a VDJ or multiome experiment, you can use soupx_load_multi
to consolidate all relevant data into a singular SoupX object that can be later converted to a Seurat object.
# Obtain paths for all samples
<-
gex_multi_soup_raw ::getAbsolutePath(Sys.glob(
R.utilsfile.path(
"../../SCTC-X/Data_Analysis/xxx-vdj-gex/code/02-multi/runs/*_VDJ_GEX/outs/multi/count/raw_feature_bc_matrix/"
)
))
<-
gex_multi_soup_filtered ::getAbsolutePath(Sys.glob(
R.utilsfile.path(
"../../SCTC-X/Data_Analysis/xxx-vdj-gex/code/02-multi/runs/*_VDJ_GEX/outs/per_sample_outs/*_VDJ_GEX/count/sample_filtered_feature_bc_matrix/"
)
))
<-
gex_multi_soup_clusters ::getAbsolutePath(Sys.glob(
R.utilsfile.path(
"../../SCTC-X/Data_Analysis/xxx-vdj-gex/code/02-multi/runs/*_VDJ_GEX/outs/per_sample_outs/*_VDJ_GEX/count/analysis/clustering/gene_expression_graphclust/clusters.csv"
)
))
# setting these to null is important for cell-naming purposes;
# Seurat will use the relative path set in the name of the absolute paths
# to generate cell barcodes. Not good.
names(gex_multi_soup_raw) <- NULL
names(gex_multi_soup_filtered) <- NULL
names(gex_multi_soup_clusters) <- NULL
<- list() gex_obj
adjustCounts
generates plots, so be sure to capture them using pdf()
:
pdf(
here("processed/xxx/SoupX_contam_frac.pdf"),
height = 8,
width = 11
)
for (i in 1:length(gex_multi_soup_raw)) {
<-
gex_obj[[i]] soupx_load_multi(gex_multi_soup_raw[i],
gex_multi_soup_filtered[i],
gex_multi_soup_clusters[i])<- autoEstCont(gex_obj[[i]], soupQuantile = 0.5)
gex_obj[[i]] <- adjustCounts(gex_obj[[i]])
gex_obj[[i]]
}
dev.off()
Now, you may use the SoupX ambient RNA calculations to create Seurat objects:
<- c("a", "b", "c", "d", "e", "f")
gex_names
for (i in 1:length(gex_obj)) {
<-
gex_obj[[i]] CreateSeuratObject(gex_obj[[i]], project = gex_names[i])
}
names(gex_obj) <- gex_names
get_percent
calculates both the percent mitochondrial and percent ribosomal genes in every cell. It does so by calling Seurat’s PercentageFeatureSet
for genes beginning with the strings MT
or RPS
.
for (i in 1:length(gex_obj)) {
<- get_percent(gex_obj[[i]])
gex_obj[[i]] }
You can then plot the number of read counts vs the number of genes using seurat_counts
as well as the percentages of mitochondrial and ribosomal genes per cell using plot_mito
:
seurat_counts(gex_obj[[1]], count.cutoff = 22000) + ggtitle(gex_names[[1]])
plot_mito(gex_obj[[1]], title = gex_names[[1]])
The most involved and potentially important function in our single cell analysis workflow is seurat_process
. This function prunes cells that do not meet certain cutoff criteria and optionally adjusts the overall gene expression profile for cell cycle genes. Here, I will explicitly write out the default arguments, but note that as they are defaults, they do not need to be specified each time.
One argument, which you absolutely need to supply, is count.cutoff
. This value is the same as the one used in and can be obtained from seurat_counts
.
<- c(22000, 52000, 10000, 70000, 65000, 60000)
count_cutoffs
for (i in 1:length(gex_obj)) {
<-
gex_obj[[i]] seurat_process(
gex_obj[[i]],counts_name = gex_names[[i]],
count.cutoff = count_cutoffs[i],
mito.cutoff = 10,
rps.cutoff = 10,
cc_adjust = FALSE
) }
<- list()
gex_pk
for (i in 1:length(gex_obj)) {
message(crayon::red$underline$bold(
paste0("Finding Parameter Sweep with DoubletFinder for ", gex_names[[i]])
))<-
gex_pk[[i]] sum.sweep(gex_obj[[i]])
message(crayon::red$underline$bold(paste0("Saving ", gex_names[[i]])))
}
pdf(here("processed/01-5_3-comp/ParamSweepPeaks_count_cutoffs.pdf"))
plot_pk(gex_pk[[1]], title = gex_names[[1]])
plot_pk(gex_pk[[2]], title = gex_names[[2]])
plot_pk(gex_pk[[3]], title = gex_names[[3]])
plot_pk(gex_pk[[4]], title = gex_names[[4]])
plot_pk(gex_pk[[5]], title = gex_names[[5]])
plot_pk(gex_pk[[6]], title = gex_names[[6]])
dev.off()
<- list()
gex_doubs
# options(future.globals.maxSize = 2500 * 1024^2)
for (i in 1:length(gex_obj)) {
message(crayon::red$underline$bold(paste0(
"Finding Doublets with DoubletFinder for ", gex_names[[i]]
)))<-
gex_doubs[[i]] doubFinder(gex_obj[[i]],
sweep.stats.sample = gex_pk[[i]],
rm.doubs = TRUE)
message(crayon::red$underline$bold(paste0("Saving ", gex_names[[i]])))
}
pdf(
here("processed/xxx/GEX_doubletClusters_removed.pdf"),
height = 8,
width = 14
)
viz.doubs(gex_doubs[[1]], title = gex_names[[1]])
viz.doubs(gex_doubs[[2]], title = gex_names[[2]])
viz.doubs(gex_doubs[[3]], title = gex_names[[3]])
viz.doubs(gex_doubs[[4]], title = gex_names[[4]])
viz.doubs(gex_doubs[[5]], title = gex_names[[5]])
viz.doubs(gex_doubs[[6]], title = gex_names[[6]])
dev.off()
The process_combat
function is a helper function that is used to correct for batch effects in single-cell RNA sequencing (scRNA-seq) data using the ComBat algorithm. It takes in a seuratObj
object and a seuratBatch
variable as inputs, and returns the seuratObj
with corrected data in a new assay called combatBatch.
This function can be used to apply the ComBat algorithm in a streamlined manner within a Seurat workflow, making the analysis of scRNA-seq data more accurate by removing batch effects. Batch effect can be caused by technical variations between different batches of scRNA-seq data, and removing them is crucial for proper analysis and interpretation of the data.
Note that you can either merge the desired batches yourself prior to calling process_combat
, or you can create a vector of Seurat objects yourself and have the function do it for you by setting merge = TRUE
(default FALSE). Here we merged first.
for (i in 1:length(rna_seq)) {
<-
rna_seq[[i]] RenameCells(rna_seq[[i]],
new.names = sub("_[^_]+$", "", colnames(rna_seq[[i]])))
}
<-
ComBat_merge merge(
1]],
rna_seq[[y = c(
2]],
rna_seq[[3]],
rna_seq[[4]],
rna_seq[[5]],
rna_seq[[6]],
rna_seq[[1]],
gex_doubs[[2]],
gex_doubs[[3]],
gex_doubs[[4]],
gex_doubs[[5]],
gex_doubs[[6]]
gex_doubs[[
)
)
<- ComBat_merge[, ComBat_merge$doublet == "Singlet"]
ComBat_merge
<-
combat_proccess process_combat(ComBat_merge, seuratObj$comp.ident)
Sometimes you want to remove certain specific genes from your Seurat object, like mitochondrial genes (MT-
) or ribosomal genes (RPS
), or maybe you have a vector of gene names you want to remove instead. You can use del_genes
to do so.
<- del_genes(seuratObj, "HLA")
seuratObj # multiple genes
<- c("HLA", "RPS", "RPL", "IG")
gene_name_prefixes <- del_genes(seuratObj, gene_name_prefixes, assay = "SCT") seuratObj
The package SeuratDisk
is helpful for converting between HDF5 filetypes, but has certain kinks that are hard to work out. One of these kinks is the fact that converting a Seurat object to h5ad
causes an issue with the var names indexing. Fortunately, there is a solution. This function depends on having Python installed with ScanPy.
SaveH5Seurat(
samples.combined,filename = "processed/02-annotation/cellxgene_out_UNFILTERED_no4k_doublets.h5Seurat",
verbose = TRUE,
overwrite = TRUE
)
Convert(
source = "processed/02-annotation/cellxgene_out_UNFILTERED_no4k_doublets.h5Seurat",
dest = "h5ad",
assay = "combatBatch",
verbose = TRUE,
overwrite = TRUE
)
fix_cellxgene(file = "processed/02-annotation/cellxgene_out_UNFILTERED_no4k_doublets.h5ad")