The below code document are tips and tricks for common downstream analysis next steps. Each section contains information on how to perform scRNA-seq analysis, including identification of low quality cells, merging and removal of clusters/cells, reclustering, changing cell type annotations and identities, etc.
Please contact support@honeycomb.bio for all questions.
The below packages needs to be installed prior to running this vignette. Code for package installation is below. There may be additional dependencies required by these packages you may have to install depending on your R environment.
install.packages("htmltools")
install.packages("tidyverse")
install.packages("patchwork")
install.packages("plyr")
install.packages('Seurat')
if (!require("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
}::install("UCell")
BiocManagerinstall.packages("ggpubr")
::install("SingleCellExperiment")
BiocManager::install("scuttle")
BiocManager::install("celldex")
BiocManagerinstall.packages("DT")
if (!requireNamespace("remotes", quietly = TRUE)) {
install.packages("remotes")
}install.packages("HGNChelper")
Load libraries using the code below:
suppressPackageStartupMessages({
library(tidyverse)
library(patchwork)
library(plyr)
library(Seurat)
library(UCell)
library(ggpubr)
library(SingleCellExperiment)
library('scuttle')
library(celldex)
library(DT)
library(scales)
library(HGNChelper)
})
The code below loads your R object from the Google Cloud Platform Bucket. You will need to replace code where outlined below:
system("gsutil -m cp -r gs://resources.honeycomb.bio/honeycomb-public-scripts/HC-logo.png . ")
system("mkdir -p data")
#Replace the URL with the link to the R object output by BeeNetPLUS or your own downstream R object. Navigate to your Google Cloud output directory and locate the .tgz object or R object and copy the path into the code below:
system("gsutil -m cp gs://hcb-external-pipeline-testing/20230426-TestData/Analysis-20230426-300-600/Analysis-20230426-300-600.tgz ./data")
#Unzip the file if it is a .tgz file
system("tar -zxf data/*tgz -C ./data")
Next, load your R object into the environment. This tutorial is based on an R object named “obj”. For BeeNetPLUSv1.0 users, change the name of the object using the code below.
#Replace the file path for the R object and load in the R object
load("./data/Analysis-20230426-300-600/FullDataset-20230426-TestData.Rdata")
#This should load a Seurat object called umi_seurat and several other data frames and variables that contain pipeline parameters and other quality control metrics.
# Rename the umi_seurat object obj for this analysis
<- umi_seurat
objrm(umi_seurat)
# The code below creates a data frame of metadata included in the Seurat object
<- rownames_to_column(obj@meta.data, "cellids")
metadata
datatable(metadata)
## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
You can set the Cell Identity to any column in the object metadata. After clustering, the Ident is automatically the cluster number, but this can be changed to cell type, sample name, or group depending on the downstream analysis to be performed.
##Set the Identity of the cells to the seurat cluster
Idents(obj)<- obj$seurat_clusters
DimPlot(obj, reduction="umap")
Change the “level” of the metadata parameter “seurat_clusters” so that the numbers are in order. Releveling can be performed for any metadata and changed at any time.
$seurat_clusters<- factor(obj$seurat_clusters, levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18"))
objIdents(obj)<- obj$seurat_clusters
DimPlot(obj, reduction="umap")
A high quality UMAP contains:
DimPlot(obj,pt.size = 0.2, cells.highlight=WhichCells(obj, idents = 4), cols.highlight = "orange", cols= "grey")
A low quality cluster on a UMAP may:
DimPlot(obj,pt.size = 0.2, cells.highlight=WhichCells(obj, idents = 12), cols.highlight = "orange", cols= "grey")
In this example, we think Cluster 12 might be low quality due to its positioning on the UMAP. We also want to check to see if there are distinct marker genes that define cluster 12, as well as the overall numbers of genes, transcripts, and mitochondrial reads
#Get the median QC parameters for the whole object
median(obj$nCount_RNA)
## [1] 2357
median(obj$nFeature_RNA)
## [1] 1102
median(obj$percent.mito)
## [1] 0.0620623
#Isolate c12 and get the QC parameters
<- subset(obj, subset = seurat_clusters=="12")
c12median(c12$nCount_RNA)
## [1] 781.5
median(c12$nFeature_RNA)
## [1] 486.5
median(c12$percent.mito)
## [1] 0.07668692
#Find Markers for just c12
<- FindMarkers(object = obj, ident.1 = 12)
markers12 datatable(markers12[1:5,])
As cluster 12 does not isolate on its own, has no biologically significant marker genes (log2fc <-0.58 or >0.58) and has significantly lower genes and transcripts per cell than the rest of the object, this cluster is a good candidate for removal. Please note that certain cell types (e.g. granulocytes) have lower numbers of genes and transcripts than others, and therefore looking at marker genes and UMAP positioning is necessary to help determine whether a cluster should be removed.
#Remove Cluster 12
<- subset(obj, subset = seurat_clusters !="12")
obj
#Recluster
<- NormalizeData(obj)
obj <- ScaleData(obj)
obj <- FindVariableFeatures(obj, selection.method = "vst", nfeatures = 2000)
obj <-RunPCA(obj, verbose = FALSE)
obj<-RunUMAP(obj, dims=1:30,verbose = FALSE)
obj<- FindNeighbors(obj,verbose = FALSE)
obj<- FindClusters(obj, verbose = FALSE)
obj=BuildClusterTree(obj,reorder=TRUE,reorder.numeric=TRUE)
obj
#Add scTransform as another option
<- SCTransform(obj, vars.to.regress = "percent.mito", verbose = FALSE)
obj <-RunPCA(obj, verbose = FALSE)
obj<-RunUMAP(obj, dims=1:18,verbose = FALSE)
obj<- FindNeighbors(obj,verbose = FALSE)
obj<- FindClusters(obj, verbose = FALSE)
obj=BuildClusterTree(obj,reorder=TRUE,reorder.numeric=TRUE)
obj$seurat_clusters<- obj$tree.ident
obj
DimPlot(obj, reduction="umap")
Iterate through removing low quality clusters one at a time until high quality cells remain. Be sure to re-do any Violin Plots, QC plots/tables, and UMAPs with the updated Seurat object.
#Redo the QC
<- function(obj){
qc_metrics_sample=obj@meta.data$orig.ident
I=obj@meta.data$ExonReads
Reads=obj@meta.data$reads.mapped
ReadsM=obj@meta.data$reads.Total
ReadsT=obj@meta.data$nCount_RNA
trans=obj@meta.data$nFeature_RNA
Genes=obj@meta.data$percent.mito
Mito=obj@meta.data$Complexity
Complexity=length(unique(obj$groupSamp))
nSamples=vector(mode="double",length=nSamples)
readEC=totReads.df
readsTot=colnames(readsTot)
readsTotalSamplesnames(readEC)=sort(unique(obj$groupSamp))
=readEC
readTC=readEC
readMC=readEC
transC=readEC
geneC=readEC
mitoC=readEC
nCells=readEC
compl=readEC
readsCellTotal=readEC
readsSampleTotal=readEC
FracCellReads=readEC
readsSampleMapped=readEC
readsSampleFiltered=readEC
readsSampleExon=readEC
percExon=readEC
percFilt=readEC
nF=readEC
r=readEC
v1=readEC
v2=readEC
v3=readEC
date=readEC
sT=readEC
DS=readEC
G=readEC
T=readEC
BC=readEC
seqsat=readEC
sampleName=readEC
sampleGroup<-sort(unique(obj$orig.ident))
sampleslist
for (x in 1:nSamples) {
=median(ReadsT[is.na(match(I,sampleslist[x]))==0])
readTC[x]=median(Reads[is.na(match(I,sampleslist[x]))==0])
readEC[x]=median(ReadsM[is.na(match(I,sampleslist[x]))==0])
readMC[x]=median(trans[is.na(match(I,sampleslist[x]))==0])
transC[x]=median(Genes[is.na(match(I,sampleslist[x]))==0])
geneC[x]=median(Mito[is.na(match(I,sampleslist[x]))==0])
mitoC[x]=median(Complexity[is.na(match(I,sampleslist[x]))==0])
compl[x]=sum(is.na(match(I,sampleslist[x]))==0)
nCells[x]=sum(ReadsT[is.na(match(I,sampleslist[x]))==0])
readsCellTotal[x]=sampleslist[x]
sampleName[x]=readsTot[1,is.na(match(readsTotalSamples,sampleslist[x]))==0]
readsSampleTotal[x]=readsCellTotal[x]/readsSampleTotal[x]
FracCellReads[x]=readsTot[3,is.na(match(readsTotalSamples,sampleslist[x]))==0]
readsSampleMapped[x]=readsTot[2,is.na(match(readsTotalSamples,sampleslist[x]))==0]
readsSampleFiltered[x]=readsTot[4,is.na(match(readsTotalSamples,sampleslist[x]))==0]
readsSampleExon[x]=readEC[x]/readTC[x]*100
percExon[x]=readsSampleFiltered[x]/readsSampleTotal[x]
percFilt[x]=gsub("-","",Sys.Date())
date[x]=nGeneI
G[x]=nTranI
T[x]=1-(sum(trans[is.na(match(I,sampleslist[x]))==0] / sum(ReadsM[is.na(match(I,sampleslist[x]))==0])))
seqsat[x]
}
=data.frame(date,sampleName,G, T, readTC,
clustInfo
readMC,readEC,transC,geneC,mitoC,nCells,
compl,readsSampleTotal,readsSampleFiltered,
readsSampleMapped,readsSampleExon,readsCellTotal,
FracCellReads,percExon,percFilt, seqsat)
colnames(clustInfo)=c("AnalysisDate", "SampleID", "GeneThreshold",
"TranscriptThreshold", "TotalReads", "MappedReads",
"ExonReads", "nTrans", "nGenes", "percMito" ,"nCells",
"Complexity", "SampleTotalReads", "SampleFilteredReads",
"SampleMappedReads", "SampleExonReads", "HQCellsTotalReads",
"FracReadsHQCells", "PctExon", "FracPassFilter", "SeqSat")
write.table(clustInfo,"SamplesMetrics.txt",sep="\t",col.names=NA)
return(clustInfo)
}
<- qc_metrics_sample(obj)
sample_metrics
datatable(sample_metrics)
If two or more clusters have the same marker genes that define them, with no distinct marker genes distinguishing between either of the two clusters (see below section for details on how to compare two clusters), these clusters can be merged to form one cluster. We do not recommend overwriting the original clustering; rather, we suggest making a new column in the metadata and reassigning clusters there.
#Copy the original clusters to a new column called new.cluster.ids, releveling to make sure they are in the correct order
$new.cluster.ids<- factor(obj$seurat_clusters, levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18"))
obj
#Set the cell identities to the new cell clusters
Idents(obj)<- obj$new.cluster.ids
#Set new Idents to combine original clusters 6 and 7 into one cluster (6), and then renumbering the rest so that they go in order
<- c("1", "2", "3", "4", "5", "6", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17")
new.cluster.idsnames(new.cluster.ids) <- levels(obj)
<- RenameIdents(obj, new.cluster.ids)
obj DimPlot(obj)
Note that when you rename, merge, or remove clusters, subsequent analyses will need to be repeated. Rerun plot and figure generation, marker gene analysis and cluster QC parameters with the new cluster numbers for consistency in your data!
We recommend using a min.pct of 0.25 to find marker genes. This ensures that markers are frequently expressed in at least one group.
#Find markers for cluster 1 (ident.1)
Idents(obj)<- obj$seurat_clusters
<- FindMarkers(object = obj, ident.1 = 1, min.pct = 0.25)
cluster1.markers datatable(cluster1.markers)
#Compare clusters 1 and 2 directly
Idents(obj)<- obj$seurat_clusters
<- FindMarkers(object = obj, ident.1 = 1, ident.2 = 2, min.pct = 0.25)
cluster1v2.markers datatable(cluster1v2.markers)
#Compare markers in B cells versus Monocytes
Idents(obj)<- obj$cell_type
<- FindMarkers(object = obj, ident.1 = "Naive B cells", ident.2 = "Monocyte", min.pct = 0.25)
clusterbvm.markers datatable(clusterbvm.markers)
#Compare markers in B cells in 7.5k group versus the 30k group
Idents(obj)<- obj$cell_type
<- FindMarkers(object = obj, ident.1 = "7.5k", ident.2 = "30k", group.by = "group", subset.ident = "Naive B cells", min.pct = 0.25)
group.markers datatable(group.markers)
We do not recommend overwriting the original cell type annotation; rather, we suggest making a new column in the metadata and reassigning cell types there. In this object, cells that were annotated as “Neutrophil-2” and “Neutrophil-3” are actually monocytes, based on their positioning on the UMAP and marker gene expression. Here we will clean up the data and reassign these cells. Be sure to re-perform all downstream analyses pertaining to cell types after reassignment!
$cell_type_new<- obj$cell_type
obj$cell_type_new[obj$cell_type == "Neutrophil-2"]<- "Monocyte"
obj$cell_type_new[obj$cell_type == "Neutrophil-3"]<- "Monocyte"
objDimPlot(obj, group.by = "cell_type_new")
Be sure to save your final object! Include all metadata tables and
environment variables that you want to include in your final saved R
object Here we saved our object and metadata as a new file called
FilteredDataset-
# The code below preserves environment variables for BeeNetPLUSv1.0.
save(obj, totReads.df, dataNames, input, sampletype, known_markers,nFastq, len, refs, beeV, wdlV, dataset, nGeneI, nTranI, clustGroupCountsN, clustCountsN2, clustCountsN3, clustGroupCountsN2, file=paste0("FilteredDataset-", dataset, ".Rdata"))
Not all the variables above are present for BeeNet users, but consider at least saving the final Seurat object, for downstream use. The following code saves only the Seurat object. Any other variables, functions, or data frames made in this tutorial will have to be regenerated in future analysis if not saved.
save(obj, "FilteredDataset-NAMEofDATASET-SeuratObjectOnly.Rdata)
Setting | Value |
---|---|
version | R version 4.3.0 (2023-04-21) |
os | macOS Monterey 12.6.3 |
system | x86_64, darwin21.6.0 |
ui | unknown |
language | (EN) |
collate | en_US.UTF-8 |
ctype | en_US.UTF-8 |
tz | America/Los_Angeles |
date | 2023-05-22 |
pandoc | 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown) |
Package | Loaded version | Date |
---|---|---|
Biobase | 2.60.0 | 2023-04-25 |
BiocGenerics | 0.46.0 | 2023-04-25 |
celldex | 1.10.0 | 2023-04-27 |
dplyr | 1.1.2 | 2023-04-20 |
DT | 0.28 | 2023-05-18 |
forcats | 1.0.0 | 2023-01-29 |
GenomeInfoDb | 1.36.0 | 2023-04-25 |
GenomicRanges | 1.52.0 | 2023-04-25 |
ggplot2 | 3.4.2 | 2023-04-03 |
ggpubr | 0.6.0 | 2023-02-10 |
HGNChelper | 0.8.1 | 2019-10-24 |
IRanges | 2.34.0 | 2023-04-25 |
lubridate | 1.9.2 | 2023-02-10 |
MatrixGenerics | 1.12.0 | 2023-04-25 |
matrixStats | 0.63.0 | 2022-11-18 |
patchwork | 1.1.2 | 2022-08-19 |
plyr | 1.8.8 | 2022-11-11 |
purrr | 1.0.1 | 2023-01-10 |
readr | 2.1.4 | 2023-02-10 |
S4Vectors | 0.38.1 | 2023-05-02 |
scales | 1.2.1 | 2022-08-20 |
scuttle | 1.10.1 | 2023-05-02 |
Seurat | 4.3.0 | 2022-11-18 |
SeuratObject | 4.1.3 | 2022-11-07 |
SingleCellExperiment | 1.22.0 | 2023-04-25 |
stringr | 1.5.0 | 2022-12-02 |
SummarizedExperiment | 1.30.1 | 2023-05-01 |
tibble | 3.2.1 | 2023-03-20 |
tidyr | 1.3.0 | 2023-01-24 |
tidyverse | 2.0.0 | 2023-02-22 |
UCell | 2.4.0 | 2023-04-25 |