This tutorial describes how to perform the downstream analysis on single-cell nano CUT&Tag data (2 biological replicates, 3 modalities).
This tutorial assumes that the data have already been preprocessed,
including demultiplexing of the different modalities, running cellranger
and custom cell calling (see Set Up and
Preprocessing).
Please note how in this tutorial we are describing how to
perform the downstream analysis by using peaks, not bins. For
tutorial on bins, see here.
In addition, note how the biological data here analysed are 2
biological replicates of the same sample, thus it is safe to merge the
two samples without any kind of integration. Often experiments are
designed in order to have multiple samples and/or biological conditions
requiring data integration, differential analysis, etc. These are not
part of this vignette, but will be implemented in other vignettes in the
future.
The downstream analysis implemented in this vignette is composed of 6
steps:
1. Data loading (peaks, metadata and
fragments)
2. Quality controls (QC)
3. Merge Seurat objects
4. Normalisation and dimensional
reduction
5. Non-linear dimension reduction and
clustering
6. Annotation
~/NatProt/nanoscope/
folderfragments.tsv.gz
, fragments.tsv.gz.tbi
,
metadata.csv
and peaks.broadPeak
files and
store them in specific folders (see below section “1.1 Setting some
initial variables”)In this section we are going to load the input files for each modality and sample, create a common set of peaks for each sample, create a fragment object for each experiment, generate a peaks x cell matrix for each experiment and finally generate one Seurat object for experiment.
fragments.tsv.gz
,
metadata.csv
and peaks.broadPeak
files.First, load all the required libraries.
library(Signac)
library(Seurat)
library(GenomicRanges)
library(future)
library(stringr)
library(ggplot2)
library(gghalves)
library(ggpubr)
library(EnsDb.Mmusculus.v79)
library(ComplexUpset)
library(regioneR)
library(scales)
library(ggVennDiagram)
First, define some variables that will be used across the entire
vignette. Modify them according to your data.
Extremely important:
repodir
the directory where the nanoscope
Github repo is. This is required in order to upload all the custom
functions needed to run this vignette# directory where the nanoscope repo was cloned
repodir <- "~/Documents/GitHub/nanoscope/" # change accordingly
source(paste0(repodir,"scripts/functions_scCT.R"))
wd
the output root directory.sample_P23209
and sample_P24004
in a directory called result
and set it as working directory
wd <- "~/NatProt/nanoscope/results/"
wd
will be
/path/to/nanoscope_toy_data/results/
wd
is formatted as follow:fragments.tsv.gz
is expected to be in a folder
with this path:
wd/{sample_name}/{modality_barcode}/cellranger/outs/
metadata.csv
is expected to be in a folder with
this path:
wd/{sample_name}/{modality_barcode}/cell_picking/
{modality}_peaks.broadPeak
is expected to be in a
folder with this path:
wd/{sample_name}/{modality_barcode}/peaks/macs_broad/
Please note that nothing but the directories containing the
input data for each samples must be stored in
wd
# directory where the nanoscope output files are stored
wd <- "~/NatProt/nanoscope/results" # change accordingly
setwd(wd)
# Set the used genome version
genome <- "mm10"
# Genome annotations
genome_ann <- EnsDb.Mmusculus.v79
# Assay (peaks or bins)
assay <- "peaks"
# Samples
samples <- dir()
# Modalities
modalities <- dir(samples[1]) # this assumes sample1, sample2, sampleN share the same modalities and barcodes!
# Minimum number of cells a feature must be detected in to be included in the Seurat obj
min_cell <- 1
# Minium number of features a cell must have to be included in the Seurat obj
min_feat <- 1
Here we are going to load the peaks data for each modality, for each
sample.
For convenience, files will be stored in a list.
# Read-in peak coordinates for each modality iterating across samples and modalities
input.ls <- list()
for (smpl in samples) {
cat("Loading peaks for sample",smpl,"\n")
for (mod in modalities) {
cat("\t",mod)
# get mod name withouth barcode
mod2 <- strsplit(mod,"_")[[1]][1]
# Load peaks called by macs for H3K27ac and H3K27me3 and by cellranger for ATAC
if (mod2 != "ATAC") {
cat("\t(macs peaks)\n")
input.ls[[paste0(mod,"_",smpl)]] <- read.table(paste0(smpl,"/",mod,"/peaks/macs_broad/",mod2,"_peaks.broadPeak"))[1:3]
colnames(input.ls[[paste0(mod,"_",smpl)]] ) <- c("chr", "start", "end")
} else {
cat("\t(cellranger peaks)\n")
input.ls[[paste0(mod,"_",smpl)]] <- read.table(paste0(smpl,"/",mod,"/cellranger/outs/peaks.bed"))[1:3]
colnames(input.ls[[paste0(mod,"_",smpl)]] ) <- c("chr", "start", "end")
}
}
}
## Loading peaks for sample sample_P23209
## ATAC_TATAGCCT (cellranger peaks)
## H3K27ac_ATAGAGGC (macs peaks)
## H3K27me3_CCTATCCT (macs peaks)
## Loading peaks for sample sample_P24004
## ATAC_TATAGCCT (cellranger peaks)
## H3K27ac_ATAGAGGC (macs peaks)
## H3K27me3_CCTATCCT (macs peaks)
Since the peaks were identified independently in each experiment, it is unlikely they will perfectly overlap. We thus need to merge into unique intervals the overlapping peaks from the different samples. Please note how the overlapping peaks among different samples will be joined into single intervals, whereas the peaks found in only one, or a subset of, sample will be left unaltered and will be part of the so-called “common set of peaks”.
This will be done by following the Signac vignette.
Having uploaded the peaks for all the modalities and samples, we will convert them into genomic ranges and create a common set of peaks in each sample by using GenomicRanges::reduce function. GenomicRanges::reduce will merge all the overlapping peaks still maintaining in the “common set of peaks” the peaks found in only one sample, as shown below:
First, convert peaks to genomic ranges
input.ls <- lapply(input.ls,function(x){ makeGRangesFromDataFrame(x) })
(We are going to frequently use the lapply
function.
lapply
applies a given R function to all the items (x) of a
given list)
Then, merge all the overlapping peaks, still maintaining in the “common set of peaks” the peaks found in only one sample. This way it is safe to merge objects from different samples into a single Seurat object (see below Merge Seurat objects)
combined.peaks.ls <- list()
for (mod in modalities) {
combined.peaks.ls[[mod]] <- reduce(x = c(input.ls[[paste0(mod,"_",samples[1])]],
input.ls[[paste0(mod,"_",samples[2])]]))
}
combined.peaks.ls
## $ATAC_TATAGCCT
## GRanges object with 116675 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 3094778-3095663 *
## [2] chr1 3333330-3334204 *
## [3] chr1 3367922-3368816 *
## [4] chr1 3399705-3400612 *
## [5] chr1 3466125-3467002 *
## ... ... ... ...
## [116671] JH584304.1 52767-54323 *
## [116672] JH584304.1 55003-55931 *
## [116673] JH584304.1 59270-60187 *
## [116674] JH584304.1 68483-69442 *
## [116675] JH584295.1 1466-1970 *
## -------
## seqinfo: 25 sequences from an unspecified genome; no seqlengths
##
## $H3K27ac_ATAGAGGC
## GRanges object with 70327 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 3042535-3043231 *
## [2] chr1 3325312-3325622 *
## [3] chr1 3368224-3368416 *
## [4] chr1 3412976-3413298 *
## [5] chr1 3514757-3515057 *
## ... ... ... ...
## [70323] JH584304.1 491-44360 *
## [70324] JH584304.1 48370-61344 *
## [70325] JH584304.1 68621-114400 *
## [70326] GL456393.1 40014-41631 *
## [70327] JH584292.1 13043-14058 *
## -------
## seqinfo: 37 sequences from an unspecified genome; no seqlengths
##
## $H3K27me3_CCTATCCT
## GRanges object with 57932 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 3361152-3362524 *
## [2] chr1 3667086-3678017 *
## [3] chr1 4412597-4416942 *
## [4] chr1 4425744-4431988 *
## [5] chr1 4487340-4509495 *
## ... ... ... ...
## [57928] JH584294.1 154717-155488 *
## [57929] JH584304.1 707-17155 *
## [57930] JH584304.1 23562-31962 *
## [57931] JH584304.1 59012-61030 *
## [57932] JH584304.1 68573-114394 *
## -------
## seqinfo: 35 sequences from an unspecified genome; no seqlengths
We might also be interested in knowing how many of the peaks are shared between the samples, and how many are, instead, “private” of each sample. Keep always in mind that all of these peaks, common and private, are part of the “common set of peaks”.
upset_plot <- getUpsetPeaks(modalities = modalities,
samples = samples,
combined_peaks_ls = combined.peaks.ls,
input_ls = input.ls)
upset_plot
To quantify our combined set of peaks in each modality and sample, we
need to create a fragment object for each experiment (i.e., an
object holding all the information related to a single fragment file).
To this end we will use the Signac function
CreateFragmentObject
which requires, for each experiment,
the fragments.tsv.gz
and the metadata.csv
files.
Let’s start by loading the metadata data!
##### Metadata
# Load metadata for each experiment iterating across the sample list
metadata.ls <- list()
for (smpl in samples) {
cat("Loading metadata for sample",smpl,"\n")
for (mod in modalities) {
cat("\t",mod,"\n")
# Load metadata
metadata.ls[[paste0(mod,"_",smpl)]] <- read.csv(paste0(smpl,"/",mod,"/cell_picking/metadata.csv"),
stringsAsFactors = FALSE)
rownames(metadata.ls[[paste0(mod,"_",smpl)]]) <- metadata.ls[[paste0(mod,"_",smpl)]]$barcode
}
}
## Loading metadata for sample sample_P23209
## ATAC_TATAGCCT
## H3K27ac_ATAGAGGC
## H3K27me3_CCTATCCT
## Loading metadata for sample sample_P24004
## ATAC_TATAGCCT
## H3K27ac_ATAGAGGC
## H3K27me3_CCTATCCT
Here we are going to exclude the cells which have not passed the
internal filtering (flagged as passedMB
). This internal
filtering takes into consideration the log10UMI and the percentage of
reads in peaks for each cell and clusterizes the cells based on these
values (by building a Gaussian finite mixture model fitted via EM
algorithm via Mclust
). This will cluster the cells in
n clusters, but only the cells belonging to the top1 cluster
with the highest log10UMI will be classified as
passedMB=TRUE
.
Before doing this, however, we would like to know how many cells we are filtering out:
plotPassed(metadata.ls,xaxis_text = 9,angle_x=60)
And which cells we are filtering out:
plotPassedCells(metadata.ls,samples,modalities)
In this case, you can see how the cells can be easily clustered in
three groups: empty droplets, weird droplets and good cells and how only
the good cells are selected. However, if you notice something weird,
please, re-do the cell selection by applying some hard cutoffs.
Once we have classified the good cells as passedMB=TRUE
,
we can proceed with the selection of the good cells only, discarding all
the others from further analyses:
metadata.ls <- lapply(metadata.ls, function(x){x[x$passedMB,]})
Having loaded the metadta files, we can now create the fragment
objects for each experiment.
As explained in the Signac
vignette, the CreateFragmentObject
function checks that
the file is present on disk, it is compressed and indexed, computes the
MD5 sum for the file and the tabix index so that we can tell if the file
is modified at any point, and checks that the expected cells
(metadata.csv
) are present in the file.
##### Fragments
# Create one fragment object per each experiment iterating across the sample list
fragment.ls <- list()
for (smpl in samples) {
cat("Loading peaks for sample",smpl,"\n")
for (mod in modalities) {
cat("\t",mod,"\n")
# Load fragments
fragment.ls[[paste0(mod,"_",smpl)]] <- CreateFragmentObject(
path = paste0(smpl,"/",mod,"/cellranger/outs/fragments.tsv.gz"),
cells = metadata.ls[[paste0(mod,"_",smpl)]]$barcode)
}
}
## Loading peaks for sample sample_P23209
## ATAC_TATAGCCT
## H3K27ac_ATAGAGGC
## H3K27me3_CCTATCCT
## Loading peaks for sample sample_P24004
## ATAC_TATAGCCT
## H3K27ac_ATAGAGGC
## H3K27me3_CCTATCCT
head(fragment.ls[[paste0("ATAC_TATAGCCT_",samples[1])]])
chrom | start | end | barcode | readCount |
---|---|---|---|---|
chr1 | 3003932 | 3004576 | GCCTAGGCAGTTACAC-1 | 1 |
chr1 | 3003985 | 3004021 | CTACTTAAGTTCAGGG-1 | 7 |
chr1 | 3005376 | 3005689 | CCAGATACAGCGTCGT-1 | 3 |
chr1 | 3005764 | 3005913 | CAAGCTAGTTTGCATG-1 | 1 |
chr1 | 3005861 | 3006012 | TCTCTGGTCCAGAGAG-1 | 1 |
chr1 | 3006223 | 3006256 | AACAGTCCATCTGCAA-1 | 3 |
Now that each experiment has been associated to a fragment object
containing fragments and metadata, we can create a peaks x cell matrix
for each experiment. This will be done by using the Signac function
FeatureMatrix
.
FeatureMatrix
basically constructs a peaks x cell matrix by
parsing: i) fragments, ii) peaks and iii) metadata. Please, note how the
features
here used to build the peaks x cell matrix are the
common set of peaks for each modalities identified in the previous step
(Creating a common set of peaks).
To speed-up the process, we launch FeatureMatrix
with
process_n = 20000
. This runs OK on our 64GB MacBook Pro,
but could go out-of-memory on other systems. If this happens, please,
decrease the process_n
value (2000 should be fine for every
system).
# As usual, append the object to a list
counts.ls <- list()
# Iterate among the experiment names in order to retrieve fragments and cell barcodes from the same experiment
for (experim in names(fragment.ls)) {
cat("Analysing experiment: ",experim,"\n")
# modality name
modal <- paste0(str_split_fixed(experim,"_",4)[,1],"_",str_split_fixed(experim,"_",4)[,2])
cat("\tPeaks from modality: ",modal,"\n")
# Create FeatureMatrix
counts.ls[[experim]] <- FeatureMatrix(fragments = fragment.ls[[experim]], # fragments
features = combined.peaks.ls[[modal]], # modality common set of peaks
cells = metadata.ls[[experim]]$barcode, # metadata
process_n = 20000) # the higher, the faster, the more memory is used
}
## Analysing experiment: ATAC_TATAGCCT_sample_P23209
## Peaks from modality: ATAC_TATAGCCT
## Analysing experiment: H3K27ac_ATAGAGGC_sample_P23209
## Peaks from modality: H3K27ac_ATAGAGGC
## Analysing experiment: H3K27me3_CCTATCCT_sample_P23209
## Peaks from modality: H3K27me3_CCTATCCT
## Analysing experiment: ATAC_TATAGCCT_sample_P24004
## Peaks from modality: ATAC_TATAGCCT
## Analysing experiment: H3K27ac_ATAGAGGC_sample_P24004
## Peaks from modality: H3K27ac_ATAGAGGC
## Analysing experiment: H3K27me3_CCTATCCT_sample_P24004
## Peaks from modality: H3K27me3_CCTATCCT
counts.ls[[paste0(modal,"_",samples[1])]][1:10,1:25]
## 10 x 25 sparse Matrix of class "dgCMatrix"
##
## chr1-3361152-3362524 . . . . . . . . . . . . . . . . . . . . . . . . .
## chr1-3667086-3678017 2 1 . . . . . . . . . . . . . 2 . 2 . . 2 . . . 3
## chr1-4412597-4416942 . . . . . . . . . . . . . 1 . . . . . . . . . . 1
## chr1-4425744-4431988 . . . . . . . . . . . . . . . . . . . . . . . . .
## chr1-4487340-4509495 8 . 2 . 5 10 1 . . . . . . 1 . 1 . . . . 4 13 2 2 7
## chr1-4514289-4515435 1 . . . . . . . . . . . . 1 . . . . . . . . . . 2
## chr1-4568747-4578581 1 . . . 1 . . . . . 1 . . . . . . . . . . 1 . . .
## chr1-4592282-4595271 . . . . . 1 . . . . . . . . . . . . . . . . . . .
## chr1-4687932-4689799 . . . . . . . . . . . . . . . . . . . . . . . . .
## chr1-4785566-4794296 . . . . . . . . . . . . . . . . . . . . . . . . .
Now that we have generated a peaks x cell matrix for each experiment, we can store it in a Seurat object along with the fragment object and the metadata.
This is done for each experiment separately, objects will be merged in the next step. The reason for this is that we would like to first perform QC on each experiment individually and then merge all the experiments in a single Seurat object.
obj.ls <- list()
# Iterate among the experiment names in order to retrieve fragments and cell barcodes from the same experiment
for (experim in names(counts.ls)) {
# Get sample and modality name from the experiment variable
smpl <- str_split_fixed(experim,"_",3)[,3]
modality <- str_split_fixed(experim,"_",2)[,1]
# Create the chromatin assay
chrom.assay <- CreateChromatinAssay(counts = counts.ls[[experim]],
fragments = fragment.ls[[experim]],
genome = genome,
min.cells = min_cell,
min.features = min_feat)
# Create the object
obj.ls[[experim]] <- CreateSeuratObject(counts = chrom.assay,
assay = assay,
meta.data=metadata.ls[[experim]],
project = smpl)
# Add some metadata of origin
obj.ls[[experim]]$dataset <- experim
obj.ls[[experim]]$modality <- modality
obj.ls[[experim]]$sample <- smpl
}
Before merging the objects, we are going to do some QC on the individual experiments. In particular, we are going to drop cells with abnormal number of UMIs and too low percentage of UMIs overlapping peaks. To define what is “abnormal” and what is not, we will use quantiles. Everything greater than the 95th or less than the 5th quantiles will be here considered as “abnormal” and therefore discarded.
Please, consider how applying these filters is optional and the thresholds used should be specifically set for each dataset as they usually are highly sample-specific. The thresholds we are applying here are quite harsh and, if applied to other datasets, might lead to removal of important biological information.
Please note how we are not here applying any filtering on the TSS enrichment score as while this might be relevant for ATAC it is probably not for H3K27ac and H3K27me3. In case, however, you would like to apply this filtering we recommend to follow Signac’s tutorial on this.
obj.ls.qc <- list()
quant_high <- 0.99
quant_low <- 0.01
for (experiment in names(obj.ls)) {
# Calculate 5th and 95th quantiles for logUMI
logUMI_cutoff_high <- quantile(obj.ls[[experiment]]$logUMI, quant_high)
logUMI_cutoff_low <- quantile(obj.ls[[experiment]]$logUMI, quant_low)
# Calculate 5th quantile for logUMI for % fragments in peaks (here indicated by peak_ratio_MB)
peak_ratio_MB_cutoff_low <- quantile(obj.ls[[experiment]]$peak_ratio_MB, quant_low)
# Filter out cells outside these thresholds
obj.ls.qc[[experiment]] <- subset(obj.ls[[experiment]],
logUMI > logUMI_cutoff_low &
logUMI < logUMI_cutoff_high &
peak_ratio_MB > peak_ratio_MB_cutoff_low)
# print number of discarded cells
old_n_cell <- nrow(obj.ls[[experiment]][[]])
new_n_cell <- nrow(obj.ls.qc[[experiment]][[]])
discarded <- old_n_cell - new_n_cell
cat(experiment,"\n")
cat("\tdiscarded",discarded,"cells (",round(discarded/old_n_cell*100,2),"%)\n")
}
## ATAC_TATAGCCT_sample_P23209
## discarded 81 cells ( 3.01 %)
## H3K27ac_ATAGAGGC_sample_P23209
## discarded 80 cells ( 3.01 %)
## H3K27me3_CCTATCCT_sample_P23209
## discarded 77 cells ( 2.86 %)
## ATAC_TATAGCCT_sample_P24004
## discarded 83 cells ( 2.95 %)
## H3K27ac_ATAGAGGC_sample_P24004
## discarded 85 cells ( 3 %)
## H3K27me3_CCTATCCT_sample_P24004
## discarded 85 cells ( 3.01 %)
plotCounts(obj = obj.ls, quantiles = c(quant_low,quant_high), feature = "logUMI")
plotCounts(obj = obj.ls, quantiles = c(quant_low), feature = "peak_ratio_MB",ylabel = "% UMI in peaks")
rm(obj.ls)
Now we can plot, for each sample, the number of cells being simultaneously profiled by all the three modalities. Since we will annotate the cells based on their H3K27ac signal (see 6. Annotation), all the cells lacking the H3K27ac profiling will result as not annotated at the end of the processing of the data. So it is important that the majority of the cells have H3K27ac profiling and that, in general, there is a good overlap among the different modalities.
# Plot number of cells for each modality for each sample
venn1 <- commonCellHistonMarks(mod1 = obj.ls.qc[[paste0(modalities[1],"_",samples[1])]],
name_mod1 = strsplit(modalities[1],"_")[[1]][1],
mod2 = obj.ls.qc[[paste0(modalities[2],"_",samples[1])]],
name_mod2 = gsub("H3","",strsplit(modalities[2],"_")[[1]][1]),
mod3 = obj.ls.qc[[paste0(modalities[3],"_",samples[1])]],
name_mod3 = gsub("H3","",strsplit(modalities[3],"_")[[1]][1]),
sample = samples[1] )
venn2 <- commonCellHistonMarks(mod1 = obj.ls.qc[[paste0(modalities[1],"_",samples[2])]],
name_mod1 = strsplit(modalities[1],"_")[[1]][1],
mod2 = obj.ls.qc[[paste0(modalities[2],"_",samples[2])]],
name_mod2 = gsub("H3","",strsplit(modalities[2],"_")[[1]][1]),
mod3 = obj.ls.qc[[paste0(modalities[3],"_",samples[2])]],
name_mod3 = gsub("H3","",strsplit(modalities[3],"_")[[1]][1]),
sample = samples[2] )
ggarrange(venn1,venn2,ncol=2)
Now that we have one object for experiment containing only QC-passed cells, and that all the experiments share a common set of peaks, we can merge all the objects from different samples obtaining one Seurat object for modality.
combined.obj.ls <- list()
for (mod in modalities) {
# merge objects among the different samples
combined.obj.ls[[mod]] <- merge(obj.ls.qc[[paste0(mod,"_",samples[1])]],
y = obj.ls.qc[[paste0(mod,"_",samples[2])]],
add.cell.ids = c(samples[1], samples[2]) )
}
combined.obj.ls[["ATAC_TATAGCCT"]][["peaks"]]
## ChromatinAssay data with 116675 features for 5345 cells
## Variable features: 0
## Genome: mm10
## Annotation present: FALSE
## Motifs present: FALSE
## Fragment files: 2
combined.obj.ls[["H3K27ac_ATAGAGGC"]][["peaks"]]
## ChromatinAssay data with 69613 features for 5324 cells
## Variable features: 0
## Genome: mm10
## Annotation present: FALSE
## Motifs present: FALSE
## Fragment files: 2
combined.obj.ls[["H3K27me3_CCTATCCT"]][["peaks"]]
## ChromatinAssay data with 57176 features for 5358 cells
## Variable features: 0
## Genome: mm10
## Annotation present: FALSE
## Motifs present: FALSE
## Fragment files: 2
After having filter out low quality cells and having generated one Seurat object per each modality, we can proceed with normalisation and dimensionality reduction. Once again, we will follow Signac vignette.
RunTFIDF
will be used. This is a two-step normalisation.
procedure, that both normalises across cells to correct for differences
in cellular sequencing depth, and across peaks to give higher values to
more rare peaks# Normalisation
combined.obj.ls <- lapply(combined.obj.ls,RunTFIDF)
# a warning (Some features contain 0 total counts) might rise. This is due to the cell filtering performed in the QC step. Should be safe to proceed
# Variable features
combined.obj.ls <- lapply(combined.obj.ls,FindTopFeatures)
# Dim reduction
combined.obj.ls <- lapply(combined.obj.ls,RunSVD)
It usually happens that the first LSI (you can think it as a
principal component, if you are more familiar with scrna-seq) is mostly
associated to the sequencing depth, rather than biological variation. We
will test this by running a modified version of the
DepthCor
function. The modifications simply consists in
plotting the modality name as plot title.
plot.list <- lapply(combined.obj.ls,DepthCorMulMod)
ggarrange(plot.list[[1]],plot.list[[2]],plot.list[[3]],ncol=3)
Here, the correlation between the first LSI and sequencing depth is
quite strong. We will discard it from further analyses.
Then, the dimensionality of the dataset should be determined. Although more sophisticated (and time consuming) methods have been implemented (like the JackStraw procedure), we think that the ‘Elbow plot’ could be informative enough. In the ‘Elbow plot’ the standard deviation explained by each LSI is plotted. When the the plateau is reached, it means that adding more LSI to our analysis does not actually increase the variability explained.
plot.list_elbow <- lapply(combined.obj.ls,ElbowPlot,reduction = "lsi",ndims = 50)
ggarrange(plot.list_elbow[[1]],plot.list_elbow[[2]],plot.list_elbow[[3]],ncol=3)
In this case, we believe that considering the first 40 LSI should be
definitely more than enough for all the three modalities (probably even
less than 40 should be fine).
Now that we have normalised and found variable features in the data, we can perform graph-based clustering and non-linear dimension reduction for visualization. This is done individually for each modality as different modalities might require different parameters (e.g., number of PCs, clustering resolution, ..).
To find the most appropriate parameter we have used, and suggest using, clustertree.
# Run dimension reduction and clustering on each modality individually
# ATAC
combined.obj.ls$ATAC_TATAGCCT <- RunUMAP(combined.obj.ls$ATAC_TATAGCCT,reduction = 'lsi', dims = 2:40)
combined.obj.ls$ATAC_TATAGCCT <- FindNeighbors(combined.obj.ls$ATAC_TATAGCCT,reduction = 'lsi', dims = 2:40)
combined.obj.ls$ATAC_TATAGCCT <- FindClusters(combined.obj.ls$ATAC_TATAGCCT, verbose = FALSE, algorithm = 3,resolution = .5)
# algorithm = 3 means SLM algorithm is used for modularity optimization
# H3K27ac
combined.obj.ls$H3K27ac_ATAGAGGC <- RunUMAP(combined.obj.ls$H3K27ac_ATAGAGGC,reduction = 'lsi', dims = 2:40)
combined.obj.ls$H3K27ac_ATAGAGGC <- FindNeighbors(combined.obj.ls$H3K27ac_ATAGAGGC,reduction = 'lsi', dims = 2:40)
combined.obj.ls$H3K27ac_ATAGAGGC <- FindClusters(combined.obj.ls$H3K27ac_ATAGAGGC, verbose = FALSE, algorithm = 3,resolution = .8)
# H3K27me3
combined.obj.ls$H3K27me3_CCTATCCT <- RunUMAP(combined.obj.ls$H3K27me3_CCTATCCT,reduction = 'lsi', dims = 2:40)
combined.obj.ls$H3K27me3_CCTATCCT <- FindNeighbors(combined.obj.ls$H3K27me3_CCTATCCT,reduction = 'lsi', dims = 2:40)
combined.obj.ls$H3K27me3_CCTATCCT <- FindClusters(combined.obj.ls$H3K27me3_CCTATCCT, verbose = FALSE, algorithm = 3,resolution = .8)
# Plot each modality separately
p1=DimPlot(combined.obj.ls$ATAC_TATAGCCT, label = TRUE) + NoLegend() +
ggtitle("ATAC") +
theme(plot.title = element_text(size=15,hjust=0.5,face='bold'))
p2=DimPlot(combined.obj.ls$H3K27ac_ATAGAGGC, label = TRUE) + NoLegend() +
ggtitle("H3K27ac") +
theme(plot.title = element_text(size=15,hjust=0.5,face='bold'))
p3=DimPlot(combined.obj.ls$H3K27me3_CCTATCCT, label = TRUE) + NoLegend() +
ggtitle("H3K27me3") +
theme(plot.title = element_text(size=15,hjust=0.5,face='bold'))
ggarrange(p1,p2,p3,ncol=3)
Now that we have identified the different cell states, we need to
annotate them, assigning to each of them a biological identity. To this
end, we will estimate the activity of each gene by quantifying the
number of fragments mapping to its gene body and 2kb upstream the TSS.
This will be done by using the Signac function
GeneActivity
.
Of course, this has to be done on modalities marking
transcriptionally active regions. In our case either ATAC or H3K27ac. We
believe that H3K27ac is the active modality which might best predict
gene activity, thus we will calculate the gene activity scores on nano
CUT&Tag H3K27ac modality.
Once again, for most of these processes, we will here follow the Signac
vignette.
Before creating a gene activity matrix we first need to add gene annotations to the H3K27ac Seurat object.
# get gene annotations from Ensembl
annotations <- GetGRangesFromEnsDb(ensdb = genome_ann)
# change to UCSC style since the data was mapped to mm10
seqlevelsStyle(annotations) <- 'UCSC'
# add the gene information to the object
Annotation(combined.obj.ls$H3K27ac_ATAGAGGC) <- annotations
Having added the gene annotations, we can calculate the gene
activity. This is done by the Signac GeneActivity
function
that: i) extracts gene coordinates and extends them to include the 2 kb
upstream region and ii) counts the number of fragments for each cell
that map to each of these regions. Finally, the gene activities are
added to the Seurat object as ‘RNA’
# Calculate gene activities
gene.activities <- GeneActivity(combined.obj.ls$H3K27ac_ATAGAGGC)
# Add the gene activity matrix to the Seurat object as a new assay
combined.obj.ls$H3K27ac_ATAGAGGC[['RNA']] <- CreateAssayObject(counts = gene.activities)
# normalize RNA data
combined.obj.ls$H3K27ac_ATAGAGGC <- NormalizeData(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
assay = 'RNA',
normalization.method = 'LogNormalize',
scale.factor = median(combined.obj.ls$H3K27ac_ATAGAGGC$nCount_RNA)
)
# Set RNA as default assay
DefaultAssay(combined.obj.ls$H3K27ac_ATAGAGGC) <- 'RNA'
By plotting the activities of known canonical marker genes we can get
some hints on the biological identities of most of the H3K27ac clusters.
Although the signal might sometimes be a bit noisy, already by plotting
the gene activity scores we can discriminate some cell states such as
astrocytes (telencephalon and non-telencephalon), vascular cells
(endothelial, smooth muscle, leptomeningeal cells), oligodendrocytes,
neurons (excitatory and inhibitory), microglia, macrophages, Bergmann
glia, olfactory ensheating cells and choroid plexus cells.
(These atlases are a great
resource to manually annotate your favorite cell state!)
For conciseness, we here opted for a manual annotation, but the
probably best practice would be to integrate one of the active
nanoCUT&Tag modalities with scRNA-seq data. If interested in doing
so, please follow this
Signac tutorial.
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('Foxg1','Lhx2','Mfge8'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('Gpr116','Cldn5','Emcn'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3,
label = T)
FeaturePlot(object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('Irx2','Agt','Slc6a9'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('Neurod1','Neurod6'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 2,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('Mag','Mbp'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('P2ry12','Ccr6','Selplg'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('Gad1','Slc32a1'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 2,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('A2m','Gdf10','Hopx'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('Map3k7cl','Myh11','Acta2'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('Pdgfra','Sox10','Olig1'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('F13a1','Cd74'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 2,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('Frzb','Mgst1','Prss56'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('Dcn','Slc6a13'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 2,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('Clic6','Folr1'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 2,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('Foxj1','Pifo','Dynlrb2'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3,
label = T)
Now we are ready to annotate the clusters! We are going to generate
here two layers of annotation, one more general and one more
specific.
Layer 2 (cell-state-specific)
##### Annotation layer2
# create a named annotation vector
l2.ids <- c("0"="AST-TE1","1"="VEC","2"="AST-NTE","3"="EXC1","4"="OL",
"5"="MGL","6"="INH1","7"="AST-TE2","8"="BG","9"="INH2",
"10"="VSMC","11"="OPC","12"="MAC","13"="OEC","14"="VLMC",
"15"="CHP","16"="INH3","17"="INH4","18"="INH5","19"="EXC2",
"20"="EPE","21"="EXC3")
names(l2.ids) <- levels(combined.obj.ls$H3K27ac_ATAGAGGC)
# rename the cluster identities with the layer2 annotations
combined.obj.ls$H3K27ac_ATAGAGGC <- RenameIdents(combined.obj.ls$H3K27ac_ATAGAGGC,l2.ids)
# Add layer2 annotations to the object
combined.obj.ls$H3K27ac_ATAGAGGC <- AddMetaData(combined.obj.ls$H3K27ac_ATAGAGGC,
Idents(combined.obj.ls$H3K27ac_ATAGAGGC),"layer2_annotation")
# UMAP plot
p1=DimPlot(combined.obj.ls$H3K27ac_ATAGAGGC, label = TRUE, repel = T) + NoLegend() +
ggtitle("H3K27ac") +
theme(plot.title = element_text(size=15,hjust=0.5,face='bold'))
Layer 1 (cell-type-specific)
##### General layer, L1
# create a named annotation vector
l1.ids <- c("AST-TE1"="Astroependymal","VEC"="Vascular","AST-NTE"="Astroependymal",
"EXC1"="Neurons","OL"="Oligodendrocytes","MGL"="Immune","INH1"="Neurons",
"AST-TE2"="Astroependymal","BG"="Astroependymal","INH2"="Neurons",
"VSMC"="Vascular","OPC"="Oligodendrocytes","MAC"="Immune",
"OEC"="OEC","VLMC"="Vascular","CHP"="Astroependymal","INH3"="Neurons",
"INH4"="Neurons","INH5"="Neurons","EXC2"="Neurons","EPE"="Astroependymal","EXC3"="Neurons")
names(l1.ids) <- levels(combined.obj.ls$H3K27ac_ATAGAGGC)
# rename the cluster identities with the layer1 annotations
combined.obj.ls$H3K27ac_ATAGAGGC <- RenameIdents(combined.obj.ls$H3K27ac_ATAGAGGC,l1.ids)
# Add layer1 annotations to the object
combined.obj.ls$H3K27ac_ATAGAGGC <- AddMetaData(combined.obj.ls$H3K27ac_ATAGAGGC,
Idents(combined.obj.ls$H3K27ac_ATAGAGGC),"layer1_annotation")
# UMAP plot
p2=DimPlot(combined.obj.ls$H3K27ac_ATAGAGGC, label = F) +
ggtitle("H3K27ac") +
theme(plot.title = element_text(size=15,hjust=0.5,face='bold'))
Finally, transfer the layer1 and layer2 annotations from the H3K27ac to the H3K27me3 and ATAC modalities.
# H3K27ac annotated cell states
idents_l1 <- combined.obj.ls$H3K27ac_ATAGAGGC$layer1_annotation
idents_l2 <- combined.obj.ls$H3K27ac_ATAGAGGC$layer2_annotation
# Transfer to ATAC and H3K27me3
# layer1
combined.obj.ls$ATAC_TATAGCCT <- AddMetaData(combined.obj.ls$ATAC_TATAGCCT,
idents_l1,
col.name='layer1_annotation')
combined.obj.ls$H3K27me3_CCTATCCT <- AddMetaData(combined.obj.ls$H3K27me3_CCTATCCT,
idents_l1,
col.name='layer1_annotation')
# layer2
combined.obj.ls$ATAC_TATAGCCT <- AddMetaData(combined.obj.ls$ATAC_TATAGCCT,
idents_l2,
col.name='layer2_annotation')
combined.obj.ls$H3K27me3_CCTATCCT <- AddMetaData(combined.obj.ls$H3K27me3_CCTATCCT,
idents_l2,
col.name='layer2_annotation')
# Plot each modality separately
p1=DimPlot(combined.obj.ls$ATAC_TATAGCCT, label = TRUE,repel = T, group.by = "layer2_annotation", label.size = 3) +
NoLegend() +
ggtitle("ATAC") +
theme(plot.title = element_text(size=15,hjust=0.5,face='bold'))
p2=DimPlot(combined.obj.ls$H3K27ac_ATAGAGGC, label = TRUE,repel = T, group.by = "layer2_annotation", label.size = 3) +
NoLegend() +
ggtitle("H3K27ac") +
theme(plot.title = element_text(size=15,hjust=0.5,face='bold'))
p3=DimPlot(combined.obj.ls$H3K27me3_CCTATCCT, label = TRUE,repel = T, group.by = "layer2_annotation", label.size = 3) +
NoLegend() +
ggtitle("H3K27me3") +
theme(plot.title = element_text(size=15,hjust=0.5,face='bold'))
ggarrange(p1,p2,p3,ncol=3)
plotConnectModal(seurat = combined.obj.ls, group = 'layer1_annotation')
saveRDS(combined.obj.ls, file = "../nanoscope_final_peaks.rds")
sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.5.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] patchwork_1.1.2 ggbio_1.46.0
## [3] ggVennDiagram_1.2.2 scales_1.2.1
## [5] regioneR_1.30.0 ComplexUpset_1.3.5
## [7] EnsDb.Mmusculus.v79_2.99.0 ensembldb_2.22.0
## [9] AnnotationFilter_1.22.0 GenomicFeatures_1.50.4
## [11] AnnotationDbi_1.60.2 Biobase_2.58.0
## [13] ggpubr_0.6.0 gghalves_0.1.4
## [15] ggplot2_3.4.2 stringr_1.5.0
## [17] future_1.32.0 GenomicRanges_1.50.2
## [19] GenomeInfoDb_1.34.9 IRanges_2.32.0
## [21] S4Vectors_0.36.2 BiocGenerics_0.44.0
## [23] SeuratObject_4.1.3 Seurat_4.3.0
## [25] Signac_1.9.0
##
## loaded via a namespace (and not attached):
## [1] rappdirs_0.3.3 rtracklayer_1.58.0
## [3] scattermore_1.1 GGally_2.1.2
## [5] tidyr_1.3.0 bit64_4.0.5
## [7] knitr_1.42 irlba_2.3.5.1
## [9] DelayedArray_0.24.0 data.table_1.14.8
## [11] rpart_4.1.19 KEGGREST_1.38.0
## [13] RCurl_1.98-1.12 generics_0.1.3
## [15] cowplot_1.1.1 RSQLite_2.3.1
## [17] RANN_2.6.1 proxy_0.4-27
## [19] bit_4.0.5 spatstat.data_3.0-1
## [21] xml2_1.3.4 httpuv_1.6.11
## [23] SummarizedExperiment_1.28.0 xfun_0.39
## [25] hms_1.1.3 jquerylib_0.1.4
## [27] evaluate_0.21 promises_1.2.0.1
## [29] fansi_1.0.4 restfulr_0.0.15
## [31] progress_1.2.2 dbplyr_2.3.2
## [33] igraph_1.4.1 DBI_1.1.3
## [35] htmlwidgets_1.6.2 reshape_0.8.9
## [37] spatstat.geom_3.2-1 purrr_1.0.1
## [39] ellipsis_0.3.2 dplyr_1.1.2
## [41] backports_1.4.1 biomaRt_2.54.1
## [43] deldir_1.0-9 MatrixGenerics_1.10.0
## [45] vctrs_0.6.2 ROCR_1.0-11
## [47] abind_1.4-5 cachem_1.0.8
## [49] withr_2.5.0 RVenn_1.1.0
## [51] BSgenome_1.66.3 progressr_0.13.0
## [53] checkmate_2.2.0 sctransform_0.3.5
## [55] GenomicAlignments_1.34.1 prettyunits_1.1.1
## [57] goftest_1.2-3 cluster_2.1.4
## [59] lazyeval_0.2.2 crayon_1.5.2
## [61] spatstat.explore_3.2-1 units_0.8-2
## [63] pkgconfig_2.0.3 labeling_0.4.2
## [65] nlme_3.1-162 ProtGenerics_1.30.0
## [67] nnet_7.3-19 rlang_1.1.1
## [69] globals_0.16.2 lifecycle_1.0.3
## [71] miniUI_0.1.1.1 filelock_1.0.2
## [73] BiocFileCache_2.6.1 dichromat_2.0-0.1
## [75] polyclip_1.10-4 matrixStats_0.63.0
## [77] lmtest_0.9-40 graph_1.76.0
## [79] Matrix_1.5-4.1 carData_3.0-5
## [81] zoo_1.8-12 base64enc_0.1-3
## [83] ggridges_0.5.4 png_0.1-8
## [85] viridisLite_0.4.2 rjson_0.2.21
## [87] bitops_1.0-7 KernSmooth_2.23-21
## [89] Biostrings_2.66.0 blob_1.2.4
## [91] classInt_0.4-9 parallelly_1.35.0
## [93] spatstat.random_3.1-5 rstatix_0.7.2
## [95] ggsignif_0.6.4 memoise_2.0.1
## [97] magrittr_2.0.3 plyr_1.8.8
## [99] ica_1.0-3 zlibbioc_1.44.0
## [101] compiler_4.2.2 BiocIO_1.8.0
## [103] RColorBrewer_1.1-3 fitdistrplus_1.1-11
## [105] Rsamtools_2.14.0 cli_3.6.1
## [107] XVector_0.38.0 listenv_0.9.0
## [109] pbapply_1.7-0 htmlTable_2.4.1
## [111] Formula_1.2-5 MASS_7.3-60
## [113] tidyselect_1.2.0 stringi_1.7.12
## [115] highr_0.10 yaml_2.3.7
## [117] ggrepel_0.9.3 grid_4.2.2
## [119] sass_0.4.6 VariantAnnotation_1.44.1
## [121] fastmatch_1.1-3 tools_4.2.2
## [123] future.apply_1.11.0 parallel_4.2.2
## [125] rstudioapi_0.14 foreign_0.8-84
## [127] gridExtra_2.3 farver_2.1.1
## [129] Rtsne_0.16 digest_0.6.31
## [131] BiocManager_1.30.20 shiny_1.7.4
## [133] Rcpp_1.0.10 car_3.1-2
## [135] broom_1.0.4 later_1.3.1
## [137] RcppAnnoy_0.0.20 OrganismDbi_1.40.0
## [139] httr_1.4.6 biovizBase_1.46.0
## [141] sf_1.0-12 colorspace_2.1-0
## [143] XML_3.99-0.14 tensor_1.5
## [145] reticulate_1.28 splines_4.2.2
## [147] uwot_0.1.14 RBGL_1.74.0
## [149] RcppRoll_0.3.0 spatstat.utils_3.0-3
## [151] sp_1.6-0 plotly_4.10.1
## [153] xtable_1.8-4 jsonlite_1.8.4
## [155] R6_2.5.1 Hmisc_5.1-0
## [157] pillar_1.9.0 htmltools_0.5.5
## [159] mime_0.12 glue_1.6.2
## [161] fastmap_1.1.1 BiocParallel_1.32.6
## [163] class_7.3-22 codetools_0.2-19
## [165] utf8_1.2.3 lattice_0.21-8
## [167] bslib_0.4.2 spatstat.sparse_3.0-1
## [169] tibble_3.2.1 curl_5.0.0
## [171] leiden_0.4.3 survival_3.5-5
## [173] rmarkdown_2.21 munsell_0.5.0
## [175] e1071_1.7-13 GenomeInfoDbData_1.2.9
## [177] reshape2_1.4.4 gtable_0.3.3