This tutorial describes how to perform the downstream analysis on single-cell nano CUT&Tag data (2 biological replicates, 3 modalities), quantifying the fragment counts in genomic bins instead of peaks.
This tutorial assumes that the data have already been preprocessed,
including demultiplexing of the different modalities, running cellranger
and custom cell calling (see Set Up and
Preprocessing).
Please note how in this tutorial we are describing how to
perform the downstream analysis by using bins, not peaks. For
tutorial on peaks, see here.
In addition, note how the biological data here analysed are 2
biological replicates of the same sample, thus it is safe to merge the
two samples without any kind of integration. Often experiments are
designed in order to have multiple samples and/or biological conditions
requiring data integration, differential analysis, etc. These are not
part of this vignette, but will be implemented in other vignettes in the
future.
The downstream analysis implemented in this vignette is composed of 6
steps:
1. Data loading (metadata and
fragments)
2. Quality controls (QC)
3. Merge Seurat objects
4. Normalisation and dimensional
reduction
5. Non-linear dimension reduction and
clustering
6. Annotation
~/NatProt/nanoscope/
folderfragments.tsv.gz
, fragments.tsv.gz.tbi
and
metadata.csv
files and store them in specific folders (see
below section “1.1 Setting some initial variables”)In this section we are going to load the input files for each modality and sample, create a fragment object for each experiment, generate a bins x cell matrix for each experiment and finally generate one Seurat object for experiment.
fragments.tsv.gz
and
metadata.csv
files.First, load all the required libraries.
rm(list = ls())
library(Signac)
library(Seurat)
library(GenomicRanges)
library(future)
library(stringr)
library(ggplot2)
library(gghalves)
library(ggpubr)
library(EnsDb.Mmusculus.v79)
library(BSgenome.Mmusculus.UCSC.mm10)
library(ComplexUpset)
library(regioneR)
library(scales)
library(ggVennDiagram)
First, define some variables that will be used across the entire
vignette. Modify them according to your data.
Extremely important:
repodir
the directory where the nanoscope
Github repo is. This is required in order to upload all the custom
functions needed to run this vignette# directory where the nanoscope repo was cloned
repodir <- "~/Documents/GitHub/nanoscope/" # change accordingly
source(paste0(repodir,"scripts/functions_scCT.R"))
wd
the output root directory.sample_P23209
and sample_P24004
in a directory called result
and set it as working directory
wd <- "~/NatProt/nanoscope/results/"
wd
will be
/path/to/nanoscope_toy_data/results/
wd
is formatted as follow:fragments.tsv.gz
is expected to be in a folder
with this path:
wd/{sample_name}/{modality_barcode}/cellranger/outs/
metadata.csv
is expected to be in a folder with
this path:
wd/{sample_name}/{modality_barcode}/cell_picking/
Please note that nothing but the directories containing the
input data for each samples must be stored in
wd
# directory where the nanoscope output files are stored
wd <- "~/NatProt/nanoscope/results" # change accordingly
setwd(wd)
# Set the used genome version
genome <- "mm10"
# Genome annotations
genome_ann <- EnsDb.Mmusculus.v79
# BSgenome
bs_genome <- BSgenome.Mmusculus.UCSC.mm10
# Assay (peaks or bins)
assay <- "bins"
# Samples
samples <- dir()
# Modalities
modalities <- dir(samples[1]) # this assumes sample1, sample2, sampleN share the same modalities and barcodes!
# Minimum number of cells a feature must be detected in to be included in the Seurat obj
min_cell <- 1
# Minium number of features a cell must have to be included in the Seurat obj
min_feat <- 1
In order to create one Seurat object for each sample contaning a bin
per cell count matrix we need to combine all the input files together.
To this end we will need to: i) select the cells that passed internal
filtering, ii) get the fragments file reporting the genomic coordinates
of each fragment and the reads supporting each fragment and iii)
construct a bin x cell matrix from a fragments file.
First, we parse the metadata file in order to get the cells that passed the filtering and are actually identified as cells.
##### Metadata
# Load metadata for each experiment iterating across the sample list
metadata.ls <- list()
for (smpl in samples) {
cat("Loading metadata for sample",smpl,"\n")
for (mod in modalities) {
cat("\t",mod,"\n")
# Load metadata
metadata.ls[[paste0(mod,"_",smpl)]] <- read.csv(paste0(smpl,"/",mod,"/cell_picking/metadata.csv"),
stringsAsFactors = FALSE)
rownames(metadata.ls[[paste0(mod,"_",smpl)]]) <- metadata.ls[[paste0(mod,"_",smpl)]]$barcode
}
}
## Loading metadata for sample sample_P23209
## ATAC_TATAGCCT
## H3K27ac_ATAGAGGC
## H3K27me3_CCTATCCT
## Loading metadata for sample sample_P24004
## ATAC_TATAGCCT
## H3K27ac_ATAGAGGC
## H3K27me3_CCTATCCT
Here we are going to exclude the cells which have not passed the
internal filtering (flagged as passedMB
). This internal
filtering takes into consideration the log10UMI and the percentage of
reads in peaks for each cell and clusterizes the cells based on these
values (by building a Gaussian finite mixture mode fitted via EM
algorithm via Mclust
). This will cluster the cells in
n clusters, but only the cells belonging to the top1 cluster
with the highest log10UMI will be classified as
passedMB=TRUE
.
Before doing this, however, we would like to know how many cells we are filtering out:
plotPassed(metadata.ls,xaxis_text = 9,angle_x=60)
And which cells we are filtering out:
plotPassedCells(metadata.ls,samples,modalities)
In this case, you can see how the cells can be easily clustered in
three groups: empty droplets, weird droplets and good cells and how only
the good cells are selected. However, if you notice something weird,
please, re-do the cell selection by applying some hard cutoffs.
Once we have classified the good cells as passedMB=TRUE
,
we can proceed with the selection of the good cells only, discarding all
the others from further analyses:
metadata.ls <- lapply(metadata.ls, function(x){x[x$passedMB,]})
Then, we create a fragment object for each experiment (i.e.,
an object holding all the information related to a single fragment
file). To this end we will use the Signac function
CreateFragmentObject
.
As explained in the Signac
vignette, the CreateFragmentObject
function checks that
the file is present on disk, it is compressed and indexed, computes the
MD5 sum for the file and the tabix index so that we can tell if the file
is modified at any point, and checks that the expected cells
(metadata.csv
) are present in the file.
##### Fragments
# Create one fragment object per each experiment iterating across the sample list
fragment.ls <- list()
for (smpl in samples) {
cat("Loading fragments for sample",smpl,"\n")
for (mod in modalities) {
cat("\t",mod,"\n")
# Load fragments
fragment.ls[[paste0(mod,"_",smpl)]] <- CreateFragmentObject(
path = paste0(smpl,"/",mod,"/cellranger/outs/fragments.tsv.gz"),
cells = metadata.ls[[paste0(mod,"_",smpl)]]$barcode)
}
}
## Loading fragments for sample sample_P23209
## ATAC_TATAGCCT
## H3K27ac_ATAGAGGC
## H3K27me3_CCTATCCT
## Loading fragments for sample sample_P24004
## ATAC_TATAGCCT
## H3K27ac_ATAGAGGC
## H3K27me3_CCTATCCT
head(fragment.ls[[paste0("ATAC_TATAGCCT_",samples[1])]])
chrom | start | end | barcode | readCount |
---|---|---|---|---|
chr1 | 3003932 | 3004576 | GCCTAGGCAGTTACAC-1 | 1 |
chr1 | 3003985 | 3004021 | CTACTTAAGTTCAGGG-1 | 7 |
chr1 | 3005376 | 3005689 | CCAGATACAGCGTCGT-1 | 3 |
chr1 | 3005764 | 3005913 | CAAGCTAGTTTGCATG-1 | 1 |
chr1 | 3005861 | 3006012 | TCTCTGGTCCAGAGAG-1 | 1 |
chr1 | 3006223 | 3006256 | AACAGTCCATCTGCAA-1 | 3 |
Now that each experiment has been associated to a fragment object
containing fragments and metadata, we can create a bins x cell matrix
for each experiment. This will be done by using the Signac function
GenomeBinMatrix
. GenomeBinMatrix
subdivides
the genome in non-overlapping windows of length X (X=5kb here) and
counts for each bin, for each cell, the number of overlapping
fragments.
To speed-up the process, we launch FeatureMatrix
with
process_n = 20000
. This runs OK on our 64GB MacBook Pro,
but could go out-of-memory on other systems. If this happens, please,
decrease the process_n
value (2000 should be fine for every
system).
# As usual, append the object to a list
counts.ls <- list()
# Iterate among the experiment names in order to retrieve fragments and cell barcodes from the same experiment
for (experim in names(fragment.ls)) {
cat("Analysing experiment: ",experim,"\n")
# modality name
modal <- paste0(str_split_fixed(experim,"_",4)[,1],"_",str_split_fixed(experim,"_",4)[,2])
cat("\tBins from modality: ",modal,"\n")
# Create FeatureMatrix
counts.ls[[experim]] <- GenomeBinMatrix(fragments = fragment.ls[[experim]], # fragments
cells = metadata.ls[[experim]]$barcode, # metadata
binsize = 5000, # bin size
genome = seqlengths(bs_genome), # chom sizes
process_n = 20000) # the higher, the faster, the more memory is used
}
## Analysing experiment: ATAC_TATAGCCT_sample_P23209
## Bins from modality: ATAC_TATAGCCT
## Analysing experiment: H3K27ac_ATAGAGGC_sample_P23209
## Bins from modality: H3K27ac_ATAGAGGC
## Analysing experiment: H3K27me3_CCTATCCT_sample_P23209
## Bins from modality: H3K27me3_CCTATCCT
## Analysing experiment: ATAC_TATAGCCT_sample_P24004
## Bins from modality: ATAC_TATAGCCT
## Analysing experiment: H3K27ac_ATAGAGGC_sample_P24004
## Bins from modality: H3K27ac_ATAGAGGC
## Analysing experiment: H3K27me3_CCTATCCT_sample_P24004
## Bins from modality: H3K27me3_CCTATCCT
counts.ls[[names(fragment.ls)[1]]][3990:4000,1:50]
## 11 x 50 sparse Matrix of class "dgCMatrix"
##
## chr1-19945001-19950000 . . . . . . . . . . . . . . . . . . . . . . 1 . . . . .
## chr1-19950001-19955000 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## chr1-19955001-19960000 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## chr1-19960001-19965000 . . . . . 1 . . . . . . . . 1 . . . . . . . . . . . . .
## chr1-19965001-19970000 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## chr1-19970001-19975000 . . . . . . . . . . . . . . . . . . . . . . . . . 1 . .
## chr1-19975001-19980000 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## chr1-19980001-19985000 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## chr1-19985001-19990000 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## chr1-19990001-19995000 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## chr1-19995001-20000000 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##
## chr1-19945001-19950000 . . . . . . . . . . . . . . . . . . . . . .
## chr1-19950001-19955000 . . . . . . 2 1 . . . . . . . . . . . . . .
## chr1-19955001-19960000 . . . . . . . . . . 1 . . . . . . . . . . .
## chr1-19960001-19965000 . . . . . . . . . . . . . . . . . . . . . .
## chr1-19965001-19970000 . . . . . . . . . . . . . . . . . . . . . .
## chr1-19970001-19975000 . . . . . . . . . . . . . . . . . . . . . .
## chr1-19975001-19980000 . . . . . . . . . . . . . . . . . . . . . .
## chr1-19980001-19985000 . . . . . . . . . . . . . . . . . . . . . .
## chr1-19985001-19990000 . . . . . . . . . . . . . . . . . . . . . .
## chr1-19990001-19995000 . . . . . . . . . . . . . . . . . . . . . .
## chr1-19995001-20000000 . . . . . . . . . . . . . . . . . . . . . .
Now that we have generated a bins x cell matrix for each experiment, we can store it in a Seurat object along with the fragment object and the metadata.
This is done for each experiment separately, objects will be merged in the next step. The reason for this is that we would like to first perform QC on each experiment individually and then merge all the experiments in a single Seurat object.
obj.ls <- list()
# Iterate among the experiment names in order to retrieve fragments and cell barcodes from the same experiment
for (experim in names(counts.ls)) {
# Get sample and modality name from the experiment variable
smpl <- str_split_fixed(experim,"_",3)[,3]
modality <- str_split_fixed(experim,"_",2)[,1]
# Create the chromatin assay
chrom.assay <- CreateChromatinAssay(counts = counts.ls[[experim]],
fragments = fragment.ls[[experim]],
genome = genome,
min.cells = min_cell,
min.features = min_feat)
# Create the object
obj.ls[[experim]] <- CreateSeuratObject(counts = chrom.assay,
assay = assay,
meta.data=metadata.ls[[experim]],
project = smpl)
# Add some metadata of origin
obj.ls[[experim]]$dataset <- experim
obj.ls[[experim]]$modality <- modality
obj.ls[[experim]]$sample <- smpl
}
Before merging the objects, we are going to do some QC on the individual experiments. In particular, we are going to drop cells with abnormal number of UMIs. To define what is “abnormal” and what is not, we will use quantiles. Everything greater than the 95th or less than the 5th quantiles will be here considered as “abnormal” and therefore discarded.
Please, consider how applying these filters is optional and the thresholds used should be specifically set for each dataset as they usually are highly sample-specific. The thresholds we are applying here are quite harsh and, if applied to other datasets, might lead to removal of important biological information.
obj.ls.qc <- list()
quant_high <- 0.99
quant_low <- 0.01
for (experiment in names(obj.ls)) {
# Calculate 5th and 95th quantiles for logUMI
logUMI_cutoff_high <- quantile(obj.ls[[experiment]]$logUMI, quant_high)
logUMI_cutoff_low <- quantile(obj.ls[[experiment]]$logUMI, quant_low)
# Filter out cells outside these thresholds
obj.ls.qc[[experiment]] <- subset(obj.ls[[experiment]],
logUMI > logUMI_cutoff_low &
logUMI < logUMI_cutoff_high)
# print number of discarded cells
old_n_cell <- nrow(obj.ls[[experiment]][[]])
new_n_cell <- nrow(obj.ls.qc[[experiment]][[]])
discarded <- old_n_cell - new_n_cell
cat(experiment,"\n")
cat("\tdiscarded",discarded,"cells (",round(discarded/old_n_cell*100,2),"%)\n")
}
## ATAC_TATAGCCT_sample_P23209
## discarded 55 cells ( 2.04 %)
## H3K27ac_ATAGAGGC_sample_P23209
## discarded 54 cells ( 2.03 %)
## H3K27me3_CCTATCCT_sample_P23209
## discarded 54 cells ( 2 %)
## ATAC_TATAGCCT_sample_P24004
## discarded 58 cells ( 2.06 %)
## H3K27ac_ATAGAGGC_sample_P24004
## discarded 58 cells ( 2.05 %)
## H3K27me3_CCTATCCT_sample_P24004
## discarded 58 cells ( 2.05 %)
plotCounts(obj = obj.ls, quantiles = c(quant_low,quant_high), feature = "logUMI")
rm(obj.ls)
Now we can plot, for each sample, the number of cells being simultaneously profiled by all the three modalities. Since we will annotate the cells based on their H3K27ac signal (see 6. Annotation), all the cells lacking the H3K27ac profiling will result as not annotated at the end of the processing of the data. So it is important that the majority of the cells have H3K27ac profiling and that, in general, there is a good overlap among the different modalities.
# Plot number of cells for each modality for each sample
venn1 <- commonCellHistonMarks(mod1 = obj.ls.qc[[paste0(modalities[1],"_",samples[1])]],
name_mod1 = strsplit(modalities[1],"_")[[1]][1],
mod2 = obj.ls.qc[[paste0(modalities[2],"_",samples[1])]],
name_mod2 = gsub("H3","",strsplit(modalities[2],"_")[[1]][1]),
mod3 = obj.ls.qc[[paste0(modalities[3],"_",samples[1])]],
name_mod3 = gsub("H3","",strsplit(modalities[3],"_")[[1]][1]),
sample = samples[1] )
venn2 <- commonCellHistonMarks(mod1 = obj.ls.qc[[paste0(modalities[1],"_",samples[2])]],
name_mod1 = strsplit(modalities[1],"_")[[1]][1],
mod2 = obj.ls.qc[[paste0(modalities[2],"_",samples[2])]],
name_mod2 = gsub("H3","",strsplit(modalities[2],"_")[[1]][1]),
mod3 = obj.ls.qc[[paste0(modalities[3],"_",samples[2])]],
name_mod3 = gsub("H3","",strsplit(modalities[3],"_")[[1]][1]),
sample = samples[2] )
ggarrange(venn1,venn2,ncol=2)
Now that we have one object for experiment containing only QC-passed cells, we can merge all the objects from different samples obtaining one Seurat object for modality.
combined.obj.ls <- list()
for (mod in modalities) {
# merge objects among the different samples
combined.obj.ls[[mod]] <- merge(obj.ls.qc[[paste0(mod,"_",samples[1])]],
y = obj.ls.qc[[paste0(mod,"_",samples[2])]],
add.cell.ids = c(samples[1], samples[2]) )
}
combined.obj.ls[[modalities[1]]][["bins"]]
## ChromatinAssay data with 499249 features for 5396 cells
## Variable features: 0
## Genome: mm10
## Annotation present: FALSE
## Motifs present: FALSE
## Fragment files: 2
combined.obj.ls[[modalities[2]]][["bins"]]
## ChromatinAssay data with 501329 features for 5377 cells
## Variable features: 0
## Genome: mm10
## Annotation present: FALSE
## Motifs present: FALSE
## Fragment files: 2
combined.obj.ls[[modalities[3]]][["bins"]]
## ChromatinAssay data with 499058 features for 5408 cells
## Variable features: 0
## Genome: mm10
## Annotation present: FALSE
## Motifs present: FALSE
## Fragment files: 2
After having filter out low quality cells and having generated one Seurat object per each modality, we can proceed with normalisation and dimensionality reduction. Once again, we will follow Signac vignette.
RunTFIDF
will be used# Normalisation
combined.obj.ls <- lapply(combined.obj.ls,RunTFIDF)
# a warning (Some features contain 0 total counts) might rise. This is due to the cell filtering performed in the QC step. Should be safe to proceed
# Variable features
combined.obj.ls <- lapply(combined.obj.ls,FindTopFeatures)
# Dim reduction
combined.obj.ls <- lapply(combined.obj.ls,RunSVD)
It usually happens that the first LSI (you can think it as a
principal component, if you are more familiar with scrna-seq) is mostly
associated to the sequencing depth, rather than biological variation. We
will test this by running a modified version of the
DepthCor
function. The modifications simply consists in
plotting the modality name as plot title.
plot.list <- lapply(combined.obj.ls,DepthCorMulMod)
ggarrange(plot.list[[1]],plot.list[[2]],plot.list[[3]],ncol=3)
Here, the correlation between the first LSI and sequencing depth is
quite strong. We will discard it from further analyses.
Then, the dimensionality of the dataset should be determined. Although more sophisticated (and time consuming) methods have been implemented (like the JackStraw procedure), we think that the ‘Elbow plot’ could be informative enough. In the ‘Elbow plot’ the standard deviation explained by each LSI is plotted. When the the plateau is reached, it means that adding more LSI to our analysis does not actually increase the variability explained.
plot.list_elbow <- lapply(combined.obj.ls,ElbowPlot,reduction = "lsi",ndims = 50)
ggarrange(plot.list_elbow[[1]],plot.list_elbow[[2]],plot.list_elbow[[3]],ncol=3)
pc <- 35
In this case, we believe that considering the first 35 LSI should be
definitely more than enough for all the three modalities (probably even
less than 35 should be fine).
Now that we have normalised and found variable features in the data,
we can perform graph-based clustering and non-linear dimension reduction
for visualization. This is done individually for each modality as
different modalities might require different parameters (e.g.,
number of PCs, clustering resolution, ..).
To find the most appropriate parameter we have used, and suggest
using, clustertree.
As you will notice, the resolution value for the ATAC modality is quite low (.2), but this is necessary in order to avoid over-clustering of the cluster 0. With any value of resolution greater than 0.2, the cluster 0 is subclustered in two or more clusters, but we do not think there is enough variability here to define the cells in cluster 0 as belonging to more than one cluster.
# Run dimension reduction and clustering on each modality individually
# ATAC
combined.obj.ls$ATAC_TATAGCCT <- RunUMAP(combined.obj.ls$ATAC_TATAGCCT,reduction = 'lsi', dims = 2:pc)
combined.obj.ls$ATAC_TATAGCCT <- FindNeighbors(combined.obj.ls$ATAC_TATAGCCT,reduction = 'lsi', dims = 2:pc)
combined.obj.ls$ATAC_TATAGCCT <- FindClusters(combined.obj.ls$ATAC_TATAGCCT, verbose = FALSE, algorithm = 3,resolution = .2)
# algorithm = 3 means SLM algorithm is used for modularity optimization
# H3K27ac
combined.obj.ls$H3K27ac_ATAGAGGC <- RunUMAP(combined.obj.ls$H3K27ac_ATAGAGGC,reduction = 'lsi', dims = 2:pc)
combined.obj.ls$H3K27ac_ATAGAGGC <- FindNeighbors(combined.obj.ls$H3K27ac_ATAGAGGC,reduction = 'lsi', dims = 2:pc)
combined.obj.ls$H3K27ac_ATAGAGGC <- FindClusters(combined.obj.ls$H3K27ac_ATAGAGGC, verbose = FALSE, algorithm = 3,resolution = .6)
# H3K27me3
combined.obj.ls$H3K27me3_CCTATCCT <- RunUMAP(combined.obj.ls$H3K27me3_CCTATCCT,reduction = 'lsi', dims = 2:pc)
combined.obj.ls$H3K27me3_CCTATCCT <- FindNeighbors(combined.obj.ls$H3K27me3_CCTATCCT,reduction = 'lsi', dims = 2:pc)
combined.obj.ls$H3K27me3_CCTATCCT <- FindClusters(combined.obj.ls$H3K27me3_CCTATCCT, verbose = FALSE, algorithm = 3,resolution = .2)
# Plot each modality separately
p1=DimPlot(combined.obj.ls$ATAC_TATAGCCT, label = TRUE) + NoLegend() +
ggtitle("ATAC") +
theme(plot.title = element_text(size=15,hjust=0.5,face='bold'))
p2=DimPlot(combined.obj.ls$H3K27ac_ATAGAGGC, label = TRUE) + NoLegend() +
ggtitle("H3K27ac") +
theme(plot.title = element_text(size=15,hjust=0.5,face='bold'))
p3=DimPlot(combined.obj.ls$H3K27me3_CCTATCCT, label = TRUE) + NoLegend() +
ggtitle("H3K27me3") +
theme(plot.title = element_text(size=15,hjust=0.5,face='bold'))
ggarrange(p1,p2,p3,ncol=3)
Now that we have identified the different cell states, we need to
annotate them, assigning to each of them a biological identity. To this
end, we will estimate the activity of each gene by quantifying the
number of fragments mapping to its gene body and 2kb upstream the TSS.
This will be done by using the Signac function
GeneActivity
.
Of course, this has to be done on modalities marking
transcriptionally active regions. In our case either ATAC or H3K27ac. We
believe that H3K27ac is the active modality which might best predict
gene activity, thus we will calculate the gene activity scores on nano
CUT&Tag H3K27ac modality.
Once again, for most of these processes, we will here follow the Signac
vignette.
Before creating a gene activity matrix we first need to add gene annotations to the H3K27ac Seurat object.
# get gene annotations from Ensembl
annotations <- GetGRangesFromEnsDb(ensdb = genome_ann)
# change to UCSC style since the data was mapped to mm10
seqlevelsStyle(annotations) <- 'UCSC'
# add the gene information to the object
Annotation(combined.obj.ls$H3K27ac_ATAGAGGC) <- annotations
Having added the gene annotations, we can calculate the gene
activity. This is done by the Signac GeneActivity
function
that: i) extracts gene coordinates and extends them to include the 2 kb
upstream region and ii) counts the number of fragments for each cell
that map to each of these regions. Finally, the gene activities are
added to the Seurat object as ‘RNA’
# Calculate gene activities
gene.activities <- GeneActivity(combined.obj.ls$H3K27ac_ATAGAGGC)
# Add the gene activity matrix to the Seurat object as a new assay
combined.obj.ls$H3K27ac_ATAGAGGC[['RNA']] <- CreateAssayObject(counts = gene.activities)
# normalize RNA data
combined.obj.ls$H3K27ac_ATAGAGGC <- NormalizeData(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
assay = 'RNA',
normalization.method = 'LogNormalize',
scale.factor = median(combined.obj.ls$H3K27ac_ATAGAGGC$nCount_RNA)
)
# Set RNA as default assay
DefaultAssay(combined.obj.ls$H3K27ac_ATAGAGGC) <- 'RNA'
By plotting the activities of known canonical marker genes we can get
some hints on the biological identities of most of the H3K27ac clusters.
Although the signal might sometimes be a bit noisy, already by plotting
the gene activity scores we can discriminate some cell states such as
astrocytes (telencephalon and non-telencephalon), vascular cells
(endothelial, smooth muscle, leptomeningeal cells), oligodendrocytes,
neurons (excitatory and inhibitory), microglia, macrophages, Bergmann
glia, olfactory ensheating cells and choroid plexus cells.
(These atlases are a great
resource to manually annotate your favorite cell state!)
For conciseness, we here opted for a manual annotation, but the
probably best practice would be to integrate one of the active
nanoCUT&Tag modalities with scRNA-seq data. If interested in doing
so, please follow this
Signac tutorial.
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('Foxg1','Lhx2','Mfge8'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('Gpr116','Cldn5','Emcn'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('P2ry12','Ccr6','Selplg'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3,
label = T)
FeaturePlot(object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('Irx2','Agt','Slc6a9'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('Mag','Mbp','Sox10'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('Neurod1','Neurod6'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 2,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('Dcn','Slc6a13'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('A2m','Gdf10','Hopx'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('Gad1','Slc32a1'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 2,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('Clic6','Folr1'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 2,
label = T)
FeaturePlot(
object = combined.obj.ls$H3K27ac_ATAGAGGC,
features = c('Frzb','Mgst1','Prss56'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3,
label = T)
Now we are ready to annotate the clusters! We are going to generate
here two layers of annotations, one more general and one more
specific.
Layer 2 (cell-state-specific)
##### Annotation layer2
# create a named annotation vector
l2.ids <- c("0"="AST-TE","1"="VEC","2"="MGL","3"="AST-NT","4"="OL",
"5"="EXC1","6"="VLMC","7"="EXC2","8"="BG","9"="INH1",
"10"="CHP","11"="INH2","12"="OEC","13"="INH3")
names(l2.ids) <- levels(combined.obj.ls$H3K27ac_ATAGAGGC)
# rename the cluster identities with the layer2 annotations
combined.obj.ls$H3K27ac_ATAGAGGC <- RenameIdents(combined.obj.ls$H3K27ac_ATAGAGGC,l2.ids)
# Add layer2 annotations to the object
combined.obj.ls$H3K27ac_ATAGAGGC <- AddMetaData(combined.obj.ls$H3K27ac_ATAGAGGC,
Idents(combined.obj.ls$H3K27ac_ATAGAGGC),"layer2_annotation")
# UMAP plot
p1=DimPlot(combined.obj.ls$H3K27ac_ATAGAGGC, label = TRUE, repel = T) + NoLegend() +
ggtitle("H3K27ac") +
theme(plot.title = element_text(size=15,hjust=0.5,face='bold'))
Layer 1 (cell-type-specific)
##### General layer, L1
# create a named annotation vector
l1.ids <- c("AST-TE"="Astroependymal","VEC"="Vascular","MGL"="Immune","AST-NT"="Astroependymal","OL"="Oligodendrocytes",
"EXC1"="Neurons","VLMC"="Vascular","EXC2"="Neurons","BG"="Astroependymal","INH1"="Neurons",
"CHP"="Astroependymal","INH2"="Neurons","OEC"="OEC","INH3"="Neurons")
names(l1.ids) <- levels(combined.obj.ls$H3K27ac_ATAGAGGC)
# rename the cluster identities with the layer1 annotations
combined.obj.ls$H3K27ac_ATAGAGGC <- RenameIdents(combined.obj.ls$H3K27ac_ATAGAGGC,l1.ids)
# Add layer1 annotations to the object
combined.obj.ls$H3K27ac_ATAGAGGC <- AddMetaData(combined.obj.ls$H3K27ac_ATAGAGGC,
Idents(combined.obj.ls$H3K27ac_ATAGAGGC),"layer1_annotation")
# UMAP plot
p2=DimPlot(combined.obj.ls$H3K27ac_ATAGAGGC, label = F) +
ggtitle("H3K27ac") +
theme(plot.title = element_text(size=15,hjust=0.5,face='bold'))
Finally, transfer the layer1 and layer2 annotations from the H3K27ac to the H3K27me3 and ATAC modalities.
# H3K27ac annotated cell states
idents_l1 <- combined.obj.ls$H3K27ac_ATAGAGGC$layer1_annotation
idents_l2 <- combined.obj.ls$H3K27ac_ATAGAGGC$layer2_annotation
# Transfer to ATAC and H3K27me3
# layer1
combined.obj.ls$ATAC_TATAGCCT <- AddMetaData(combined.obj.ls$ATAC_TATAGCCT,
idents_l1,
col.name='layer1_annotation')
combined.obj.ls$H3K27me3_CCTATCCT <- AddMetaData(combined.obj.ls$H3K27me3_CCTATCCT,
idents_l1,
col.name='layer1_annotation')
# layer2
combined.obj.ls$ATAC_TATAGCCT <- AddMetaData(combined.obj.ls$ATAC_TATAGCCT,
idents_l2,
col.name='layer2_annotation')
combined.obj.ls$H3K27me3_CCTATCCT <- AddMetaData(combined.obj.ls$H3K27me3_CCTATCCT,
idents_l2,
col.name='layer2_annotation')
# Plot each modality separately
p1=DimPlot(combined.obj.ls$ATAC_TATAGCCT, label = TRUE,repel = T, group.by = "layer2_annotation",label.size=3) +
NoLegend() +
ggtitle("ATAC") +
theme(plot.title = element_text(size=15,hjust=0.5,face='bold'))
p2=DimPlot(combined.obj.ls$H3K27ac_ATAGAGGC, label = TRUE,repel = T, group.by = "layer2_annotation",label.size=3) +
NoLegend() +
ggtitle("H3K27ac") +
theme(plot.title = element_text(size=15,hjust=0.5,face='bold'))
p3=DimPlot(combined.obj.ls$H3K27me3_CCTATCCT, label = TRUE,repel = T, group.by = "layer2_annotation",label.size=3) +
NoLegend() +
ggtitle("H3K27me3") +
theme(plot.title = element_text(size=15,hjust=0.5,face='bold'))
ggarrange(p1,p2,p3,ncol=3)
plotConnectModal(seurat = combined.obj.ls, group = 'layer1_annotation')
saveRDS(combined.obj.ls, file = "../nanoscope_final_bins_5kb.rds")
sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.5.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggVennDiagram_1.2.2 scales_1.2.1
## [3] regioneR_1.30.0 ComplexUpset_1.3.5
## [5] BSgenome.Mmusculus.UCSC.mm10_1.4.3 BSgenome_1.66.3
## [7] rtracklayer_1.58.0 Biostrings_2.66.0
## [9] XVector_0.38.0 EnsDb.Mmusculus.v79_2.99.0
## [11] ensembldb_2.22.0 AnnotationFilter_1.22.0
## [13] GenomicFeatures_1.50.4 AnnotationDbi_1.60.2
## [15] Biobase_2.58.0 ggpubr_0.6.0
## [17] gghalves_0.1.4 ggplot2_3.4.2
## [19] stringr_1.5.0 future_1.32.0
## [21] GenomicRanges_1.50.2 GenomeInfoDb_1.34.9
## [23] IRanges_2.32.0 S4Vectors_0.36.2
## [25] BiocGenerics_0.44.0 SeuratObject_4.1.3
## [27] Seurat_4.3.0 Signac_1.9.0
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.3 spatstat.explore_3.2-1
## [3] reticulate_1.28 tidyselect_1.2.0
## [5] RSQLite_2.3.1 htmlwidgets_1.6.2
## [7] grid_4.2.2 BiocParallel_1.32.6
## [9] Rtsne_0.16 munsell_0.5.0
## [11] units_0.8-2 codetools_0.2-19
## [13] ica_1.0-3 miniUI_0.1.1.1
## [15] withr_2.5.0 spatstat.random_3.1-5
## [17] colorspace_2.1-0 progressr_0.13.0
## [19] filelock_1.0.2 highr_0.10
## [21] knitr_1.42 rstudioapi_0.14
## [23] ROCR_1.0-11 ggsignif_0.6.4
## [25] tensor_1.5 listenv_0.9.0
## [27] labeling_0.4.2 MatrixGenerics_1.10.0
## [29] GenomeInfoDbData_1.2.9 polyclip_1.10-4
## [31] farver_2.1.1 bit64_4.0.5
## [33] parallelly_1.35.0 vctrs_0.6.2
## [35] generics_0.1.3 xfun_0.39
## [37] biovizBase_1.46.0 BiocFileCache_2.6.1
## [39] R6_2.5.1 RVenn_1.1.0
## [41] DelayedArray_0.24.0 bitops_1.0-7
## [43] spatstat.utils_3.0-3 cachem_1.0.8
## [45] promises_1.2.0.1 BiocIO_1.8.0
## [47] nnet_7.3-19 gtable_0.3.3
## [49] globals_0.16.2 goftest_1.2-3
## [51] rlang_1.1.1 RcppRoll_0.3.0
## [53] splines_4.2.2 rstatix_0.7.2
## [55] lazyeval_0.2.2 dichromat_2.0-0.1
## [57] checkmate_2.2.0 spatstat.geom_3.2-1
## [59] broom_1.0.4 yaml_2.3.7
## [61] reshape2_1.4.4 abind_1.4-5
## [63] backports_1.4.1 httpuv_1.6.11
## [65] Hmisc_5.1-0 tools_4.2.2
## [67] ellipsis_0.3.2 jquerylib_0.1.4
## [69] RColorBrewer_1.1-3 proxy_0.4-27
## [71] ggridges_0.5.4 Rcpp_1.0.10
## [73] plyr_1.8.8 base64enc_0.1-3
## [75] progress_1.2.2 zlibbioc_1.44.0
## [77] classInt_0.4-9 purrr_1.0.1
## [79] RCurl_1.98-1.12 prettyunits_1.1.1
## [81] rpart_4.1.19 deldir_1.0-9
## [83] pbapply_1.7-0 cowplot_1.1.1
## [85] zoo_1.8-12 SummarizedExperiment_1.28.0
## [87] ggrepel_0.9.3 cluster_2.1.4
## [89] magrittr_2.0.3 data.table_1.14.8
## [91] scattermore_1.1 lmtest_0.9-40
## [93] RANN_2.6.1 ProtGenerics_1.30.0
## [95] fitdistrplus_1.1-11 matrixStats_0.63.0
## [97] hms_1.1.3 patchwork_1.1.2
## [99] mime_0.12 evaluate_0.21
## [101] xtable_1.8-4 XML_3.99-0.14
## [103] gridExtra_2.3 compiler_4.2.2
## [105] biomaRt_2.54.1 tibble_3.2.1
## [107] KernSmooth_2.23-21 crayon_1.5.2
## [109] htmltools_0.5.5 later_1.3.1
## [111] Formula_1.2-5 tidyr_1.3.0
## [113] DBI_1.1.3 dbplyr_2.3.2
## [115] MASS_7.3-60 rappdirs_0.3.3
## [117] sf_1.0-12 Matrix_1.5-4.1
## [119] car_3.1-2 cli_3.6.1
## [121] parallel_4.2.2 igraph_1.4.1
## [123] pkgconfig_2.0.3 GenomicAlignments_1.34.1
## [125] foreign_0.8-84 sp_1.6-0
## [127] plotly_4.10.1 spatstat.sparse_3.0-1
## [129] xml2_1.3.4 bslib_0.4.2
## [131] VariantAnnotation_1.44.1 digest_0.6.31
## [133] sctransform_0.3.5 RcppAnnoy_0.0.20
## [135] spatstat.data_3.0-1 rmarkdown_2.21
## [137] leiden_0.4.3 fastmatch_1.1-3
## [139] htmlTable_2.4.1 uwot_0.1.14
## [141] restfulr_0.0.15 curl_5.0.0
## [143] shiny_1.7.4 Rsamtools_2.14.0
## [145] rjson_0.2.21 lifecycle_1.0.3
## [147] nlme_3.1-162 jsonlite_1.8.4
## [149] carData_3.0-5 viridisLite_0.4.2
## [151] fansi_1.0.4 pillar_1.9.0
## [153] lattice_0.21-8 KEGGREST_1.38.0
## [155] fastmap_1.1.1 httr_1.4.6
## [157] survival_3.5-5 glue_1.6.2
## [159] png_0.1-8 bit_4.0.5
## [161] class_7.3-22 stringi_1.7.12
## [163] sass_0.4.6 blob_1.2.4
## [165] memoise_2.0.1 dplyr_1.1.2
## [167] e1071_1.7-13 irlba_2.3.5.1
## [169] future.apply_1.11.0