pasta_vignette.Rmd
While scRNA-seq has been widely applied to characterize cellular heterogeneity in gene expression levels, these datasets can also be leveraged to explore variation in transcript structure as well. For example, reads from 3’ scRNA-seq datasets can be leveraged to quantify the relative usage of multiple polyadenylation (polyA) sites within the same gene.
In Kowalski* , Wessels*, Linder* et al, we introduce a statistical framework, based on the Dirichlet multinomial distribution, to characterize heterogeneity in the relative use of polyA sites at single-cell resolution. We calculate polyA-residuals, which indicate - for each polyA site in each cell - the degree of over or underutilization compared to a background model. We use these residuals for both supervised differential analysis (differential polyadenylation between groups of cells), but also unsupervised analysis and visualization.
We have implemented these methods in an R package PASTA (PolyA Site analysis using relative Transcript Abundance) that interfaces directly with Seurat. In this vignette, we demonstrate how to use PASTA and Seurat to analyze a dataset of 49,958 circulating human peripheral blood mononuclear cells (PBMC). This vignette demonstrates how to reproduce results from Figure 6 in our manuscript. While this represents an initial demonstration of PASTA, we will be adding significant functionality and documentation in future releases.
First, we install PASTA and dependencies. Note: At the time of release of this vignette, there was an issue with GenomeInfoDb. You may have to install as described here, if you do not have the latest version of Bioconductor installed.
# install remotes and BiocManager if necessary
if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes")
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
# additionally install several Bioconductor dependencies
BiocManager::install(c("BSgenome.Hsapiens.UCSC.hg38", "EnsDb.Hsapiens.v86", "rtracklayer", "GenomicFeatures",
"plyranges"))
remotes::install_github("satijalab/PASTA")
# loading PASTA also loads Seurat and Signac
library(PASTA)
library(ggplot2)
library(EnsDb.Hsapiens.v86)
The data for this vignette can be found on the following site: The files include:
# dir = directory where files are downloaded
counts.file <- paste0(dir, "PBMC_pA_counts.tab.gz")
peak.file <- paste0(dir, "PBMC_polyA_peaks.gff")
fragment.file <- paste0(dir, "PBMC_fragments.tsv.gz")
polyAdb.file <- paste0(dir, "human_PAS_hg38.txt")
metadata.file <- paste0(dir, "PBMC_meta_data.csv")
First, we create a Seurat object based on a pre-quantified scRNA-seq expression matrix of human PBMC. We then annotate cell types using our Azimuth reference, as described here.
library(Azimuth)
counts = Read10X(dir)
pbmc <- CreateSeuratObject(counts, min.cells = 3, min.features = 200)
# add meta data
meta_data <- read.csv(file = metadata.file, row.names = 1)
pbmc <- AddMetaData(pbmc, metadata = meta_data)
# Run Azimuth for each Donor
obj.list <- SplitObject(pbmc, split.by = "donor")
obj.list <- lapply(obj.list, FUN = RunAzimuth, reference = "pbmcref")
pbmc <- merge(obj.list[[1]], obj.list[2:length(obj.list)], merge.dr = "ref.umap")
# note that Azimuth provides higher-resolution annotations with pbmc$celltype.l2
Idents(pbmc) <- pbmc$celltype.l1
library(RColorBrewer)
cols <- colorRampPalette(brewer.pal(8, "Dark2"))(30)
cols.l1 <- cols[c(1:4, 9, 11, 25, 29, 30)]
names(cols.l1) <- c("NK", "CD4 T", "CD8 T", "Other T", "B", "Plasma", "Mono", "DC", "Other")
DimPlot(pbmc, reduction = "ref.umap", label = TRUE, cols = cols.l1)
Next, we create a polyA assay that we will add to the Seurat object, which quantifies the usage (counts) of individual polyA sites in single cells. PASTA currently accepts polyA site quantifications from the polyApipe pipeline, but we will be adding support for additional tools in future releases. The ReadPolyApipe
function takes three input files
blocks
function in sinto.Once we read in these files, we can create a polyA assay, and add it to the Seurat pbmc object.
polyA.counts = ReadPolyApipe(counts.file = counts.file, peaks.file = peak.file, filter.chromosomes = TRUE,
min.features = 10, min.cells = 25)
polyA.assay = CreatePolyAsiteAssay(counts = polyA.counts, genome = "hg38", fragments = fragment.file,
validate.fragments = FALSE)
# add annotations to polyA.assay
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
genome(annotations) <- "hg38"
Annotation(polyA.assay) <- annotations
# only use cells in both assays
cells = intersect(colnames(pbmc), colnames(polyA.assay))
polyA.assay = subset(polyA.assay, cells = cells)
pbmc = subset(pbmc, cells = cells)
pbmc[["polyA"]] <- polyA.assay #add polyA assay to Seurat object
DefaultAssay(pbmc) <- "polyA"
pbmc
## An object of class Seurat
## 74239 features across 49958 samples within 6 assays
## Active assay: polyA (41480 features, 0 variable features)
## 5 other assays present: RNA, refAssay, prediction.score.celltype.l1, prediction.score.celltype.l2, prediction.score.celltype.l3
## 1 dimensional reduction calculated: ref.umap
We next annotate each polyA site with information from the PolyA_DB v3 database, which catalogs polyA sites in the human genome based on deep sequencing data. This GetPolyADbAnnotation
function assigns each polyA site to a gene, and assigns a location within the transcript (intron, last exon, etc.). This information is stored in the assay, and can be accessed using the [[]]
feature metadata accessor function.
Note that for polyA sites that receive an NA
for the Gene_Symbol
field, these sites did not map to any site within 50bp in the polyAdbv3 database, and may be spurious.
# max.dist parameter controls maximum distance from polyAdbv3 site
pbmc <- GetPolyADbAnnotation(pbmc, polyAdb.file = polyAdb.file, max.dist = 50)
meta <- pbmc[["polyA"]][[]]
head(meta[, c(1, 13, 15, 18)])
## strand PAS_ID Intron.exon_location
## 10-100000560-100000859 + <NA> <NA>
## 10-100150094-100150393 - chr10:101909851:- 3'_most_exon
## 10-100181533-100181832 - <NA> <NA>
## 10-100188367-100188666 - chr10:101948124:- 3'_most_exon
## 10-100232298-100232597 - chr10:101992055:- 3'_most_exon
## 10-100237155-100237454 - chr10:101996912:- Intron
## Gene_Symbol
## 10-100000560-100000859 <NA>
## 10-100150094-100150393 ERLIN1
## 10-100181533-100181832 <NA>
## 10-100188367-100188666 CHUK
## 10-100232298-100232597 CWF19L1
## 10-100237155-100237454 CWF19L1
We can now calculate polyA residuals, as described in our manuscript. For this analysis we will focus on tandem polyadenylation events reflecting the usage of distinct polyA sites located in the 3’ UTR of the terminal exon. Once we select features, we can run the CalcPolyAResiduals
function. The residuals are stored in the scale.data
slot of the polyA assay.
# extract 14,985 features in 3' UTRs
features.last.exon = rownames(subset(meta, Intron.exon_location == "3'_most_exon"))
length(features.last.exon)
## [1] 14985
# gene names are stored in the Gene_Symbol column of the meta features
pbmc <- CalcPolyAResiduals(pbmc, assay = "polyA", features = features.last.exon, gene.names = "Gene_Symbol",
verbose = TRUE)
Now, we can perform dimension reduction directly on the polyA residuals. The workflow is similar to scRNA-seq: we use FindVariableFeatures
to identify polyA sites that vary across cells, and then run perform dimension reduction on the variable polyA sites. We see that there is a very strong separation between Plasma cell and other celltypes.
pbmc <- FindVariableFeatures(pbmc, selection.method = "residuals", gene.names = "Gene_Symbol")
pbmc <- RunPCA(pbmc)
pbmc <- RunUMAP(pbmc, dims = 1:30, reduction.name = "polyA.umap", reduction.key = "polyAUMAP_")
DimPlot(pbmc, group.by = "celltype.l1", reduction = "polyA.umap", cols = cols.l1) + ggtitle("Level 1 Annotations")
Analagous to differential expression anlaysis, we can also perform differential polyadenylation analysis. As described in our manuscript, this function aims to identify polyA sites whose relative usage within a gene changes across different groups of cells. Here, we identify differentially polyadenylated sites between Plasma cells and B cells.
The FindDifferentialPolyA function returns the following outputs:
Idents(pbmc) <- pbmc$celltype.l1
m.plasma <- FindDifferentialPolyA(pbmc, ident.1 = "Plasma", ident.2 = "B", covariates = "donor")
head(m.plasma, 10)
## Estimate p.value p_val_adj percent.1
## 17-41690699-41690998 1.1450433 0.000000e+00 0.000000e+00 0.592298722
## 6-7882814-7883113 0.6222369 0.000000e+00 0.000000e+00 0.346950866
## 12-49762706-49763005 0.5352492 0.000000e+00 0.000000e+00 0.770397231
## 4-176331946-176332245 -0.4459568 0.000000e+00 0.000000e+00 0.069836204
## 6-7881521-7881820 -0.4895199 0.000000e+00 0.000000e+00 0.587784615
## 12-49764635-49764934 -0.5325040 0.000000e+00 0.000000e+00 0.229602769
## 17-41691345-41691644 -1.1527642 0.000000e+00 0.000000e+00 0.001901237
## 3-57571365-57571664 -0.5136573 2.588878e-313 2.128058e-309 0.228808536
## 3-57572076-57572375 0.5317318 1.823834e-275 1.499191e-271 0.759039715
## 5-177331562-177331861 -0.4413488 2.480992e-258 2.039376e-254 0.320045711
## percent.2 symbol
## 17-41690699-41690998 0.326026942 EIF1
## 6-7882814-7883113 0.117305459 TXNDC5
## 12-49762706-49763005 0.186829268 TMBIM6
## 4-176331946-176332245 0.193798450 SPCS3
## 6-7881521-7881820 0.806039489 TXNDC5
## 12-49764635-49764934 0.813170732 TMBIM6
## 17-41691345-41691644 0.004107767 EIF1
## 3-57571365-57571664 0.722650231 ARF4
## 3-57572076-57572375 0.183359014 ARF4
## 5-177331562-177331861 0.504907975 LMAN2
We can also visualize the read coverage across cell groups, which demonstrates changes in transcript structure across groups of cells. The PolyACoveragePlot
function is based off the CoveragePlot
function in Signac, and uses the previously computed fragment file to compute local coverage. Beneath the coverage plot, we also display the location of polyA sites (peaks), as well as gene annotations. Note that in each of these examples, Plasmablasts exhibit increased utilization of the proximal polyA site (i.e. 3’ UTR shortening)
Idents(pbmc) <- pbmc$celltype.l1
# by default shows all polyAsites where we calculated polyA residuals in a gene
PolyACoveragePlot(pbmc, gene = "EIF1") & scale_fill_manual(values = cols.l1)
PolyACoveragePlot(pbmc, gene = "TMBIM6") & scale_fill_manual(values = cols.l1)
Session Info
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] RColorBrewer_1.1-3 Azimuth_0.4.6
## [3] shinyBS_0.61.1 rlang_1.0.6
## [5] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.20.2
## [7] AnnotationFilter_1.20.0 GenomicFeatures_1.48.4
## [9] AnnotationDbi_1.58.0 Biobase_2.56.0
## [11] GenomicRanges_1.48.0 GenomeInfoDb_1.35.15
## [13] IRanges_2.30.0 S4Vectors_0.34.0
## [15] BiocGenerics_0.42.0 ggplot2_3.4.0
## [17] PASTA_0.0.0.9000 Signac_1.9.0
## [19] SeuratObject_4.1.3 Seurat_4.3.0
##
## loaded via a namespace (and not attached):
## [1] rappdirs_0.3.3 rtracklayer_1.56.1
## [3] scattermore_0.8 R.methodsS3_1.8.2
## [5] ragg_1.2.4 tidyr_1.3.0
## [7] bit64_4.0.5 knitr_1.42
## [9] irlba_2.3.5.1 DelayedArray_0.22.0
## [11] R.utils_2.12.2 rpart_4.1.19
## [13] data.table_1.14.6 KEGGREST_1.36.0
## [15] RCurl_1.98-1.9 generics_0.1.3
## [17] cowplot_1.1.1 RSQLite_2.2.12
## [19] RANN_2.6.1 future_1.31.0
## [21] bit_4.0.4 spatstat.data_3.0-0
## [23] xml2_1.3.3 httpuv_1.6.8
## [25] SummarizedExperiment_1.26.1 assertthat_0.2.1
## [27] gargle_1.2.0 xfun_0.37
## [29] hms_1.1.2 jquerylib_0.1.4
## [31] evaluate_0.20 promises_1.2.0.1
## [33] fansi_1.0.4 restfulr_0.0.15
## [35] progress_1.2.2 dbplyr_2.2.1
## [37] igraph_1.3.5 DBI_1.1.2
## [39] htmlwidgets_1.6.1 spatstat.geom_3.0-6
## [41] googledrive_2.0.0 purrr_1.0.1
## [43] ellipsis_0.3.2 backports_1.4.1
## [45] dplyr_1.1.0 biomaRt_2.52.0
## [47] deldir_1.0-6 MatrixGenerics_1.8.0
## [49] vctrs_0.5.2 SeuratDisk_0.0.0.9020
## [51] ROCR_1.0-11 adiposeref.SeuratData_1.0.0
## [53] abind_1.4-5 cachem_1.0.6
## [55] withr_2.5.0 BSgenome_1.64.0
## [57] progressr_0.13.0 checkmate_2.1.0
## [59] presto_1.0.0 sctransform_0.3.5
## [61] GenomicAlignments_1.32.1 prettyunits_1.1.1
## [63] goftest_1.2-3 cluster_2.1.4
## [65] lazyeval_0.2.2 crayon_1.5.2
## [67] hdf5r_1.3.5 spatstat.explore_3.0-6
## [69] labeling_0.4.2 pkgconfig_2.0.3
## [71] nlme_3.1-162 ProtGenerics_1.28.0
## [73] nnet_7.3-18 globals_0.16.2
## [75] lifecycle_1.0.3 miniUI_0.1.1.1
## [77] filelock_1.0.2 BiocFileCache_2.4.0
## [79] pbmc3k.SeuratData_3.1.4 SeuratData_0.2.2
## [81] dichromat_2.0-0.1 cellranger_1.1.0
## [83] rprojroot_2.0.3 polyclip_1.10-4
## [85] matrixStats_0.63.0 lmtest_0.9-40
## [87] hcabm40k.SeuratData_3.0.0 Matrix_1.5-1
## [89] zoo_1.8-11 base64enc_0.1-3
## [91] ggridges_0.5.4 googlesheets4_1.0.0
## [93] png_0.1-8 viridisLite_0.4.1
## [95] rjson_0.2.21 mousecortexref.SeuratData_1.0.0
## [97] bitops_1.0-7 shinydashboard_0.7.2
## [99] R.oo_1.25.0 KernSmooth_2.23-20
## [101] Biostrings_2.64.0 blob_1.2.3
## [103] stringr_1.5.0 parallelly_1.34.0
## [105] spatstat.random_3.1-3 jpeg_0.1-10
## [107] scales_1.2.1 memoise_2.0.1
## [109] magrittr_2.0.3 plyr_1.8.8
## [111] ica_1.0-3 zlibbioc_1.42.0
## [113] compiler_4.2.2 BiocIO_1.6.0
## [115] fitdistrplus_1.1-8 Rsamtools_2.12.0
## [117] cli_3.6.0 XVector_0.36.0
## [119] listenv_0.9.0 patchwork_1.1.2
## [121] pbapply_1.7-0 htmlTable_2.4.1
## [123] Formula_1.2-4 formatR_1.12
## [125] MASS_7.3-58.2 tidyselect_1.2.0
## [127] stxBrain.SeuratData_0.1.1 stringi_1.7.12
## [129] textshaping_0.3.6 highr_0.10
## [131] yaml_2.3.7 latticeExtra_0.6-30
## [133] ggrepel_0.9.3 grid_4.2.2
## [135] VariantAnnotation_1.42.1 sass_0.4.5
## [137] fastmatch_1.1-3 tools_4.2.2
## [139] bmcite.SeuratData_0.3.0 future.apply_1.10.0
## [141] parallel_4.2.2 rstudioapi_0.13
## [143] foreign_0.8-82 gridExtra_2.3
## [145] farver_2.1.1 plyranges_1.16.0
## [147] Rtsne_0.16 digest_0.6.31
## [149] rgeos_0.5-9 shiny_1.7.4
## [151] MGLM_0.2.1 Rcpp_1.0.10
## [153] later_1.3.0 RcppAnnoy_0.0.20
## [155] httr_1.4.4 biovizBase_1.44.0
## [157] panc8.SeuratData_3.0.2 colorspace_2.1-0
## [159] XML_3.99-0.9 fs_1.6.0
## [161] tensor_1.5 reticulate_1.28
## [163] splines_4.2.2 uwot_0.1.14
## [165] RcppRoll_0.3.0 spatstat.utils_3.0-1
## [167] pkgdown_2.0.6 sp_1.6-0
## [169] plotly_4.10.1.9000 systemfonts_1.0.4
## [171] xtable_1.8-4 jsonlite_1.8.4
## [173] pbmcref.SeuratData_1.0.0 R6_2.5.1
## [175] Hmisc_4.7-2 pillar_1.8.1
## [177] htmltools_0.5.4 mime_0.12
## [179] glue_1.6.2 fastmap_1.1.0
## [181] DT_0.26 BiocParallel_1.30.0
## [183] codetools_0.2-19 ifnb.SeuratData_3.1.0
## [185] utf8_1.2.3 lattice_0.20-45
## [187] bslib_0.4.2 spatstat.sparse_3.0-0
## [189] tibble_3.1.8 curl_5.0.0
## [191] leiden_0.4.3 interp_1.1-3
## [193] shinyjs_2.1.0 survival_3.4-0
## [195] rmarkdown_2.20 desc_1.4.1
## [197] munsell_0.5.0 GenomeInfoDbData_1.2.8
## [199] thp1.eccite.SeuratData_3.1.5 reshape2_1.4.4
## [201] gtable_0.3.1