The Platypus ecosystem of packages are meant to provide potential pipelines and examples relevant to the broad field of computational immunology. While the majority of functions are currently tailored to single-cell immune repertoire sequencing, we will continue to building the package in a modular structure that allows integration to other bioinformatic methods. Currently, the package can be used to separately analyze gene expression (GEX) or immune receptor repertoire (VDJ) sequencing data, in addition to integrating the two data sets to combine phenotypic features with repertoire analysis. The package is designed to primarily analyze the output from 10x genomics cellranger (output from count for GEX and VDJ for enriched immune receptor libraries). The functions could work with other barcode-based scSeq technologies assuming the input columns are added correctly. The gene expression analysis relies heavily upon Seurat, a commonly used R package for single-cell sequencing (scSeq). The core set of functions can be found at https://github.com/alexyermanos/Platypus and examples of use can be found in the publications https://doi.org/10.1093/nargab/lqab023 and https://academic.oup.com/bioinformatics/article/39/9/btad553/7264179
Stay tuned for updates https://twitter.com/AlexYermanos
Following the original NARGAB publication of Platypus, we have rebuilt the package to entirely revolve around the VDJ_GEX_matrix function (VGM for short). This function integrates both repertoire and transcriptome information (if desired) and serves as the input to all secondary functions in future iterations of the package. The advantage of this is having all repertoire and transcriptome information at a per-cell level. This was originally required due to significant changes to the clonotyping strategies in the default 10x genomics VDJ pipeline. While the format of this object will be detailed below, the core of the function creates a list in R where the first element corresponds to the VDJ information at the per-cell level and the second element corresponds to a Seurat object. The VGM can then be supplied to all other functions that will assume this format by default. All functions should now be updated to take the VGM structure (referred to as “v3” of Platypus) as input in the “platypus.version” argument. “platypus.version” can be set to either “v2” for backcompatibility found in the original manucsript or “v3” for newer and future functions. Some new functions are only compatible with “v3”.
Yermanos A et al. Platypus: an open-access software for integrating lymphocyte single-cell immune repertoires with transcriptomes. NAR Genomics Bioinforma (2021) 3: doi:10.1093/NARGAB/LQAB023
Tudor-Stefan Cotet, Andreas Agrafiotis, Victor Kreiner et al. ePlatypus: an ecosystem for computational analysis of immunogenomics data, Bioinformatics, Volume 39, Issue 9, September 2023, btad553, https://doi.org/10.1093/bioinformatics/btad553
We understand the pain that locating and utilizing public data can inflict on aspiring and established scientists alike. We have therefore created a public database that is integrated within the structure of the Platypus package. This means that single-cell immune repertoire sequencing data can be directly loaded into the R session and even integrated with locally stored data. All examples in these vignettes highlight the strength of this feature by primarily using public datasets within our database. We highly encourage others to contribute to this database by contacting use with the output files from your cellranger runs (hopefully with at least v6 of cellranger and using the standard references). Sharing and organizing public data can help save the world.
We are adding new features constantly. Recent changes may not yet be reflected in the CRAN version. All Github versions are developmental!
Individual R functions can additionally be found on the Github in the folder “R”. These can similarly be downloaded and loaded into the R environment in case not all functions are desired. These functions are actively updated and may include more features than the in tar.gz or Github version file.
### Removing any previous versions of the package
#First can ensure that there is no previous version installed locally
detach("package:Platypus", unload=TRUE)
remove.packages("Platypus")
### Downloading and installing Platypus
# First we need to download the most recent version from the master branch at https://github.com/alexyermanos/Platypus we can install the package using the following command.
# WARNING: This needs to be replaced with your own directory where the downloaded package is found
# For MacOS users it may look like this
install.packages("~/Downloads/Platypus_3.5.0.tar.gz", repos = NULL, type="source")
# For windows it will likely look something like this.
# WARNING: You will need to replace 'YourPCName' with your user name for the windows account in the directory.
#install.packages("C:/Users/YourPCName/Downloads/Platypus_3.5.0.tar.gz", repos = NULL, type="source")
library(Platypus)
Installing from the Github repo
Comprehensive details about the parameters of the VDJ_GEX_matrix() function can be found in the respective function documentation and in the dedicated vignette.
As an exemplary dataset we use GEX data from murine B and T cells and VDJ data from B cells in the aged CNS. This small dataset contains VDJ and GEX libraries from the central nervous system of four murine samples. The corresponding publication can be found here: https://doi.org/10.1098/rspb.2020.2793
For local datasets, the paths to Cellranger output directories are needed. VDJ and GEX are processed simultaneously using the VDJ_GEX_matrix function, any input of the two can be omitted if not necessary.
#Creating a list with local paths to cellranger directories
VDJ.out.directory.list <- list() ### Set directory to the outs folder of cellranger VDJ
VDJ.out.directory.list[[1]] <- c("~/Downloads/yermanos2021b__VDJ_RAW/Aged.CNS.pool.3m.Bcell.S1")
VDJ.out.directory.list[[2]] <- c("~/Downloads/yermanos2021b__VDJ_RAW/Aged.CNS.pool.12m.Bcell.S2")
VDJ.out.directory.list[[3]] <- c("~/Downloads/yermanos2021b__VDJ_RAW/Aged.CNS.pool.18m.Bcell.S3")
VDJ.out.directory.list[[4]] <- c("~/Downloads/yermanos2021b__VDJ_RAW/Aged.CNS.single.18m.Bcell.S4")
GEX.out.directory.list <- list() ### Set directory to the /outs folder of cellranger count
GEX.out.directory.list[[1]] <- c("~/Downloads/yermanos2021b__GEX_RAW/Aged.CNS.pool.3m.Bcell.S1/")
GEX.out.directory.list[[2]] <- c("~/Downloads/yermanos2021b__GEX_RAW/Aged.CNS.pool.12m.Bcell.S2/")
GEX.out.directory.list[[3]] <- c("~/Downloads/yermanos2021b__GEX_RAW/Aged.CNS.pool.18m.Bcell.S3/")
GEX.out.directory.list[[4]] <- c("~/Downloads/yermanos2021b__GEX_RAW/Aged.CNS.single.18m.Bcell.S4/")
#Running the VDJ_GEX_matrix function
VGM <- VDJ_GEX_matrix(VDJ.out.directory.list = VDJ.out.directory.list,
GEX.out.directory.list = GEX.out.directory.list,
GEX.integrate = T,
VDJ.combine = T,
integrate.GEX.to.VDJ = T,
integrate.VDJ.to.GEX = T)
If you are only interested in repertoire analysis, we have made available a function to directly build the VDJ data frame of the VGM object. This will integrate trimmed and raw receptor sequences, consensus sequences, and germlines. For obtaining trimmed germlines, we are aligning the raw germline to the consensus sequence (gapless global-local alignment). However, this can slow down the computation for the VDJ data frame, therefore we recommend you use the parallel option.
Ensure your VDJ output files have been obtained using cellranger v7 (e.g., the filtered_contig_annotations.csv file should have annotations for all FWR and CDR regions). The following files should be present in all sample folders: all_contig_annotations.csv, filtered_contig_annotations.csv, all_contig_annotations.csv, filtered_contig.fasta, consensus.fasta, concat_ref.fasta.
#Creating a list with local paths to Cellranger directories for VDJ outputs for the OVA dataset.
VDJ.out.directory.list <- list() ### Set directory to the outs folder of Cellranger VDJ
VDJ.out.directory.list[[1]] <- c("~/Downloads/ova_v7/S1")
VDJ.out.directory.list[[2]] <- c("~/Downloads/ova_v7/S2")
VDJ.out.directory.list[[3]] <- c("~/Downloads/ova_v7/S3")
VDJ.out.directory.list[[4]] <- c("~/Downloads/ova_v7/S4")
VDJ.out.directory.list[[5]] <- c("~/Downloads/ova_v7/S5")
#Running the minimal_VDJ function with trim.germlines = TRUE
VDJ <- minimal_VDJ(VDJ.directory.list = VDJ.out.directory.list,
trim.germlines = TRUE,
gap.opening.cost = Inf,
parallel = TRUE)
PlatypusDB stores both raw and processed data for each study. Processed data can be loaded directly as a VGM object. Raw data can be downloaded and then processed with the VDJ_GEX_matrix function.
These features are usful for integration of local and DB datasets. For this please refer to the PlatypusDB vignette
Here we load samples from PlatypusDB, and process these.
#Downloading PlatypusDB raw data in a list format
#For structure of PlatypusDB links, please refer to the PlatypusDB vignette
library(Platypus)
yermanos2021_raw <- PlatypusDB_fetch(PlatypusDB.links =
c("yermanos2021b/ALL/ALL"),
load.to.enviroment = F,
load.to.list = T)
#Running the VDJ_GEX_matrix function
VGM <- VDJ_GEX_matrix(Data.in = yermanos2021_raw,
GEX.integrate = T,
VDJ.combine = T,
integrate.GEX.to.VDJ = T,
integrate.VDJ.to.GEX = T,
verbose = F)
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at -2.0899
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 0.30103
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 7.0607e-15
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## Warning: The following requested variables were not found: UMAP_1, UMAP_2
We can briefly look at the datastructure
## [1] "barcode" "sample_id" "group_id"
## [4] "clonotype_id_10x" "clonotype_id" "clonotype_frequency"
## By setting integrate.GEX.to.VDJ and integrate.VDJ.to.GEX to T, VDJ and GEX information will be found in VGM[[1]] and VGM[[2]] objects.
# For example, the seurat-determined cluster is attached to each cell in the VDJ library by
head(VGM[[1]]$seurat_clusters)
## [1] <NA> <NA> <NA> <NA> <NA> <NA>
## Levels: 0 1 2 3 4 5 6 7 8 9 10 11
## [1] "CARIGYAMDYW" "CARDFTTVVARGYFDVW" "CARGITTVVAYYYAMDYW"
## [4] "CTTGPYDYYAMDYW" "CARPHDYDGVDYW" "CARRYYSNYAWFAYW"
# an NA indicates that the cell barcode in the VDJ library was not detected in the GEX object (or was filtered out, depending on mitochondrial gene limits, etc)
Importantly, all samples are integrated in one VGM. Basic stats can be retrieved from the VDJ_GEX_stats output which was run as part of the VGM function (get_VDJ_stats = T)
## Repertoir path
## VDJ.stats N:/reddy/Shared_Folders/AA_AY/DBmanuscript_Aged_CNS/VDJ/Aged.CNS.pool.3m.Bcell.S1
## VDJ.stats1 N:/reddy/Shared_Folders/AA_AY/DBmanuscript_Aged_CNS/VDJ/Aged.CNS.pool.12m.Bcell.S2
## VDJ.stats2 N:/reddy/Shared_Folders/AA_AY/DBmanuscript_Aged_CNS/VDJ/Aged.CNS.pool.18m.Bcell.S3
## VDJ.stats3 N:/reddy/Shared_Folders/AA_AY/DBmanuscript_Aged_CNS/VDJ/Aged.CNS.single.18m.Bcell.S4
## Sample name Nr unique barcodes Nr barcodes is_cell Nr cells 1VDJ 1VJ
## VDJ.stats s1 78 78 61
## VDJ.stats1 s2 160 160 122
## VDJ.stats2 s3 208 208 160
## VDJ.stats3 s4 499 499 372
## Nr cells 1VDJ 0VJ Nr cells 0VDJ 1VJ Nr cells 2 or more VDJ 1VJ
## VDJ.stats 8 1 0
## VDJ.stats1 13 15 1
## VDJ.stats2 7 16 4
## VDJ.stats3 34 60 8
## Nr cells 1VDJ 2 or more VJ Nr cells 2 or more VDJ 2 or more VJ
## VDJ.stats 7 1
## VDJ.stats1 7 2
## VDJ.stats2 18 3
## VDJ.stats3 22 3
## Nr cells full_length Nr cells productive Nr cells high_confidence
## VDJ.stats 78 78 78
## VDJ.stats1 160 160 160
## VDJ.stats2 208 208 208
## VDJ.stats3 499 499 499
## Nr cells all true Nr cells all true and 1VDJ 1VJ Nr clonotypes
## VDJ.stats 78 61 73
## VDJ.stats1 160 122 129
## VDJ.stats2 208 160 200
## VDJ.stats3 499 372 349
## Nr clonotypes 1VDJ 1VJ Nr clonotypes < 1VDJ 1VJ
## VDJ.stats 58 8
## VDJ.stats1 94 25
## VDJ.stats2 154 21
## VDJ.stats3 248 77
## Nr clonotypes > 1VDJ 1VJ % Nr unique barcodes % Nr barcodes is_cell
## VDJ.stats 7 100 100
## VDJ.stats1 10 100 100
## VDJ.stats2 25 100 100
## VDJ.stats3 24 100 100
## % Nr cells 1VDJ 1VJ % Nr cells 1VDJ 0VJ % Nr cells 0VDJ 1VJ
## VDJ.stats 78.2 10.3 1.3
## VDJ.stats1 76.2 8.1 9.4
## VDJ.stats2 76.9 3.4 7.7
## VDJ.stats3 74.5 6.8 12
## % Nr cells 2 or more VDJ 1VJ % Nr cells 1VDJ 2 or more VJ
## VDJ.stats 0 9
## VDJ.stats1 0.6 4.4
## VDJ.stats2 1.9 8.7
## VDJ.stats3 1.6 4.4
## % Nr cells 2 or more VDJ 2 or more VJ % Nr cells full_length
## VDJ.stats 1.3 100
## VDJ.stats1 1.2 100
## VDJ.stats2 1.4 100
## VDJ.stats3 0.6 100
## % Nr cells productive % Nr cells high_confidence % Nr cells all true
## VDJ.stats 100 100 100
## VDJ.stats1 100 100 100
## VDJ.stats2 100 100 100
## VDJ.stats3 100 100 100
## % Nr cells all true and 1VDJ 1VJ % Nr clonotypes
## VDJ.stats 78.2 100
## VDJ.stats1 76.2 100
## VDJ.stats2 76.9 100
## VDJ.stats3 74.5 100
## % Nr clonotypes 1VDJ 1VJ % Nr clonotypes < 1VDJ 1VJ
## VDJ.stats 79.5 11
## VDJ.stats1 72.9 19.4
## VDJ.stats2 77 10.5
## VDJ.stats3 71.1 22.1
## % Nr clonotypes > 1VDJ 1VJ Estimated.Number.of.Cells
## VDJ.stats 9.6 78
## VDJ.stats1 7.8 160
## VDJ.stats2 12.5 208
## VDJ.stats3 6.9 499
## Mean.Read.Pairs.per.Cell
## VDJ.stats 272,337
## VDJ.stats1 114,292
## VDJ.stats2 99,848
## VDJ.stats3 50,928
## Number.of.Cells.With.Productive.V.J.Spanning.Pair
## VDJ.stats 69
## VDJ.stats1 132
## VDJ.stats2 185
## VDJ.stats3 405
## Number.of.Read.Pairs Valid.Barcodes Q30.Bases.in.Barcode
## VDJ.stats 21,242,313 96.5% 91.7%
## VDJ.stats1 18,286,818 96.7% 91.9%
## VDJ.stats2 20,768,413 95.3% 91.7%
## VDJ.stats3 25,413,451 96.1% 91.7%
## Q30.Bases.in.RNA.Read.1 Q30.Bases.in.RNA.Read.2 Q30.Bases.in.UMI
## VDJ.stats 73.3% 79.8% 91.0%
## VDJ.stats1 73.2% 79.8% 91.1%
## VDJ.stats2 73.1% 79.2% 91.0%
## VDJ.stats3 73.3% 79.5% 91.0%
## Reads.Mapped.to.Any.V.D.J.Gene Reads.Mapped.to.IGH
## VDJ.stats 82.6% 26.3%
## VDJ.stats1 82.0% 30.4%
## VDJ.stats2 80.0% 20.5%
## VDJ.stats3 83.2% 24.6%
## Reads.Mapped.to.IGK Reads.Mapped.to.IGL
## VDJ.stats 29.4% 27.0%
## VDJ.stats1 30.3% 21.3%
## VDJ.stats2 38.0% 21.5%
## VDJ.stats3 40.6% 18.1%
## Mean.Used.Read.Pairs.per.Cell Fraction.Reads.in.Cells
## VDJ.stats 36,548 79.0%
## VDJ.stats1 29,068 74.6%
## VDJ.stats2 33,816 84.6%
## VDJ.stats3 18,887 55.1%
## Median.IGH.UMIs.per.Cell Median.IGK.UMIs.per.Cell
## VDJ.stats 11 30
## VDJ.stats1 7 12
## VDJ.stats2 10 31
## VDJ.stats3 4 10
## Median.IGL.UMIs.per.Cell Cells.With.Productive.V.J.Spanning.Pair
## VDJ.stats 47 88.5%
## VDJ.stats1 8 82.5%
## VDJ.stats2 32 88.9%
## VDJ.stats3 6.5 81.2%
## Cells.With.Productive.V.J.Spanning..IGK..IGH..Pair
## VDJ.stats 80.8%
## VDJ.stats1 71.2%
## VDJ.stats2 82.2%
## VDJ.stats3 76.0%
## Cells.With.Productive.V.J.Spanning..IGL..IGH..Pair
## VDJ.stats 9.0%
## VDJ.stats1 11.9%
## VDJ.stats2 6.7%
## VDJ.stats3 5.8%
## Paired.Clonotype.Diversity Cells.With.IGH.Contig
## VDJ.stats 69.14 98.7%
## VDJ.stats1 80 93.8%
## VDJ.stats2 191.43 96.2%
## VDJ.stats3 159.31 90.6%
## Cells.With.IGK.Contig Cells.With.IGL.Contig
## VDJ.stats 91.0% 38.5%
## VDJ.stats1 82.5% 38.1%
## VDJ.stats2 92.3% 41.8%
## VDJ.stats3 89.0% 29.5%
## Cells.With.CDR3.annotated.IGH.Contig
## VDJ.stats 98.7%
## VDJ.stats1 91.9%
## VDJ.stats2 93.3%
## VDJ.stats3 88.6%
## Cells.With.CDR3.annotated.IGK.Contig
## VDJ.stats 82.1%
## VDJ.stats1 78.8%
## VDJ.stats2 88.9%
## VDJ.stats3 86.8%
## Cells.With.CDR3.annotated.IGL.Contig
## VDJ.stats 11.5%
## VDJ.stats1 13.8%
## VDJ.stats2 8.2%
## VDJ.stats3 7.2%
## Cells.With.V.J.Spanning.IGH.Contig
## VDJ.stats 98.7%
## VDJ.stats1 93.1%
## VDJ.stats2 95.7%
## VDJ.stats3 88.8%
## Cells.With.V.J.Spanning.IGK.Contig
## VDJ.stats 89.7%
## VDJ.stats1 81.2%
## VDJ.stats2 92.3%
## VDJ.stats3 88.6%
## Cells.With.V.J.Spanning.IGL.Contig Cells.With.Productive.IGH.Contig
## VDJ.stats 16.7% 98.7%
## VDJ.stats1 17.5% 90.6%
## VDJ.stats2 10.1% 92.3%
## VDJ.stats3 10.4% 88.0%
## Cells.With.Productive.IGK.Contig Cells.With.Productive.IGL.Contig
## VDJ.stats 80.8% 10.3%
## VDJ.stats1 78.8% 13.8%
## VDJ.stats2 88.5% 8.2%
## VDJ.stats3 86.6% 7.2%
## rep_id Estimated.Number.of.Cells Mean.Reads.per.Cell
## VDJ.stats 1 303 819,292
## VDJ.stats1 2 261 919,694
## VDJ.stats2 3 945 413,917
## VDJ.stats3 4 595 369,552
## Median.Genes.per.Cell Number.of.Reads Valid.Barcodes
## VDJ.stats 1,199 248,245,523 83.2%
## VDJ.stats1 970 240,040,317 82.8%
## VDJ.stats2 1,338 391,152,436 85.3%
## VDJ.stats3 766 219,883,523 84.2%
## Sequencing.Saturation Q30.Bases.in.Barcode Q30.Bases.in.RNA.Read
## VDJ.stats 99.1% 97.7% 92.3%
## VDJ.stats1 98.8% 97.8% 89.9%
## VDJ.stats2 98.2% 97.8% 93.1%
## VDJ.stats3 97.9% 97.7% 92.3%
## Q30.Bases.in.UMI Reads.Mapped.to.Genome
## VDJ.stats 97.5% 90.7%
## VDJ.stats1 97.6% 83.1%
## VDJ.stats2 97.6% 92.2%
## VDJ.stats3 97.5% 91.1%
## Reads.Mapped.Confidently.to.Genome
## VDJ.stats 85.4%
## VDJ.stats1 77.3%
## VDJ.stats2 83.2%
## VDJ.stats3 85.7%
## Reads.Mapped.Confidently.to.Intergenic.Regions
## VDJ.stats 10.7%
## VDJ.stats1 15.0%
## VDJ.stats2 11.3%
## VDJ.stats3 15.8%
## Reads.Mapped.Confidently.to.Intronic.Regions
## VDJ.stats 9.7%
## VDJ.stats1 10.2%
## VDJ.stats2 10.5%
## VDJ.stats3 11.1%
## Reads.Mapped.Confidently.to.Exonic.Regions
## VDJ.stats 65.0%
## VDJ.stats1 52.0%
## VDJ.stats2 61.4%
## VDJ.stats3 58.7%
## Reads.Mapped.Confidently.to.Transcriptome
## VDJ.stats 61.5%
## VDJ.stats1 49.1%
## VDJ.stats2 59.7%
## VDJ.stats3 54.9%
## Reads.Mapped.Antisense.to.Gene Fraction.Reads.in.Cells
## VDJ.stats 12.7% 84.2%
## VDJ.stats1 12.9% 59.7%
## VDJ.stats2 11.7% 88.8%
## VDJ.stats3 14.6% 51.9%
## Total.Genes.Detected Median.UMI.Counts.per.Cell rep_id
## VDJ.stats 14,652 2,865 1
## VDJ.stats1 14,033 1,966 2
## VDJ.stats2 16,905 3,151 3
## VDJ.stats3 14,629 1,536 4
Moreover, given the many parameters of the VGM function is may be useful to check back on runtime parameters. These, as well as session info are also saved in the VGM object
The VGM object now contains all information needed for both V(D)J and gene expression analysis. In the examples below we first focus on gene expression
We can visualize the 2D plots for each sample individually. By default, UMAP, PCA, and TSNE embeddings are included in the object. Here it is important to supply the seurat object in VGM[[2]]. For the parameter group.by we supply “sample_id” to color points by the entry in their sample id column
# Ensure all gene names are capitalized for downstream analyses
rownames(VGM[[2]]) <- sapply(rownames(VGM[[2]]), function(x) toupper(x))
# In Platypus version 2, the output from GEX_automate was used as input to other GEX functions. These functions are still compatible with v3 if the VGM[[2]] seurat object is supplied as input.
# For example, the following function can be used to calculate the DE genes for each cluster, as before.
Seurat::DimPlot(VGM[[2]],reduction = "umap", group.by = "sample_id")
Seurat::DimPlot(VGM[[2]],reduction = "pca", group.by = "sample_id")
## Warning: Removed 1761 rows containing missing values (`geom_point()`).
## Warning: Removed 1761 rows containing missing values (`geom_point()`).
#alternatively plot each sample separately
Seurat::DimPlot(VGM[[2]],reduction = "umap", split.by = "sample_id")
#this also works with any other column of VGM[[2]]@meta.data
Seurat::DimPlot(VGM[[2]],reduction = "umap", split.by = "group_id")
We can observe that the majority of clusters have cells from both samples, suggesting a similar distribution of transcriptional properties between the two samples. We can also plot the cluster membership for each of the distinct samples using the GEX_proportions_barplot function.
GEX_proportions_barplot(GEX = VGM[[2]], stacked.plot = T, source.group = "sample_id", target.group = "seurat_clusters")
#This function is very flexible and can be used to plot proportions of cells from and of any groups. For this use the source.group and target.group parameters to specify metadata columns.
Finally, we can also look at common B cell markers.
Seurat::FeaturePlot(VGM[[2]],reduction = "umap",features = c("CD19","PTPRC", "EBF1", "H2-K1"))
#To easily scout through genes in the dataset use:
#View(as.data.frame(rownames(VGM[[2]])))
Platypus also allows us to assign cell type/state identity to different clusters by using GEX_phenotype. This function takes the Seurat object as input and uses canonical markers to easily match the clustering to known cell types. The user also has the possibility to use a custom list of markers and their associated cell types/states. While the default cell states are quite basic, we find that the strength of the function is that a user can label cells with custom phenotypes as demonstrated below. For more information please see the functions documentation.
#using defaults
VGM[[2]] <- GEX_phenotype(VGM[[2]], default = T)
#custom criteria
#VGM[[2]] <- GEX_phenotype(VGM[[2]], default = F,cell.state.markers=c("CD8A+;CCL5+;CD44+;IL7R-;CD19-","CD8A+;CCL5-;CD44+;IL7R+;CD19-"),cell.state.names=c("EffectorCD8","MemoryCD8"))
The resulting Seurat object now contains a “cell.state” column which can be used for annotation in the DimPlot function of the Seurat package.
After scaling, normalizing, and clustering the cells from the GEX libraries we can now investigate which genes are differentially expressed between either clusters or samples. First, we can investigate the genes that define each of the clusters by using the GEX output of the VGM function. Depending on the size of the dataset and the number of cells this function can be quite slow. The output of this function is a list in which each element contains the differentially expressed genes for a given cluster. For example, the first element of the list will correspond to a data frame describing the genes for cluster0 that we previously observed on the UMAP. This list object will correspond to the length of the number of clusters that were previously calculated, and has the same format as the FindMarkers function from Seurat. The pct.1 will correspond to the percentage of cells expressing the gene in the cluster of interest, and the pct.2 to the percentage of cells in all other clusters.
VGM[[2]] <- SeuratObject::JoinLayers(VGM[[2]])
gene_expression_cluster <- GEX_cluster_genes(VGM[[2]], min.pct = 0.2)
## For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the presto package
## --------------------------------------------
## install.packages('devtools')
## devtools::install_github('immunogenomics/presto')
## --------------------------------------------
## After installation of presto, Seurat will automatically use the more
## efficient implementation (no further action necessary).
## This message will be shown once per session
## [1] 12
## [1] 12
## [1] 1921 1266 1238 1424 2516 1814 1863 1617 1611 1513 1743 1409
We can now look at some of the genes associated with cluster0 in the previously displayed umap
## p_val avg_logFC pct.1 pct.2 p_val_adj SYMBOL cluster
## NKG7 2.631427e-306 3.732392 0.982 0.104 8.336623e-302 NKG7 0
## CCL5 1.070166e-286 3.434706 0.972 0.112 3.390392e-282 CCL5 0
## CTSW 5.876528e-254 3.307826 0.872 0.079 1.861743e-249 CTSW 0
## CD8A 2.962265e-251 3.746899 0.807 0.059 9.384753e-247 CD8A 0
## CD8B1 1.725846e-237 3.285546 0.842 0.080 5.467653e-233 CD8B1 0
## THY1 1.075640e-235 2.854907 0.941 0.145 3.407735e-231 THY1 0
Raw tables are informative, but not visually appealing. Differentially expressed genes may be plotted in different modes using Platypus.
It is also possible to create a heatmap displaying differentially expressed genes for each cluster. This can be customized to sub-sample cells in case certain clusters are too large for visualization purposes. Additionally, the user can determine the number of genes to display for each cluster based on the n.genes.per.cluster argument. The function GEX_cluster_genes_heatmap can be used to produce a ggplot object, based on the DoHeatmap from Seurat.
agedCNS_heatmap_clusters <- GEX_cluster_genes_heatmap(GEX = VGM[[2]], GEX_cluster_genes.output = gene_expression_cluster,n.genes.per.cluster = 3,max.cell = 30,metric = "avg_logFC", platypus.version = "v3")
print(agedCNS_heatmap_clusters)
After plotting the ggplot object we can clearly see genes enriched in various clusters - mainly indicated by the diagonal.
Further, volcano plots may be used to show and annotate a larger number of differentially expressed genes
agedCNS_heatmap_volcano <- GEX_volcano(DEGs.input = gene_expression_cluster, input.type = "cluster.genes", RP.MT.filter = T, color.p.threshold = 0.01, n.label.up = 10, n.label.down = 10)
## [1] "Top 10 and bottom 10 genes will be labeled"
We can additionally extract the top N genes per cluster directly (with filtering) using the following function:
top_10_genes_per_cluster <- GEX_topN_DE_genes_per_cluster(GEX_cluster_genes.output = gene_expression_cluster, n.genes = 10, by_FC = T)
head(top_10_genes_per_cluster)
## p_val avg_logFC pct.1 pct.2 p_val_adj SYMBOL cluster
## CD81 9.129362e-53 -11.663484 0.000 0.357 2.892273e-48 CD81 0
## BLK 5.337741e-39 -10.261625 0.000 0.278 1.691050e-34 BLK 0
## FCRLA 2.255021e-38 -10.195180 0.000 0.274 7.144132e-34 FCRLA 0
## MZB1 4.144279e-36 -10.111562 0.000 0.260 1.312949e-31 MZB1 0
## H2-DMB2 1.198041e-49 -9.893185 0.002 0.342 3.795514e-45 H2-DMB2 0
## SPI1 5.434681e-34 -9.487847 0.000 0.247 1.721761e-29 SPI1 0
We can extract and test for differential expressed genes between the two samples (or between other subgroups) by using the GEX_DEgenes_persample function.
This function allows us to also create a heatmap displaying the top
most up- or downregulated genes for each cluster based on log fold
change (avg_logFC) or p value (adj_p_value). Additionally, the user can
determine the number of up- and downreulated genes to be displayed for
each sample. In this case, the output returns a list where the first
element contains the dataframe with the differntial expression
information and the second element contains the heatmap displaying the
most up-/downregulated genes.
If more than two samples are to be examined please refer to
GEX_pairwise_degs two chunks below
DE_genes_samples <- GEX_DEgenes(GEX = VGM[[2]],min.pct = .25, grouping.column = "sample_id",group1 = "s1", group2 = "s2",return.plot = "volcano",up.genes = 5,down.genes = 5,logFC = F)
#This function is flexible and takes any column name as grouping.column to allow easy exploration of differences between custom groups
head(DE_genes_samples[[1]])
## p_val avg_logFC pct.1 pct.2 p_val_adj SYMBOL signif_pval
## RGS10 1.564907e-13 1.649797 0.436 0.138 4.957783e-09 RGS10 TRUE
## CTSD 3.522094e-12 2.038039 0.568 0.297 1.115835e-07 CTSD TRUE
## C1QC 7.536629e-11 1.655614 0.375 0.134 2.387680e-06 C1QC TRUE
## CST3 2.673577e-10 1.862194 0.679 0.455 8.470159e-06 CST3 TRUE
## HPGDS 3.129809e-10 2.328784 0.287 0.077 9.915547e-06 HPGDS TRUE
## C1QA 5.307456e-10 1.574627 0.372 0.138 1.681455e-05 C1QA TRUE
## signif_logfc color_threshold
## RGS10 TRUE TRUE
## CTSD TRUE TRUE
## C1QC TRUE TRUE
## CST3 TRUE TRUE
## HPGDS TRUE TRUE
## C1QA TRUE TRUE
The same function can also return DEGs between seurat clusters or any 2 groups of cells Here DEGs between cluster 1 and 3 are calculated
DE_genes_cl1_vs_3 <- GEX_DEgenes(GEX= VGM[[2]],min.pct = .25, grouping.column = "seurat_clusters",group1 = "1", group2 = "3",return.plot = "heatmap",up.genes = 10,down.genes = 10,logFC = F)
## [1] "MS4A4B" "CD3G" "CD3D" "CD3E" "THY1" "SKAP1" "MS4A6B"
## [8] "LCK" "TMSB10" "GM2682" "H2-AA" "H2-AB1" "H2-EB1" "CD79A"
## [15] "CD74" "CD79B" "LY6D" "EBF1" "BANK1" "H2-DMB2"
To gain a complete overview of DEGs between groups or clusters the GEX_pairwise_degs function is used This function calculates DEGs for every possible pairwise comparison between groups of the selected column. It is thereby recommended to use this function with a maximum number of 10 groups/clusters/samples This function automatically saves plots where top label.n.top.genes genes are labeled by their p value. Additionally, genes of special interest that should be labeled irregardless of their p value can be supplied to genes.to.label
DE_clusters_all <- GEX_pairwise_DEGs(GEX = VGM[[2]], group.by = "seurat_clusters", min.pct = 0.25, RP.MT.filter = T, label.n.top.genes = 10, genes.to.label = c("CD74", "EBF1"), save.plot = F)
## Calculating pairwise DEGs 1 of 66
## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 10
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 2 of 66
## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 4
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 3 of 66
## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 9
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 4 of 66
## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 6
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 5 of 66
## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 7
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 6 of 66
## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 2
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 7 of 66
## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 0
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 8 of 66
## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 3
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 9 of 66
## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 5
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 10 of 66
## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 8
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 11 of 66
## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 12 of 66
## [1] 10
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 4
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 13 of 66
## [1] 10
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 9
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 14 of 66
## [1] 10
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 6
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 15 of 66
## [1] 10
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 7
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 16 of 66
## [1] 10
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 2
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 17 of 66
## [1] 10
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 0
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 18 of 66
## [1] 10
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 3
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 19 of 66
## [1] 10
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 5
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 20 of 66
## [1] 10
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 8
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 21 of 66
## [1] 10
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 22 of 66
## [1] 4
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 9
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 23 of 66
## [1] 4
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 6
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 24 of 66
## [1] 4
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 7
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 25 of 66
## [1] 4
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 2
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 26 of 66
## [1] 4
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 0
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 27 of 66
## [1] 4
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 3
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 28 of 66
## [1] 4
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 5
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 29 of 66
## [1] 4
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 8
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 30 of 66
## [1] 4
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 31 of 66
## [1] 9
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 6
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 32 of 66
## [1] 9
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 7
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 33 of 66
## [1] 9
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 2
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 34 of 66
## [1] 9
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 0
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 35 of 66
## [1] 9
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 3
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 36 of 66
## [1] 9
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 5
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 37 of 66
## [1] 9
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 8
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 38 of 66
## [1] 9
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 39 of 66
## [1] 6
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 7
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 40 of 66
## [1] 6
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 2
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 41 of 66
## [1] 6
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 0
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 42 of 66
## [1] 6
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 3
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 43 of 66
## [1] 6
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 5
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 44 of 66
## [1] 6
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 8
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 45 of 66
## [1] 6
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 46 of 66
## [1] 7
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 2
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 47 of 66
## [1] 7
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 0
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 48 of 66
## [1] 7
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 3
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 49 of 66
## [1] 7
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 5
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 50 of 66
## [1] 7
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 8
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 51 of 66
## [1] 7
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 52 of 66
## [1] 2
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 0
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 53 of 66
## [1] 2
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 3
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 54 of 66
## [1] 2
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 5
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 55 of 66
## [1] 2
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 8
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 56 of 66
## [1] 2
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 57 of 66
## [1] 0
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 3
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 58 of 66
## [1] 0
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 5
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 59 of 66
## [1] 0
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 8
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 60 of 66
## [1] 0
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 61 of 66
## [1] 3
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 5
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 62 of 66
## [1] 3
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 8
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 63 of 66
## [1] 3
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 64 of 66
## [1] 5
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 8
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 65 of 66
## [1] 5
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
## Calculating pairwise DEGs 66 of 66
## [1] 8
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11
Now that cluster defining genes have been examined, a helper plotting function allows to visualize selected genes.
dottile <- GEX_dottile_plot(GEX = VGM[[2]], genes = c("CD19", "CD74","SDC1", "EBF1","PTPRC","CD93","CD38","CD24A","CD34","CD1D1","CR2","MS4A1","CXCR5","SELL","CD40","CD83","H2-AB1","H2-EB1","CD27","POU2AF1","NT5E","FAS","PDCD1LG2","PRDM1","ITGAM","IL10","IL12A","HAVCR2"), group.by = "seurat_clusters", threshold.to.plot = 5)
#threshold.to.plot specifies how many % of cells have to express a gene to show a dot.
dottile + ggplot2::theme(plot.title = ggplot2::element_blank(), plot.subtitle = ggplot2::element_blank(), legend.position = "bottom")
## Warning: Removed 192 rows containing missing values (`geom_point()`).
To examine patterns of co-expression two functions allow to generate both overview and single-cell resolution plots.
#overview
coexpression_dotmap <- GEX_coexpression_coefficient(GEX = VGM[[2]], genes = c("CD19", "CD74","SDC1", "EBF1","PTPRC","CD93","CD38","CD24A","CD34"), plot.dotmap = T)
## Calculating coexpression for 9 genes with 2064 cells
Now we can analyze the VDJ repertoire data before integrating VDJ and GEX information. This may be useful if only VDJ libraries have been sequenced without the accompanying gene expression data. We first start by examining the structure of VGM[[1]]
## [1] "barcode" "sample_id"
## [3] "group_id" "clonotype_id_10x"
## [5] "clonotype_id" "clonotype_frequency"
## [7] "celltype" "Nr_of_VDJ_chains"
## [9] "Nr_of_VJ_chains" "VDJ_cdr3s_aa"
## [11] "VJ_cdr3s_aa" "VDJ_cdr3s_nt"
## [13] "VJ_cdr3s_nt" "VDJ_umis"
## [15] "VJ_umis" "VDJ_chain_contig"
## [17] "VJ_chain_contig" "VDJ_chain"
## [19] "VJ_chain" "VDJ_vgene"
## [21] "VJ_vgene" "VDJ_dgene"
## [23] "VDJ_jgene" "VJ_jgene"
## [25] "VDJ_cgene" "VJ_cgene"
## [27] "VDJ_sequence_nt_raw" "VJ_sequence_nt_raw"
## [29] "VDJ_sequence_nt_trimmed" "VJ_sequence_nt_trimmed"
## [31] "VDJ_sequence_aa" "VJ_sequence_aa"
## [33] "VDJ_raw_ref" "VJ_raw_ref"
## [35] "VDJ_trimmed_ref" "VJ_trimmed_ref"
## [37] "VDJ_raw_consensus_id" "VJ_raw_consensus_id"
## [39] "orig_barcode" "specifity"
## [41] "affinity" "GEX_available"
## [43] "orig.ident" "seurat_clusters"
## [45] "PC_1" "PC_2"
## [47] "tSNE_1" "tSNE_2"
## [49] "batch_id"
If GEX data has been integrated, this dataframe also contains dimensional reduction coordinates and cluster id. If trim.and.align was set to TRUE, the columns VDJ_sequence_nt_trimmed to VJ_trimmed_ref contain aligned and reference sequences. The clonotype_frequency column takes input from the clonotyping output of cellranger, which is saved int the clonotype_id_10x column.
This object contains both cells with less than 2 V(D)J chains as well as with more than 2 To show how this is done the VDJ_cgene column is taken as an example
## [1] "IGHM" "IGHD" "" "IGHM;IGHM" "IGHG2B"
## [6] "IGHM;IGHD" "IGHG1" "IGHA" "IGHM;IGHG2C"
Apart from classical isotypes, we can see an empty (““) element. This is for cells, for which only a VJ chain is available. Secondly we can see 3 examples of cells containing more than one VDJ chain. These are delimited by”;“. Importantly, this format is maintained throughout all columns.
Both cells with less and more than 2 chains can be cumbersome for analysis. Platypus V3 functions generally support such cells. To check their quantity and, if needed, exclude these, the following line can be used
For advanced filtering and selection options concerning cells with aberrant number of chains, please refer to the detailed VDJ_GEX_matrix vignette.
## Cell count by number of VDJ chains
##
## 0 1 2
## 92 831 22
##
## Cell count by number of VJ chains
##
## 0 1 2
## 62 820 63
Often paired nucleotide CDRH3 and CDRL3 clonotyping may not be the best strategy given somatic hypermutation may occur in the CDR3 region. Therefore, there could be highly similar clones that likely bind the same antigen that are officially part of different clonal families. To address this, we have added a function that allows for various heuristic clonotyping strategies. This involves clonotyping by identical amino acid CDRH3 + CDRL3 seuqence, identical germline usage, or seqeunce homology requirements. This function works both for V3 and V2 of platypus. The version needs to be specified. For V3 (VGM) platypus.version a new clonotyping function was added to improve the existing one. For the full documentation see VDJ_clonotype_v3(). The deprecated function VDJ_clonotype may still be used for Platypus v2 input Three new parameters have been added therein: 1. global.clonotype If set to TRUE, clonotypes will be generated across all samples of the VGM. This may be useful if more than one sample has been taken from the same animal (e.g. spleen and bone marrow). Moreover this can provide an alternative approach to investigate clone sharing across donors or mice. After global clonotyping, the resulting clonotype_id can be used as a grouping variable and summary statistics can be generated per clone (including its presence in multiple repertoires.) A key update as part of the hierarchical clonotyping strategy is the same-sample-criterion. This is explained below. 2. hierarchical For cells with aberrant numbers of VDJ and VJ chains: the functions implements a hierarchical clonotyping strategy for such cells. In brief, clonotypes are determined based on all cells with exactly 1VDJ and 1VJ chain. Other cells are then “joined in” in post. This follows three steps 1. hierarchical = “none” Create clonotypes from 1 VDJ 1 VJ cells. 2. hierarchical = “single.chains” Cells with a single chain (e.g. 0VDJ and 1VJ) are merged into existing clonotypes containing the same or a homologous VJ chain, depending on the clonotyping strategy. Here the same-sample-criterion is applied when using global clonotyping. If multiple matching clonotypes have been found for that cell then ones within the same sample_id are prioritized. Further if multiple matches exist, the cell is merged into the existing clone with the highest frequency 2. hierarchical = “double.and.single.chains” A combination of 1VDJ and 2VJ chain will first be checked for frequency in the dataset (by exact nucleotide CDR3 matching). Only if this count exceeds the triple.chain.count.threshold, the clone is used as a “hub clone”. This protects from merging clonotypes on the basis of rare doublets. Existing clonotypes are then merged into the 1 VDJ 2 VJ clonotypes as they match with the assumption that e.g. a cell with 1 VDJ 1 VJ is part of that same clonotype, but missing a VJ chain due to stochastical sampling
In the example that follows, we use VDJ_clonotype to group cells into clones based on identical CDRH3 + CDRL3 amino acid sequence. We will compare this to the case in which we group B cells by using the same germline genes (both heavy chain and light chain).
VGM[[1]] <- VDJ_clonotype(VDJ = VGM[[1]], clone.strategy = "cdr3.aa", global.clonotype = F, VDJ.VJ.1chain = F, hierarchical = "single.chains")
## Clonotyping strategy: cdr3.aa
## Filtered out 22 cells containing more than one VDJ AND VJ chain or two VDJ chains, as these likely correspond to doublets
## Processing sample 1 of 4
## Attempting to merge in 9 aberrant cells
## Processing sample 2 of 4
## Attempting to merge in 28 aberrant cells
## Processing sample 3 of 4
## Attempting to merge in 23 aberrant cells
## Processing sample 4 of 4
## Attempting to merge in 94 aberrant cells
## Backing up 10x default clonotyping in columns clonotype_id_10x and clonotype_frequency_10x before updating clonotype_id and clonotype_frequency columns
## Nr and distribution of clonotypes using exact CDR3.aa matching
## [1] 321
print(table(VGM[[1]]$clonotype_frequency_cdr3.aa)) #Check distribution of clonotypes with identical CDR3 aa sequences
##
## 1 2 3 4 5 6 7 8 9 12 24
## 593 118 69 24 15 36 7 16 9 12 24
VGM[[1]] <- VDJ_clonotype(VDJ = VGM[[1]], clone.strategy = "VDJJ.VJJ", global.clonotype = F, VDJ.VJ.1chain = F, hierarchical = "double.and.single.chains", triple.chain.count.threshold = 4)
## Clonotyping strategy: VDJJ.VJJ
## Filtered out 0 cells containing more than one VDJ AND VJ chain or two VDJ chains, as these likely correspond to doublets
## Found 0 exact matching clones with 3 chains and a frequency of at least 4. These will be used as high confidence clonotypes.
## Processing sample 1 of 4
## Attempting to merge in 9 aberrant cells
## Processing sample 2 of 4
## Attempting to merge in 28 aberrant cells
## Processing sample 3 of 4
## Attempting to merge in 23 aberrant cells
## Processing sample 4 of 4
## Attempting to merge in 94 aberrant cells
##
## Nr and distribution of clonotypes using germline gene matching
##
## [1] 295
print(table(VGM[[1]]$clonotype_frequency_VDJJ.VJJ)) #Check distribution of clonotypes with identical germline genes
##
## 1 2 3 4 5 6 7 8 10 12 24
## 534 148 84 32 30 12 21 16 10 12 24
Columns with new clonotype information are added to the VGM so to allow for multiple different strategies to be compared easily. For most Platypus downstream functions, the columns default columns “clonotype_id” and “clonotype_frequency” are used. To control which clonotyping strategy is used in these functions, the relevant columns can be copied over
At the start of this vignette the VGM function was run with trim.and.align = F. Here we run the function again, but this time with trim.and.align = T. This does take significantly longer
To accelerate runtime, parallel computing options are available. If running a Windows machine parallel.processing should be set to “parlapply” This does require the package Parallel and its dependencies. The additional parameter numcores allows to set the numbers of cores to use. This is important when running the function on a cluster. If on a MAC or Linux machine, set parallel.processing to “mclapply”
Reference sequences are obtained from the cellranger output and thereby the 10x Genomics reference. Further trimming is made based on annotations by cellranger (feature$region_type). Alignment is done via Biostrings::pairwiseAlignment and the alignment with maximum score is returned. If alignments are not complete, the VGM parameters gap.opening.cost and gap.extension.cost can be modified
VGM <- VDJ_GEX_matrix(Data.in = yermanos2021_raw, #For local dataset input see section 5.1
GEX.integrate = T,
VDJ.combine = T,
integrate.GEX.to.VDJ = T,
integrate.VDJ.to.GEX = T,
trim.and.align = T, #Set this to TRUE for full length sequence recovery
gap.opening.cost = 10,
gap.extension.cost = 4, #Tweak to optimize alignments if neccessary
parallel.processing = "parlapply")
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at -2.0899
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 0.30103
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 7.0607e-15
## Warning: The following requested variables were not found: UMAP_1, UMAP_2
No trimmed and reference sequence columns should be filled.
## barcode sample_id group_id clonotype_id_10x clonotype_id
## 1 s1_AAACGGGGTTTAGGAA s1 1 clonotype7 clonotype7
## clonotype_frequency celltype Nr_of_VDJ_chains Nr_of_VJ_chains VDJ_cdr3s_aa
## 1 1 B cell 1 1 CARIGYAMDYW
## VJ_cdr3s_aa VDJ_cdr3s_nt
## 1 CQQGNTLPPTF TGTGCTCGAATAGGATATGCTATGGACTACTGG
## VJ_cdr3s_nt VDJ_umis VJ_umis
## 1 TGCCAACAGGGTAATACGCTTCCTCCGACGTTC 2 8
## VDJ_chain_contig VJ_chain_contig VDJ_chain VJ_chain
## 1 AAACGGGGTTTAGGAA-1_contig_2 AAACGGGGTTTAGGAA-1_contig_1 IGH IGK
## VDJ_vgene VJ_vgene VDJ_dgene VDJ_jgene VJ_jgene VDJ_cgene VJ_cgene
## 1 IGHV8-8 IGKV10-96 IGHJ4 IGKJ1 IGHM IGKC
## VDJ_sequence_nt_raw
## 1 TGGGAAGTGTGCAGCCATGGGCAGGCTTACTTCTTCATTCCTGTTACTGATTGTCCCTGCATATGTCCTGTCCCAGGTTACTCTGAAAGAGTCTGGCCCTGGGATATTGCAGCCCTCCCAGACCCTCAGTCTGACTTGTTCTTTCTCTGGGTTTTCACTGAGCACTTTTGGTATGGGTGTAGGCTGGATTCGTCAGCCTTCAGGGAAGGGTCTGGAGTGGCTGGCACACATTTGGTGGGATGATGATAAGTACTATAACCCAGCCCTGAAGAGTCGGCTCACAATCTCCAAGGATACCTCCAAAAACCAGGTATTCCTCAAGATCGCCAATGTGGACACTGCAGATACTGCCACATACTACTGTGCTCGAATAGGATATGCTATGGACTACTGGGGTCAAGGAACCTCAGTCACCGTCTCCTCAGAGAGTCAGTCCTTCCCAAATGTCTTCCCCCTCGTCTCCTGCGAGAGCCCCCTGTCTGATAAGAATCTGGTGGCCATGGGCTGCCTGGCCCGGGACTTCCTGCCCAGCACCATTTCCTTCACCTGGAACTACCAGAACAACACTGAAGTCATCCAGGGTATCAGAACCTTCCCAACACTGAGGACAGGGGGCAAGTACCTAGCCACCTCGCA
## VJ_sequence_nt_raw
## 1 TTGGGGCATTGTAATTGAAGTCAAGACTCAGCCTGGACATGATGTCCTCTGCTCAGTTCCTTGGTCTCCTGTTGCTCTGTTTTCAAGGTACCAGATGTGATATCCAGATGACACAGACTACATCCTCCCTGTCTGCCTCTCTGGGAGACAGAGTCACCATCAGTTGCAGGGCAAGTCAGGACATTAGCAATTATTTAAACTGGTATCAGCAGAAACCAGATGGAACTGTTAAACTCCTGATCTACTACACATCAAGATTACACTCAGGAGTCCCATCAAGGTTCAGTGGCAGTGGGTCTGGAACAGATTATTCTCTCACCATTAGCAACCTGGAGCAAGAAGATATTGCCACTTACTTTTGCCAACAGGGTAATACGCTTCCTCCGACGTTCGGTGGAGGCACCAAGCTGGAAATCAAACGGGCTGATGCTGCACCAACTGTATCCATCTTCCCACCATCCAGTGAGCAGTTAACATCTGGAGGTGCCTCAGTCGTGTGCTTC
## VDJ_sequence_nt_trimmed
## 1 ATGGGCAGGCTTACTTCTTCATTCCTGTTACTGATTGTCCCTGCATATGTCCTGTCCCAGGTTACTCTGAAAGAGTCTGGCCCTGGGATATTGCAGCCCTCCCAGACCCTCAGTCTGACTTGTTCTTTCTCTGGGTTTTCACTGAGCACTTTTGGTATGGGTGTAGGCTGGATTCGTCAGCCTTCAGGGAAGGGTCTGGAGTGGCTGGCACACATTTGGTGGGATGATGATAAGTACTATAACCCAGCCCTGAAGAGTCGGCTCACAATCTCCAAGGATACCTCCAAAAACCAGGTATTCCTCAAGATCGCCAATGTGGACACTGCAGATACTGCCACATACTACTGTGCTCGAATAGGATATGCTATGGACTACTGGGGTCAAGGAACCTCAGTCACCGTCTCCTCA
## VJ_sequence_nt_trimmed
## 1 ATGATGTCCTCTGCTCAGTTCCTTGGTCTCCTGTTGCTCTGTTTTCAAGGTACCAGATGTGATATCCAGATGACACAGACTACATCCTCCCTGTCTGCCTCTCTGGGAGACAGAGTCACCATCAGTTGCAGGGCAAGTCAGGACATTAGCAATTATTTAAACTGGTATCAGCAGAAACCAGATGGAACTGTTAAACTCCTGATCTACTACACATCAAGATTACACTCAGGAGTCCCATCAAGGTTCAGTGGCAGTGGGTCTGGAACAGATTATTCTCTCACCATTAGCAACCTGGAGCAAGAAGATATTGCCACTTACTTTTGCCAACAGGGTAATACGCTTCCTCCGACGTTCGGTGGAGGCACCAAGCTGGAAATCAAA
## VDJ_sequence_aa
## 1 MGRLTSSFLLLIVPAYVLSQVTLKESGPGILQPSQTLSLTCSFSGFSLSTFGMGVGWIRQPSGKGLEWLAHIWWDDDKYYNPALKSRLTISKDTSKNQVFLKIANVDTADTATYYCARIGYAMDYWGQGTSVTVSS
## VJ_sequence_aa
## 1 MMSSAQFLGLLLLCFQGTRCDIQMTQTTSSLSASLGDRVTISCRASQDISNYLNWYQQKPDGTVKLLIYYTSRLHSGVPSRFSGSGSGTDYSLTISNLEQEDIATYFCQQGNTLPPTFGGGTKLEIK
## VDJ_raw_ref
## 1 ATGGGCAGGCTTACTTCTTCATTCCTGTTACTGATTGTCCCTGCATATGTCCTGTCCCAGGTTACTCTGAAAGAGTCTGGCCCTGGGATATTGCAGCCCTCCCAGACCCTCAGTCTGACTTGTTCTTTCTCTGGGTTTTCACTGAGCACTTTTGGTATGGGTGTAGGCTGGATTCGTCAGCCTTCAGGGAAGGGTCTGGAGTGGCTGGCACACATTTGGTGGGATGATGATAAGTACTATAACCCAGCCCTGAAGAGTCGGCTCACAATCTCCAAGGATACCTCCAAAAACCAGGTATTCCTCAAGATCGCCAATGTGGACACTGCAGATACTGCCACATACTACTGTGCTCGAATAGATTACTATGCTATGGACTACTGGGGTCAAGGAACCTCAGTCACCGTCTCCTCAGAGAGTCAGTCCTTCCCAAATGTCTTCCCCCTCGTCTCCTGCGAGAGCCCCCTGTCTGATAAGAATCTGGTGGCCATGGGCTGCCTGGCCCGGGACTTCCTGCCCAGCACCATTTCCTTCACCTGGAACTACCAGAACAACACTGAAGTCATCCAGGGTATCAGAACCTTCCCAACACTGAGGACAGGGGGCAAGTACCTAGCCACCTCGCAGGTGTTGCTGTCTCCCAAGAGCATCCTTGAAGGTTCAGATGAATACCTGGTATGCAAAATCCACTACGGAGGCAAAAACAAAGATCTGCATGTGCCCATTCCAGCTGTCGCAGAGATGAACCCCAATGTAAATGTGTTCGTCCCACCACGGGATGGCTTCTCTGGCCCTGCACCACGCAAGTCTAAACTCATCTGCGAGGCCACGAACTTCACTCCAAAACCGATCACAGTATCCTGGCTAAAGGATGGGAAGCTCGTGGAATCTGGCTTCACCACAGATCCGGTGACCATCGAGAACAAAGGATCCACACCCCAAACCTACAAGGTCATAAGCACACTTACCATCTCTGAAATCGACTGGCTGAACCTGAATGTGTACACCTGCCGTGTGGATCACAGGGGTCTCACCTTCTTGAAGAACGTGTCCTCCACATGTGCTGCCAGTCCCTCCACAGACATCCTAACCTTCACCATCCCCCCCTCCTTTGCCGACATCTTCCTCAGCAAGTCCGCTAACCTGACCTGTCTGGTCTCAAACCTGGCAACCTATGAAACCCTGAATATCTCCTGGGCTTCTCAAAGTGGTGAACCACTGGAAACCAAAATTAAAATCATGGAAAGCCATCCCAATGGCACCTTCAGTGCTAAGGGTGTGGCTAGTGTTTGTGTGGAAGACTGGAATAACAGGAAGGAATTTGTGTGTACTGTGACTCACAGGGATCTGCCTTCACCACAGAAGAAATTCATCTCAAAACCCAATGAGGTGCACAAACATCCACCTGCTGTGTACCTGCTGCCACCAGCTCGTGAGCAACTGAACCTGAGGGAGTCAGCCACAGTCACCTGCCTGGTGAAGGGCTTCTCTCCTGCAGACATCAGTGTGCAGTGGCTTCAGAGAGGGCAACTCTTGCCCCAAGAGAAGTATGTGACCAGTGCCCCGATGCCAGAGCCTGGGGCCCCAGGCTTCTACTTTACCCACAGCATCCTGACTGTGACAGAGGAGGAATGGAACTCCGGAGAGACCTATACCTGTGTTGTAGGCCACGAGGCCCTGCCACACCTGGTGACCGAGAGGACCGTGGACAAGTCCACTGGTAAACCCACACTGTACAATGTCTCCCTGATCATGTCTGACACAGGCGGCACCTGCTAT
## VJ_raw_ref
## 1 ATGATGTCCTCTGCTCAGTTCCTTGGTCTCCTGTTGCTCTGTTTTCAAGGTACCAGATGTGATATCCAGATGACACAGACTACATCCTCCCTGTCTGCCTCTCTGGGAGACAGAGTCACCATCAGTTGCAGGGCAAGTCAGGACATTAGCAATTATTTAAACTGGTATCAGCAGAAACCAGATGGAACTGTTAAACTCCTGATCTACTACACATCAAGATTACACTCAGGAGTCCCATCAAGGTTCAGTGGCAGTGGGTCTGGAACAGATTATTCTCTCACCATTAGCAACCTGGAGCAAGAAGATATTGCCACTTACTTTTGCCAACAGGGTAATACGCTTCCTCCGTGGACGTTCGGTGGAGGCACCAAGCTGGAAATCAAACGGGCTGATGCTGCACCAACTGTATCCATCTTCCCACCATCCAGTGAGCAGTTAACATCTGGAGGTGCCTCAGTCGTGTGCTTCTTGAACAACTTCTACCCCAAAGACATCAATGTCAAGTGGAAGATTGATGGCAGTGAACGACAAAATGGCGTCCTGAACAGTTGGACTGATCAGGACAGCAAAGACAGCACCTACAGCATGAGCAGCACCCTCACGTTGACCAAGGACGAGTATGAACGACATAACAGCTATACCTGTGAGGCCACTCACAAGACATCAACTTCACCCATTGTCAAGAGCTTCAACAGGAATGAGTGT
## VDJ_trimmed_ref
## 1 ATGGGCAGGCTTACTTCTTCATTCCTGTTACTGATTGTCCCTGCATATGTCCTGTCCCAGGTTACTCTGAAAGAGTCTGGCCCTGGGATATTGCAGCCCTCCCAGACCCTCAGTCTGACTTGTTCTTTCTCTGGGTTTTCACTGAGCACTTTTGGTATGGGTGTAGGCTGGATTCGTCAGCCTTCAGGGAAGGGTCTGGAGTGGCTGGCACACATTTGGTGGGATGATGATAAGTACTATAACCCAGCCCTGAAGAGTCGGCTCACAATCTCCAAGGATACCTCCAAAAACCAGGTATTCCTCAAGATCGCCAATGTGGACACTGCAGATACTGCCACATACTACTGTGCTCGAATAGATTACTATGCTATGGACTACTGGGGTCAAGGAACCTCAGTCACCGTCTCCTCA
## VJ_trimmed_ref
## 1 ATGATGTCCTCTGCTCAGTTCCTTGGTCTCCTGTTGCTCTGTTTTCAAGGTACCAGATGTGATATCCAGATGACACAGACTACATCCTCCCTGTCTGCCTCTCTGGGAGACAGAGTCACCATCAGTTGCAGGGCAAGTCAGGACATTAGCAATTATTTAAACTGGTATCAGCAGAAACCAGATGGAACTGTTAAACTCCTGATCTACTACACATCAAGATTACACTCAGGAGTCCCATCAAGGTTCAGTGGCAGTGGGTCTGGAACAGATTATTCTCTCACCATTAGCAACCTGGAGCAAGAAGATATTGCCACTTACTTTTGCCAACAGGGTAATACGCTTCCTCCGTGGACGTTCGGTGGAGGCACCAAGCTGGAAATCAAA
## VDJ_raw_consensus_id VJ_raw_consensus_id orig_barcode specifity
## 1 clonotype7_concat_ref_1 clonotype7_concat_ref_2 AAACGGGGTTTAGGAA NA
## affinity GEX_available orig.ident seurat_clusters PC_1 PC_2 tSNE_1 tSNE_2
## 1 NA FALSE <NA> <NA> NA NA NA NA
## batch_id
## 1 N:/reddy/Shared_Folders/AA_AY/DBmanuscript_Aged_CNS/GEX/Aged.CNS.pool.3m.Bcell.S1
Lastly, you can extract full receptor sequences, annotated regions, as well as consensus and germline sequences (trimmed or raw) using the minimal_VDJ function, as demonstrated before. This is intended for use on the output files of cellranger v7.
For a basic view of clonal expansion the VDJ_clonal_donut produces circular plots per sample. Label position and size should only be adjusted after deciding on a plot export format. This function uses the clonotype_id column as input. If VDJ_clonotyping function was used, its result are stored in that column. If not, the default 10x clonotyping is stored there. To retrieve default 10x clonotyping, use the column clonotyping_id_10x
donuts <- VDJ_clonal_donut(VDJ = VGM[[1]], expanded.colors = c("grey50", "grey65", "grey80"), non.expanded.color = "black", counts.to.use = "freq_column")
## Warning in VDJ_clonal_donut(VDJ = VGM[[1]], expanded.colors = c("grey50", :
## Column name for counts.to.use was not found in VDJ. Defaulting to
## 'clonotype_id_10x'
## Using column clonotype_id for counting clones
## Clones: Expanded: 5 / 6.85%; 1 cell 68 / 93.15%; total: 73
## Cells: Expanded: 10 / 12.82%; 1 cell 68 / 87.18%; Total: 78
## Clones: Expanded: 14 / 10.85%; 1 cell 115 / 89.15%; total: 129
## Cells: Expanded: 45 / 28.12%; 1 cell 115 / 71.88%; Total: 160
## Clones: Expanded: 7 / 3.5%; 1 cell 193 / 96.5%; total: 200
## Cells: Expanded: 15 / 7.21%; 1 cell 193 / 92.79%; Total: 208
## Clones: Expanded: 64 / 18.34%; 1 cell 285 / 81.66%; total: 349
## Cells: Expanded: 214 / 42.89%; 1 cell 285 / 57.11%; Total: 499
#Counts to use = "freq_column" uses the counts in the clonotype_frequency column. Counts to use = "VGM" simply counts the rows of a given clonotype in the VGM table (these counts may differ if cells have been filtered out due to overlapping barcodes or if another clonotyping strategy was used)
donuts[[1]]
The black section shows cells in unexpanded clones. The rest shows expanded clones by their frequency. The middle label shows the total number of clones (cells)
This requires the Vegan package. We can calculate the diversity for any column(s) in the VGM. If more than one column is provided, the content will be pasted together before calculating diversity metrics
#Shannon Evenness for the VDJ chain CDR3
diversity_plot <- VDJ_diversity(VDJ = VGM[[1]],feature.columns = c("VDJ_cdr3s_aa"),grouping.column = "sample_id",metric = c("shannonevenness"), subsample.to.same.n = T)
diversity_plot
#Gini-Simpson index for pasted VDJ and VJ chain CDR3s
diversity_plot <- VDJ_diversity(VDJ = VGM[[1]],feature.columns = c("VDJ_cdr3s_aa", "VJ_cdr3s_aa"),grouping.column = "sample_id", metric = c("ginisimpson"), subsample.to.same.n = T)
diversity_plot
#exact values can be retrived as
#Jaccard index between repertoires of the two samples
diversity_plot <- VDJ_diversity(VDJ = VGM[[1]],feature.columns = c("VDJ_cgene"),grouping.column = "sample_id",metric = c("jaccard"), subsample.to.same.n = T)
diversity_plot
Other functions are specifically tailored to repertoire analysis - such as VDJ_network, which creates a sequence similarity network between repertoires or within a repertoire by connecting those clones with sequence similarity. This function relies upon igraph to visually display and construct the graph - which means that networks with high number of sequences will not display easily. In the following example we are using a small dataset. In case of a bigger dataset one may subsample by using the sample.n function on VGM[[1]]. Setting the per.mouse argument to false indicates that one network for multiple repertoires should be produced.
This function is quite specialized and we are working on building a better networks module. But hopefully the code can be used as a template for your own analyses.
#subsampled_VGM <- dplyr::sample_n(VGM[[1]], 60)
cl_network <- VDJ_network(VDJ = VGM[[1]],per.sample = T,distance.cutoff = 8, platypus.version = "v3")
for(i in 1:2){
igraph::plot.igraph(cl_network[[1]][[i]],vertex.label=NA,vertex.size=7)
}
#For a plot including clonotype frequencies
cl_network <- VDJ_network(VDJ = VGM[[1]],per.sample = F,distance.cutoff = 8, platypus.version = "v3")
igraph::plot.igraph(cl_network[[4]],vertex.label=NA,vertex.size=3+(0.95*as.numeric(cl_network[[2]]$clonotype_frequency)),vertex.color= as.factor(VGM[[1]]$sample_id))
For more details see the documentation of the VDJ_network function, but essentially information such as clonal frequency and which sample (here still indicated by the “sample_id” column) are stored in the second element of the output list. Here we can see only a few clones that are showing connections (produced by edges between those with a distance of 8 amino acids or less between heavy and light chain paired CDR3 sequence homology)
It is also possible to produce heatmaps of the germline gene usage in the context of heavy chain V gene and light chain V gene. The output of the VDJ_Vgene_usage function is a matrix for each repertoire corresponding to the order specified by VDJ_analyze. The outer list corresponds to the sample and the inner list corresponds to a matrix, where the rows correspond to the heavy chain V genes and the columns correspond to the light chains of the V genes. Therefore the output[[1]][i,j] corresponds to the number of clones using the combination of IGH-Vgene[i] and IGK/L-Vgene[j].
#First calculate adjacency matrix for V gene usage
Vgene_usage <- VDJ_Vgene_usage(VDJ = VGM[[1]], platypus.version = "v3")
#library(pheatmap)
pheatmap::pheatmap(Vgene_usage[[1]],show_rownames = F,show_colnames = F)
print(class(Vgene_usage[[1]]))
## [1] "matrix" "array"
## [1] "IGHV8-8" "IGHV1-9" "IGHV1-26" "IGHV14-1" "IGHV5-17" "IGHV5-16"
## [1] "IGKV10-96" "IGKV6-32" "IGKV19-93" "IGKV8-24" "IGKV1-110" "IGKV4-55"
This can then be easily plotted as a heatmap to observe patterns between repertoires or can be used to calculate V gene correlation using the “pheatmap” package.
Platypus also allows a separate analysis of V gene usage for HC and LC. The VDJ_Vgene_usage_barplot allows the user to plot most frequently used IgH or IgK/L V genes. By default, this function only provides visualizations for the HC V genes, but can also provide for the LC if LC. Vgene is set to TRUE. The User can also select the number of most used genes to be depicted.
Vgene_usage_barplot <- VDJ_Vgene_usage_barplot(VDJ = VGM[[1]], HC.gene.number = 10, LC.Vgene = T, LC.gene.number = 10, platypus.version = "v3")
## Sample order: s1 ; s2 ; s3 ; s4
Vgene_usage_barplot[[1]]
#VDJ chains
Vgene_usage_stackedbarplot <- VDJ_Vgene_usage_stacked_barplot(VDJ = VGM[[1]], LC.Vgene = F,HC.gene.number = 10, Fraction.HC = 1, platypus.version = "v3")
## Sample order: s1 ; s2 ; s3 ; s4
Furthermore, we can also produce a circular visualization of how V and J genes are combined throughout the repertoire. In the example that follows we use VDJ_VJ_usage_circos to look at the V gene with the corresponding J gene for each expanded clonotype.
#VDJ and VJ V and J gene pairing
vj_circos_bcells <- VDJ_VJ_usage_circos(VGM = VGM[[1]], c.threshold = 5,label.threshold=5,cell.level = T, platypus.version = "v3")
## Chosen clonotype column: clonotype_id_10x
## Chosen clonotype.frequency column: clonotype_frequency_10x
## Chosen clonotype.frequency column: clonotype_frequency
## Processing sample 1
## Processing sample 2
#VDJ and VJ pairing
HL_circos_bcells <- VDJ_alpha_beta_Vgene_circos(VGM = VGM[[1]], c.threshold = 5,label.threshold=5,cell.level = T, V.or.J= "both", platypus.version = "v3")
## Chosen clonotype.frequency column: clonotype_frequency
## [1] "s2"
## [1] "s4"
## Plotting for sample s2
## Plotting for sample s4
Finally, one can also look at any specific HC and LC CDR3 amino acid patterns arising across the different clones. Using the VDJ_logoplot_vector function the user can plot a logoplot of the CDR3 region od a certain length, specyfied by the length_cdr3 argument. For instance, the logoplot below corresponds to all those CDR3 aminoacid sequences of length 25
pasted_CDR3s <- paste0(VGM[[1]]$VDJ_cdr3s_aa, VGM[[1]]$VJ_cdr3s_aa)
agedCNS_CDR3_logoplot <- VDJ_logoplot_vector(cdr3.vector = pasted_CDR3s, seq_type = "aa", length_cdr3 = "auto")
## Returning logoplot based on 106 sequences of a total 945 input sequences
Depending on the clonotype strategy, within a clone several different variants of CDR3s or full length sequences may be contained. To count variants, the function VDJ_variants_per_clone is used. It returns a table per sample_id with stats about variants
variants_agedCNS <- VDJ_variants_per_clone(VDJ = VGM[[1]], variants.of = c("VDJ_cdr3s_aa", "VJ_cdr3s_aa"), clonotypes.col = "clonotype_id_10x", split.by = "sample_id", stringDist.method = "levenshtein")
head(variants_agedCNS[[1]])
## split_group clonotype_id n sequences n variants mean variants distance
## 1 s1 clonotype7 1 1 NA
## 2 s1 clonotype58 1 1 NA
## 3 s1 clonotype65 1 1 NA
## 4 s1 clonotype43 1 1 NA
## 5 s1 clonotype33 1 1 NA
## 6 s1 clonotype49 1 1 NA
## max variants distance min variants distance most frequent variant
## 1 NA NA CARIGYAMDYW/CQQGNTLPPTF
## 2 NA NA CARDFTTVVARGYFDVW/CQQDYSSPWTF
## 3 NA NA CARGITTVVAYYYAMDYW/CLQYDNLYTF
## 4 NA NA CTTGPYDYYAMDYW/CQQHYSTPYTF
## 5 NA NA CARPHDYDGVDYW/CSQSTHVPPWTF
## 6 NA NA CARRYYSNYAWFAYW/CQQWSSYPYTF
## n VDJ cgene variants most frequent VDJ cgene
## 1 1 IGHM
## 2 1 IGHD
## 3 1 IGHD
## 4 1 IGHD
## 5 1 IGHM
## 6 1 IGHM
Public clones or sequences are sequences that are shared between individuals. The VDJ_overlap_heatmap function can quantify this overlap and return a list of overlapping items
#overlap of VDJ V genes
VDJv_overlap <- VDJ_overlap_heatmap(VDJ = VGM[[1]],feature.columns = c("VDJ_vgene") ,grouping.column = "sample_id", axis.label.size = 20, pvalues.label.size = 12, add.barcode.table = T, plot.type = "ggplot")
head(VDJv_overlap[[2]]) #summary of overlap
## V1 V2 overlap
## 1 s1 s2 30
## 2 s1 s3 33
## 3 s1 s4 38
## 4 s2 s3 40
## 5 s2 s4 47
## 6 s3 s4 53
## items.overlapping
## 1 IGHV8-8;IGHV1-9;IGHV1-26;IGHV5-17;IGHV5-16;IGHV1-54;IGHV1-72;IGHV1-64;IGHV5-9-1;IGHV1-19;IGHV1-18;IGHV8-5;IGHV3-6;IGHV1-52;IGHV1-80;IGHV1-74;IGHV1-39;IGHV1-22;IGHV10-1;IGHV1-62-2;IGHV9-3;IGHV1-53;IGHV1-69;IGHV9-1;IGHV1-76;IGHV1-50;IGHV1-81;IGHV1-12;IGHV1-82;IGHV1-85
## 2 IGHV8-8;IGHV1-9;IGHV1-26;IGHV5-17;IGHV5-16;IGHV1-54;IGHV1-72;IGHV1-64;IGHV5-9-1;IGHV1-71;IGHV1-19;IGHV1-18;IGHV1-59;IGHV8-5;IGHV3-6;IGHV1-52;IGHV1-80;IGHV1-74;IGHV1-39;IGHV10-1;IGHV1-62-2;IGHV9-3;IGHV2-5;IGHV1-53;IGHV1-69;IGHV1-76;IGHV1-81;IGHV1-12;IGHV1-66;IGHV14-2;IGHV1-4;IGHV1-82;IGHV1-85
## 3 IGHV8-8;IGHV1-9;IGHV1-26;IGHV14-1;IGHV5-17;IGHV5-16;IGHV1-54;IGHV1-72;IGHV1-64;IGHV5-9-1;IGHV2-3;IGHV1-71;IGHV1-19;IGHV1-18;IGHV1-75;IGHV1-59;IGHV4-1;IGHV3-6;IGHV1-52;IGHV1-80;IGHV1-74;IGHV1-39;IGHV1-22;IGHV10-1;IGHV1-62-2;IGHV9-3;IGHV2-5;IGHV1-53;IGHV1-69;IGHV9-1;IGHV1-76;IGHV1-50;IGHV1-81;IGHV1-12;IGHV1-66;IGHV14-2;IGHV1-82;IGHV1-85
## 4 IGHV5-9-1;IGHV5-16;IGHV9-3;IGHV1-12;IGHV1-39;IGHV14-3;IGHV1-9;IGHV1-82;IGHV1-26;IGHV10-3;IGHV7-3;IGHV1-19;IGHV1-74;IGHV1-76;IGHV3-6;IGHV1-53;IGHV1-69;IGHV1-85;IGHV1-81;IGHV1-18;IGHV1-54;IGHV8-8;IGHV1-15;IGHV2-9-1;IGHV1-64;IGHV1-52;IGHV5-12;IGHV10-1;IGHV1-78;IGHV5-4;IGHV6-3;IGHV6-6;IGHV5-17;IGHV1-72;IGHV1-55;IGHV1-80;IGHV8-5;IGHV5-9;IGHV2-2;IGHV1-62-2
## 5 IGHV5-9-1;IGHV5-16;IGHV9-3;IGHV1-50;IGHV1-12;IGHV1-39;IGHV14-3;IGHV1-2;IGHV8-12;IGHV1-9;IGHV1-82;IGHV1-58;IGHV1-36;IGHV1-26;IGHV10-3;IGHV7-3;IGHV1-19;IGHV1-74;IGHV1-76;IGHV3-6;IGHV1-53;IGHV1-69;IGHV1-85;IGHV1-81;IGHV1-18;IGHV1-54;IGHV8-8;IGHV1-15;IGHV2-9-1;IGHV1-22;IGHV1-64;IGHV1-52;IGHV1-42;IGHV5-12;IGHV10-1;IGHV1-78;IGHV5-4;IGHV6-3;IGHV6-6;IGHV5-17;IGHV9-1;IGHV1-72;IGHV1-55;IGHV1-80;IGHV2-2;IGHV2-6;IGHV1-62-2
## 6 IGHV1-55;IGHV1-9;IGHV1-71;IGHV1-64;IGHV1-53;IGHV1-82;IGHV7-1;IGHV1-18;IGHV5-4;IGHV1-52;IGHV1-26;IGHV1-7;IGHV5-9-1;IGHV2-9-1;IGHV5-12;IGHV13-2;IGHV1-72;IGHV1-69;IGHV1-19;IGHV10-3;IGHV1-81;IGHV1-80;IGHV6-3;IGHV1-5;IGHV1-15;IGHV6-6;IGHV1-84;IGHV5-16;IGHV9-3;IGHV1-78;IGHV1-66;IGHV2-5;IGHV3-6;IGHV1-12;IGHV1-39;IGHV7-3;IGHV1-62-2;IGHV1-74;IGHV10-1;IGHV1-59;IGHV8-8;IGHV1-77;IGHV15-2;IGHV14-3;IGHV14-2;IGHV5-17;IGHV1-61;IGHV3-1;IGHV1-54;IGHV2-2;IGHV1-63;IGHV1-76;IGHV1-85
## overlap_lab
## 1 30
## 2 33
## 3 38
## 4 40
## 5 47
## 6 53
#overlap of clones
clone_overlap <- VDJ_overlap_heatmap(VDJ = VGM[[1]],feature.columns = c("VDJ_cdr3s_aa","VJ_cdr3s_aa") ,grouping.column = "sample_id", axis.label.size = 20, pvalues.label.size = 12, add.barcode.table = T, plot.type = "ggplot")
print(clone_overlap[[1]])
#Pheatmap plots will function only with length(unique(grouping.column)) > 4
Given the small size of the two repertoires, it is unsurprising that no clones are shared
The strength of the current 5’ sequencing protocols are that the gene expression (GEX) and repertoire (VDJ) libraries are extracted from the same sample, which can then be linked back to demonstrate that a given T cell has a certain gene expression pattern and also a certain T cell receptor sequence. The following functions are meant to integrate these two pieces of information.
One thing we may ask is how similar the B or T cells in a given clonal family are on the transcriptional level. The basic work of matching barcodes from V(D)J sequencing to those from GEX sequencing was already done in the VGM function. This part is therefore dedicated to functions that help explore and exploite this data
First we get basic stats on the overlap between GEX and VDJ. In this dataset, GEX data contains both B and T cells while VDJ data contains only BCR sequences.
## [1] 945
## [1] 2064
## [1] 303
To better visualize this, the cells of any clone can also be highlighted on a dimensional reduction
Because the legend of these plots can get quite large, we can split the plot and the legend and draw them separately. For this the gridExtra and cowplot package is required
#here we overlay the top 5 clones
plot_out <- VDJ_GEX_overlay_clones(GEX = VGM[[2]], reduction = "umap", n.clones = 5, by.sample = F, ncol.facet = 1, split.plot.and.legend = F, pt.size = 0.5)
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
plot_out[[1]] # the plot
#We can also plot any clone of interest by adding a column were TRUE values select which clones are to be plotted
interesting_clones <- c("clonotype7", "clonotype42")
VGM[[2]]@meta.data$clones_to_plot <- FALSE
VGM[[2]]@meta.data$clones_to_plot[which(VGM[[2]]$clonotype_id_10x %in% interesting_clones)] <- TRUE
plot_out <- VDJ_GEX_overlay_clones(GEX = VGM[[2]], reduction = "umap", clones.to.plot = "clones_to_plot", by.sample = F, ncol.facet = 1, split.plot.and.legend = T, pt.size = 0.5)
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
We previously integrated the GEX information into the format of the VDJ output. However, we may want to ask how the gene expression looks for certain clonotypes (e.g., how many of the cells in the top clone are expressing markers of activated B cells). This is done by adding an extra column to the VDJ and GEX objects containing binary information on expansion.
## Added new columns to VDJ: expanded_clonotype_frequency
## Added new columns to GEX: expanded_clonotype_frequency
For a more unbiased view, DEGs can be calculated between expanded and non-expanded clones
DE_byexpansion <- GEX_DEgenes(GEX = VGM[[2]], min.pct = 0.25, group1 = "TRUE", group2 = "FALSE", grouping.column = "expanded_clonotype_frequency", return.plot = "volcano", logFC = F, label.n.top.genes = 5)
head(DE_byexpansion[[1]])
## p_val avg_logFC pct.1 pct.2 p_val_adj SYMBOL
## Zc3h15 1.501698e-07 -6.650521 0.000 0.309 0.004757529 Zc3h15
## Ccr7 3.843845e-07 -1.552368 0.129 0.515 0.012177684 Ccr7
## Tspan13 8.735331e-07 -1.692454 0.071 0.412 0.027674401 Tspan13
## Pde4b 1.116669e-06 -1.349421 0.143 0.528 0.035377178 Pde4b
## AC149090.1 1.681616e-06 -6.815884 0.000 0.266 0.053275287 AC149090.1
## Gimap6 1.761742e-06 -1.094514 0.200 0.601 0.055813735 Gimap6
## signif_pval signif_logfc color_threshold
## Zc3h15 TRUE TRUE TRUE
## Ccr7 FALSE TRUE FALSE
## Tspan13 FALSE TRUE FALSE
## Pde4b FALSE TRUE FALSE
## AC149090.1 FALSE TRUE FALSE
## Gimap6 FALSE TRUE FALSE
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.0
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## time zone: Europe/Zurich
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Platypus_3.5.0 Seurat_5.0.1 SeuratObject_5.0.1 sp_2.1-1
## [5] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 purrr_1.0.2
## [9] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.4
## [13] tidyverse_2.0.0 dplyr_1.1.3
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.21 splines_4.3.1 later_1.3.1
## [4] bitops_1.0-7 ggplotify_0.1.2 polyclip_1.10-6
## [7] fastDummies_1.7.3 lifecycle_1.0.3 doParallel_1.0.17
## [10] globals_0.16.2 lattice_0.21-8 MASS_7.3-60
## [13] magrittr_2.0.3 limma_3.56.2 plotly_4.10.2
## [16] sass_0.4.7 rmarkdown_2.25 jquerylib_0.1.4
## [19] yaml_2.3.7 useful_1.2.6 httpuv_1.6.11
## [22] sctransform_0.4.1 spam_2.10-0 spatstat.sparse_3.0-2
## [25] reticulate_1.34.0 cowplot_1.1.1 pbapply_1.7-2
## [28] RColorBrewer_1.1-3 zlibbioc_1.46.0 abind_1.4-5
## [31] Rtsne_0.16 RCurl_1.98-1.12 BiocGenerics_0.46.0
## [34] yulab.utils_0.1.0 GenomeInfoDbData_1.2.10 circlize_0.4.15
## [37] IRanges_2.34.1 S4Vectors_0.38.2 ggrepel_0.9.4
## [40] irlba_2.3.5.1 listenv_0.9.0 spatstat.utils_3.0-3
## [43] tidytree_0.4.5 pheatmap_1.0.12 vegan_2.6-4
## [46] goftest_1.2-3 RSpectra_0.16-1 spatstat.random_3.1-6
## [49] fitdistrplus_1.1-11 parallelly_1.36.0 permute_0.9-7
## [52] leiden_0.4.3 codetools_0.2-19 tidyselect_1.2.0
## [55] shape_1.4.6 ggseqlogo_0.1 aplot_0.2.2
## [58] farver_2.1.1 stats4_4.3.1 matrixStats_1.0.0
## [61] spatstat.explore_3.2-3 jsonlite_1.8.7 ellipsis_0.3.2
## [64] progressr_0.14.0 ggridges_0.5.4 survival_3.5-5
## [67] iterators_1.0.14 foreach_1.5.2 tools_4.3.1
## [70] treeio_1.24.3 stringdist_0.9.10 ica_1.0-3
## [73] Rcpp_1.0.11 glue_1.6.2 gridExtra_2.3
## [76] xfun_0.40 mgcv_1.9-0 GenomeInfoDb_1.36.4
## [79] withr_2.5.1 fastmap_1.1.1 fansi_1.0.5
## [82] digest_0.6.33 timechange_0.2.0 R6_2.5.1
## [85] mime_0.12 gridGraphics_0.5-1 colorspace_2.1-0
## [88] scattermore_1.2 tensor_1.5 spatstat.data_3.0-1
## [91] utf8_1.2.3 generics_0.1.3 data.table_1.14.8
## [94] httr_1.4.7 htmlwidgets_1.6.2 uwot_0.1.16
## [97] pkgconfig_2.0.3 gtable_0.3.4 lmtest_0.9-40
## [100] XVector_0.40.0 htmltools_0.5.6.1 dotCall64_1.1-0
## [103] scales_1.2.1 png_0.1-8 ggfun_0.1.3
## [106] knitr_1.44 rstudioapi_0.15.0 tzdb_0.4.0
## [109] reshape2_1.4.4 nlme_3.1-162 cachem_1.0.8
## [112] zoo_1.8-12 GlobalOptions_0.1.2 KernSmooth_2.23-22
## [115] parallel_4.3.1 miniUI_0.1.1.1 pillar_1.9.0
## [118] grid_4.3.1 vctrs_0.6.4 RANN_2.6.1
## [121] promises_1.2.1 xtable_1.8-4 cluster_2.1.4
## [124] evaluate_0.22 cli_3.6.1 compiler_4.3.1
## [127] rlang_1.1.1 crayon_1.5.2 future.apply_1.11.0
## [130] labeling_0.4.3 plyr_1.8.9 fs_1.6.3
## [133] stringi_1.7.12 viridisLite_0.4.2 deldir_1.0-9
## [136] munsell_0.5.0 Biostrings_2.68.1 lazyeval_0.2.2
## [139] spatstat.geom_3.2-5 Matrix_1.6-3 RcppHNSW_0.4.1
## [142] hms_1.1.3 patchwork_1.1.3 future_1.33.0
## [145] shiny_1.7.5.1 ROCR_1.0-11 igraph_1.5.1
## [148] memoise_2.0.1 bslib_0.5.1 ggtree_3.8.2
## [151] ape_5.7-1