Platypus Quick Start

1. Introduction

The Platypus ecosystem of packages are meant to provide potential pipelines and examples relevant to the broad field of computational immunology. While the majority of functions are currently tailored to single-cell immune repertoire sequencing, we will continue to building the package in a modular structure that allows integration to other bioinformatic methods. Currently, the package can be used to separately analyze gene expression (GEX) or immune receptor repertoire (VDJ) sequencing data, in addition to integrating the two data sets to combine phenotypic features with repertoire analysis. The package is designed to primarily analyze the output from 10x genomics cellranger (output from count for GEX and VDJ for enriched immune receptor libraries). The functions could work with other barcode-based scSeq technologies assuming the input columns are added correctly. The gene expression analysis relies heavily upon Seurat, a commonly used R package for single-cell sequencing (scSeq). The core set of functions can be found at https://github.com/alexyermanos/Platypus and examples of use can be found in the publications https://doi.org/10.1093/nargab/lqab023 and https://academic.oup.com/bioinformatics/article/39/9/btad553/7264179

Stay tuned for updates https://twitter.com/AlexYermanos

2. Platypus Version 3 (v3)

Following the original NARGAB publication of Platypus, we have rebuilt the package to entirely revolve around the VDJ_GEX_matrix function (VGM for short). This function integrates both repertoire and transcriptome information (if desired) and serves as the input to all secondary functions in future iterations of the package. The advantage of this is having all repertoire and transcriptome information at a per-cell level. This was originally required due to significant changes to the clonotyping strategies in the default 10x genomics VDJ pipeline. While the format of this object will be detailed below, the core of the function creates a list in R where the first element corresponds to the VDJ information at the per-cell level and the second element corresponds to a Seurat object. The VGM can then be supplied to all other functions that will assume this format by default. All functions should now be updated to take the VGM structure (referred to as “v3” of Platypus) as input in the “platypus.version” argument. “platypus.version” can be set to either “v2” for backcompatibility found in the original manucsript or “v3” for newer and future functions. Some new functions are only compatible with “v3”.

2.1 Citing Platypus

Yermanos A et al. Platypus: an open-access software for integrating lymphocyte single-cell immune repertoires with transcriptomes. NAR Genomics Bioinforma (2021) 3: doi:10.1093/NARGAB/LQAB023

Tudor-Stefan Cotet, Andreas Agrafiotis, Victor Kreiner et al. ePlatypus: an ecosystem for computational analysis of immunogenomics data, Bioinformatics, Volume 39, Issue 9, September 2023, btad553, https://doi.org/10.1093/bioinformatics/btad553

3. PlatypusDB

We understand the pain that locating and utilizing public data can inflict on aspiring and established scientists alike. We have therefore created a public database that is integrated within the structure of the Platypus package. This means that single-cell immune repertoire sequencing data can be directly loaded into the R session and even integrated with locally stored data. All examples in these vignettes highlight the strength of this feature by primarily using public datasets within our database. We highly encourage others to contribute to this database by contacting use with the output files from your cellranger runs (hopefully with at least v6 of cellranger and using the standard references). Sharing and organizing public data can help save the world.

4. Installation…

4.1 … from CRAN

#install.packages("Platypus")
install.packages("~/Downloads/Platypus_3.5.0.tar.gz", repos = NULL, type="source")

#Installing main imports
install.packages("Seurat")
install.packages("Tidyverse")
install.packages("Biostrings")
install.packages("jsonlite")
install.packages("seqinr")

library(Platypus)

4.2 …or from Github or source file

We are adding new features constantly. Recent changes may not yet be reflected in the CRAN version. All Github versions are developmental!

Individual R functions can additionally be found on the Github in the folder “R”. These can similarly be downloaded and loaded into the R environment in case not all functions are desired. These functions are actively updated and may include more features than the in tar.gz or Github version file.

### Removing any previous versions of the package
#First can ensure that there is no previous version installed locally
detach("package:Platypus", unload=TRUE)
remove.packages("Platypus")

### Downloading and installing Platypus
# First we need to download the most recent version from the master branch at https://github.com/alexyermanos/Platypus we can install the package using the following command. 
# WARNING: This needs to be replaced with your own directory where the downloaded package is found
# For MacOS users it may look like this
install.packages("~/Downloads/Platypus_3.5.0.tar.gz", repos = NULL, type="source")
# For windows it will likely look something like this. 
# WARNING: You will need to replace 'YourPCName' with your user name for the windows account in the directory. 
#install.packages("C:/Users/YourPCName/Downloads/Platypus_3.5.0.tar.gz", repos = NULL, type="source")

library(Platypus)

Installing from the Github repo

detach("package:Platypus", unload=TRUE)
remove.packages("Platypus")

install.packages("devtools")
devtools::install_github("alexyermanos/Platypus")
library(Platypus)

5. Extracting and integrating repertoire data with VDJ_GEX_matrix or VGM (Platypus v3)

Comprehensive details about the parameters of the VDJ_GEX_matrix() function can be found in the respective function documentation and in the dedicated vignette.

As an exemplary dataset we use GEX data from murine B and T cells and VDJ data from B cells in the aged CNS. This small dataset contains VDJ and GEX libraries from the central nervous system of four murine samples. The corresponding publication can be found here: https://doi.org/10.1098/rspb.2020.2793

5.1 Process local datasets

For local datasets, the paths to Cellranger output directories are needed. VDJ and GEX are processed simultaneously using the VDJ_GEX_matrix function, any input of the two can be omitted if not necessary.

#Creating a list with local paths to cellranger directories

VDJ.out.directory.list <- list() ### Set directory to the outs folder of cellranger VDJ
VDJ.out.directory.list[[1]] <- c("~/Downloads/yermanos2021b__VDJ_RAW/Aged.CNS.pool.3m.Bcell.S1")
VDJ.out.directory.list[[2]] <- c("~/Downloads/yermanos2021b__VDJ_RAW/Aged.CNS.pool.12m.Bcell.S2")
VDJ.out.directory.list[[3]] <- c("~/Downloads/yermanos2021b__VDJ_RAW/Aged.CNS.pool.18m.Bcell.S3")
VDJ.out.directory.list[[4]] <- c("~/Downloads/yermanos2021b__VDJ_RAW/Aged.CNS.single.18m.Bcell.S4")
GEX.out.directory.list <- list() ### Set directory to the /outs folder of cellranger count
GEX.out.directory.list[[1]] <- c("~/Downloads/yermanos2021b__GEX_RAW/Aged.CNS.pool.3m.Bcell.S1/")
GEX.out.directory.list[[2]] <- c("~/Downloads/yermanos2021b__GEX_RAW/Aged.CNS.pool.12m.Bcell.S2/")
GEX.out.directory.list[[3]] <- c("~/Downloads/yermanos2021b__GEX_RAW/Aged.CNS.pool.18m.Bcell.S3/")
GEX.out.directory.list[[4]] <- c("~/Downloads/yermanos2021b__GEX_RAW/Aged.CNS.single.18m.Bcell.S4/")

#Running the VDJ_GEX_matrix function

VGM <- VDJ_GEX_matrix(VDJ.out.directory.list = VDJ.out.directory.list,
                               GEX.out.directory.list = GEX.out.directory.list,
                               GEX.integrate = T,
                               VDJ.combine = T,
                               integrate.GEX.to.VDJ = T,
                               integrate.VDJ.to.GEX = T)

If you are only interested in repertoire analysis, we have made available a function to directly build the VDJ data frame of the VGM object. This will integrate trimmed and raw receptor sequences, consensus sequences, and germlines. For obtaining trimmed germlines, we are aligning the raw germline to the consensus sequence (gapless global-local alignment). However, this can slow down the computation for the VDJ data frame, therefore we recommend you use the parallel option.

Ensure your VDJ output files have been obtained using cellranger v7 (e.g., the filtered_contig_annotations.csv file should have annotations for all FWR and CDR regions). The following files should be present in all sample folders: all_contig_annotations.csv, filtered_contig_annotations.csv, all_contig_annotations.csv, filtered_contig.fasta, consensus.fasta, concat_ref.fasta.

#Creating a list with local paths to Cellranger directories for VDJ outputs for the OVA dataset.

VDJ.out.directory.list <- list() ### Set directory to the outs folder of Cellranger VDJ
VDJ.out.directory.list[[1]] <- c("~/Downloads/ova_v7/S1")
VDJ.out.directory.list[[2]] <- c("~/Downloads/ova_v7/S2")
VDJ.out.directory.list[[3]] <- c("~/Downloads/ova_v7/S3")
VDJ.out.directory.list[[4]] <- c("~/Downloads/ova_v7/S4")
VDJ.out.directory.list[[5]] <- c("~/Downloads/ova_v7/S5")

#Running the minimal_VDJ function with trim.germlines = TRUE

VDJ <- minimal_VDJ(VDJ.directory.list = VDJ.out.directory.list,
                   trim.germlines = TRUE,
                   gap.opening.cost = Inf, 
                   parallel = TRUE)

5.2 Process raw datasets from PlatypusDB

PlatypusDB stores both raw and processed data for each study. Processed data can be loaded directly as a VGM object. Raw data can be downloaded and then processed with the VDJ_GEX_matrix function.

These features are usful for integration of local and DB datasets. For this please refer to the PlatypusDB vignette

Here we load samples from PlatypusDB, and process these.

#Downloading PlatypusDB raw data in a list format
#For structure of PlatypusDB links, please refer to the PlatypusDB vignette 
library(Platypus)
yermanos2021_raw <- PlatypusDB_fetch(PlatypusDB.links = 
                                     c("yermanos2021b/ALL/ALL"), 
                                     load.to.enviroment = F,
                                     load.to.list = T)
#Running the VDJ_GEX_matrix function

VGM <- VDJ_GEX_matrix(Data.in = yermanos2021_raw,
                      GEX.integrate = T,
                      VDJ.combine = T,
                      integrate.GEX.to.VDJ = T,
                      integrate.VDJ.to.GEX = T,
                      verbose = F)

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at -2.0899

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 0.30103

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 7.0607e-15

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## Warning: The following requested variables were not found: UMAP_1, UMAP_2

#For output format see 5.3

We can briefly look at the datastructure

head(colnames(VGM[[1]]))

## [1] "barcode"             "sample_id"           "group_id"           
## [4] "clonotype_id_10x"    "clonotype_id"        "clonotype_frequency"

## By setting integrate.GEX.to.VDJ and integrate.VDJ.to.GEX to T, VDJ and GEX information will be found in VGM[[1]] and VGM[[2]] objects.
# For example, the seurat-determined cluster is attached to each cell in the VDJ library by
head(VGM[[1]]$seurat_clusters)

## [1] <NA> <NA> <NA> <NA> <NA> <NA>
## Levels: 0 1 2 3 4 5 6 7 8 9 10 11

# which corresponds to cells with the following VDJ_cdr3 
head(VGM[[1]]$VDJ_cdr3s_aa)

## [1] "CARIGYAMDYW"        "CARDFTTVVARGYFDVW"  "CARGITTVVAYYYAMDYW"
## [4] "CTTGPYDYYAMDYW"     "CARPHDYDGVDYW"      "CARRYYSNYAWFAYW"

# an NA indicates that the cell barcode in the VDJ library was not detected in the GEX object (or was filtered out, depending on mitochondrial gene limits, etc)

Importantly, all samples are integrated in one VGM. Basic stats can be retrieved from the VDJ_GEX_stats output which was run as part of the VGM function (get_VDJ_stats = T)

head(VGM[[3]])

##                                                                                  Repertoir path
## VDJ.stats     N:/reddy/Shared_Folders/AA_AY/DBmanuscript_Aged_CNS/VDJ/Aged.CNS.pool.3m.Bcell.S1
## VDJ.stats1   N:/reddy/Shared_Folders/AA_AY/DBmanuscript_Aged_CNS/VDJ/Aged.CNS.pool.12m.Bcell.S2
## VDJ.stats2   N:/reddy/Shared_Folders/AA_AY/DBmanuscript_Aged_CNS/VDJ/Aged.CNS.pool.18m.Bcell.S3
## VDJ.stats3 N:/reddy/Shared_Folders/AA_AY/DBmanuscript_Aged_CNS/VDJ/Aged.CNS.single.18m.Bcell.S4
##            Sample name Nr unique barcodes Nr barcodes is_cell Nr cells 1VDJ 1VJ
## VDJ.stats           s1                 78                  78                61
## VDJ.stats1          s2                160                 160               122
## VDJ.stats2          s3                208                 208               160
## VDJ.stats3          s4                499                 499               372
##            Nr cells 1VDJ 0VJ Nr cells 0VDJ 1VJ Nr cells 2 or more VDJ 1VJ
## VDJ.stats                  8                 1                          0
## VDJ.stats1                13                15                          1
## VDJ.stats2                 7                16                          4
## VDJ.stats3                34                60                          8
##            Nr cells 1VDJ 2 or more VJ Nr cells 2 or more VDJ 2 or more VJ
## VDJ.stats                           7                                   1
## VDJ.stats1                          7                                   2
## VDJ.stats2                         18                                   3
## VDJ.stats3                         22                                   3
##            Nr cells full_length Nr cells productive Nr cells high_confidence
## VDJ.stats                    78                  78                       78
## VDJ.stats1                  160                 160                      160
## VDJ.stats2                  208                 208                      208
## VDJ.stats3                  499                 499                      499
##            Nr cells all true Nr cells all true and 1VDJ 1VJ Nr clonotypes
## VDJ.stats                 78                             61            73
## VDJ.stats1               160                            122           129
## VDJ.stats2               208                            160           200
## VDJ.stats3               499                            372           349
##            Nr clonotypes 1VDJ 1VJ Nr clonotypes < 1VDJ 1VJ
## VDJ.stats                      58                        8
## VDJ.stats1                     94                       25
## VDJ.stats2                    154                       21
## VDJ.stats3                    248                       77
##            Nr clonotypes > 1VDJ 1VJ % Nr unique barcodes % Nr barcodes is_cell
## VDJ.stats                         7                  100                   100
## VDJ.stats1                       10                  100                   100
## VDJ.stats2                       25                  100                   100
## VDJ.stats3                       24                  100                   100
##            % Nr cells 1VDJ 1VJ % Nr cells 1VDJ 0VJ % Nr cells 0VDJ 1VJ
## VDJ.stats                 78.2                10.3                 1.3
## VDJ.stats1                76.2                 8.1                 9.4
## VDJ.stats2                76.9                 3.4                 7.7
## VDJ.stats3                74.5                 6.8                  12
##            % Nr cells 2 or more VDJ 1VJ % Nr cells 1VDJ 2 or more VJ
## VDJ.stats                             0                            9
## VDJ.stats1                          0.6                          4.4
## VDJ.stats2                          1.9                          8.7
## VDJ.stats3                          1.6                          4.4
##            % Nr cells 2 or more VDJ 2 or more VJ % Nr cells full_length
## VDJ.stats                                    1.3                    100
## VDJ.stats1                                   1.2                    100
## VDJ.stats2                                   1.4                    100
## VDJ.stats3                                   0.6                    100
##            % Nr cells productive % Nr cells high_confidence % Nr cells all true
## VDJ.stats                    100                        100                 100
## VDJ.stats1                   100                        100                 100
## VDJ.stats2                   100                        100                 100
## VDJ.stats3                   100                        100                 100
##            % Nr cells all true and 1VDJ 1VJ % Nr clonotypes
## VDJ.stats                              78.2             100
## VDJ.stats1                             76.2             100
## VDJ.stats2                             76.9             100
## VDJ.stats3                             74.5             100
##            % Nr clonotypes 1VDJ 1VJ % Nr clonotypes < 1VDJ 1VJ
## VDJ.stats                      79.5                         11
## VDJ.stats1                     72.9                       19.4
## VDJ.stats2                       77                       10.5
## VDJ.stats3                     71.1                       22.1
##            % Nr clonotypes > 1VDJ 1VJ Estimated.Number.of.Cells
## VDJ.stats                         9.6                        78
## VDJ.stats1                        7.8                       160
## VDJ.stats2                       12.5                       208
## VDJ.stats3                        6.9                       499
##            Mean.Read.Pairs.per.Cell
## VDJ.stats                   272,337
## VDJ.stats1                  114,292
## VDJ.stats2                   99,848
## VDJ.stats3                   50,928
##            Number.of.Cells.With.Productive.V.J.Spanning.Pair
## VDJ.stats                                                 69
## VDJ.stats1                                               132
## VDJ.stats2                                               185
## VDJ.stats3                                               405
##            Number.of.Read.Pairs Valid.Barcodes Q30.Bases.in.Barcode
## VDJ.stats            21,242,313          96.5%                91.7%
## VDJ.stats1           18,286,818          96.7%                91.9%
## VDJ.stats2           20,768,413          95.3%                91.7%
## VDJ.stats3           25,413,451          96.1%                91.7%
##            Q30.Bases.in.RNA.Read.1 Q30.Bases.in.RNA.Read.2 Q30.Bases.in.UMI
## VDJ.stats                    73.3%                   79.8%            91.0%
## VDJ.stats1                   73.2%                   79.8%            91.1%
## VDJ.stats2                   73.1%                   79.2%            91.0%
## VDJ.stats3                   73.3%                   79.5%            91.0%
##            Reads.Mapped.to.Any.V.D.J.Gene Reads.Mapped.to.IGH
## VDJ.stats                           82.6%               26.3%
## VDJ.stats1                          82.0%               30.4%
## VDJ.stats2                          80.0%               20.5%
## VDJ.stats3                          83.2%               24.6%
##            Reads.Mapped.to.IGK Reads.Mapped.to.IGL
## VDJ.stats                29.4%               27.0%
## VDJ.stats1               30.3%               21.3%
## VDJ.stats2               38.0%               21.5%
## VDJ.stats3               40.6%               18.1%
##            Mean.Used.Read.Pairs.per.Cell Fraction.Reads.in.Cells
## VDJ.stats                         36,548                   79.0%
## VDJ.stats1                        29,068                   74.6%
## VDJ.stats2                        33,816                   84.6%
## VDJ.stats3                        18,887                   55.1%
##            Median.IGH.UMIs.per.Cell Median.IGK.UMIs.per.Cell
## VDJ.stats                        11                       30
## VDJ.stats1                        7                       12
## VDJ.stats2                       10                       31
## VDJ.stats3                        4                       10
##            Median.IGL.UMIs.per.Cell Cells.With.Productive.V.J.Spanning.Pair
## VDJ.stats                        47                                   88.5%
## VDJ.stats1                        8                                   82.5%
## VDJ.stats2                       32                                   88.9%
## VDJ.stats3                      6.5                                   81.2%
##            Cells.With.Productive.V.J.Spanning..IGK..IGH..Pair
## VDJ.stats                                               80.8%
## VDJ.stats1                                              71.2%
## VDJ.stats2                                              82.2%
## VDJ.stats3                                              76.0%
##            Cells.With.Productive.V.J.Spanning..IGL..IGH..Pair
## VDJ.stats                                                9.0%
## VDJ.stats1                                              11.9%
## VDJ.stats2                                               6.7%
## VDJ.stats3                                               5.8%
##            Paired.Clonotype.Diversity Cells.With.IGH.Contig
## VDJ.stats                       69.14                 98.7%
## VDJ.stats1                         80                 93.8%
## VDJ.stats2                     191.43                 96.2%
## VDJ.stats3                     159.31                 90.6%
##            Cells.With.IGK.Contig Cells.With.IGL.Contig
## VDJ.stats                  91.0%                 38.5%
## VDJ.stats1                 82.5%                 38.1%
## VDJ.stats2                 92.3%                 41.8%
## VDJ.stats3                 89.0%                 29.5%
##            Cells.With.CDR3.annotated.IGH.Contig
## VDJ.stats                                 98.7%
## VDJ.stats1                                91.9%
## VDJ.stats2                                93.3%
## VDJ.stats3                                88.6%
##            Cells.With.CDR3.annotated.IGK.Contig
## VDJ.stats                                 82.1%
## VDJ.stats1                                78.8%
## VDJ.stats2                                88.9%
## VDJ.stats3                                86.8%
##            Cells.With.CDR3.annotated.IGL.Contig
## VDJ.stats                                 11.5%
## VDJ.stats1                                13.8%
## VDJ.stats2                                 8.2%
## VDJ.stats3                                 7.2%
##            Cells.With.V.J.Spanning.IGH.Contig
## VDJ.stats                               98.7%
## VDJ.stats1                              93.1%
## VDJ.stats2                              95.7%
## VDJ.stats3                              88.8%
##            Cells.With.V.J.Spanning.IGK.Contig
## VDJ.stats                               89.7%
## VDJ.stats1                              81.2%
## VDJ.stats2                              92.3%
## VDJ.stats3                              88.6%
##            Cells.With.V.J.Spanning.IGL.Contig Cells.With.Productive.IGH.Contig
## VDJ.stats                               16.7%                            98.7%
## VDJ.stats1                              17.5%                            90.6%
## VDJ.stats2                              10.1%                            92.3%
## VDJ.stats3                              10.4%                            88.0%
##            Cells.With.Productive.IGK.Contig Cells.With.Productive.IGL.Contig
## VDJ.stats                             80.8%                            10.3%
## VDJ.stats1                            78.8%                            13.8%
## VDJ.stats2                            88.5%                             8.2%
## VDJ.stats3                            86.6%                             7.2%
##            rep_id Estimated.Number.of.Cells Mean.Reads.per.Cell
## VDJ.stats       1                       303             819,292
## VDJ.stats1      2                       261             919,694
## VDJ.stats2      3                       945             413,917
## VDJ.stats3      4                       595             369,552
##            Median.Genes.per.Cell Number.of.Reads Valid.Barcodes
## VDJ.stats                  1,199     248,245,523          83.2%
## VDJ.stats1                   970     240,040,317          82.8%
## VDJ.stats2                 1,338     391,152,436          85.3%
## VDJ.stats3                   766     219,883,523          84.2%
##            Sequencing.Saturation Q30.Bases.in.Barcode Q30.Bases.in.RNA.Read
## VDJ.stats                  99.1%                97.7%                 92.3%
## VDJ.stats1                 98.8%                97.8%                 89.9%
## VDJ.stats2                 98.2%                97.8%                 93.1%
## VDJ.stats3                 97.9%                97.7%                 92.3%
##            Q30.Bases.in.UMI Reads.Mapped.to.Genome
## VDJ.stats             97.5%                  90.7%
## VDJ.stats1            97.6%                  83.1%
## VDJ.stats2            97.6%                  92.2%
## VDJ.stats3            97.5%                  91.1%
##            Reads.Mapped.Confidently.to.Genome
## VDJ.stats                               85.4%
## VDJ.stats1                              77.3%
## VDJ.stats2                              83.2%
## VDJ.stats3                              85.7%
##            Reads.Mapped.Confidently.to.Intergenic.Regions
## VDJ.stats                                           10.7%
## VDJ.stats1                                          15.0%
## VDJ.stats2                                          11.3%
## VDJ.stats3                                          15.8%
##            Reads.Mapped.Confidently.to.Intronic.Regions
## VDJ.stats                                          9.7%
## VDJ.stats1                                        10.2%
## VDJ.stats2                                        10.5%
## VDJ.stats3                                        11.1%
##            Reads.Mapped.Confidently.to.Exonic.Regions
## VDJ.stats                                       65.0%
## VDJ.stats1                                      52.0%
## VDJ.stats2                                      61.4%
## VDJ.stats3                                      58.7%
##            Reads.Mapped.Confidently.to.Transcriptome
## VDJ.stats                                      61.5%
## VDJ.stats1                                     49.1%
## VDJ.stats2                                     59.7%
## VDJ.stats3                                     54.9%
##            Reads.Mapped.Antisense.to.Gene Fraction.Reads.in.Cells
## VDJ.stats                           12.7%                   84.2%
## VDJ.stats1                          12.9%                   59.7%
## VDJ.stats2                          11.7%                   88.8%
## VDJ.stats3                          14.6%                   51.9%
##            Total.Genes.Detected Median.UMI.Counts.per.Cell rep_id
## VDJ.stats                14,652                      2,865      1
## VDJ.stats1               14,033                      1,966      2
## VDJ.stats2               16,905                      3,151      3
## VDJ.stats3               14,629                      1,536      4

Moreover, given the many parameters of the VGM function is may be useful to check back on runtime parameters. These, as well as session info are also saved in the VGM object

print(VGM[[4]]) #Runtime params
print(VGM[[5]]) #session info

4. Gene expression analysis (Platypus v3)

The VGM object now contains all information needed for both V(D)J and gene expression analysis. In the examples below we first focus on gene expression

We can visualize the 2D plots for each sample individually. By default, UMAP, PCA, and TSNE embeddings are included in the object. Here it is important to supply the seurat object in VGM[[2]]. For the parameter group.by we supply “sample_id” to color points by the entry in their sample id column

# Ensure all gene names are capitalized for downstream analyses
rownames(VGM[[2]]) <- sapply(rownames(VGM[[2]]), function(x) toupper(x))

# In Platypus version 2, the output from GEX_automate was used as input to other GEX functions. These functions are still compatible with v3 if the VGM[[2]] seurat object is supplied as input.
# For example, the following function can be used to calculate the DE genes for each cluster, as before. 
Seurat::DimPlot(VGM[[2]],reduction = "umap", group.by = "sample_id")
Seurat::DimPlot(VGM[[2]],reduction = "pca", group.by = "sample_id")

## Warning: Removed 1761 rows containing missing values (`geom_point()`).

Seurat::DimPlot(VGM[[2]],reduction = "tsne", group.by = "sample_id")

## Warning: Removed 1761 rows containing missing values (`geom_point()`).

#alternatively plot each sample separately
Seurat::DimPlot(VGM[[2]],reduction = "umap", split.by = "sample_id")
#this also works with any other column of VGM[[2]]@meta.data
Seurat::DimPlot(VGM[[2]],reduction = "umap", split.by = "group_id")

We can observe that the majority of clusters have cells from both samples, suggesting a similar distribution of transcriptional properties between the two samples. We can also plot the cluster membership for each of the distinct samples using the GEX_proportions_barplot function.

GEX_proportions_barplot(GEX = VGM[[2]], stacked.plot = T, source.group = "sample_id", target.group = "seurat_clusters")
#This function is very flexible and can be used to plot proportions of cells from and of any groups. For this use the source.group and target.group parameters to specify metadata columns.

Finally, we can also look at common B cell markers.

Seurat::FeaturePlot(VGM[[2]],reduction = "umap",features = c("CD19","PTPRC", "EBF1", "H2-K1"))
#To easily scout through genes in the dataset use:
#View(as.data.frame(rownames(VGM[[2]])))

Platypus also allows us to assign cell type/state identity to different clusters by using GEX_phenotype. This function takes the Seurat object as input and uses canonical markers to easily match the clustering to known cell types. The user also has the possibility to use a custom list of markers and their associated cell types/states. While the default cell states are quite basic, we find that the strength of the function is that a user can label cells with custom phenotypes as demonstrated below. For more information please see the functions documentation.

#using defaults
VGM[[2]] <- GEX_phenotype(VGM[[2]], default = T)
#custom criteria
#VGM[[2]] <- GEX_phenotype(VGM[[2]], default = F,cell.state.markers=c("CD8A+;CCL5+;CD44+;IL7R-;CD19-","CD8A+;CCL5-;CD44+;IL7R+;CD19-"),cell.state.names=c("EffectorCD8","MemoryCD8"))

The resulting Seurat object now contains a “cell.state” column which can be used for annotation in the DimPlot function of the Seurat package.

Seurat::DimPlot(VGM[[2]],reduction = "umap", group.by = "cell.state")

5. Differential Gene Expression Analysis

After scaling, normalizing, and clustering the cells from the GEX libraries we can now investigate which genes are differentially expressed between either clusters or samples. First, we can investigate the genes that define each of the clusters by using the GEX output of the VGM function. Depending on the size of the dataset and the number of cells this function can be quite slow. The output of this function is a list in which each element contains the differentially expressed genes for a given cluster. For example, the first element of the list will correspond to a data frame describing the genes for cluster0 that we previously observed on the UMAP. This list object will correspond to the length of the number of clusters that were previously calculated, and has the same format as the FindMarkers function from Seurat. The pct.1 will correspond to the percentage of cells expressing the gene in the cluster of interest, and the pct.2 to the percentage of cells in all other clusters.

VGM[[2]] <- SeuratObject::JoinLayers(VGM[[2]])
gene_expression_cluster <- GEX_cluster_genes(VGM[[2]], min.pct = 0.2)

## For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the presto package
## --------------------------------------------
## install.packages('devtools')
## devtools::install_github('immunogenomics/presto')
## --------------------------------------------
## After installation of presto, Seurat will automatically use the more 
## efficient implementation (no further action necessary).
## This message will be shown once per session

length(gene_expression_cluster) # length of 12, corresponding to 12 clusters

## [1] 12

length(unique(VGM[[2]]$seurat_clusters)) # length of 12

## [1] 12

print(sapply(gene_expression_cluster,nrow)) #Nr of differentially expressed genes per cluster

##  [1] 1921 1266 1238 1424 2516 1814 1863 1617 1611 1513 1743 1409

We can now look at some of the genes associated with cluster0 in the previously displayed umap

head(gene_expression_cluster[[1]])

##               p_val avg_logFC pct.1 pct.2     p_val_adj SYMBOL cluster
## NKG7  2.631427e-306  3.732392 0.982 0.104 8.336623e-302   NKG7       0
## CCL5  1.070166e-286  3.434706 0.972 0.112 3.390392e-282   CCL5       0
## CTSW  5.876528e-254  3.307826 0.872 0.079 1.861743e-249   CTSW       0
## CD8A  2.962265e-251  3.746899 0.807 0.059 9.384753e-247   CD8A       0
## CD8B1 1.725846e-237  3.285546 0.842 0.080 5.467653e-233  CD8B1       0
## THY1  1.075640e-235  2.854907 0.941 0.145 3.407735e-231   THY1       0

Raw tables are informative, but not visually appealing. Differentially expressed genes may be plotted in different modes using Platypus.

It is also possible to create a heatmap displaying differentially expressed genes for each cluster. This can be customized to sub-sample cells in case certain clusters are too large for visualization purposes. Additionally, the user can determine the number of genes to display for each cluster based on the n.genes.per.cluster argument. The function GEX_cluster_genes_heatmap can be used to produce a ggplot object, based on the DoHeatmap from Seurat.

agedCNS_heatmap_clusters <- GEX_cluster_genes_heatmap(GEX = VGM[[2]], GEX_cluster_genes.output = gene_expression_cluster,n.genes.per.cluster = 3,max.cell = 30,metric = "avg_logFC", platypus.version = "v3")
print(agedCNS_heatmap_clusters)

After plotting the ggplot object we can clearly see genes enriched in various clusters - mainly indicated by the diagonal.

Further, volcano plots may be used to show and annotate a larger number of differentially expressed genes

agedCNS_heatmap_volcano <- GEX_volcano(DEGs.input = gene_expression_cluster, input.type = "cluster.genes", RP.MT.filter = T, color.p.threshold = 0.01, n.label.up = 10, n.label.down = 10)

## [1] "Top 10 and bottom 10 genes will be labeled"

print(agedCNS_heatmap_volcano[[1]]) #genes specific to cluster 0

We can additionally extract the top N genes per cluster directly (with filtering) using the following function:

top_10_genes_per_cluster <- GEX_topN_DE_genes_per_cluster(GEX_cluster_genes.output = gene_expression_cluster, n.genes = 10, by_FC = T)
head(top_10_genes_per_cluster)

##                p_val  avg_logFC pct.1 pct.2    p_val_adj  SYMBOL cluster
## CD81    9.129362e-53 -11.663484 0.000 0.357 2.892273e-48    CD81       0
## BLK     5.337741e-39 -10.261625 0.000 0.278 1.691050e-34     BLK       0
## FCRLA   2.255021e-38 -10.195180 0.000 0.274 7.144132e-34   FCRLA       0
## MZB1    4.144279e-36 -10.111562 0.000 0.260 1.312949e-31    MZB1       0
## H2-DMB2 1.198041e-49  -9.893185 0.002 0.342 3.795514e-45 H2-DMB2       0
## SPI1    5.434681e-34  -9.487847 0.000 0.247 1.721761e-29    SPI1       0

We can extract and test for differential expressed genes between the two samples (or between other subgroups) by using the GEX_DEgenes_persample function.

This function allows us to also create a heatmap displaying the top most up- or downregulated genes for each cluster based on log fold change (avg_logFC) or p value (adj_p_value). Additionally, the user can determine the number of up- and downreulated genes to be displayed for each sample. In this case, the output returns a list where the first element contains the dataframe with the differntial expression information and the second element contains the heatmap displaying the most up-/downregulated genes.
If more than two samples are to be examined please refer to GEX_pairwise_degs two chunks below

DE_genes_samples <- GEX_DEgenes(GEX = VGM[[2]],min.pct = .25, grouping.column = "sample_id",group1 = "s1", group2 = "s2",return.plot = "volcano",up.genes = 5,down.genes = 5,logFC = F)
#This function is flexible and takes any column name as grouping.column to allow easy exploration of differences between custom groups
head(DE_genes_samples[[1]])

##              p_val avg_logFC pct.1 pct.2    p_val_adj SYMBOL signif_pval
## RGS10 1.564907e-13  1.649797 0.436 0.138 4.957783e-09  RGS10        TRUE
## CTSD  3.522094e-12  2.038039 0.568 0.297 1.115835e-07   CTSD        TRUE
## C1QC  7.536629e-11  1.655614 0.375 0.134 2.387680e-06   C1QC        TRUE
## CST3  2.673577e-10  1.862194 0.679 0.455 8.470159e-06   CST3        TRUE
## HPGDS 3.129809e-10  2.328784 0.287 0.077 9.915547e-06  HPGDS        TRUE
## C1QA  5.307456e-10  1.574627 0.372 0.138 1.681455e-05   C1QA        TRUE
##       signif_logfc color_threshold
## RGS10         TRUE            TRUE
## CTSD          TRUE            TRUE
## C1QC          TRUE            TRUE
## CST3          TRUE            TRUE
## HPGDS         TRUE            TRUE
## C1QA          TRUE            TRUE

DE_genes_samples[[2]]

The same function can also return DEGs between seurat clusters or any 2 groups of cells Here DEGs between cluster 1 and 3 are calculated

DE_genes_cl1_vs_3 <- GEX_DEgenes(GEX= VGM[[2]],min.pct = .25, grouping.column = "seurat_clusters",group1 = "1", group2 = "3",return.plot = "heatmap",up.genes = 10,down.genes = 10,logFC = F)

##  [1] "MS4A4B"  "CD3G"    "CD3D"    "CD3E"    "THY1"    "SKAP1"   "MS4A6B" 
##  [8] "LCK"     "TMSB10"  "GM2682"  "H2-AA"   "H2-AB1"  "H2-EB1"  "CD79A"  
## [15] "CD74"    "CD79B"   "LY6D"    "EBF1"    "BANK1"   "H2-DMB2"

#head(DE_genes_cl1_vs_3[[1]]) 
DE_genes_cl1_vs_3[[2]]

To gain a complete overview of DEGs between groups or clusters the GEX_pairwise_degs function is used This function calculates DEGs for every possible pairwise comparison between groups of the selected column. It is thereby recommended to use this function with a maximum number of 10 groups/clusters/samples This function automatically saves plots where top label.n.top.genes genes are labeled by their p value. Additionally, genes of special interest that should be labeled irregardless of their p value can be supplied to genes.to.label

DE_clusters_all <- GEX_pairwise_DEGs(GEX = VGM[[2]], group.by = "seurat_clusters", min.pct = 0.25, RP.MT.filter = T, label.n.top.genes = 10, genes.to.label = c("CD74", "EBF1"), save.plot = F)

## Calculating pairwise DEGs 1 of 66

## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 10
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 2 of 66

## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 4
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 3 of 66

## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 9
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 4 of 66

## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 6
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 5 of 66

## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 7
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 6 of 66

## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 2
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 7 of 66

## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 0
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 8 of 66

## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 3
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 9 of 66

## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 5
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 10 of 66

## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 8
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 11 of 66

## [1] 1
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 12 of 66

## [1] 10
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 4
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 13 of 66

## [1] 10
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 9
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 14 of 66

## [1] 10
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 6
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 15 of 66

## [1] 10
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 7
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 16 of 66

## [1] 10
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 2
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 17 of 66

## [1] 10
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 0
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 18 of 66

## [1] 10
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 3
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 19 of 66

## [1] 10
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 5
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 20 of 66

## [1] 10
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 8
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 21 of 66

## [1] 10
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 22 of 66

## [1] 4
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 9
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 23 of 66

## [1] 4
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 6
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 24 of 66

## [1] 4
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 7
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 25 of 66

## [1] 4
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 2
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 26 of 66

## [1] 4
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 0
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 27 of 66

## [1] 4
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 3
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 28 of 66

## [1] 4
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 5
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 29 of 66

## [1] 4
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 8
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 30 of 66

## [1] 4
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 31 of 66

## [1] 9
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 6
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 32 of 66

## [1] 9
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 7
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 33 of 66

## [1] 9
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 2
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 34 of 66

## [1] 9
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 0
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 35 of 66

## [1] 9
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 3
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 36 of 66

## [1] 9
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 5
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 37 of 66

## [1] 9
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 8
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 38 of 66

## [1] 9
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 39 of 66

## [1] 6
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 7
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 40 of 66

## [1] 6
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 2
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 41 of 66

## [1] 6
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 0
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 42 of 66

## [1] 6
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 3
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 43 of 66

## [1] 6
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 5
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 44 of 66

## [1] 6
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 8
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 45 of 66

## [1] 6
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 46 of 66

## [1] 7
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 2
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 47 of 66

## [1] 7
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 0
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 48 of 66

## [1] 7
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 3
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 49 of 66

## [1] 7
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 5
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 50 of 66

## [1] 7
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 8
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 51 of 66

## [1] 7
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 52 of 66

## [1] 2
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 0
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 53 of 66

## [1] 2
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 3
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 54 of 66

## [1] 2
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 5
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 55 of 66

## [1] 2
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 8
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 56 of 66

## [1] 2
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 57 of 66

## [1] 0
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 3
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 58 of 66

## [1] 0
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 5
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 59 of 66

## [1] 0
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 8
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 60 of 66

## [1] 0
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 61 of 66

## [1] 3
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 5
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 62 of 66

## [1] 3
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 8
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 63 of 66

## [1] 3
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 64 of 66

## [1] 5
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 8
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 65 of 66

## [1] 5
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

## Calculating pairwise DEGs 66 of 66

## [1] 8
## Levels: 11 < 8 < 5 < 3 < 0 < 2 < 7 < 6 < 9 < 4 < 10 < 1
## [1] 11
## Levels: 1 < 10 < 4 < 9 < 6 < 7 < 2 < 0 < 3 < 5 < 8 < 11

Now that cluster defining genes have been examined, a helper plotting function allows to visualize selected genes.

dottile <- GEX_dottile_plot(GEX = VGM[[2]], genes = c("CD19", "CD74","SDC1", "EBF1","PTPRC","CD93","CD38","CD24A","CD34","CD1D1","CR2","MS4A1","CXCR5","SELL","CD40","CD83","H2-AB1","H2-EB1","CD27","POU2AF1","NT5E","FAS","PDCD1LG2","PRDM1","ITGAM","IL10","IL12A","HAVCR2"), group.by = "seurat_clusters", threshold.to.plot = 5) 
#threshold.to.plot specifies how many % of cells have to express a gene to show a dot.
dottile + ggplot2::theme(plot.title = ggplot2::element_blank(), plot.subtitle = ggplot2::element_blank(), legend.position = "bottom")

## Warning: Removed 192 rows containing missing values (`geom_point()`).

#For visualisation purposes in the RMD format

To examine patterns of co-expression two functions allow to generate both overview and single-cell resolution plots.

#overview
coexpression_dotmap <- GEX_coexpression_coefficient(GEX = VGM[[2]], genes = c("CD19", "CD74","SDC1", "EBF1","PTPRC","CD93","CD38","CD24A","CD34"), plot.dotmap = T)

## Calculating coexpression for 9 genes with 2064 cells

coexpression_dotmap + ggplot2::theme(legend.position = "bottom")
#detail of two selected genes
GEX_scatter_coexpression(GEX = VGM[[2]], gene.1 = "CD19", gene.2 = "EBF1", color.theme = "darkred")
#to save use: 
#ggsave(last_plot(), filename = "Coexpression_scatter.png")

5. VDJ Repertoire anaylsis

5.1 Checking VGM output

Now we can analyze the VDJ repertoire data before integrating VDJ and GEX information. This may be useful if only VDJ libraries have been sequenced without the accompanying gene expression data. We first start by examining the structure of VGM[[1]]

print(colnames(VGM[[1]]))

##  [1] "barcode"                 "sample_id"              
##  [3] "group_id"                "clonotype_id_10x"       
##  [5] "clonotype_id"            "clonotype_frequency"    
##  [7] "celltype"                "Nr_of_VDJ_chains"       
##  [9] "Nr_of_VJ_chains"         "VDJ_cdr3s_aa"           
## [11] "VJ_cdr3s_aa"             "VDJ_cdr3s_nt"           
## [13] "VJ_cdr3s_nt"             "VDJ_umis"               
## [15] "VJ_umis"                 "VDJ_chain_contig"       
## [17] "VJ_chain_contig"         "VDJ_chain"              
## [19] "VJ_chain"                "VDJ_vgene"              
## [21] "VJ_vgene"                "VDJ_dgene"              
## [23] "VDJ_jgene"               "VJ_jgene"               
## [25] "VDJ_cgene"               "VJ_cgene"               
## [27] "VDJ_sequence_nt_raw"     "VJ_sequence_nt_raw"     
## [29] "VDJ_sequence_nt_trimmed" "VJ_sequence_nt_trimmed" 
## [31] "VDJ_sequence_aa"         "VJ_sequence_aa"         
## [33] "VDJ_raw_ref"             "VJ_raw_ref"             
## [35] "VDJ_trimmed_ref"         "VJ_trimmed_ref"         
## [37] "VDJ_raw_consensus_id"    "VJ_raw_consensus_id"    
## [39] "orig_barcode"            "specifity"              
## [41] "affinity"                "GEX_available"          
## [43] "orig.ident"              "seurat_clusters"        
## [45] "PC_1"                    "PC_2"                   
## [47] "tSNE_1"                  "tSNE_2"                 
## [49] "batch_id"

If GEX data has been integrated, this dataframe also contains dimensional reduction coordinates and cluster id. If trim.and.align was set to TRUE, the columns VDJ_sequence_nt_trimmed to VJ_trimmed_ref contain aligned and reference sequences. The clonotype_frequency column takes input from the clonotyping output of cellranger, which is saved int the clonotype_id_10x column.

This object contains both cells with less than 2 V(D)J chains as well as with more than 2 To show how this is done the VDJ_cgene column is taken as an example

print(unique(VGM[[1]]$VDJ_cgene))

## [1] "IGHM"        "IGHD"        ""            "IGHM;IGHM"   "IGHG2B"     
## [6] "IGHM;IGHD"   "IGHG1"       "IGHA"        "IGHM;IGHG2C"

Apart from classical isotypes, we can see an empty (““) element. This is for cells, for which only a VJ chain is available. Secondly we can see 3 examples of cells containing more than one VDJ chain. These are delimited by”;“. Importantly, this format is maintained throughout all columns.

Both cells with less and more than 2 chains can be cumbersome for analysis. Platypus V3 functions generally support such cells. To check their quantity and, if needed, exclude these, the following line can be used

For advanced filtering and selection options concerning cells with aberrant number of chains, please refer to the detailed VDJ_GEX_matrix vignette.

cat(" Cell count by number of VDJ chains")

##  Cell count by number of VDJ chains

print(table(VGM[[1]]$Nr_of_VDJ_chains))

## 
##   0   1   2 
##  92 831  22

cat("\n Cell count by number of VJ chains")

## 
##  Cell count by number of VJ chains

print(table(VGM[[1]]$Nr_of_VJ_chains))

## 
##   0   1   2 
##  62 820  63

#Subset the VGM matrix to only include cells with 1 VDJ and 1VJ chain
#VGM[[1]] <- subset(VGM[[1]], Nr_of_VJ_chains == 1 & Nr_of_VJ_chains == 1)

5.2 Changing the clonotype strategy

Often paired nucleotide CDRH3 and CDRL3 clonotyping may not be the best strategy given somatic hypermutation may occur in the CDR3 region. Therefore, there could be highly similar clones that likely bind the same antigen that are officially part of different clonal families. To address this, we have added a function that allows for various heuristic clonotyping strategies. This involves clonotyping by identical amino acid CDRH3 + CDRL3 seuqence, identical germline usage, or seqeunce homology requirements. This function works both for V3 and V2 of platypus. The version needs to be specified. For V3 (VGM) platypus.version a new clonotyping function was added to improve the existing one. For the full documentation see VDJ_clonotype_v3(). The deprecated function VDJ_clonotype may still be used for Platypus v2 input Three new parameters have been added therein: 1. global.clonotype If set to TRUE, clonotypes will be generated across all samples of the VGM. This may be useful if more than one sample has been taken from the same animal (e.g. spleen and bone marrow). Moreover this can provide an alternative approach to investigate clone sharing across donors or mice. After global clonotyping, the resulting clonotype_id can be used as a grouping variable and summary statistics can be generated per clone (including its presence in multiple repertoires.) A key update as part of the hierarchical clonotyping strategy is the same-sample-criterion. This is explained below. 2. hierarchical For cells with aberrant numbers of VDJ and VJ chains: the functions implements a hierarchical clonotyping strategy for such cells. In brief, clonotypes are determined based on all cells with exactly 1VDJ and 1VJ chain. Other cells are then “joined in” in post. This follows three steps 1. hierarchical = “none” Create clonotypes from 1 VDJ 1 VJ cells. 2. hierarchical = “single.chains” Cells with a single chain (e.g. 0VDJ and 1VJ) are merged into existing clonotypes containing the same or a homologous VJ chain, depending on the clonotyping strategy. Here the same-sample-criterion is applied when using global clonotyping. If multiple matching clonotypes have been found for that cell then ones within the same sample_id are prioritized. Further if multiple matches exist, the cell is merged into the existing clone with the highest frequency 2. hierarchical = “double.and.single.chains” A combination of 1VDJ and 2VJ chain will first be checked for frequency in the dataset (by exact nucleotide CDR3 matching). Only if this count exceeds the triple.chain.count.threshold, the clone is used as a “hub clone”. This protects from merging clonotypes on the basis of rare doublets. Existing clonotypes are then merged into the 1 VDJ 2 VJ clonotypes as they match with the assumption that e.g. a cell with 1 VDJ 1 VJ is part of that same clonotype, but missing a VJ chain due to stochastical sampling

In the example that follows, we use VDJ_clonotype to group cells into clones based on identical CDRH3 + CDRL3 amino acid sequence. We will compare this to the case in which we group B cells by using the same germline genes (both heavy chain and light chain).

VGM[[1]] <- VDJ_clonotype(VDJ = VGM[[1]], clone.strategy = "cdr3.aa", global.clonotype = F, VDJ.VJ.1chain = F, hierarchical = "single.chains")

## Clonotyping strategy: cdr3.aa

## Filtered out 22 cells containing more than one VDJ AND VJ chain or two VDJ chains, as these likely correspond to doublets

## Processing sample 1 of 4

## Attempting to merge in 9 aberrant cells

## Processing sample 2 of 4

## Attempting to merge in 28 aberrant cells

## Processing sample 3 of 4

## Attempting to merge in 23 aberrant cells

## Processing sample 4 of 4

## Attempting to merge in 94 aberrant cells

## Backing up 10x default clonotyping in columns clonotype_id_10x and clonotype_frequency_10x before updating clonotype_id and clonotype_frequency columns

cat(" Nr and distribution of clonotypes using exact CDR3.aa matching \n")

##  Nr and distribution of clonotypes using exact CDR3.aa matching

print(length(unique(VGM[[1]]$clonotype_id_cdr3.aa)))

## [1] 321

print(table(VGM[[1]]$clonotype_frequency_cdr3.aa)) #Check distribution of clonotypes with identical CDR3 aa sequences

## 
##   1   2   3   4   5   6   7   8   9  12  24 
## 593 118  69  24  15  36   7  16   9  12  24

VGM[[1]] <- VDJ_clonotype(VDJ = VGM[[1]], clone.strategy = "VDJJ.VJJ", global.clonotype = F, VDJ.VJ.1chain = F, hierarchical = "double.and.single.chains", triple.chain.count.threshold = 4)

## Clonotyping strategy: VDJJ.VJJ

## Filtered out 0 cells containing more than one VDJ AND VJ chain or two VDJ chains, as these likely correspond to doublets

## Found 0 exact matching clones with 3 chains and a frequency of at least 4. These will be used as high confidence clonotypes.

## Processing sample 1 of 4

## Attempting to merge in 9 aberrant cells

## Processing sample 2 of 4

## Attempting to merge in 28 aberrant cells

## Processing sample 3 of 4

## Attempting to merge in 23 aberrant cells

## Processing sample 4 of 4

## Attempting to merge in 94 aberrant cells

cat("\n Nr and distribution of clonotypes using germline gene matching \n ")

## 
##  Nr and distribution of clonotypes using germline gene matching 
##

print(length(unique(VGM[[1]]$clonotype_id_VDJJ.VJJ)))

## [1] 295

print(table(VGM[[1]]$clonotype_frequency_VDJJ.VJJ)) #Check distribution of clonotypes with identical germline genes

## 
##   1   2   3   4   5   6   7   8  10  12  24 
## 534 148  84  32  30  12  21  16  10  12  24

Columns with new clonotype information are added to the VGM so to allow for multiple different strategies to be compared easily. For most Platypus downstream functions, the columns default columns “clonotype_id” and “clonotype_frequency” are used. To control which clonotyping strategy is used in these functions, the relevant columns can be copied over

VGM[[1]]$clonotype_id <- VGM[[1]]$clonotype_id_cdr3.aa
VGM[[1]]$clonotype_frequency <- VGM[[1]]$clonotype_frequency_cdr3.aa

5.3 Extracting full-length sequences from the VDJRegion

At the start of this vignette the VGM function was run with trim.and.align = F. Here we run the function again, but this time with trim.and.align = T. This does take significantly longer

To accelerate runtime, parallel computing options are available. If running a Windows machine parallel.processing should be set to “parlapply” This does require the package Parallel and its dependencies. The additional parameter numcores allows to set the numbers of cores to use. This is important when running the function on a cluster. If on a MAC or Linux machine, set parallel.processing to “mclapply”

Reference sequences are obtained from the cellranger output and thereby the 10x Genomics reference. Further trimming is made based on annotations by cellranger (feature$region_type). Alignment is done via Biostrings::pairwiseAlignment and the alignment with maximum score is returned. If alignments are not complete, the VGM parameters gap.opening.cost and gap.extension.cost can be modified

VGM <- VDJ_GEX_matrix(Data.in = yermanos2021_raw, #For local dataset input see section 5.1
                               GEX.integrate = T,
                               VDJ.combine = T,
                               integrate.GEX.to.VDJ = T,
                               integrate.VDJ.to.GEX = T, 
                               trim.and.align = T, #Set this to TRUE for full length sequence recovery
                               gap.opening.cost = 10,
                               gap.extension.cost = 4, #Tweak to optimize alignments if neccessary
                               parallel.processing = "parlapply")

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at -2.0899

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 0.30103

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 7.0607e-15

## Warning: The following requested variables were not found: UMAP_1, UMAP_2

No trimmed and reference sequence columns should be filled.

print(VGM[[1]][1,])

##               barcode sample_id group_id clonotype_id_10x clonotype_id
## 1 s1_AAACGGGGTTTAGGAA        s1        1       clonotype7   clonotype7
##   clonotype_frequency celltype Nr_of_VDJ_chains Nr_of_VJ_chains VDJ_cdr3s_aa
## 1                   1   B cell                1               1  CARIGYAMDYW
##   VJ_cdr3s_aa                      VDJ_cdr3s_nt
## 1 CQQGNTLPPTF TGTGCTCGAATAGGATATGCTATGGACTACTGG
##                         VJ_cdr3s_nt VDJ_umis VJ_umis
## 1 TGCCAACAGGGTAATACGCTTCCTCCGACGTTC        2       8
##              VDJ_chain_contig             VJ_chain_contig VDJ_chain VJ_chain
## 1 AAACGGGGTTTAGGAA-1_contig_2 AAACGGGGTTTAGGAA-1_contig_1       IGH      IGK
##   VDJ_vgene  VJ_vgene VDJ_dgene VDJ_jgene VJ_jgene VDJ_cgene VJ_cgene
## 1   IGHV8-8 IGKV10-96               IGHJ4    IGKJ1      IGHM     IGKC
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            VDJ_sequence_nt_raw
## 1 TGGGAAGTGTGCAGCCATGGGCAGGCTTACTTCTTCATTCCTGTTACTGATTGTCCCTGCATATGTCCTGTCCCAGGTTACTCTGAAAGAGTCTGGCCCTGGGATATTGCAGCCCTCCCAGACCCTCAGTCTGACTTGTTCTTTCTCTGGGTTTTCACTGAGCACTTTTGGTATGGGTGTAGGCTGGATTCGTCAGCCTTCAGGGAAGGGTCTGGAGTGGCTGGCACACATTTGGTGGGATGATGATAAGTACTATAACCCAGCCCTGAAGAGTCGGCTCACAATCTCCAAGGATACCTCCAAAAACCAGGTATTCCTCAAGATCGCCAATGTGGACACTGCAGATACTGCCACATACTACTGTGCTCGAATAGGATATGCTATGGACTACTGGGGTCAAGGAACCTCAGTCACCGTCTCCTCAGAGAGTCAGTCCTTCCCAAATGTCTTCCCCCTCGTCTCCTGCGAGAGCCCCCTGTCTGATAAGAATCTGGTGGCCATGGGCTGCCTGGCCCGGGACTTCCTGCCCAGCACCATTTCCTTCACCTGGAACTACCAGAACAACACTGAAGTCATCCAGGGTATCAGAACCTTCCCAACACTGAGGACAGGGGGCAAGTACCTAGCCACCTCGCA
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        VJ_sequence_nt_raw
## 1 TTGGGGCATTGTAATTGAAGTCAAGACTCAGCCTGGACATGATGTCCTCTGCTCAGTTCCTTGGTCTCCTGTTGCTCTGTTTTCAAGGTACCAGATGTGATATCCAGATGACACAGACTACATCCTCCCTGTCTGCCTCTCTGGGAGACAGAGTCACCATCAGTTGCAGGGCAAGTCAGGACATTAGCAATTATTTAAACTGGTATCAGCAGAAACCAGATGGAACTGTTAAACTCCTGATCTACTACACATCAAGATTACACTCAGGAGTCCCATCAAGGTTCAGTGGCAGTGGGTCTGGAACAGATTATTCTCTCACCATTAGCAACCTGGAGCAAGAAGATATTGCCACTTACTTTTGCCAACAGGGTAATACGCTTCCTCCGACGTTCGGTGGAGGCACCAAGCTGGAAATCAAACGGGCTGATGCTGCACCAACTGTATCCATCTTCCCACCATCCAGTGAGCAGTTAACATCTGGAGGTGCCTCAGTCGTGTGCTTC
##                                                                                                                                                                                                                                                                                                                                                                                                    VDJ_sequence_nt_trimmed
## 1 ATGGGCAGGCTTACTTCTTCATTCCTGTTACTGATTGTCCCTGCATATGTCCTGTCCCAGGTTACTCTGAAAGAGTCTGGCCCTGGGATATTGCAGCCCTCCCAGACCCTCAGTCTGACTTGTTCTTTCTCTGGGTTTTCACTGAGCACTTTTGGTATGGGTGTAGGCTGGATTCGTCAGCCTTCAGGGAAGGGTCTGGAGTGGCTGGCACACATTTGGTGGGATGATGATAAGTACTATAACCCAGCCCTGAAGAGTCGGCTCACAATCTCCAAGGATACCTCCAAAAACCAGGTATTCCTCAAGATCGCCAATGTGGACACTGCAGATACTGCCACATACTACTGTGCTCGAATAGGATATGCTATGGACTACTGGGGTCAAGGAACCTCAGTCACCGTCTCCTCA
##                                                                                                                                                                                                                                                                                                                                                                          VJ_sequence_nt_trimmed
## 1 ATGATGTCCTCTGCTCAGTTCCTTGGTCTCCTGTTGCTCTGTTTTCAAGGTACCAGATGTGATATCCAGATGACACAGACTACATCCTCCCTGTCTGCCTCTCTGGGAGACAGAGTCACCATCAGTTGCAGGGCAAGTCAGGACATTAGCAATTATTTAAACTGGTATCAGCAGAAACCAGATGGAACTGTTAAACTCCTGATCTACTACACATCAAGATTACACTCAGGAGTCCCATCAAGGTTCAGTGGCAGTGGGTCTGGAACAGATTATTCTCTCACCATTAGCAACCTGGAGCAAGAAGATATTGCCACTTACTTTTGCCAACAGGGTAATACGCTTCCTCCGACGTTCGGTGGAGGCACCAAGCTGGAAATCAAA
##                                                                                                                            VDJ_sequence_aa
## 1 MGRLTSSFLLLIVPAYVLSQVTLKESGPGILQPSQTLSLTCSFSGFSLSTFGMGVGWIRQPSGKGLEWLAHIWWDDDKYYNPALKSRLTISKDTSKNQVFLKIANVDTADTATYYCARIGYAMDYWGQGTSVTVSS
##                                                                                                                    VJ_sequence_aa
## 1 MMSSAQFLGLLLLCFQGTRCDIQMTQTTSSLSASLGDRVTISCRASQDISNYLNWYQQKPDGTVKLLIYYTSRLHSGVPSRFSGSGSGTDYSLTISNLEQEDIATYFCQQGNTLPPTFGGGTKLEIK
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        VDJ_raw_ref
## 1 ATGGGCAGGCTTACTTCTTCATTCCTGTTACTGATTGTCCCTGCATATGTCCTGTCCCAGGTTACTCTGAAAGAGTCTGGCCCTGGGATATTGCAGCCCTCCCAGACCCTCAGTCTGACTTGTTCTTTCTCTGGGTTTTCACTGAGCACTTTTGGTATGGGTGTAGGCTGGATTCGTCAGCCTTCAGGGAAGGGTCTGGAGTGGCTGGCACACATTTGGTGGGATGATGATAAGTACTATAACCCAGCCCTGAAGAGTCGGCTCACAATCTCCAAGGATACCTCCAAAAACCAGGTATTCCTCAAGATCGCCAATGTGGACACTGCAGATACTGCCACATACTACTGTGCTCGAATAGATTACTATGCTATGGACTACTGGGGTCAAGGAACCTCAGTCACCGTCTCCTCAGAGAGTCAGTCCTTCCCAAATGTCTTCCCCCTCGTCTCCTGCGAGAGCCCCCTGTCTGATAAGAATCTGGTGGCCATGGGCTGCCTGGCCCGGGACTTCCTGCCCAGCACCATTTCCTTCACCTGGAACTACCAGAACAACACTGAAGTCATCCAGGGTATCAGAACCTTCCCAACACTGAGGACAGGGGGCAAGTACCTAGCCACCTCGCAGGTGTTGCTGTCTCCCAAGAGCATCCTTGAAGGTTCAGATGAATACCTGGTATGCAAAATCCACTACGGAGGCAAAAACAAAGATCTGCATGTGCCCATTCCAGCTGTCGCAGAGATGAACCCCAATGTAAATGTGTTCGTCCCACCACGGGATGGCTTCTCTGGCCCTGCACCACGCAAGTCTAAACTCATCTGCGAGGCCACGAACTTCACTCCAAAACCGATCACAGTATCCTGGCTAAAGGATGGGAAGCTCGTGGAATCTGGCTTCACCACAGATCCGGTGACCATCGAGAACAAAGGATCCACACCCCAAACCTACAAGGTCATAAGCACACTTACCATCTCTGAAATCGACTGGCTGAACCTGAATGTGTACACCTGCCGTGTGGATCACAGGGGTCTCACCTTCTTGAAGAACGTGTCCTCCACATGTGCTGCCAGTCCCTCCACAGACATCCTAACCTTCACCATCCCCCCCTCCTTTGCCGACATCTTCCTCAGCAAGTCCGCTAACCTGACCTGTCTGGTCTCAAACCTGGCAACCTATGAAACCCTGAATATCTCCTGGGCTTCTCAAAGTGGTGAACCACTGGAAACCAAAATTAAAATCATGGAAAGCCATCCCAATGGCACCTTCAGTGCTAAGGGTGTGGCTAGTGTTTGTGTGGAAGACTGGAATAACAGGAAGGAATTTGTGTGTACTGTGACTCACAGGGATCTGCCTTCACCACAGAAGAAATTCATCTCAAAACCCAATGAGGTGCACAAACATCCACCTGCTGTGTACCTGCTGCCACCAGCTCGTGAGCAACTGAACCTGAGGGAGTCAGCCACAGTCACCTGCCTGGTGAAGGGCTTCTCTCCTGCAGACATCAGTGTGCAGTGGCTTCAGAGAGGGCAACTCTTGCCCCAAGAGAAGTATGTGACCAGTGCCCCGATGCCAGAGCCTGGGGCCCCAGGCTTCTACTTTACCCACAGCATCCTGACTGTGACAGAGGAGGAATGGAACTCCGGAGAGACCTATACCTGTGTTGTAGGCCACGAGGCCCTGCCACACCTGGTGACCGAGAGGACCGTGGACAAGTCCACTGGTAAACCCACACTGTACAATGTCTCCCTGATCATGTCTGACACAGGCGGCACCTGCTAT
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          VJ_raw_ref
## 1 ATGATGTCCTCTGCTCAGTTCCTTGGTCTCCTGTTGCTCTGTTTTCAAGGTACCAGATGTGATATCCAGATGACACAGACTACATCCTCCCTGTCTGCCTCTCTGGGAGACAGAGTCACCATCAGTTGCAGGGCAAGTCAGGACATTAGCAATTATTTAAACTGGTATCAGCAGAAACCAGATGGAACTGTTAAACTCCTGATCTACTACACATCAAGATTACACTCAGGAGTCCCATCAAGGTTCAGTGGCAGTGGGTCTGGAACAGATTATTCTCTCACCATTAGCAACCTGGAGCAAGAAGATATTGCCACTTACTTTTGCCAACAGGGTAATACGCTTCCTCCGTGGACGTTCGGTGGAGGCACCAAGCTGGAAATCAAACGGGCTGATGCTGCACCAACTGTATCCATCTTCCCACCATCCAGTGAGCAGTTAACATCTGGAGGTGCCTCAGTCGTGTGCTTCTTGAACAACTTCTACCCCAAAGACATCAATGTCAAGTGGAAGATTGATGGCAGTGAACGACAAAATGGCGTCCTGAACAGTTGGACTGATCAGGACAGCAAAGACAGCACCTACAGCATGAGCAGCACCCTCACGTTGACCAAGGACGAGTATGAACGACATAACAGCTATACCTGTGAGGCCACTCACAAGACATCAACTTCACCCATTGTCAAGAGCTTCAACAGGAATGAGTGT
##                                                                                                                                                                                                                                                                                                                                                                                                               VDJ_trimmed_ref
## 1 ATGGGCAGGCTTACTTCTTCATTCCTGTTACTGATTGTCCCTGCATATGTCCTGTCCCAGGTTACTCTGAAAGAGTCTGGCCCTGGGATATTGCAGCCCTCCCAGACCCTCAGTCTGACTTGTTCTTTCTCTGGGTTTTCACTGAGCACTTTTGGTATGGGTGTAGGCTGGATTCGTCAGCCTTCAGGGAAGGGTCTGGAGTGGCTGGCACACATTTGGTGGGATGATGATAAGTACTATAACCCAGCCCTGAAGAGTCGGCTCACAATCTCCAAGGATACCTCCAAAAACCAGGTATTCCTCAAGATCGCCAATGTGGACACTGCAGATACTGCCACATACTACTGTGCTCGAATAGATTACTATGCTATGGACTACTGGGGTCAAGGAACCTCAGTCACCGTCTCCTCA
##                                                                                                                                                                                                                                                                                                                                                                                     VJ_trimmed_ref
## 1 ATGATGTCCTCTGCTCAGTTCCTTGGTCTCCTGTTGCTCTGTTTTCAAGGTACCAGATGTGATATCCAGATGACACAGACTACATCCTCCCTGTCTGCCTCTCTGGGAGACAGAGTCACCATCAGTTGCAGGGCAAGTCAGGACATTAGCAATTATTTAAACTGGTATCAGCAGAAACCAGATGGAACTGTTAAACTCCTGATCTACTACACATCAAGATTACACTCAGGAGTCCCATCAAGGTTCAGTGGCAGTGGGTCTGGAACAGATTATTCTCTCACCATTAGCAACCTGGAGCAAGAAGATATTGCCACTTACTTTTGCCAACAGGGTAATACGCTTCCTCCGTGGACGTTCGGTGGAGGCACCAAGCTGGAAATCAAA
##      VDJ_raw_consensus_id     VJ_raw_consensus_id     orig_barcode specifity
## 1 clonotype7_concat_ref_1 clonotype7_concat_ref_2 AAACGGGGTTTAGGAA        NA
##   affinity GEX_available orig.ident seurat_clusters PC_1 PC_2 tSNE_1 tSNE_2
## 1       NA         FALSE       <NA>            <NA>   NA   NA     NA     NA
##                                                                            batch_id
## 1 N:/reddy/Shared_Folders/AA_AY/DBmanuscript_Aged_CNS/GEX/Aged.CNS.pool.3m.Bcell.S1

Lastly, you can extract full receptor sequences, annotated regions, as well as consensus and germline sequences (trimmed or raw) using the minimal_VDJ function, as demonstrated before. This is intended for use on the output files of cellranger v7.

5.4 Visualizing clonal frequencies (V3 only)

For a basic view of clonal expansion the VDJ_clonal_donut produces circular plots per sample. Label position and size should only be adjusted after deciding on a plot export format. This function uses the clonotype_id column as input. If VDJ_clonotyping function was used, its result are stored in that column. If not, the default 10x clonotyping is stored there. To retrieve default 10x clonotyping, use the column clonotyping_id_10x

donuts <- VDJ_clonal_donut(VDJ = VGM[[1]], expanded.colors = c("grey50", "grey65", "grey80"), non.expanded.color = "black", counts.to.use = "freq_column")

## Warning in VDJ_clonal_donut(VDJ = VGM[[1]], expanded.colors = c("grey50", :
## Column name for counts.to.use was not found in VDJ. Defaulting to
## 'clonotype_id_10x'

## Using column clonotype_id for counting clones

## Clones: Expanded: 5 / 6.85%; 1 cell 68 / 93.15%; total: 73

## Cells: Expanded: 10 / 12.82%; 1 cell 68 / 87.18%; Total: 78

## Clones: Expanded: 14 / 10.85%; 1 cell 115 / 89.15%; total: 129

## Cells: Expanded: 45 / 28.12%; 1 cell 115 / 71.88%; Total: 160

## Clones: Expanded: 7 / 3.5%; 1 cell 193 / 96.5%; total: 200

## Cells: Expanded: 15 / 7.21%; 1 cell 193 / 92.79%; Total: 208

## Clones: Expanded: 64 / 18.34%; 1 cell 285 / 81.66%; total: 349

## Cells: Expanded: 214 / 42.89%; 1 cell 285 / 57.11%; Total: 499

#Counts to use = "freq_column" uses the counts in the clonotype_frequency column. Counts to use = "VGM" simply counts the rows of a given clonotype in the VGM table (these counts may differ if cells have been filtered out due to overlapping barcodes or if another clonotyping strategy was used)
donuts[[1]]

The black section shows cells in unexpanded clones. The rest shows expanded clones by their frequency. The middle label shows the total number of clones (cells)

5.5 Calculating common repertoire diversity metrics

This requires the Vegan package. We can calculate the diversity for any column(s) in the VGM. If more than one column is provided, the content will be pasted together before calculating diversity metrics

#Shannon Evenness for the VDJ chain CDR3
diversity_plot <- VDJ_diversity(VDJ = VGM[[1]],feature.columns = c("VDJ_cdr3s_aa"),grouping.column = "sample_id",metric = c("shannonevenness"), subsample.to.same.n = T)
diversity_plot
#Gini-Simpson index for pasted VDJ and VJ chain CDR3s
diversity_plot <- VDJ_diversity(VDJ = VGM[[1]],feature.columns = c("VDJ_cdr3s_aa", "VJ_cdr3s_aa"),grouping.column = "sample_id", metric = c("ginisimpson"), subsample.to.same.n = T)
diversity_plot
#exact values can be retrived as 
#Jaccard index between repertoires of the two samples
diversity_plot <- VDJ_diversity(VDJ = VGM[[1]],feature.columns = c("VDJ_cgene"),grouping.column = "sample_id",metric = c("jaccard"), subsample.to.same.n = T)
diversity_plot

5.6 Sequence similarity networks

Other functions are specifically tailored to repertoire analysis - such as VDJ_network, which creates a sequence similarity network between repertoires or within a repertoire by connecting those clones with sequence similarity. This function relies upon igraph to visually display and construct the graph - which means that networks with high number of sequences will not display easily. In the following example we are using a small dataset. In case of a bigger dataset one may subsample by using the sample.n function on VGM[[1]]. Setting the per.mouse argument to false indicates that one network for multiple repertoires should be produced.

This function is quite specialized and we are working on building a better networks module. But hopefully the code can be used as a template for your own analyses.

#subsampled_VGM <- dplyr::sample_n(VGM[[1]], 60)
cl_network <- VDJ_network(VDJ = VGM[[1]],per.sample = T,distance.cutoff = 8, platypus.version = "v3")
for(i in 1:2){
igraph::plot.igraph(cl_network[[1]][[i]],vertex.label=NA,vertex.size=7)
}
#For a plot including clonotype frequencies
cl_network <- VDJ_network(VDJ = VGM[[1]],per.sample = F,distance.cutoff = 8, platypus.version = "v3")
igraph::plot.igraph(cl_network[[4]],vertex.label=NA,vertex.size=3+(0.95*as.numeric(cl_network[[2]]$clonotype_frequency)),vertex.color= as.factor(VGM[[1]]$sample_id))

For more details see the documentation of the VDJ_network function, but essentially information such as clonal frequency and which sample (here still indicated by the “sample_id” column) are stored in the second element of the output list. Here we can see only a few clones that are showing connections (produced by edges between those with a distance of 8 amino acids or less between heavy and light chain paired CDR3 sequence homology)

5.7 Germline gene usage heatmaps

It is also possible to produce heatmaps of the germline gene usage in the context of heavy chain V gene and light chain V gene. The output of the VDJ_Vgene_usage function is a matrix for each repertoire corresponding to the order specified by VDJ_analyze. The outer list corresponds to the sample and the inner list corresponds to a matrix, where the rows correspond to the heavy chain V genes and the columns correspond to the light chains of the V genes. Therefore the output[[1]][i,j] corresponds to the number of clones using the combination of IGH-Vgene[i] and IGK/L-Vgene[j].

#First calculate adjacency matrix for V gene usage
Vgene_usage <- VDJ_Vgene_usage(VDJ = VGM[[1]], platypus.version = "v3")
#library(pheatmap)
pheatmap::pheatmap(Vgene_usage[[1]],show_rownames = F,show_colnames = F)
print(class(Vgene_usage[[1]]))

## [1] "matrix" "array"

print(head(rownames(Vgene_usage[[1]])))

## [1] "IGHV8-8"  "IGHV1-9"  "IGHV1-26" "IGHV14-1" "IGHV5-17" "IGHV5-16"

print(head(colnames(Vgene_usage[[1]])))

## [1] "IGKV10-96" "IGKV6-32"  "IGKV19-93" "IGKV8-24"  "IGKV1-110" "IGKV4-55"

This can then be easily plotted as a heatmap to observe patterns between repertoires or can be used to calculate V gene correlation using the “pheatmap” package.

Platypus also allows a separate analysis of V gene usage for HC and LC. The VDJ_Vgene_usage_barplot allows the user to plot most frequently used IgH or IgK/L V genes. By default, this function only provides visualizations for the HC V genes, but can also provide for the LC if LC. Vgene is set to TRUE. The User can also select the number of most used genes to be depicted.

Vgene_usage_barplot <- VDJ_Vgene_usage_barplot(VDJ = VGM[[1]], HC.gene.number = 10, LC.Vgene = T, LC.gene.number = 10, platypus.version = "v3")

## Sample order: s1 ; s2 ; s3 ; s4

Vgene_usage_barplot[[1]]
#VDJ chains
Vgene_usage_stackedbarplot <- VDJ_Vgene_usage_stacked_barplot(VDJ = VGM[[1]], LC.Vgene = F,HC.gene.number = 10, Fraction.HC = 1, platypus.version = "v3")

## Sample order: s1 ; s2 ; s3 ; s4

Vgene_usage_stackedbarplot

Furthermore, we can also produce a circular visualization of how V and J genes are combined throughout the repertoire. In the example that follows we use VDJ_VJ_usage_circos to look at the V gene with the corresponding J gene for each expanded clonotype.

#VDJ and VJ V and J gene pairing
vj_circos_bcells <- VDJ_VJ_usage_circos(VGM = VGM[[1]], c.threshold = 5,label.threshold=5,cell.level = T, platypus.version = "v3")

## Chosen clonotype column:  clonotype_id_10x

## Chosen clonotype.frequency column:  clonotype_frequency_10x

## Chosen clonotype.frequency column:  clonotype_frequency

## Processing sample 1

## Processing sample 2

#VDJ and VJ pairing 
HL_circos_bcells <- VDJ_alpha_beta_Vgene_circos(VGM = VGM[[1]], c.threshold = 5,label.threshold=5,cell.level = T, V.or.J= "both", platypus.version = "v3")

## Chosen clonotype.frequency column:  clonotype_frequency

## [1] "s2"
## [1] "s4"

## Plotting for sample s2

## Plotting for sample s4

5.8 Assessing CDR3 sequence similarity

Finally, one can also look at any specific HC and LC CDR3 amino acid patterns arising across the different clones. Using the VDJ_logoplot_vector function the user can plot a logoplot of the CDR3 region od a certain length, specyfied by the length_cdr3 argument. For instance, the logoplot below corresponds to all those CDR3 aminoacid sequences of length 25

pasted_CDR3s <- paste0(VGM[[1]]$VDJ_cdr3s_aa, VGM[[1]]$VJ_cdr3s_aa)
agedCNS_CDR3_logoplot <- VDJ_logoplot_vector(cdr3.vector = pasted_CDR3s, seq_type = "aa", length_cdr3 = "auto")

## Returning logoplot based on 106 sequences of a total 945 input sequences

print(agedCNS_CDR3_logoplot)

5.9 Assessing Nr of variants per clone

Depending on the clonotype strategy, within a clone several different variants of CDR3s or full length sequences may be contained. To count variants, the function VDJ_variants_per_clone is used. It returns a table per sample_id with stats about variants

variants_agedCNS <- VDJ_variants_per_clone(VDJ = VGM[[1]], variants.of = c("VDJ_cdr3s_aa", "VJ_cdr3s_aa"), clonotypes.col = "clonotype_id_10x", split.by = "sample_id", stringDist.method = "levenshtein")
head(variants_agedCNS[[1]])

##   split_group clonotype_id n sequences n variants mean variants distance
## 1          s1   clonotype7           1          1                     NA
## 2          s1  clonotype58           1          1                     NA
## 3          s1  clonotype65           1          1                     NA
## 4          s1  clonotype43           1          1                     NA
## 5          s1  clonotype33           1          1                     NA
## 6          s1  clonotype49           1          1                     NA
##   max variants distance min variants distance         most frequent variant
## 1                    NA                    NA       CARIGYAMDYW/CQQGNTLPPTF
## 2                    NA                    NA CARDFTTVVARGYFDVW/CQQDYSSPWTF
## 3                    NA                    NA CARGITTVVAYYYAMDYW/CLQYDNLYTF
## 4                    NA                    NA    CTTGPYDYYAMDYW/CQQHYSTPYTF
## 5                    NA                    NA    CARPHDYDGVDYW/CSQSTHVPPWTF
## 6                    NA                    NA   CARRYYSNYAWFAYW/CQQWSSYPYTF
##   n VDJ cgene variants most frequent VDJ cgene
## 1                    1                    IGHM
## 2                    1                    IGHD
## 3                    1                    IGHD
## 4                    1                    IGHD
## 5                    1                    IGHM
## 6                    1                    IGHM

#set split.by to "none" if clonotyping was conducted across all samples

5.10 Searching for overlap between repertoires

Public clones or sequences are sequences that are shared between individuals. The VDJ_overlap_heatmap function can quantify this overlap and return a list of overlapping items

#overlap of VDJ V genes
VDJv_overlap <- VDJ_overlap_heatmap(VDJ = VGM[[1]],feature.columns = c("VDJ_vgene") ,grouping.column = "sample_id", axis.label.size = 20, pvalues.label.size = 12, add.barcode.table = T, plot.type = "ggplot") 
head(VDJv_overlap[[2]]) #summary of overlap

##   V1 V2 overlap
## 1 s1 s2      30
## 2 s1 s3      33
## 3 s1 s4      38
## 4 s2 s3      40
## 5 s2 s4      47
## 6 s3 s4      53
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                    items.overlapping
## 1                                                                                                                                                                                                         IGHV8-8;IGHV1-9;IGHV1-26;IGHV5-17;IGHV5-16;IGHV1-54;IGHV1-72;IGHV1-64;IGHV5-9-1;IGHV1-19;IGHV1-18;IGHV8-5;IGHV3-6;IGHV1-52;IGHV1-80;IGHV1-74;IGHV1-39;IGHV1-22;IGHV10-1;IGHV1-62-2;IGHV9-3;IGHV1-53;IGHV1-69;IGHV9-1;IGHV1-76;IGHV1-50;IGHV1-81;IGHV1-12;IGHV1-82;IGHV1-85
## 2                                                                                                                                                                               IGHV8-8;IGHV1-9;IGHV1-26;IGHV5-17;IGHV5-16;IGHV1-54;IGHV1-72;IGHV1-64;IGHV5-9-1;IGHV1-71;IGHV1-19;IGHV1-18;IGHV1-59;IGHV8-5;IGHV3-6;IGHV1-52;IGHV1-80;IGHV1-74;IGHV1-39;IGHV10-1;IGHV1-62-2;IGHV9-3;IGHV2-5;IGHV1-53;IGHV1-69;IGHV1-76;IGHV1-81;IGHV1-12;IGHV1-66;IGHV14-2;IGHV1-4;IGHV1-82;IGHV1-85
## 3                                                                                                                                   IGHV8-8;IGHV1-9;IGHV1-26;IGHV14-1;IGHV5-17;IGHV5-16;IGHV1-54;IGHV1-72;IGHV1-64;IGHV5-9-1;IGHV2-3;IGHV1-71;IGHV1-19;IGHV1-18;IGHV1-75;IGHV1-59;IGHV4-1;IGHV3-6;IGHV1-52;IGHV1-80;IGHV1-74;IGHV1-39;IGHV1-22;IGHV10-1;IGHV1-62-2;IGHV9-3;IGHV2-5;IGHV1-53;IGHV1-69;IGHV9-1;IGHV1-76;IGHV1-50;IGHV1-81;IGHV1-12;IGHV1-66;IGHV14-2;IGHV1-82;IGHV1-85
## 4                                                                                                                   IGHV5-9-1;IGHV5-16;IGHV9-3;IGHV1-12;IGHV1-39;IGHV14-3;IGHV1-9;IGHV1-82;IGHV1-26;IGHV10-3;IGHV7-3;IGHV1-19;IGHV1-74;IGHV1-76;IGHV3-6;IGHV1-53;IGHV1-69;IGHV1-85;IGHV1-81;IGHV1-18;IGHV1-54;IGHV8-8;IGHV1-15;IGHV2-9-1;IGHV1-64;IGHV1-52;IGHV5-12;IGHV10-1;IGHV1-78;IGHV5-4;IGHV6-3;IGHV6-6;IGHV5-17;IGHV1-72;IGHV1-55;IGHV1-80;IGHV8-5;IGHV5-9;IGHV2-2;IGHV1-62-2
## 5                                                     IGHV5-9-1;IGHV5-16;IGHV9-3;IGHV1-50;IGHV1-12;IGHV1-39;IGHV14-3;IGHV1-2;IGHV8-12;IGHV1-9;IGHV1-82;IGHV1-58;IGHV1-36;IGHV1-26;IGHV10-3;IGHV7-3;IGHV1-19;IGHV1-74;IGHV1-76;IGHV3-6;IGHV1-53;IGHV1-69;IGHV1-85;IGHV1-81;IGHV1-18;IGHV1-54;IGHV8-8;IGHV1-15;IGHV2-9-1;IGHV1-22;IGHV1-64;IGHV1-52;IGHV1-42;IGHV5-12;IGHV10-1;IGHV1-78;IGHV5-4;IGHV6-3;IGHV6-6;IGHV5-17;IGHV9-1;IGHV1-72;IGHV1-55;IGHV1-80;IGHV2-2;IGHV2-6;IGHV1-62-2
## 6 IGHV1-55;IGHV1-9;IGHV1-71;IGHV1-64;IGHV1-53;IGHV1-82;IGHV7-1;IGHV1-18;IGHV5-4;IGHV1-52;IGHV1-26;IGHV1-7;IGHV5-9-1;IGHV2-9-1;IGHV5-12;IGHV13-2;IGHV1-72;IGHV1-69;IGHV1-19;IGHV10-3;IGHV1-81;IGHV1-80;IGHV6-3;IGHV1-5;IGHV1-15;IGHV6-6;IGHV1-84;IGHV5-16;IGHV9-3;IGHV1-78;IGHV1-66;IGHV2-5;IGHV3-6;IGHV1-12;IGHV1-39;IGHV7-3;IGHV1-62-2;IGHV1-74;IGHV10-1;IGHV1-59;IGHV8-8;IGHV1-77;IGHV15-2;IGHV14-3;IGHV14-2;IGHV5-17;IGHV1-61;IGHV3-1;IGHV1-54;IGHV2-2;IGHV1-63;IGHV1-76;IGHV1-85
##   overlap_lab
## 1          30
## 2          33
## 3          38
## 4          40
## 5          47
## 6          53

#overlap of clones
clone_overlap <- VDJ_overlap_heatmap(VDJ = VGM[[1]],feature.columns = c("VDJ_cdr3s_aa","VJ_cdr3s_aa") ,grouping.column = "sample_id", axis.label.size = 20, pvalues.label.size = 12, add.barcode.table = T, plot.type = "ggplot") 
print(clone_overlap[[1]])
#Pheatmap plots will function only with length(unique(grouping.column)) > 4

Given the small size of the two repertoires, it is unsurprising that no clones are shared

6 Integrating repertoire and gene expression

The strength of the current 5’ sequencing protocols are that the gene expression (GEX) and repertoire (VDJ) libraries are extracted from the same sample, which can then be linked back to demonstrate that a given T cell has a certain gene expression pattern and also a certain T cell receptor sequence. The following functions are meant to integrate these two pieces of information.

6.1 Integrating transcriptional clusters to the VDJ objects

One thing we may ask is how similar the B or T cells in a given clonal family are on the transcriptional level. The basic work of matching barcodes from V(D)J sequencing to those from GEX sequencing was already done in the VGM function. This part is therefore dedicated to functions that help explore and exploite this data

First we get basic stats on the overlap between GEX and VDJ. In this dataset, GEX data contains both B and T cells while VDJ data contains only BCR sequences.

#Nr of cells for which VDJ info is available
nrow(VGM[[1]])

## [1] 945

#Nr of cells for which GEX info is available
ncol(VGM[[2]])

## [1] 2064

#VDJ sequences for which GEX is available
sum(VGM[[2]]$VDJ_available)

## [1] 303

#We can also plot this 
Seurat::DimPlot(VGM[[2]],reduction = "umap", group.by = "VDJ_available", shuffle = T)

6.2 Visualizing clones on the 2 dimensional landscape

To better visualize this, the cells of any clone can also be highlighted on a dimensional reduction

Because the legend of these plots can get quite large, we can split the plot and the legend and draw them separately. For this the gridExtra and cowplot package is required

#here we overlay the top 5 clones
plot_out <- VDJ_GEX_overlay_clones(GEX = VGM[[2]], reduction = "umap", n.clones = 5, by.sample = F, ncol.facet = 1, split.plot.and.legend = F, pt.size = 0.5)

## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

plot_out[[1]] # the plot

#We can also plot any clone of interest by adding a column were TRUE values select which clones are to be plotted
interesting_clones <- c("clonotype7", "clonotype42")
VGM[[2]]@meta.data$clones_to_plot <- FALSE
VGM[[2]]@meta.data$clones_to_plot[which(VGM[[2]]$clonotype_id_10x %in% interesting_clones)] <- TRUE
plot_out <- VDJ_GEX_overlay_clones(GEX = VGM[[2]], reduction = "umap", clones.to.plot = "clones_to_plot", by.sample = F, ncol.facet = 1, split.plot.and.legend = T, pt.size = 0.5)

## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

plot_out[[1]] # the plot
plot(plot_out[[2]]) # the legend. Alternatively use gridExtra::grid.arrange(plot_out[[2]])

6.3 Specific gene expression information on the clonal level

We previously integrated the GEX information into the format of the VDJ output. However, we may want to ask how the gene expression looks for certain clonotypes (e.g., how many of the cells in the top clone are expressing markers of activated B cells). This is done by adding an extra column to the VDJ and GEX objects containing binary information on expansion.

VGM <- VGM_expanded_clones(VGM, add.to.VDJ = T, add.to.GEX = T, expansion.threshold = 2)

## Added new columns to VDJ: expanded_clonotype_frequency

## Added new columns to GEX: expanded_clonotype_frequency

VGM[[2]] <- SeuratObject::JoinLayers(VGM[[2]])

For a more unbiased view, DEGs can be calculated between expanded and non-expanded clones

DE_byexpansion <- GEX_DEgenes(GEX = VGM[[2]], min.pct = 0.25, group1 = "TRUE", group2 = "FALSE", grouping.column = "expanded_clonotype_frequency", return.plot = "volcano", logFC = F, label.n.top.genes = 5)
head(DE_byexpansion[[1]])

##                   p_val avg_logFC pct.1 pct.2   p_val_adj     SYMBOL
## Zc3h15     1.501698e-07 -6.650521 0.000 0.309 0.004757529     Zc3h15
## Ccr7       3.843845e-07 -1.552368 0.129 0.515 0.012177684       Ccr7
## Tspan13    8.735331e-07 -1.692454 0.071 0.412 0.027674401    Tspan13
## Pde4b      1.116669e-06 -1.349421 0.143 0.528 0.035377178      Pde4b
## AC149090.1 1.681616e-06 -6.815884 0.000 0.266 0.053275287 AC149090.1
## Gimap6     1.761742e-06 -1.094514 0.200 0.601 0.055813735     Gimap6
##            signif_pval signif_logfc color_threshold
## Zc3h15            TRUE         TRUE            TRUE
## Ccr7             FALSE         TRUE           FALSE
## Tspan13          FALSE         TRUE           FALSE
## Pde4b            FALSE         TRUE           FALSE
## AC149090.1       FALSE         TRUE           FALSE
## Gimap6           FALSE         TRUE           FALSE

DE_byexpansion[[2]]

9 Version information

sessionInfo()

## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.0
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## time zone: Europe/Zurich
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] Platypus_3.5.0     Seurat_5.0.1       SeuratObject_5.0.1 sp_2.1-1          
##  [5] lubridate_1.9.2    forcats_1.0.0      stringr_1.5.0      purrr_1.0.2       
##  [9] readr_2.1.4        tidyr_1.3.0        tibble_3.2.1       ggplot2_3.4.4     
## [13] tidyverse_2.0.0    dplyr_1.1.3       
## 
## loaded via a namespace (and not attached):
##   [1] RcppAnnoy_0.0.21        splines_4.3.1           later_1.3.1            
##   [4] bitops_1.0-7            ggplotify_0.1.2         polyclip_1.10-6        
##   [7] fastDummies_1.7.3       lifecycle_1.0.3         doParallel_1.0.17      
##  [10] globals_0.16.2          lattice_0.21-8          MASS_7.3-60            
##  [13] magrittr_2.0.3          limma_3.56.2            plotly_4.10.2          
##  [16] sass_0.4.7              rmarkdown_2.25          jquerylib_0.1.4        
##  [19] yaml_2.3.7              useful_1.2.6            httpuv_1.6.11          
##  [22] sctransform_0.4.1       spam_2.10-0             spatstat.sparse_3.0-2  
##  [25] reticulate_1.34.0       cowplot_1.1.1           pbapply_1.7-2          
##  [28] RColorBrewer_1.1-3      zlibbioc_1.46.0         abind_1.4-5            
##  [31] Rtsne_0.16              RCurl_1.98-1.12         BiocGenerics_0.46.0    
##  [34] yulab.utils_0.1.0       GenomeInfoDbData_1.2.10 circlize_0.4.15        
##  [37] IRanges_2.34.1          S4Vectors_0.38.2        ggrepel_0.9.4          
##  [40] irlba_2.3.5.1           listenv_0.9.0           spatstat.utils_3.0-3   
##  [43] tidytree_0.4.5          pheatmap_1.0.12         vegan_2.6-4            
##  [46] goftest_1.2-3           RSpectra_0.16-1         spatstat.random_3.1-6  
##  [49] fitdistrplus_1.1-11     parallelly_1.36.0       permute_0.9-7          
##  [52] leiden_0.4.3            codetools_0.2-19        tidyselect_1.2.0       
##  [55] shape_1.4.6             ggseqlogo_0.1           aplot_0.2.2            
##  [58] farver_2.1.1            stats4_4.3.1            matrixStats_1.0.0      
##  [61] spatstat.explore_3.2-3  jsonlite_1.8.7          ellipsis_0.3.2         
##  [64] progressr_0.14.0        ggridges_0.5.4          survival_3.5-5         
##  [67] iterators_1.0.14        foreach_1.5.2           tools_4.3.1            
##  [70] treeio_1.24.3           stringdist_0.9.10       ica_1.0-3              
##  [73] Rcpp_1.0.11             glue_1.6.2              gridExtra_2.3          
##  [76] xfun_0.40               mgcv_1.9-0              GenomeInfoDb_1.36.4    
##  [79] withr_2.5.1             fastmap_1.1.1           fansi_1.0.5            
##  [82] digest_0.6.33           timechange_0.2.0        R6_2.5.1               
##  [85] mime_0.12               gridGraphics_0.5-1      colorspace_2.1-0       
##  [88] scattermore_1.2         tensor_1.5              spatstat.data_3.0-1    
##  [91] utf8_1.2.3              generics_0.1.3          data.table_1.14.8      
##  [94] httr_1.4.7              htmlwidgets_1.6.2       uwot_0.1.16            
##  [97] pkgconfig_2.0.3         gtable_0.3.4            lmtest_0.9-40          
## [100] XVector_0.40.0          htmltools_0.5.6.1       dotCall64_1.1-0        
## [103] scales_1.2.1            png_0.1-8               ggfun_0.1.3            
## [106] knitr_1.44              rstudioapi_0.15.0       tzdb_0.4.0             
## [109] reshape2_1.4.4          nlme_3.1-162            cachem_1.0.8           
## [112] zoo_1.8-12              GlobalOptions_0.1.2     KernSmooth_2.23-22     
## [115] parallel_4.3.1          miniUI_0.1.1.1          pillar_1.9.0           
## [118] grid_4.3.1              vctrs_0.6.4             RANN_2.6.1             
## [121] promises_1.2.1          xtable_1.8-4            cluster_2.1.4          
## [124] evaluate_0.22           cli_3.6.1               compiler_4.3.1         
## [127] rlang_1.1.1             crayon_1.5.2            future.apply_1.11.0    
## [130] labeling_0.4.3          plyr_1.8.9              fs_1.6.3               
## [133] stringi_1.7.12          viridisLite_0.4.2       deldir_1.0-9           
## [136] munsell_0.5.0           Biostrings_2.68.1       lazyeval_0.2.2         
## [139] spatstat.geom_3.2-5     Matrix_1.6-3            RcppHNSW_0.4.1         
## [142] hms_1.1.3               patchwork_1.1.3         future_1.33.0          
## [145] shiny_1.7.5.1           ROCR_1.0-11             igraph_1.5.1           
## [148] memoise_2.0.1           bslib_0.5.1             ggtree_3.8.2           
## [151] ape_5.7-1