Function to infer immune receptor evolutionary networks (trees) or complex/sequence-similarity networks from a Platypus VDJ object/VGM[[1]] - AntibodyForests objects will be created for each sample and each unique clonotype, if the default parameters are used. Networks can be created in a tree-building manner (minimum spanning tree algorithm with custom tie solving methods), by linking sequences with a minimal string distance between them iteratively (and solving distance ties in a hierarchical way, with multiple resolve.ties parameters/configurations). Nodes in the network represent unique sequences per clonotype, edges are determined by the string distance between the nodes/sequences. Sequence types are dictated by the sequence.tyoe parameter, allowing networks to be built for most of the sequence types available in a Platypus VDJ object. Networks can also be created by pruning edges from a fully connected network obtained from all sequences from a specific clonotype - complex similarity networks. Pruning can either be done via a distance threshold (prunes nodes too far apart), a node degree threshold (to prune nodes with a smaller degree/not well connected), or an expansion threshold (to prune nodes for sequences with low expansion/frequency). Lastly, networks can be created by converting a phylogenetic tree inferred via different methods (neighbour-joining, maximum likelihood, maximum parsimony) into an igraph object,

AntibodyForests(
  VDJ,
  sequence.type,
  include.germline,
  network.algorithm,
  directed,
  distance.calculation,
  resolve.ties,
  connect.germline.to,
  pruning.threshold,
  remove.singletons,
  keep.largest.cc,
  VDJ.VJ.1chain,
  node.features,
  filter.na.features,
  filter.specific.features,
  node.limits,
  cell.limits,
  weighted.edges,
  weighted.germline,
  expand.intermediates,
  specific.networks,
  network.level,
  forest.method,
  random.seed,
  parallel,
  as.igraph
)

Arguments

VDJ

VDJ or vgm[[1]] object, as obtained from the VDJ_GEX_matrix function in Platypus.

sequence.type

string denoting the sequence types to create the networks for. 'cdr3.aa' - networks for amino-acid CDR3 sequences, 'cdr3.nt' - networks for nucleotide CDR3 sequences, 'VDJ.VJ.nt.trimmed' - full, trimmed VDJ-VJ sequences, as obtained when setting trin.and.align = T for VDJ_GEX_matrix(), 'VDJ.VJ.nt.raw' - full, raw VDJ-VJ sequences, 'VDJ.VJ.aa.mixcr' and 'VDJ.VJ.nt.mixcr' for the VDJ and VJ chains (nt or aa) as inferred by MIXCR, 'VDJ.aa.mixcr' and 'VDJ.nt.mixcr' for the VDJ chain inferred by MIXCR, 'VJ.aa.mixcr' and 'VJ.nt.mixcr' for the VJ chain inferred by MIXCR, 'VDJ.nt.trimmed' for the trimmed VDJ chain as nucleotides, 'VDJ.nt.raw' for the untrimmed VDJ chain as nucleotides, similarly for the VJ chain ('VJ.nt.trimmed' and 'VJ.nt.raw'), 'VDJ.cdr3s.aa' for the CDRH3 region as amino-acids, VDJ.cdr3s.nt' for the CDRH3 region as nucleotides, similarly for the CDRL3 regions ('VJ.cdr3s.aa', 'VJ.cdr3s.nt'), 'VDJ.aa' and 'VJ.aa' for the full VDJ/VJ sequence as amino-acids. Defaults to 'VDJ.VJ.nt.trimmed'.

include.germline

string or vector of strings, denoting the germline column(s) to be used (in the c('VDJ_germline', 'VJ_germline') order). 'trimmed.ref' - the networks will include a germline node, obtained by pasting the VDJ_trimmed_ref and VJ_trimmed_ref sequences for each clonotype, obtained by calling VDJ_call_MIXCR on VDJ. As reconstructed germlines as usually available for full VDJ.VJ.nt sequences, use this with sequence.type=VDJ.VJ.nt.trimmed. NULL will not include a germline.

network.algorithm

string denoting the algorithm used in constructing the networks. 'tree' - will use a tree evolutionary inference algorithm: nodes denoting unique sequences per clonotype will be linked iteratively, as long as their string distance is the minimum. Use the resolve.ties parameter to further dictate the tree topology (when there are multiple ties in the minimum links). 'prune' will create networks by pruning nodes from a fully connected networks. It must always be followed by a pruning method. For example, 'prune.distance' will prune nodes with a larger string distance than the threshold specified in the pruning.threshold parameter. 'prune.degree' will remove nodes with a lower degree than the threshold specified in pruning.threshold. 'prune.expansion' will remove nodes with a lower sequence frequency/expansion than the threshold specified in pruning.threshold. Multiple pruning methods can be used simultaneously, as long as multiple pruning.threshold is a vector - a threshold for each method. For example, 'prune.distance.degree' with pruning.threshold=c(3,2) will remove edges of nodes with a string distance greater than 3, then will remove nodes with a degree smaller than 2. 'prune.distance.degree.expansion' with pruning.threshold=c(3,2,1) will also remove one-of sequences/nodes (with a single cell). 'phylogenetic.tree.nj' will create phylogenetic (binary) trees using the neighbour-joining algorithm via ape::nj(). 'phylogenetic.tree.ml' will create phylogenetic (binary) trees using a maximum-likelihood algorithm from the phangorn package (phangorn::ml()). 'phylogenetic.tree.mp' will create phylogenetic (binary) trees using a maximum-parsimony algorithm from the phangorn package (phangorn::mp()). 'mst' will create undirected trees using the minimum spanning tree algorithm from igraph (without specific tie solving mechanisms). 'global' is a custom option to easily create whole-repertoire/multi-repertoire similarity networks: it defaults to the 'prune.distance' option, while also changing some other parameters to ensure consistency (directed is set to F, include.germline is set to F, network.level is set to 'global')

directed

boolean, whether networks obtained using network.algorithm='tree' should be directed (from the germline to the leaf nodes) or not. T - directed; F - undirected trees.

distance.calculation

string or function, specifying the method for calculating string distances between the sequences. Must be compatible with the method parameter of the stringdist::stringdistmatrix() function. Will default to 'lv' for Levenshtein distances. Else, if a function of the form distance(seq1, seq2) must be specified, which should output a float = custom distance metric between the selected sequences.

resolve.ties

vector of strings, denoting the manner in which ties should be resolved when assembling trees via network.algorithm='tree'. Ties are defined as edges with the same weights=string distances (as determined by the distance matrix for the fully connected network) between nodes already added in the tree and nodes to be added in the tree. There are multiple default and custom configurations for this parameter: 'first' will pick the first edge from a pool of edges of equal string distance (between the sequences) - these are ordered based on each node's expansion/cell count (therefore 'first' will try to add the most expanded node first); 'random' - resolve ties by picking random tied edges; 'close.germline.edges' and 'far.germline.edges' - will prefer the nodes closer or farther to/from the germline, as a number of edges (unweighted) to be next integrated into the network; 'close.germline.distance' and 'far.germline.distance' - picks nodes closer/farther to/from the germline, determined by the string distance; 'close.germline.weighted' and 'far.germline.weighted' - picks edges with nodes closer/farther to/from the germline, as a weighted path from the germline to the most recent integrated node; 'min.expansion' and 'max.expansion' - will pick the most/least expanded sequences; 'min.degree' and 'max.degree' - picks the nodes with the minimum or maximum degree - a distance threshold must be specified in the pruning.threshold parameter, otherwise all nodes will have the same degree; An additional custom configuration can be used: either min/max/specific feature value, tied to a specific feature column as defined in the node.features parameter, using '-'. For example, 'yes-OVA_binder' will select nodes that are OVA binders when resolving ties; for min and max, the node.feature column should be of numeric class. If a vector is provided, ties will be resolved in a hierarchical manner: for example, if resolve.ties=c('max.expansion', 'close.germline.distance'), it will first try to pick nodes with a max expansion that were not added in the network (with edge ties to those already added), then those closer to the germline (minimum string distance). As these two options do not always fully converge, meaning that there could be also expansion ties and distance ties between the nodes to be added, not just edge ties, a 'first' options is always added at the end of the hierarchical tie resolving algorithm, which always converges/picks a specific edge and resolves a tie. Moreover, the 'first' option is also added when only selecting a single option (still in the form of a vector - for e.g., c('min.expansion') turns automatically into c('min.expansion', 'first')).

connect.germline.to

string defining how the germline should be connected for both the pruning and tree building algorithms. When network.algorithm='tree', two options are available: 'min.adjacent' - will first connect the nodes with the min string distance from the germline, then continue adding nodes and building the tree, and 'threshold.adjacent' - will connect nodes with a string distance value lower than the threshold defined in pruning.threshold. As the pruning algorithm starts by pruning all connections out of the specified boundaries, irrespective if they are germline ones or not, the germline needs to be added at the end if it is removed. If not, then this option is ignored. Thus, there are additional options for including the germline when building a network via the pruning algorithm: 'largest.connected.component' - connects the germline to the largest resulting connected component(s), 'all.connected.components' - to all connected components, 'all.connected.components.non.single' - does not connect the germline to single-node components; 'none' - germline is not connected; 'min.adjacent' - connects to the node(s) with the minimum string distance.

pruning.threshold

vector of max size=3, specifying the thresholds for the pruning algorithm when network.algorithm includes 'prune' (as seen. 'prune' can be followed by either 'expansion', 'degree' or 'distance', or a combination of them - 'prune.distance.degree'). If we have 'prune.degree', we need to first specify a distance threshold (as 'prune.degree'='prune.distance.degree', otherwise the degree is the same for all nodes in a fully connected network). See also network.algorithm. For a direct use, if network.algorithms is set to prune.distance, set pruning.threshold to a single integer denoting the distance between two nodes for which edges will be pruned (equal or more).

remove.singletons

integer - in the case of the pruning network algorithm or 'global' network algorithm, it denotes the minimum connected component node number threshold (for e.g., if remove.singletons = 3, it will remove all nodes from a graph that form a 3-node compnent or less: 2-node and singletons). If NULL, will keep all components (including singletons) in the complex similarity graph.

keep.largest.cc

boolean - if T, will only keep the largest connected component in the similarity network (pruning network algorithm or 'global' option for network.algorithm). If F, will keep all components (including singletons unless removed via remove.singletons).

VDJ.VJ.1chain

boolean, T - excludes cells with an aberrant number of VDJ or VJ chains; F - will be kept in for network inference.

node.features

string or vector of strings, denoting the column name(s) in the original VDJ/VGM[[1]] from which the node features to be extracted. This is done by first pooling the cell cell_barcodes per unique sequence in the VDJ (for each clonotype), then adding the features of those cells.

filter.na.features

string or vector of strings, denoting the same column name(s) as specified in node.features. This will remove netowrks were ALL nodes values for the specific feature are equal to NA.

filter.specific.features

list with two elements - first one denotes the column/feature which you wish to filter on, second denotes the specific feature which HAS TO BE INCLUDED IN THE NETWORK (e.g., list('OVA_binder', 'yes') will result in networks that have at least 1 binder for OVA).

node.limits

list of integers or NULLs. node.limits[[1]] determines the least amount of unique sequences to create a network for, otherwise a network will not be created. If node.limits[[1]] is NULL, then there is no lower bound for the number of sequences in each network. node.limits[[2]] defines the upper bound for the number of sequences - networks with more sequences will have the extra ones removed, keeping only the most abundant sequences/largest sequence frequency.

cell.limits

list of integers or NULLs. cell.limits[[1]] the minimum threshold of cells which should produce a unique sequence (sequences below this threshold are removed from the network). If node.limits[[1]] is NULL, then there is no lower bound for the number of cells per unique sequence. node.limits[[2]] defines the upper bound for the number of cells per sequence - sequence frequency.

weighted.edges

boolean, T - edge weights will be equal to the string distance between a pair of nodes; F - edge weights = 1

weighted.germline

boolean, T - adds weights to the edges connected to the germline, equal to the string distance between the germline and the specific connected nodes; F - edge weights = 1.

expand.intermediates

boolean. T - will add inferred, intermediate nodes between nodes in the original network, determined by the string distance between a pair of nodes (for e.g., 2 nodes with an edge=string distance matrix of 3 will result in in 5 total nodes - 3 inferred nodes and 2 original ones, edges=1)

specific.networks

#either an integer of max sorted clonotypes to be picked for network inference, 'all' for all clonotypes to be used, or list of specific clonotypes to create networks for.

network.level

string determining the level at which networks should be built - 'intraclonal' will create intraclonal networks = networks for each sample and for each clonotype in the VDJ; 'global.clonotype' will create networks for each unique clonotype, irrespective of sample ids; 'global' will pool all clonotypes from all samples into a single global network; 'forest.per.sample' and 'forest.global' are tree-specific methods, used when obtaining networks via network.algorithm='tree': 'forest.per.sample' will join the intraclonal trees in each sample, 'forest.global' will join ALL trees. Joining is determined by the forest.method parameter.

forest.method

string determining how the trees should be joined if network.level='forest.per.sample' or 'forest.global'. 'single.germline' - trees will all be joined at a single germline (the one from the first clonotype), recalculating the string distances for the new adjacent nodes; 'multiple.germlines' - trees will all be joined in the same network, keeping the original germlines; 'multiple.germlines.joined' - same as 'multiple.germlines', but new edges will be added between the germlines.

random.seed

numeric, seed for the random tie resolving method of resolve.ties.

parallel

boolean with T - a parallelized mclapply will be used for each internal function, to accelerate computation; F - normal lapply will be used. Used best when having a large number of networks/clonotypes per sample.

as.igraph

boolean - if T, the resulting networks will be igraph objects. Otherwise, they are converted to tidygraph tibble objects.

Value

nested list of AntibodyForests objects for each sample and each clonotype. For example, output[[1]][[2]] denotes the AntibodyForests object of the first sample, second clonotype. If only a single clonotype and sample are available in the VDJ (or if the networks are joined via network.level = 'forest.global'), will output a single AntibodyForests object.

Examples

if (FALSE) {
AntibodyForests(VDJ, sequence.type='VDJ.VJ.nt.trimmed',
include.germline=T, network.algorithm='tree',
resolve.ties=c('close.germline.distance', 'max.expansion'),
node.features='OVA_binder', expand.intermediates=T, network.level='intraclonal')
}