This PlatypusML_feature_extraction function takes as input specified features from the first output of the VDJ_GEX_matrix function and encodes according to the specified strategy. The function returns a matrix containing the encoded extracted features in the order specified in the input as columns and the different cells as rows. This function should be called as a first step in the process of modeling the VGM data using machine learning.

PlatypusML_feature_extraction_VDJ(
  VGM,
  which.features,
  which.encoding,
  encoding.level,
  unique.sequence,
  parameters.encoding.nt,
  parameters.encoding.aa,
  which.label,
  problem,
  verbose.classes,
  platypus.version
)

Arguments

VGM

VGM output of the VDJ_GEX_matrix function

which.features

String vector. Information on which columns of the VDJ input the function should encode

which.encoding

String vector of size 2. Defaults to 'onehot'. Information on which encoding strategy to be used for the two types of sequences: the first entry of the vector corresponds to the nucleotide type of encoding and the second one to the amino acid type of encoding. If one type of sequence is not among the Other possible values for amino acid sequences are 'kmer', 'blosum', 'dc', 'tc' or 'topoPCA' and for nucleotide sequences 'kmer'.

encoding.level

String. Specifies on which level the features will be extracted. There are three possible options: "cell" (all available), "clone" (one unique sample per clone), "unique.sequence" (selecting only unique sequences based on a specified sequence (int he unique.sequence argument)). It defaults to cell.

unique.sequence

String. Needs to be specified only when encoding.level is set to "unique.sequence". The name of the sequence on which unique selection should be based on.

parameters.encoding.nt

List. Parameters to be used for encoding, if the chosen encoding requires it. 'onehot' -> no parameters necessary, defaults to NULL 'kmer' -> one parameter necessary to set the length of the subsequence, defaults to 3

parameters.encoding.aa

List. Parameters to be used for encoding, if the chosen encoding requires it. 'onehot', 'dc', 'tc' -> no parameters necessary, defaults to NULL 'kmer' -> one parameter necessary to set the length of the subsequence, defaults to 3 'blosum' -> two parameters necessary: k ( The number of selected scales (i.e. the first k scales) derived by the substitution matrix. This can be selected according to the printed relative importance values.) and lag (The lag parameter. Must be less than the amino acids.). They default to (5, 7). 'topoPCA' -> three parameters necessary: index (Integer vector. Specify which molecular descriptors to select from the topological descriptors), pc (Integer. Number of principal components. Must be no greater than the number of amino acid properties provided.) and lag(The lag parameter. Must be less than the amino acids.). They default to (c(1:78),5,7).

which.label

String. The name of the column in VDJ which will be used as a label in a chosen model later. If missing, no label will be appended to the encoded features.

problem

String ("classification" or "regression"). Whether the return matrix will be used in a classification problem or a regression one. Defaults to "classification".

verbose.classes

Boolean. Whether to display information on the distribution of samples between classes. Defaults to TRUE. For this parameter to be set to TRUE, classification must all be set to TRUE (default).

platypus.version

This function works with "v3" only, there is no need to set this parameter.

Value

A dataframe containing the encoded features and its label, each row corresponding to a different cell. The encodings are ordered as they have been entered in the 'which.features' parameter. The label can be found in the last column of the dataframe returned.

Examples