R/PlatypusML_feature_extraction_VDJ.R
PlatypusML_feature_extraction_VDJ.Rd
This PlatypusML_feature_extraction function takes as input specified features from the first output of the VDJ_GEX_matrix function and encodes according to the specified strategy. The function returns a matrix containing the encoded extracted features in the order specified in the input as columns and the different cells as rows. This function should be called as a first step in the process of modeling the VGM data using machine learning.
PlatypusML_feature_extraction_VDJ(
VGM,
which.features,
which.encoding,
encoding.level,
unique.sequence,
parameters.encoding.nt,
parameters.encoding.aa,
which.label,
problem,
verbose.classes,
platypus.version
)
VGM output of the VDJ_GEX_matrix function
String vector. Information on which columns of the VDJ input the function should encode
String vector of size 2. Defaults to 'onehot'. Information on which encoding strategy to be used for the two types of sequences: the first entry of the vector corresponds to the nucleotide type of encoding and the second one to the amino acid type of encoding. If one type of sequence is not among the Other possible values for amino acid sequences are 'kmer', 'blosum', 'dc', 'tc' or 'topoPCA' and for nucleotide sequences 'kmer'.
String. Specifies on which level the features will be extracted. There are three possible options: "cell" (all available), "clone" (one unique sample per clone), "unique.sequence" (selecting only unique sequences based on a specified sequence (int he unique.sequence argument)). It defaults to cell.
String. Needs to be specified only when encoding.level is set to "unique.sequence". The name of the sequence on which unique selection should be based on.
List. Parameters to be used for encoding, if the chosen encoding requires it. 'onehot' -> no parameters necessary, defaults to NULL 'kmer' -> one parameter necessary to set the length of the subsequence, defaults to 3
List. Parameters to be used for encoding, if the chosen encoding requires it. 'onehot', 'dc', 'tc' -> no parameters necessary, defaults to NULL 'kmer' -> one parameter necessary to set the length of the subsequence, defaults to 3 'blosum' -> two parameters necessary: k ( The number of selected scales (i.e. the first k scales) derived by the substitution matrix. This can be selected according to the printed relative importance values.) and lag (The lag parameter. Must be less than the amino acids.). They default to (5, 7). 'topoPCA' -> three parameters necessary: index (Integer vector. Specify which molecular descriptors to select from the topological descriptors), pc (Integer. Number of principal components. Must be no greater than the number of amino acid properties provided.) and lag(The lag parameter. Must be less than the amino acids.). They default to (c(1:78),5,7).
String. The name of the column in VDJ which will be used as a label in a chosen model later. If missing, no label will be appended to the encoded features.
String ("classification" or "regression"). Whether the return matrix will be used in a classification problem or a regression one. Defaults to "classification".
Boolean. Whether to display information on the distribution of samples between classes. Defaults to TRUE. For this parameter to be set to TRUE, classification must all be set to TRUE (default).
This function works with "v3" only, there is no need to set this parameter.
A dataframe containing the encoded features and its label, each row corresponding to a different cell. The encodings are ordered as they have been entered in the 'which.features' parameter. The label can be found in the last column of the dataframe returned.