smiles_utils
learn_label_encoding(tokenized_inputs)
Learn a label encoding from a tokenized dataset. The padding token, "[PAD]"
is always assigned the label 0.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenized_inputs |
List[List[str]]
|
SMILES of the molecules in the dataset, tokenized into a list of tokens. |
required |
Returns:
Type | Description |
---|---|
Dict[str, int]
|
A dictionary mapping SMILES tokens to integer labels. |
Source code in s4dd/smiles_utils.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
|
pad_sequences(sequences, padding_length, padding_value)
Pad sequences to a given length. The padding is done at the end of the sequences. Longer sequences are truncated from the beginning.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sequences |
List[List[Union[str, int]]
|
A list of sequences, either tokenized or label encoded SMILES. |
required |
padding_length |
int
|
The length to pad the sequences to. |
required |
padding_value |
Union[str, int]
|
The value to pad the sequences with. |
required |
Returns:
Type | Description |
---|---|
List[List[Union[str, int]]
|
The padded sequences. |
Source code in s4dd/smiles_utils.py
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
|
segment_smiles(smiles, segment_sq_brackets=True)
Segment a SMILES string into tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
smiles |
str
|
A SMILES string. |
required |
segment_sq_brackets |
bool
|
Whether to segment the square brackets |
True
|
Returns:
Type | Description |
---|---|
List[str]
|
A list of tokens. |
Source code in s4dd/smiles_utils.py
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
|
segment_smiles_batch(smiles_batch, segment_sq_brackets=True)
Segment a batch of SMILES strings into tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
smiles_batch |
List[str]
|
A batch of SMILES strings. |
required |
segment_sq_brackets |
bool
|
Whether to segment the square brackets |
True
|
Returns:
Type | Description |
---|---|
List[List[str]]
|
A list of lists of tokens. |
Source code in s4dd/smiles_utils.py
92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
|