dataloaders
PaddedLabelEncodedDataset
Bases: torch.utils.data.Dataset
A dataset that returns a tuple of (X, y)
where X
and y
are both
torch tensors. X
is a sequence of integers representing the SMILES
tokens, and y
is the same sequence shifted by one position to the
right.
The outputs are padded to the same length and label encoded.
Source code in s4dd/dataloaders.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
|
__getitem__(idx)
Returns a tuple of (X, y)
where X
and y
are both torch tensors. X
is a
sequence of integers representing the SMILES tokens, and y
is the same
sequence shifted by one position to the right.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
idx |
int
|
Index of the molecule to return. |
required |
Returns:
Type | Description |
---|---|
Tuple[torch.Tensor, torch.Tensor]
|
A tuple of |
Source code in s4dd/dataloaders.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
|
__init__(label_encoded_molecules, token2label)
Creates a PaddedLabelEncodedDataset
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
label_encoded_molecules |
List[List[int]]
|
A list of label encoded and padded molecules, where each molecule is a list of integers representing the SMILES tokens. The integers are the labels of the tokens in the token2label dictionary. All molecules must be padded to the same length. |
required |
token2label |
Dict[str, int]
|
A dictionary mapping SMILES tokens to integer labels. |
required |
Source code in s4dd/dataloaders.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
|
__len__()
Returns the number of molecules in the dataset.
Returns:
Type | Description |
---|---|
int
|
Number of molecules in the dataset. |
Source code in s4dd/dataloaders.py
38 39 40 41 42 43 44 45 46 |
|
create_dataloader(path_to_data, batch_size, sequence_length=100, num_workers=8, shuffle=True, token2label=None)
Creates a dataloader for a dataset of SMILES strings. The input sequences will be
tokenized, pre/appended with "[BEG]
/"[END]"
tokens, label encoded, and padded to the same length.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path_to_data |
str
|
Path to the dataset. Can be a zip file or a text file. The dataset must be a list of SMILES strings, one per line. |
required |
batch_size |
int
|
Batch size. |
required |
sequence_length |
int, optional
|
Number of tokens in the tokenized SMILES sequences. If a SMILES sequence has more tokens than this limit, it will be
pre-truncated. If a sequence has less tokens than this, it will be post-padded with the value |
100
|
num_workers |
int, optional
|
Number of workers for the dataloader. The default is 8. |
8
|
shuffle |
bool, optional
|
Whether to shuffle the dataset. The default is True. |
True
|
token2label |
Dict[str, int], optional
|
A dictionary mapping SMILES tokens to integer labels. If |
None
|
Returns:
Type | Description |
---|---|
torch.utils.data.DataLoader
|
A dataloader for the dataset. |
Source code in s4dd/dataloaders.py
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
|