tokenizer
tokenize_peptide(peptide)
Tokenize a peptide sequence into its constituent amino acids.
The amino acids are represented by their upper-case one-letter codes.
Post-translational modifications are also supported,
and are represented as "<aa>_<mod>"
where <aa>
is the amino acid and <mod>
is the modification.
The list supported amino acids and modifications can be found in the peptidy.biology
module.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
peptide |
str
|
A peptide sequence. |
required |
Returns:
Type | Description |
---|---|
List[str]
|
A list of tokens, each representing an amino acid (possibly with a post-translational modification) in the peptide sequence. |
Raises:
Type | Description |
---|---|
ValueError
|
If the peptide sequence contains unknown amino acids or a syntax error. |
Examples:
>>> tokenize_peptide("ACDEF")
['A', 'C', 'D', 'E', 'F']
>>> tokenize_peptide("ACK_aDGH")
['A', 'C', 'K_a', 'D', 'G', 'H']
>>> tokenize_peptide("S_pT_p")
['S_p', 'T_p']
>>> tokenize_peptide('ACD')
['A', 'C', 'D']
>>> tokenize_peptide('R_mRGD')
['R_m', 'R', 'G', 'D']
>>> tokenize_peptide('AXR')
Traceback (most recent call last):
...
ValueError: Unknown amino acid(s) in peptide: {'X'}
>>> tokenize_peptide('A_C_D')
Traceback (most recent call last):
...
ValueError: Unknown amino acid(s) in peptide: {'A_C_D'}
Source code in peptidy/tokenizer.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
|