Basics

This page features the bases of MidiTok, of how tokenizers work.

Tokens and vocabulary

A token is a distinct element, part of a sequence of tokens. In natural language, a token can be a character, a subword or a word. A sentence can then be tokenized into a sequence of tokens representing the words and punctuation. For symbolic music, tokens can represent the values of the note attributes (pitch, valocity, duration) or time events. These are the “basic” tokens, that can be compared to the characters in natural language. With Byte Pair Encoding (BPE), tokens can represent successions of these basic tokens. A token can take three forms, which we name by convention:

Token (string): the form describing it, e.g. Pitch_50.
Id (int): an unique associated integer, used as an index.
Byte (string): an unique associated byte, used internally for Byte Pair Encoding (BPE).

MidiTok works with TokSequence objects to output token sequences of represented by these three forms.

Vocabulary

The vocabulary of a tokenizer acts as a lookup table, linking tokens (string / byte) to their ids (integer). The vocabulary is an attribute of the tokenizer and can be accessed with tokenizer.vocab. The vocabulary is a Python dictionary binding tokens (keys) to their ids (values). For tokenizations with embedding embedding pooling (e.g. CPWord or Octuple), tokenizer.vocab will be a list of Vocabulary objects, and the tokenizer.is_multi_vocab property will be True.

With Byte Pair Encoding: tokenizer.vocab holds all the basic tokens describing the note and time attributes of music. By analogy with text, these tokens can be seen as unique characters. After training a tokenizer with Byte Pair Encoding (BPE), a new vocabulary is built with newly created tokens from pairs of basic tokens. This vocabulary can be accessed with tokenizer.vocab_bpe, and binds tokens as bytes (string) to their associated ids (int). This is the vocabulary of the 🤗tokenizers BPE model.

TokSequence

The methods of MidiTok use miditok.TokSequence objects as input and outputs. A TokSequence holds tokens as the three forms described in Byte Pair Encoding (BPE). TokSequences are subscriptable and implement __len__ (you can run tok_seq[id] and len(tok_seq)).

You can use the miditok.MIDITokenizer.complete_sequence() method to automatically fill the non-initialized attributes of a TokSequence.

class miditok.TokSequence(tokens: List[str | List[str]] = None, ids: List[int | List[int]] = None, bytes: str = None, events: List[Event | List[Event]] = None, ids_bpe_encoded: bool = False, _ids_no_bpe: List[int | List[int]] = None)

Represents a sequence of token. A TokSequence can represent tokens by their several forms:

tokens (list of str): tokens as sequence of strings.
ids (list of int), these are the one to be fed to models.
events (list of Event): Event objects that can carry time or other information useful for debugging.
bytes (str): ids are converted into unique bytes, all joined together in a single string.

Bytes are used internally by MidiTok for Byte Pair Encoding. The ids_are_bpe_encoded attribute tells if ids is encoded with BPE.

miditok.MIDITokenizer.complete_sequence()

MIDI Tokenizer

MidiTok features several MIDI tokenizations, all inheriting from the miditok.MIDITokenizer class. You can customize your tokenizer by creating it with a custom Tokenizer config.

class miditok.MIDITokenizer(tokenizer_config: TokenizerConfig = None, one_token_stream: bool = False, params: str | Path = None)

MIDI tokenizer base class, containing common methods and attributes for all tokenizers.

Parameters:

tokenizer_config – the tokenizer’s configuration, as a miditok.classes.TokenizerConfig object.
one_token_stream – give True if the tokenizer handle all the tracks of a MIDI as a single sequence of tokens. Tokens will be saved as a single sequence. This applies to representations that natively handle multiple tracks such as Octuple or REMIPlus, resulting in a single “stream” of tokens per MIDI. This attribute will be saved in config files of the tokenizer. (default: False)
params – path to a tokenizer config file. This will override other arguments and load the tokenizer based on the config file. This is particularly useful if the tokenizer learned Byte Pair Encoding. (default: None)

add_to_vocab(token: str | Event, vocab_idx: int = None, byte_: str = None, add_to_bpe_model: bool = False)

Adds an event to the vocabulary. Its index (int) will be the length of the vocab.

Parameters:

token – token to add, as a formatted string of the form “Type_Value”, e.g. Pitch_80, or an Event.
vocab_idx – idx of the vocabulary (in case of embedding pooling). (default: None)
byte – unique byte associated to the token. This is used when building the vocabulary with fast BPE. If None is given, it will default to chr(id_ + CHR_ID_START) . (default: None)
add_to_bpe_model – the token will be added to the bpe_model vocabulary too. (default: None)

apply_bpe(seq: TokSequence | List[TokSequence])

Applies Byte Pair Encoding (BPE) to a TokSequence, or list of TokSequences. If a list is given, BPE will be applied by batch on all sequences at the time.

Parameters:: seq – Sequence(s) to apply BPE.

apply_bpe_to_dataset(dataset_path: Path | str, out_path: Path | str = None)

Applies BPE to an already tokenized dataset (with no BPE).

Parameters:

dataset_path – path to token files to load.
out_path – output directory to save. If none is given, this method will overwrite original files. (default: None)

complete_sequence(seq: TokSequence)

Completes (inplace) a miditok.TokSequence object by converting its attributes. The input sequence can miss some of its attributes (ids, tokens), but needs at least one for reference. This method will create the missing ones from the present ones. The bytes attribute will be created if the tokenizer has been trained with BPE. The events attribute will not be filled as it is only intended for debug purpose.

Parameters:: seq – input miditok.TokSequence, must have at least one attribute defined.

decode_bpe(seq: TokSequence | List[TokSequence])

Decodes (inplace) a sequence of tokens (miditok.TokSequence) with ids encoded with BPE. This method only modifies the .ids attribute of the input sequence(s) only and does not complete it. This method can also receive a list of sequences, in which case it will decompose BPE on each of them recursively.

Parameters:: seq – token sequence to decompose.

property is_multi_voc: bool

Returns a bool indicating if the tokenizer uses embedding pooling, and so have multiple vocabularies.

Returns:: True is the tokenizer uses embedding pooling else False.

learn_bpe(vocab_size: int, iterator: Iterable = None, tokens_paths: List[Path | str] = None, start_from_empty_voc: bool = False, **kwargs)

Method to construct the vocabulary from BPE, backed by the 🤗tokenizers library. The data used for training can either be given through the iterator argument as an iterable object yielding strings, or by tokens_paths as a list of paths to token json files that will be loaded. You can read the Hugging Face 🤗tokenizers documentation, 🤗tokenizers API documentation and 🤗tokenizers course for more details about the iterator and input type.

The training progress bar will not appear with non-proper terminals. (cf GitHub issue )

Parameters:

vocab_size – size of the vocabulary to learn / build.
iterator – an iterable object yielding the training data, as lists of string. It can be a list or a Generator. This iterator will be passed to the BPE model for training. If None is given, you must use the tokens_paths argument. (default: None)
tokens_paths – paths of the token json files to load and use. (default: False)
start_from_empty_voc – the training will start from an empty base vocabulary. The tokenizer will then have a base vocabulary only based on the unique bytes present in the training data. If you set this argument to True, you should use the tokenizer only with the training data, as new data might contain “unknown” tokens missing from the vocabulary. Comparing this to text, setting this argument to True would create a tokenizer that will only know the characters present in the training data, and would not be compatible / know other characters. This argument can allow to optimize the vocabulary size. If you are unsure about this, leave it to False. (default: False)
kwargs – any additional argument to pass to the trainer.

property len: int | List[int]

Returns the length of the vocabulary. If the tokenizer uses embedding pooling / have multiple vocabularies, it will return the list of their lengths. Use the miditok.MIDITokenizer.__len__() magic method (len(tokenizer)) to get the sum of the lengths.

Returns:: length of the vocabulary.

static load_tokens(path: str | Path) → List[Any] | Dict

Loads tokens saved as JSON files.

Parameters:: path – path of the file to load.
Returns:: the tokens, with the associated information saved with.

midi_to_tokens(midi: MidiFile, apply_bpe_if_possible: bool = True, *args, **kwargs) → TokSequence | List[TokSequence]

Tokenizes a MIDI file. This method returns a list of miditok.TokSequence.

If you are implementing your own tokenization by subclassing this class, override the ``_midi_to_tokens`` method. This method implement necessary MIDI preprocessing.

Parameters:

midi – the MIDI object to convert.
apply_bpe_if_possible – will apply BPE if the tokenizer’s vocabulary was learned with.

Returns:

a miditok.TokSequence if tokenizer.one_token_stream is true, else a list of miditok.TokSequence objects.

preprocess_midi(midi: MidiFile)

Pre-process (in place) a MIDI file to quantize its time and note attributes before tokenizing it. Its notes attribute (times, pitches, velocities) will be quantized and sorted, duplicated notes removed, as well as tempos. Empty tracks (with no note) will be removed from the MIDI object. Notes with pitches outside of self.pitch_range will be deleted.

Parameters:: midi – MIDI object to preprocess.

save_params(out_path: str | Path, additional_attributes: Dict = None)

Saves the config / parameters of the tokenizer in a json encoded file. This can be useful to keep track of how a dataset has been tokenized. Note: if you override this method, you should probably call it (super()) at the end and use the additional_attributes argument.

Parameters:

out_path – output path to save the file.
additional_attributes – any additional information to store in the config file. It can be used to override the default attributes saved in the parent method. (default: None)

save_tokens(tokens: TokSequence | List | ndarray | Any, path: str | Path, programs: List[Tuple[int, bool]] = None, **kwargs)

Saves tokens as a JSON file. In order to reduce disk space usage, only the ids are saved. Use kwargs to save any additional information within the JSON file.

Parameters:

tokens – tokens, as list, numpy array, torch or tensorflow Tensor.
path – path of the file to save.
programs – (optional), programs of the associated tokens, should be given as a tuples (int, bool) for (program, is_drum).
kwargs – any additional information to save within the JSON file.

property special_tokens: Sequence[str]

Returns the vocabulary learnt with BPE. In case the tokenizer has not been trained with BPE, it returns None.

Returns:: special tokens of the tokenizer

property special_tokens_ids: Sequence[int]

Returns the vocabulary learnt with BPE. In case the tokenizer has not been trained with BPE, it returns None.

Returns:: special tokens of the tokenizer

token_id_type(id_: int, vocab_id: int = None) → str

Returns the type of the given token id.

Parameters:

id – token id to get the type.
vocab_id – index of the vocabulary associated to the token, if applicable. (default: None)

Returns:

the type of the token, as a string

token_ids_of_type(token_type: str, vocab_id: int = None) → List[int]

Returns the list of token ids of the given type.

Parameters:

token_type – token type to get the associated token ids.
vocab_id – index of the vocabulary associated to the token, if applicable. (default: None)

Returns:

list of token ids.

tokenize_midi_dataset(midi_paths: List[str] | List[Path], out_dir: str | Path, tokenizer_config_file_name: str = 'tokenizer.conf', validation_fn: Callable[[MidiFile], bool] = None, data_augment_offsets=None, apply_bpe: bool = True, save_programs: bool = None, logging: bool = True)

Converts a dataset / list of MIDI files, into their token version and save them as json files The resulting Json files will have the shape (T, ), first dimension is tracks, second tokens. In order to reduce disk space usage, **only the ids are saved*. If save_programs is True, the shape will be [(T, *), (T, 2)], first dim is tokens and programs instead, for programs the first value is the program, second a bool indicating if the track is drums. The config of the tokenizer will be saved as a “config.txt” file by default.

Parameters:

midi_paths – paths of the MIDI files.
out_dir – output directory to save the converted files.
tokenizer_config_file_name – name of the tokenizer config file name. This file will be saved in out_dir. (default: “tokenizer.conf”)
validation_fn – a function checking if the MIDI is valid on your requirements (e.g. time signature, minimum/maximum length, instruments …).
data_augment_offsets – data augmentation arguments, to be passed to the miditok.data_augmentation.data_augmentation_dataset method. Has to be given as a list / tuple of offsets pitch octaves, velocities, durations, and finally their directions (up/down). (default: None)
apply_bpe – will apply BPE on the dataset to save, if the vocabulary was learned with. (default: True)
save_programs – will save the programs of the tracks of the MIDI as an entry in the Json file. That this option is probably unnecessary when using a multitrack tokenizer (config.use_programs), as the Program information is present within the tokens, and that the tracks having the same programs are likely to have been merged. (default: False if config.use_programs, else True)
logging – logs progress bar.

validate_midi_time_signatures(midi: MidiFile) → bool: Checks if MIDI files contains only time signatures supported by the encoding. :param midi: MIDI file :return: boolean indicating whether MIDI file could be processed by the Encoding

property vocab: Dict[str, int] | List[Dict[str, int]]

Get the base vocabulary, as a dictionary linking tokens (str) to their ids (int). The different (hidden / protected) vocabulary attributes of the class are:

._vocab_base : Dict[str: int] token -> id - Registers all known base tokens.
.__vocab_base_inv : Dict[int: str] id -> token - Inverse of ._base_vocab , to go the other way.
._vocab_base_id_to_byte : Dict[int: str] id -> byte - Link ids to their associated unique bytes.
._vocab_base_byte_to_token : Dict[str: str] - similar as above but for tokens.
._vocab_bpe_bytes_to_tokens : Dict[str: List[str]] byte(s) -> token(s) used to decode BPE.
._bpe_model.get_vocab() : Dict[str: int] byte -> id - bpe model vocabulary, based on unique bytes

Before training the tokenizer with BPE, bytes are obtained by running chr(id) . After training, if we did start from an empty vocabulary, some base tokens might be removed from ._vocab_base , if they were never found in the training samples. The base vocabulary being changed, chr(id) would then bind to incorrect bytes (on which byte succession would not have been learned). We register the original id/token/byte association in ._vocab_base_id_to_byte and ._vocab_base_byte_to_token .

Returns:: the base vocabulary.

property vocab_bpe: None | Dict[str, int]

Returns the vocabulary learnt with BPE. In case the tokenizer has not been trained with BPE, it returns None.

Returns:: the BPE model’s vocabulary.

Tokenizer config

All tokenizers are initialized with common parameters, that are hold in a TokenizerConfig object, documented below. You can access a tokenizer’s configuration with tokenizer.config. Some tokenizers might take additional specific arguments / parameters when creating them.

class miditok.TokenizerConfig(pitch_range: Tuple[int, int] = (21, 109), beat_res: Dict[Tuple[int, int], int] = {(0, 4): 8, (4, 12): 4}, nb_velocities: int = 32, special_tokens: Sequence[str] = ['PAD', 'BOS', 'EOS', 'MASK'], use_chords: bool = False, use_rests: bool = False, use_tempos: bool = False, use_time_signatures: bool = False, use_programs: bool = False, rest_range: Sequence = (2, 8), chord_maps: Dict[str, Tuple] = {'7aug': (0, 4, 8, 11), '7dim': (0, 3, 6, 9), '7dom': (0, 4, 7, 10), '7halfdim': (0, 3, 6, 10), '7maj': (0, 4, 7, 11), '7min': (0, 3, 7, 10), '9maj': (0, 4, 7, 10, 14), '9min': (0, 4, 7, 10, 13), 'aug': (0, 4, 8), 'dim': (0, 3, 6), 'maj': (0, 4, 7), 'min': (0, 3, 7), 'sus2': (0, 2, 7), 'sus4': (0, 5, 7)}, chord_tokens_with_root_note: bool = False, chord_unknown: Tuple[int, int] = None, nb_tempos: int = 32, tempo_range: Tuple[int, int] = (40, 250), log_tempos: bool = False, delete_equal_successive_tempo_changes: bool = False, time_signature_range: Dict[int, List[int] | Tuple[int, int]] = {4: [5, 6, 3, 2, 1, 4], 8: [3, 12, 6]}, delete_equal_successive_time_sig_changes: bool = False, programs: Sequence[int] = [-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127], **kwargs)

MIDI tokenizer base class, containing common methods and attributes for all tokenizers. :param pitch_range: (default: (21, 109)) range of MIDI pitches to use. Pitches can take

values between 0 and 127 (included). The General MIDI 2 (GM2) specifications indicate the recommended ranges of pitches per MIDI program (instrument). These recommended ranges can also be found in miditok.constants . In all cases, the range from 21 to 108 (included) covers all the recommended values. When processing a MIDI, the notes with pitches under or above this range can be discarded.

Parameters:

beat_res – (default: {(0, 4): 8, (4, 12): 4}) beat resolutions, as a dictionary in the form: {(beat_x1, beat_x2): beat_res_1, (beat_x2, beat_x3): beat_res_2, ...} The keys are tuples indicating a range of beats, ex 0 to 3 for the first bar, and the values are the resolution (in samples per beat) to apply to the ranges, ex 8. This allows to use Duration / TimeShift tokens of different lengths / resolutions. Note: for tokenization with Position tokens, the total number of possible positions will be set at four times the maximum resolution given (max(beat_res.values)).
nb_velocities – (default: 32) number of velocity bins. In the MIDI norm, velocities can take up to 128 values (0 to 127). This parameter allows to reduce the number of velocity values. The velocities of the MIDIs resolution will be downsampled to nb_velocities values, equally separated between 0 and 127.
special_tokens – list of special tokens. This must be given as a list of strings given only the names of the tokens. (default: ["PAD", "BOS", "EOS", "MASK"])
use_chords – will use Chord tokens, if the tokenizer is compatible. A Chord token indicates the presence of a chord at a certain time step. MidiTok uses a chord detection method based on onset times and duration. This allows MidiTok to detect precisely chords without ambiguity, whereas most chord detection methods in symbolic music based on chroma features can’t. Note that using chords will increase the tokenization time, especially if you are working on music with a high “note density”. (default: False)
use_rests – will use Rest tokens, if the tokenizer is compatible. Rest tokens will be placed whenever a portion of time is silent, i.e. no note is being played. This token type is decoded as a TimeShift event. You can choose the minimum and maximum rests values to represent with the rest_range key in the additional_tokens dictionary (default is 1/2 beat to 8 beats). Note that rests shorter than one beat are only divisible by the first beat resolution, e.g. a rest of 5/8th of a beat will be a succession of Rest_0.4 and Rest_0.1, where the first number indicate the rest duration in beats and the second in samples / positions. (default: False)
use_tempos – will use Tempo tokens, if the tokenizer is compatible. Tempo tokens will specify the current tempo. This allows to train a model to predict tempo changes. Tempo values are quantized accordingly to the nb_tempos and tempo_range entries in the additional_tokens dictionary (default is 32 tempos from 40 to 250). (default: False)
use_time_signatures – will use TimeSignature tokens, if the tokenizer is compatible. TimeSignature tokens will specify the current time signature. Note that REMI and REMIPlus adds a TimeSignature token at the beginning of each Bar (i.e. after Bar tokens), while TSD and MIDILike will only represent time signature changes (MIDI messages) as they come. If you want more “recalls” of the current time signature within your token sequences, you can preprocess you MIDI file to add more TimeSignatureChange objects. (default: False)
use_programs – will use Program tokens, if the tokenizer is compatible. Used to specify an instrument / MIDI program. The Octuple, MMM and MuMIDI tokenizers use natively Program tokens, this option is always enabled. TSD, REMI, MIDILike, Structured and CPWord will add Program tokens before each Pitch / NoteOn token to indicate its associated instrument and will treat all the tracks of a MIDI as a single sequence of tokens. CPWord, Octuple and MuMIDI add a Program tokens with the stacks of Pitch, Velocity and Duration tokens. (default: False)
rest_range – range of the rest to use, in beats, as a tuple (beat_division, max_rest_in_beats). The beat division divides a beat to determine the minimum rest to represent. The minimum rest must be divisible by 2 and lower than the first beat resolution
chord_maps – list of chord maps, to be given as a dictionary where keys are chord qualities (e.g. “maj”) and values pitch maps as tuples of integers (e.g. (0, 4, 7)). You can use miditok.constants.CHORD_MAPS as an example.
chord_tokens_with_root_note – to specify the root note of each chord in Chord tokens. Tokens will look as “Chord_C:maj”. (default: False)
chord_unknown – range of number of notes to represent unknown chords. If you want to represent chords that does not match any combination in chord_maps, use this argument. Leave None to not represent unknown chords. (default: None)
nb_tempos – number of tempos “bins” to use. (default: 32)
tempo_range – range of minimum and maximum tempos within which the bins fall. (default: (40, 250))
log_tempos – will use log scaled tempo values instead of linearly scaled. (default: False)
delete_equal_successive_tempo_changes – setting this option True will delete identical successive tempo changes when preprocessing a MIDI file after loading it. For examples, if a MIDI has two tempo changes for tempo 120 at tick 1000 and the next one is for tempo 121 at tick 1200, during preprocessing the tempo values are likely to be downsampled and become identical (120 or 121). If that’s the case, the second tempo change will be deleted and not tokenized. This parameter doesn’t apply for tokenizations that natively inject the tempo information at recurrent timings (e.g. Octuple). For others, note that setting it True might reduce the number of Tempo tokens and in turn the recurrence of this information. Leave it False if you want to have recurrent Tempo tokens, that you might inject yourself by adding TempoChange objects to your MIDIs. (default: False)
time_signature_range – range as a dictionary {denom_i: [num_i1, …, num_in] / (min_num_i, max_num_i)}. (default: {4: [4]})
delete_equal_successive_time_sig_changes – setting this option True will delete identical successive time signature changes when preprocessing a MIDI file after loading it. For examples, if a MIDI has two time signature changes for 4/4 at tick 1000 and the next one is also 4/4 at tick 1200, the second time signature change will be deleted and not tokenized. This parameter doesn’t apply for tokenizations that natively inject the time signature information at recurrent timings (e.g. Octuple). For others, note that setting it True might reduce the number of TimeSig tokens and in turn the recurrence of this information. Leave it False if you want to have recurrent TimeSig tokens, that you might inject yourself by adding TimeSignatureChange objects to your MIDIs. (default: False)
programs – sequence of MIDI programs to use. Note that -1 is used and reserved for drums tracks. (default: from -1 to 127 included)
**kwargs –
additional parameters that will be saved in config.additional_params.

classmethod from_dict(input_dict: Dict[str, Any], **kwargs)

Instantiates an AdditionalTokensConfig from a Python dictionary of parameters.

Parameters:

input_dict – Dictionary that will be used to instantiate the configuration object.
kwargs – Additional parameters from which to initialize the configuration object.

Returns:

AdditionalTokensConfig: The configuration object instantiated from those parameters.

classmethod load_from_json(config_file_path: str | Path) → TokenizerConfig

Loads a tokenizer configuration from the config_path path.

Parameters:: config_file_path – path to the configuration JSON file to load.

save_to_json(out_path: str | Path)

Saves a tokenizer configuration object to the out_path path, so that it can be re-loaded later.

Parameters:: out_path – path to the output configuration JSON file.

to_dict(serialize: bool = False) → Dict[str, Any]

Serializes this instance to a Python dictionary.

Parameters:: serialize – will serialize the dictionary before returning it, so it can be saved to a JSON file.
Returns:: Dictionary of all the attributes that make up this configuration instance.

Additional tokens

MidiTok offers to include additional tokens on music information. You can specify them in the tokenizer_config argument (miditok.TokenizerConfig) when creating a tokenizer. The miditok.TokenizerConfig documentations specifically details the role of each of them, and their associated parameters. Cells with ❕ markers means the additional token is implemented by default and not optionnal.

Compatibility table of tokenizations and additional tokens.
Tokenization	Tempo	Time signature	Chord	Rest
MIDILike	✅	✅	✅	✅
REMI	✅	✅	✅	✅
TSD	✅	✅	✅	✅
Structured	❌	❌	❌	❌
CPWord	✅	✅	✅	✅
Octuple	✅	✅	❌	❌
MuMIDI	✅	❌	✅	❌
MMM	✅	✅	✅	❌

Special tokens

MidiTok offers to include some special tokens to the vocabulary. These tokens with no “musical” information can be used for training purposes. To use special tokens, you must specify them with the special_tokens argument when creating a tokenizer. By default, this argument is set to ["PAD", "BOS", "EOS", "MASK"]. Their signification are:

PAD (PAD_None): a padding token to use when training a model with batches of sequences of unequal lengths. The padding token id is often set to 0. If you use Hugging Face models, be sure to pad inputs with this tokens, and pad labels with -100.
BOS (SOS_None): “Start Of Sequence” token, indicating that a token sequence is beginning.
EOS (EOS_None): “End Of Sequence” tokens, indicating that a token sequence is ending. For autoregressive generation, this token can be used to stop it.
MASK (MASK_None): a masking token, to use when pre-training a (bidirectional) model with a self-supervised objective like BERT.

Note: you can use the tokenizer.special_tokens property to get the list of the special tokens of a tokenizer, and tokenizer.special_tokens for their ids.

Tokens & TokSequence input / output format

Depending on the tokenizer at use, the format of the tokens returned by the midi_to_tokens method may vary, as well as the expected format for the tokens_to_midi method. The format is given by the ``tokenizer.io_format` property. For any tokenizer, the format is the same for both methods.

The format is deduced from the is_multi_voc and one_token_stream tokenizer properties. one_token_stream being True means that the tokenizer will convert a MIDI file into a single stream of tokens for all instrument tracks, otherwise it will convert each track to a distinct token sequence. is_mult_voc being True means that each token stream is a list of lists of tokens, of shape (T,C) for T time steps and C subtokens per time step.

This results in four situations, where I is the number of tracks, T is the number of tokens (or time steps) and C the number of subtokens per time step:

is_multi_voc and one_token_stream are both False: [I,(T)]
is_multi_voc is False and one_token_stream is True: (T)
is_multi_voc is True and one_token_stream is False: [I,(T,C)]
is_multi_voc and one_token_stream are both True: (T,C)

Note that if there is no I dimension in the format, the output of midi_to_tokens is a miditok.TokSequence object, otherwise it is a list of miditok.TokSequence objects (one per token stream / track).

Some tokenizer examples to illustrate:

TSD without config.use_programs will not have multiple vocabularies and will treat each MIDI track as a unique stream of tokens, hence it will convert MIDI files to a list of TokSequence objects, (I,T) format.
TSD with config.use_programs being True will convert all MIDI tracks to a single stream of tokens, hence one TokSequence object, (T) format.
CPWord is a multi-voc tokenizer, without config.use_programs it will treat each MIDI track as a distinct stream of tokens, hence it will convert MIDI files to a list of TokSequence objects with the (I,T,C) format.
Octuple is a multi-voc tokenizer and converts all MIDI track to a single stream of tokens, hence it will convert MIDI files to a TokSequence object, (T,C) format.

You can use the convert_sequence_to_tokseq method to automatically convert a input sequence, of ids (integers) or tokens (string), into a miditok.TokSequence or list of miditok.TokSequence objects with the appropriate format of the tokenizer being used.

miditok.convert_sequence_to_tokseq(tokenizer, input_seq, complete_seq: bool = True, decode_bpe: bool = True) → TokSequence | List[TokSequence]

Converts a sequence into a miditok.TokSequence or list of miditok.TokSequence objects with the appropriate format of the tokenizer being used.

Parameters:

tokenizer – tokenizer being used with the sequence.
input_seq – sequence to convert. It can be a list of ids (integers), tokens (string) or events (Event). It can also be a Pytorch or TensorFlow tensor, or Numpy array representing ids.
complete_seq – will complete the output sequence(s). (default: True)
decode_bpe – if the input sequence contains ids, and that they contain BPE tokens, these tokens will be decoded. (default: True)

Returns:

Magic methods

Magic methods allows to intuitively access to a tokenizer’s attributes and methods. We list them here with some examples.

miditok.MIDITokenizer.__call__(self, obj: Any, *args, **kwargs)

Calling a tokenizer allows to directly convert a MIDI to tokens or the other way around. The method automatically detects MIDI and token objects, as well as paths and can directly load MIDI or token json files before converting them. This will call the miditok.MIDITokenizer.midi_to_tokens() if you provide a MIDI object or path to a MIDI file, or the miditok.MIDITokenizer.tokens_to_midi() method otherwise.

Parameters:: obj – a miditoolkit.MidiFile object, a sequence of tokens, or a path to a MIDI or tokens json file.
Returns:: the converted object.

tokens = tokenizer(midi)
midi2 = tokenizer(tokens)

Convert a token (int) to an event (str), or vice-versa.

Parameters:: item – a token (int) or an event (str). For tokenizers with embedding pooling / multiple vocabularies ( tokenizer.is_multi_voc ), you must either provide a string (token) that is within all vocabularies (e.g. special tokens), or a tuple where the first element in the index of the vocabulary and the second the element to index.
Returns:: the converted object.

pad_token = tokenizer["PAD_None"]

miditok.MIDITokenizer.__len__(self) → int

Returns the length of the vocabulary. If the tokenizer uses embedding pooling / have multiple vocabularies, it will return the sum of their lengths. If the vocabulary was learned with fast BPE, it will return the length of the BPE vocabulary, i.e. the proper number of possible token ids. Otherwise, it will return the length of the base vocabulary. Use the miditok.MIDITokenizer.len() property (tokenizer.len) to have the list of lengths.

Returns:: length of the vocabulary.

nb_classes = len(tokenizer)
nb_classes_per_vocab = tokenizer.len  # applicable to tokenizer with embedding pooling, e.g. CPWord or Octuple

miditok.MIDITokenizer.__eq__(self, other) → bool

Checks if two tokenizers are identical. This is done by comparing their vocabularies, and configuration.

Parameters:: other – tokenizer to compare.
Returns:: True if the vocabulary(ies) are identical, False otherwise.

if tokenizer1 == tokenizer2:
    print("The tokenizers have the same vocabulary and configurations!")

Save / Load tokenizer

You can save and load a tokenizer’s parameters and vocabulary. This is especially useful to track tokenized datasets, and to save tokenizers with vocabularies learned with Byte Pair Encoding (BPE).

miditok.MIDITokenizer.save_params(self, out_path: str | Path, additional_attributes: Dict = None)

Saves the config / parameters of the tokenizer in a json encoded file. This can be useful to keep track of how a dataset has been tokenized. Note: if you override this method, you should probably call it (super()) at the end and use the additional_attributes argument.

Parameters:

out_path – output path to save the file.
additional_attributes – any additional information to store in the config file. It can be used to override the default attributes saved in the parent method. (default: None)

To load a tokenizer from saved parameters, just use the params argument when creating a it:

tokenizer = REMI(params=Path("to", "params.json"))

Limitations

Some tokenizations using Bar tokens (REMI, CPWord and MuMIDI) only considers a 4/x time signature for now. This means that each bar is considered covering 4 beats. REMIPlus and Octuple supports it.