MIDI Tokenizer

MidiTok features several MIDI tokenizations, all inheriting from a MIDITokenizer class. Read the documentation of the arguments of miditok.MIDITokenizer to learn how to

class miditok.MIDITokenizer(pitch_range: range = range(21, 109), beat_res: Dict[Tuple[int, int], int] = {(0, 4): 8, (4, 12): 4}, nb_velocities: int = 32, additional_tokens: Dict[str, Union[bool, int, Tuple[int, int]]] = {'Chord': False, 'Program': False, 'Rest': False, 'Tempo': False, 'TimeSignature': False, 'nb_tempos': 32, 'rest_range': (2, 8), 'tempo_range': (40, 250), 'time_signature_range': (8, 2)}, pad: bool = True, sos_eos: bool = False, mask: bool = False, sep: bool = False, unique_track: bool = False, params: Optional[Union[Path, str]] = None)

MIDI tokenizer base class, containing common methods and attributes for all tokenizers.

Parameters:

pitch_range – (default: range(21, 109)) range of MIDI pitches to use. Pitches can take values between 0 and 127 (included). The General MIDI 2 (GM2) specifications indicate the recommended ranges of pitches per MIDI program (instrument). These recommended ranges can also be found in miditok.constants. In all cases, the range from 21 to 108 (included) covers all the recommended values. When processing a MIDI, the notes with pitches under or above this range can be discarded.
beat_res – (default: {(0, 4): 8, (4, 12): 4}) beat resolutions, as a dictionary in the form: {(beat_x1, beat_x2): beat_res_1, (beat_x2, beat_x3): beat_res_2, ...} The keys are tuples indicating a range of beats, ex 0 to 3 for the first bar, and the values are the resolution (in samples per beat) to apply to the ranges, ex 8. This allows to use Duration / TimeShift tokens of different lengths / resolutions. Note: for tokenization with Position tokens, the total number of possible positions will be set at four times the maximum resolution given (max(beat_res.values)).
nb_velocities – (default: 32) number of velocity bins. In the MIDI norm, velocities can take up to 128 values (0 to 127). This parameter allows to reduce the number of velocity values. The velocities of the MIDIs resolution will be downsampled to nb_velocities values, equally separated between 0 and 127.
additional_tokens – (default: None used) specify which additional tokens to use. Compatibilities between tokenization and additiona tokens may vary. See Additional tokens for the details and available tokens.
pad – will add a special PAD token to the vocabulary, to use to pad sequences when training a model with batches of different sequence lengths. (default: True)
sos_eos – adds special Start Of Sequence (SOS) and End Of Sequence (EOS) tokens to the vocabulary. (default: False)
mask – will add a special MASK token to the vocabulary. (default: False)
sep – will add a special SEP token to the vocabulary. (default: False)
unique_track – set to True if the tokenizer works only with a unique track. Tokens will be saved as a single track. This applies to representations that natively handle multiple tracks such as Octuple, resulting in a single “stream” of tokens for all tracks. This attribute will be saved in config files of the tokenizer. (default: False)
params – path to a tokenizer config file. This will override other arguments and load the tokenizer based on the config file. This is particularly useful if the tokenizer learned Byte Pair Encoding. (default: None)

add_sos_eos_to_seq(seq: List[int])

Adds Start Of Sequence (SOS) and End Of Sequence (EOS) tokens to a sequence of tokens: SOS at the beginning, EOS at the end.

Parameters:: seq – sequence of tokens.

apply_bpe(tokens: List[int]) → List[int]

Converts a sequence of tokens into tokens with BPE.

Parameters:: tokens – tokens to convert.
Returns:: the tokens with BPE applied.

apply_bpe_to_dataset(dataset_path: Union[Path, str], out_path: Union[Path, str])

Applies BPE to an already tokenized dataset (with no BPE).

Parameters:

dataset_path – path to token files to load.
out_path – output directory to save.

decompose_bpe(tokens: List[int]) → List[int]

Decomposes a sequence of tokens containing BP encoded tokens into “prime” tokens. It is an inplace operation.

Parameters:: tokens – token sequence to decompose.
Returns:: decomposed token sequence.

events_to_tokens(events: List[Event]) → List[int]

Converts a list of Event objects into a list of tokens. It will apply BPE if it has been learned.

Parameters:: events – list of Events objects to convert.
Returns:: list of corresponding tokens.

property is_multi_voc: bool

Returns a bool indicating if the tokenizer uses embedding pooling, and so have multiple vocabularies.

Returns:: True is the tokenizer uses embedding pooling else False

learn_bpe(tokens_path: Union[Path, str], vocab_size: int, out_dir: Union[Path, str], files_lim: Optional[int] = None, save_converted_samples: bool = False, print_seq_len_variation: bool = True) → Tuple[List[float], List[int], List[float]]

Byte Pair Encoding (BPE) method to build the vocabulary. This method will build (modify) the vocabulary by analyzing an already tokenized dataset to find the most recurrent token successions. Note that this implementation is in pure Python and will be slow if you use a large amount of tokens files. You might use the files_lim argument.

Parameters:

tokens_path – path to token files to learn the BPE combinations from.
vocab_size – the new vocabulary size.
out_dir – directory to save the tokenizer’s parameters and vocabulary after BPE learning is finished.
files_lim – limit of token files to use. (default: None)
save_converted_samples – will save in out_path the samples that have been used to create the BPE vocab. Files will keep the same name and relative path. (default: True)
print_seq_len_variation – prints the mean sequence length before and after BPE, and the variation in %. (default: True)

Returns:

learning metrics, as lists of: - the average number of token combinations covered by the newly created BPE tokens - the maximum number of token combinations - the average sequence length Each index in the list correspond to a learning step.

property len: Union[int, List[int]]

Returns the length of the vocabulary. If the tokenizer uses embedding pooling / have multiple vocabularies, it will return the list of their lengths. Use the miditok.MIDITokenizer.__len__() magic method (len(tokenizer)) to get the sum of the lengths.

Returns:: length of the vocabulary.

load_params(config_file_path: Union[str, Path])

Loads the parameters of the tokenizer from a config file.

Parameters:: config_file_path – path to the tokenizer config file (encoded as json).

static load_tokens(path: Union[str, Path]) → Union[List[Any], Dict]

Loads tokens saved as JSON files.

Parameters:: path – path of the file to load.
Returns:: the tokens, with the associated information saved with.

midi_to_tokens(midi: MidiFile, *args, **kwargs) → List[List[Union[int, List[int]]]]

Tokenize a MIDI file. If you override this method, be sure to keep the first lines in your method.

Parameters:: midi – the MIDI objet to convert.
Returns:: sequences of tokens.

preprocess_midi(midi: MidiFile)

Pre-process (in place) a MIDI file to quantize its time and note attributes before tokenizing it. Its notes attribute (times, pitches, velocities) will be quantized and sorted, duplicated notes removed, as well as tempos. Empty tracks (with no note) will be removed from the MIDI object. Notes with pitches outside of self.pitch_range will be deleted.

Parameters:: midi – MIDI object to preprocess.

quantize_notes(notes: List[Note], time_division: int)

Quantize the notes attributes: their pitch, velocity, start and end values. It shifts the notes so that they start at times that match the time resolution (e.g. 16 samples per bar). Notes with pitches outside of self.pitch_range will be deleted.

Parameters:

notes – notes to quantize.
time_division – MIDI time division / resolution, in ticks/beat (of the MIDI being parsed).

quantize_tempos(tempos: List[TempoChange], time_division: int)

Quantize the times and tempo values of tempo change events. Consecutive identical tempo changes will be removed.

Parameters:

tempos – tempo changes to quantize.
time_division – MIDI time division / resolution, in ticks/beat (of the MIDI being parsed).

static quantize_time_signatures(time_sigs: List[TimeSignature], time_division: int)

Quantize the time signature changes, delayed to the next bar. See MIDI 1.0 Detailed specifications, pages 54 - 56, for more information on delayed time signature messages.

Parameters:

time_sigs – time signature changes to quantize.
time_division – MIDI time division / resolution, in ticks/beat (of the MIDI being parsed).

save_params(out_path: Union[str, Path], additional_attributes: Optional[Dict] = None)

Saves the config / parameters of the tokenizer in a json encoded file. This can be useful to keep track of how a dataset has been tokenized. Note: if you override this method, you should probably call it (super()) at the end and use the additional_attributes argument.

Parameters:

out_path – output path to save the file.
additional_attributes – any additional information to store in the config file. It can be used to override the default attributes saved in the parent method. (default: None)

property special_tokens: List[int]

Returns the list of special tokens: PAD, SOS, EOS, MASK, SEP

Returns:: list of special tokens

tokenize_midi_dataset(midi_paths: Union[List[str], List[Path]], out_dir: Union[str, Path], validation_fn: Optional[Callable[[MidiFile], bool]] = None, data_augment_offsets=None, save_programs: bool = True, logging: bool = True)

Converts a dataset / list of MIDI files, into their token version and save them as json files The resulting Json files will have the shape (T, *), first dimension is tracks, second tokens. If save_programs is True, the shape will be [(T, *), (T, 2)], first dim is tokens and programs instead, for programs the first value is the program, second a bool indicating if the track is drums. The config of the tokenizer will be saved as a “config.txt” file by default.

Parameters:

midi_paths – paths of the MIDI files.
out_dir – output directory to save the converted files.
validation_fn – a function checking if the MIDI is valid on your requirements (e.g. time signature, minimum/maximum length, instruments …).
data_augment_offsets – data augmentation arguments, to be passed to the miditok.data_augmentation.data_augmentation_dataset method. Has to be given as a list / tuple of offsets pitch octaves, velocities, durations, and finally their directions (up/down). (default: None)
save_programs – will also save the programs of the tracks of the MIDI. (default: True)
logging – logs progress bar.

tokens_to_events(tokens: List[Union[int, List[int]]]) → List[Union[Event, List[Event]]]

Convert a sequence of tokens in their respective event objects. BPE tokens will be decoded.

Parameters:: tokens – sequence of tokens to convert.
Returns:: the sequence of corresponding events.

abstract tokens_to_track(tokens: List[Union[int, List[int]]], time_division: Optional[int] = 384, program: Optional[Tuple[int, bool]] = (0, False)) → Tuple[Instrument, List[TempoChange]]

Converts a sequence of tokens into a track object. This method is unimplemented and need to be overridden by inheriting classes.

Parameters:

tokens – sequence of tokens to convert.
time_division – MIDI time division / resolution, in ticks/beat (of the MIDI to create).
program – the MIDI program of the produced track and if it drum. (default (0, False), piano)

Returns:

the miditoolkit instrument object and the possible tempo changes.

abstract track_to_tokens(track: Instrument) → List[Union[int, List[int]]]

Converts a track (miditoolkit.Instrument object) into a sequence of tokens. This method is unimplemented and need to be overridden by inheriting classes.

Parameters:: track – MIDI track to convert.
Returns:: sequence of corresponding tokens.

Additional tokens

MidiTok offers to include additional tokens on music information. You can specify them in the additional_tokens argument when creating a tokenizer.

Chords: indicates the presence of a chord at a certain time step. MidiTok uses a chord detection method based on onset times and duration. This allows MidiTok to detect precisely chords without ambiguity, whereas most chord detection methods in symbolic music based on chroma features can’t.
Rests: includes Rest tokens whenever a portion of time is silent, i.e. no note is being played. This token type is decoded as a TimeShift event. You can choose the minimum and maximum rests values to represent with the rest_range key in the additional_tokens dictionary (default is 1/2 beat to 8 beats). Note that rests shorter than one beat are only divisible by the first beat resolution, e.g. a rest of 5/8th of a beat will be a succession of Rest_0.4 and Rest_0.1, where the first number indicate the rest duration in beats and the second in samples / positions.
Tempos: specifies the current tempo. This allows to train a model to predict tempo changes. Tempo values are quantized accordingly to the nb_tempos and tempo_range entries in the additional_tokens dictionary (default is 32 tempos from 40 to 250).
Programs: used to specify an instrument / MIDI program. MidiTok only offers the possibility to include these tokens in the vocabulary for you, but won’t use them. If you need model multitrack symbolic music with other methods than Octuple / MuMIDI, MidiTok leaves you the choice / task to represent the track information the way you want. You can do it as in LakhNES or MMM.
Time Signature: specifies the current time signature. Only implemented with Octuple in MidiTok a.t.w.

Compatibility table of tokenizations and additional tokens.
Token type	REMI	MIDI-Like	TSD	Structured	CPWord	Octuple	MuMIDI
Chord	✅	✅	✅	✅	❌	❌	✅
Rest	✅	✅	✅	✅	❌	❌	❌
Tempo	✅	✅	✅	✅	❌	✅	✅
Program	✅	✅	✅	✅	✅	✅	✅
Time signature	❌	❌	❌	❌	❌	✅	❌

Special tokens

MidiTok offers to include some special tokens to the vocabulary. To use them, you must specify them when creating a tokenizer (constructor argument). These are:

pad (default True) –> PAD_None: a padding token to use when training a model with batches of sequences of unequal lengths. The padding token will be at index 0 of the vocabulary.
sos_eos (default False) –> SOS_None and EOS_None: “Start Of Sequence” and “End Of Sequence” tokens, designed to be placed respectively at the beginning and end of a token sequence during training. At inference, the EOS token tells when to end the generation.
mask (default False) –> MASK_None: a masking token, to use when pre-training a (bidirectional) model with a self-supervised objective like BERT.
sep (default: False) –> SEP_None: a token to use as a separation between sequences.

Note: you can use the tokenizer.special_tokens property to get the list of the special tokens of a tokenizer.

Vocabulary

The Vocabulary class acts as a lookup table, linking tokens (Pitch…) to their index (integer). The vocabulary is an attribute of the tokenizer and can be accessed with tokenizer.vocab. For tokenizations with embedding embedding pooling (e.g. CPWord or Octuple), tokenizer.vocab will be a list of Vocabulary objects, and the tokenizer.is_multi_vocab property will be True.

class miditok.Vocabulary(pad: bool = True, mask: bool = False, sos_eos: bool = False, sep: bool = False, events: Optional[List[Union[str, Event]]] = None)

Vocabulary class. Get an element of the vocabulary from its index, such as: * token = vocab[‘Pitch_80’] # gets the token of this event * event = vocab[140] # gets the event corresponding to token 140

Use miditok.Vocabulary.add_event() or the += operator to add an event to the vocab.

Parameters:

pad – will include a PAD token, used when training a model with batch of sequences of unequal lengths, and usually at index 0 of the vocabulary. If this argument is set to True, the PAD token will be at index 0. (default: True)
sos_eos – will include Start Of Sequence (SOS) and End Of Sequence (tokens) (default: False)
mask – will add a MASK token to the vocabulary. (default: False)
sep – will add a SEP token to the vocabulary. (default: False)
events – a list of events to add to the vocabulary when creating it. (default: None)

add_event(event: Union[Event, str, Generator])

Adds one or multiple entries to the vocabulary. This method accepts generators. The index of the newly added event will be equal to the length of the vocabulary at the moment of adding it.

Parameters:: event – event to add, either as an Event object or string of the form “Type_Value”, e.g. Pitch_80

token_type(token: int) → str

Returns the type of the given token.

Parameters:: token – token to get type from
Returns:: the type of the token, as a string

tokens_of_type(token_type: str) → List[int]

Returns the list of tokens of the given type.

Parameters:: token_type – token type to get the associated tokens
Returns:: list of tokens

update_token_types_indexes(): Updates the _token_types_indexes attribute according to _event_to_token.

Magic methods

Magic methods allows to intuitively access to a tokenizer’s attributes and methods. We list them here with some examples.

miditok.MIDITokenizer.__call__(self, obj: Any, *args, **kwargs)

Automatically tokenize a MIDI file, or detokenize a sequence of tokens. This will call the miditok.MIDITokenizer.midi_to_tokens() if you provide a MIDI object, or the miditok.MIDITokenizer.tokens_to_midi() method else.

Parameters:: obj – a MIDI object or sequence of tokens.
Returns:: the converted object.

tokens = tokenizer(midi)
midi2 = tokenizer(tokens)

miditok.MIDITokenizer.__getitem__(self, item: Union[int, str, Tuple[int, Union[int, str]]]) → Union[str, int]

Convert a token (int) to an event (str), or vice-versa.

Parameters:: item – a token (int) or an event (str). For embedding pooling, you must provide a tuple where the first element in the index of the vocabulary.
Returns:: the converted object.

pad_token = tokenizer["PAD_None"]

miditok.MIDITokenizer.__len__(self) → int

Returns the length of the vocabulary. If the tokenizer uses embedding pooling / have multiple vocabularies, it will return the sum of their lengths. Use the miditok.MIDITokenizer.len() property (tokenizer.len) to have the list of lengths.

Returns:: length of the vocabulary.

nb_classes = len(tokenizer)
nb_classes_per_vocab = tokenizer.len  # applicable to tokenizer with embedding pooling, e.g. CPWord or Octuple

miditok.MIDITokenizer.__eq__(self, other) → bool

Checks if two tokenizers are identical. This is done by comparing their vocabularies, as they are built depending on most of their attributes.

Parameters:: other – tokenizer to compare.
Returns:: True if the vocabulary(ies) are identical, False otherwise.

if tokenizer1 == tokenizer2:
    print("The tokenizers have the same vocabulary!")

Save / Load tokenizer

You can save and load a tokenizer’s parameters and vocabulary. This is especially useful to track tokenized datasets, and to save tokenizers with vocabularies learned with Byte Pair Encoding (BPE).

miditok.MIDITokenizer.save_params(self, out_path: Union[str, Path], additional_attributes: Optional[Dict] = None)

Saves the config / parameters of the tokenizer in a json encoded file. This can be useful to keep track of how a dataset has been tokenized. Note: if you override this method, you should probably call it (super()) at the end and use the additional_attributes argument.

Parameters:

out_path – output path to save the file.
additional_attributes – any additional information to store in the config file. It can be used to override the default attributes saved in the parent method. (default: None)

miditok.MIDITokenizer.load_params(self, config_file_path: Union[str, Path])

Loads the parameters of the tokenizer from a config file.

Parameters:: config_file_path – path to the tokenizer config file (encoded as json).

Limitations

Tokenizations using Bar tokens (REMI, CPWord and MuMIDI) only considers a 4/x time signature for now. This means that each bar is considered covering 4 beats. Octuple supports it.