Tokenizations

This page details the tokenizations featured by MidiTok. They inherit from miditok.MIDITokenizer, see the documentation for learn to use the common methods. For each of them, the token equivalent of the lead sheet below is showed.

REMI

class miditok.REMI(pitch_range: range = range(21, 109), beat_res: Dict[Tuple[int, int], int] = {(0, 4): 8, (4, 12): 4}, nb_velocities: int = 32, additional_tokens: Dict[str, Union[bool, int]] = {'Chord': False, 'Program': False, 'Rest': False, 'Tempo': False, 'TimeSignature': False, 'nb_tempos': 32, 'rest_range': (2, 8), 'tempo_range': (40, 250), 'time_signature_range': (8, 2)}, pad: bool = True, sos_eos: bool = False, mask: bool = False, sep: bool = False, params: Optional[Union[Path, str]] = None)

Bases: MIDITokenizer

REMI, standing for Revamped MIDI and introduced with the Pop Music Transformer (Huang and Yang), is a tokenization that represents notes as successions of Pitch, Velocity and Duration tokens, and time with Bar and Position tokens. A Bar token indicate that a new bar is beginning, and Position the current position within the current bar. The number of positions is determined by the beat_res argument, the maximum value will be used as resolution. NOTE: in the original paper, the tempo information is represented as the succession of two token types: a TempoClass indicating if the tempo is fast or slow, and a TempoValue indicating its value. MidiTok only uses one Tempo token for its value (see Additional tokens).

Parameters:

pitch_range – range of MIDI pitches to use
beat_res – beat resolutions, as a dictionary: {(beat_x1, beat_x2): beat_res_1, (beat_x2, beat_x3): beat_res_2, …} The keys are tuples indicating a range of beats, ex 0 to 3 for the first bar, and the values are the resolution to apply to the ranges, in samples per beat, ex 8
nb_velocities – number of velocity bins
additional_tokens – additional tokens (chords, time signature, rests, tempo…) to use, to be given as a dictionary. (default: None is used)
pad – will add a special PAD token to the vocabulary, to use to pad sequences when training a model with batches of different sequence lengths. (default: True)
sos_eos – adds special Start Of Sequence (SOS) and End Of Sequence (EOS) tokens to the vocabulary. (default: False)
mask – will add a special MASK token to the vocabulary (default: False)
sep – will add a special SEP token to the vocabulary (default: False)
params – path to a tokenizer config file. This will override other arguments and load the tokenizer based on the config file. This is particularly useful if the tokenizer learned Byte Pair Encoding. (default: None)

MIDI-Like

class miditok.MIDILike(pitch_range: range = range(21, 109), beat_res: Dict[Tuple[int, int], int] = {(0, 4): 8, (4, 12): 4}, nb_velocities: int = 32, additional_tokens: Dict[str, bool] = {'Chord': False, 'Program': False, 'Rest': False, 'Tempo': False, 'TimeSignature': False, 'nb_tempos': 32, 'rest_range': (2, 8), 'tempo_range': (40, 250), 'time_signature_range': (8, 2)}, pad: bool = True, sos_eos: bool = False, mask: bool = False, sep: bool = False, params: Optional[Union[Path, str]] = None)

Bases: MIDITokenizer

Introduced in This time with feeling (Oore et al.) and later used with Music Transformer (Huang et al.) and MT3 (Gardner et al.), this tokenization simply converts MIDI messages (NoteOn, NoteOff, TimeShift…) as tokens, hence the name “MIDI-Like”. Note: as MIDI-Like uses TimeShifts events to move the time from note to note, it could be unsuited for tracks with long pauses. In such case, the maximum TimeShift value will be used.

Parameters:

pitch_range – range of MIDI pitches to use
beat_res – beat resolutions, as a dictionary: {(beat_x1, beat_x2): beat_res_1, (beat_x2, beat_x3): beat_res_2, …} The keys are tuples indicating a range of beats, ex 0 to 3 for the first bar, and the values are the resolution to apply to the ranges, in samples per beat, ex 8
nb_velocities – number of velocity bins
additional_tokens – additional tokens (chords, time signature, rests, tempo…) to use, to be given as a dictionary. (default: None is used)
pad – will add a special PAD token to the vocabulary, to use to pad sequences when training a model with batches of different sequence lengths. (default: True)
sos_eos – adds special Start Of Sequence (SOS) and End Of Sequence (EOS) tokens to the vocabulary. (default: False)
mask – will add a special MASK token to the vocabulary (default: False)
sep – will add a special SEP token to the vocabulary (default: False)
params – path to a tokenizer config file. This will override other arguments and load the tokenizer based on the config file. This is particularly useful if the tokenizer learned Byte Pair Encoding. (default: None)

TSD

TSD sequence, like MIDI-Like with Duration tokens

class miditok.TSD(pitch_range: range = range(21, 109), beat_res: Dict[Tuple[int, int], int] = {(0, 4): 8, (4, 12): 4}, nb_velocities: int = 32, additional_tokens: Dict[str, bool] = {'Chord': False, 'Program': False, 'Rest': False, 'Tempo': False, 'TimeSignature': False, 'nb_tempos': 32, 'rest_range': (2, 8), 'tempo_range': (40, 250), 'time_signature_range': (8, 2)}, pad: bool = True, sos_eos: bool = False, mask: bool = False, sep: bool = False, params: Optional[Union[Path, str]] = None)

Bases: MIDITokenizer

TSD, for Time Shift Duration, is similar to MIDI-Like MIDI-Like but uses explicit Duration tokens to represent note durations, which have showed better results than with NoteOff tokens. Note: as TSD uses TimeShifts events to move the time from note to note, it could be unsuited for tracks with long pauses. In such case, the maximum TimeShift value will be used.

Parameters:

pitch_range – range of MIDI pitches to use
beat_res – beat resolutions, as a dictionary: {(beat_x1, beat_x2): beat_res_1, (beat_x2, beat_x3): beat_res_2, …} The keys are tuples indicating a range of beats, ex 0 to 3 for the first bar, and the values are the resolution to apply to the ranges, in samples per beat, ex 8
nb_velocities – number of velocity bins
additional_tokens – additional tokens (chords, time signature, rests, tempo…) to use, to be given as a dictionary. (default: None is used)
pad – will add a special PAD token to the vocabulary, to use to pad sequences when training a model with batches of different sequence lengths. (default: True)
sos_eos – adds special Start Of Sequence (SOS) and End Of Sequence (EOS) tokens to the vocabulary. (default: False)
mask – will add a special MASK token to the vocabulary (default: False)
sep – will add a special SEP token to the vocabulary (default: False)
params – path to a tokenizer config file. This will override other arguments and load the tokenizer based on the config file. This is particularly useful if the tokenizer learned Byte Pair Encoding. (default: None)

Structured

class miditok.Structured(pitch_range: range = range(21, 109), beat_res: Dict[Tuple[int, int], int] = {(0, 4): 8, (4, 12): 4}, nb_velocities: int = 32, additional_tokens: Dict[str, Union[bool, int]] = {'Chord': False, 'Program': False, 'Rest': False, 'Tempo': False, 'TimeSignature': False, 'nb_tempos': 32, 'rest_range': (2, 8), 'tempo_range': (40, 250), 'time_signature_range': (8, 2)}, pad: bool = True, sos_eos: bool = False, mask: bool = False, sep: bool = False, params: Optional[Union[Path, str]] = None)

Bases: MIDITokenizer

Introduced with the Piano Inpainting Application, it is similar to TSD but is based on a consistent token type successions. Token types always follow the same pattern: Pitch -> Velocity -> Duration -> TimeShift. The latter is set to 0 for simultaneous notes. To keep this property, no additional token can be inserted in MidiTok’s implementation, except Program that can be added to the vocabulary and are up to you to use. Note: as Structured uses TimeShifts events to move the time from note to note, it could be unsuited for tracks with long pauses. In such case, the maximum TimeShift value will be used.

Parameters:

pitch_range – range of MIDI pitches to use
beat_res – beat resolutions, as a dictionary: {(beat_x1, beat_x2): beat_res_1, (beat_x2, beat_x3): beat_res_2, …} The keys are tuples indicating a range of beats, ex 0 to 3 for the first bar, and the values are the resolution to apply to the ranges, in samples per beat, ex 8
nb_velocities – number of velocity bins
additional_tokens – additional tokens (chords, time signature, rests, tempo…) to use, to be given as a dictionary. (default: None is used)
pad – will add a special PAD token to the vocabulary, to use to pad sequences when training a model with batches of different sequence lengths. (default: True)
sos_eos – adds special Start Of Sequence (SOS) and End Of Sequence (EOS) tokens to the vocabulary. (default: False)
mask – will add a special MASK token to the vocabulary (default: False)
sep – will add a special SEP token to the vocabulary (default: False)
params – path to a tokenizer config file. This will override other arguments and load the tokenizer based on the config file. This is particularly useful if the tokenizer learned Byte Pair Encoding. (default: None)

CPWord

class miditok.CPWord(pitch_range: range = range(21, 109), beat_res: Dict[Tuple[int, int], int] = {(0, 4): 8, (4, 12): 4}, nb_velocities: int = 32, additional_tokens: Dict[str, bool] = {'Chord': False, 'Program': False, 'Rest': False, 'Tempo': False, 'TimeSignature': False, 'nb_tempos': 32, 'rest_range': (2, 8), 'tempo_range': (40, 250), 'time_signature_range': (8, 2)}, pad: bool = True, sos_eos: bool = False, mask: bool = False, sep: bool = False, params: Optional[Union[Path, str]] = None)

Bases: MIDITokenizer

Introduced with the Compound Word Transformer (Hsiao et al.), this tokenization is similar to REMI but uses embedding pooling operations to reduce the overall sequence length: note tokens (Pitch, Velocity and Duration) are first independently converted to embeddings which are then merged (pooled) into a single one. Each compound token will be a list of the form (index: Token type): * 0: Family * 1: Bar/Position * 2: Pitch * 3: Velocity * 4: Duration * (+ Optional) Program: associated with notes (pitch/velocity/duration) or chords * (+ Optional) Chord: chords occurring with position tokens * (+ Optional) Rest: rest acting as a TimeShift token * (+ Optional) Tempo: occurring with position tokens

The output hidden states of the model will then be fed to several output layers (one per token type). This means that the training requires to add multiple losses. For generation, the decoding implies sample from several distributions, which can be very delicate. Hence, we do not recommend this tokenization for generation with small models.

Parameters:

pitch_range – range of MIDI pitches to use
beat_res – beat resolutions, as a dictionary: {(beat_x1, beat_x2): beat_res_1, (beat_x2, beat_x3): beat_res_2, …} The keys are tuples indicating a range of beats, ex 0 to 3 for the first bar, and the values are the resolution to apply to the ranges, in samples per beat, ex 8
nb_velocities – number of velocity bins
additional_tokens – additional tokens (chords, time signature, rests, tempo…) to use, to be given as a dictionary. (default: None is used)
pad – will add a special PAD token to the vocabulary, to use to pad sequences when training a model with batches of different sequence lengths. (default: True)
sos_eos – adds special Start Of Sequence (SOS) and End Of Sequence (EOS) tokens to the vocabulary. (default: False)
mask – will add a special MASK token to the vocabulary (default: False)
sep – will add a special SEP token to the vocabulary (default: False)
params – path to a tokenizer config file. This will override other arguments and load the tokenizer based on the config file. This is particularly useful if the tokenizer learned Byte Pair Encoding. (default: None)

Octuple

class miditok.Octuple(pitch_range: range = range(21, 109), beat_res: Dict[Tuple[int, int], int] = {(0, 4): 8, (4, 12): 4}, nb_velocities: int = 32, additional_tokens: Dict[str, bool] = {'Chord': False, 'Program': False, 'Rest': False, 'Tempo': False, 'TimeSignature': False, 'nb_tempos': 32, 'rest_range': (2, 8), 'tempo_range': (40, 250), 'time_signature_range': (8, 2)}, programs: Optional[List[int]] = None, pad: bool = True, sos_eos: bool = False, mask: bool = False, sep: bool = False, params: Optional[Union[Path, str]] = None)

Bases: MIDITokenizer

Introduced with MusicBert (Zeng et al.), the idea of Octuple is to use embedding pooling so that each pooled embedding represents a single note. Tokens (Pitch, Velocity…) are first independently converted to embeddings which are then merged (pooled) into a single one. Each pooled token will be a list of the form (index: Token type): * 0: Pitch * 1: Velocity * 2: Duration * 3: Program (track) * 4: Position * 5: Bar * (+ Optional) Tempo * (+ Optional) TimeSignature

Its considerably reduces the sequence lengths, while handling multitrack. The output hidden states of the model will then be fed to several output layers (one per token type). This means that the training requires to add multiple losses. For generation, the decoding implies sample from several distributions, which can be very delicate. Hence, we do not recommend this tokenization for generation with small models.

Notes: * Tokens are first sorted by time, then track, then pitch values. * Tracks with the same Program will be merged.

Parameters:

pitch_range – range of MIDI pitches to use
beat_res – beat resolutions, as a dictionary: {(beat_x1, beat_x2): beat_res_1, (beat_x2, beat_x3): beat_res_2, …} The keys are tuples indicating a range of beats, ex 0 to 3 for the first bar, and the values are the resolution to apply to the ranges, in samples per beat, ex 8
nb_velocities – number of velocity bins
additional_tokens – additional tokens (chords, time signature, rests, tempo…) to use, to be given as a dictionary. (default: None is used)
pad – will add a special PAD token to the vocabulary, to use to pad sequences when training a model with batches of different sequence lengths. (default: True)
sos_eos – adds special Start Of Sequence (SOS) and End Of Sequence (EOS) tokens to the vocabulary. (default: False)
mask – will add a special MASK token to the vocabulary (default: False)
sep – will add a special SEP token to the vocabulary (default: False)
params – path to a tokenizer config file. This will override other arguments and load the tokenizer based on the config file. This is particularly useful if the tokenizer learned Byte Pair Encoding. (default: None)

Octuple Mono

class miditok.OctupleMono(pitch_range: range = range(21, 109), beat_res: Dict[Tuple[int, int], int] = {(0, 4): 8, (4, 12): 4}, nb_velocities: int = 32, additional_tokens: Dict[str, bool] = {'Chord': False, 'Program': False, 'Rest': False, 'Tempo': False, 'TimeSignature': False, 'nb_tempos': 32, 'rest_range': (2, 8), 'tempo_range': (40, 250), 'time_signature_range': (8, 2)}, pad: bool = True, sos_eos: bool = False, mask: bool = False, sep: bool = False, params: Optional[Union[Path, str]] = None)

Bases: MIDITokenizer

OctupleMono is similar to Octuple (MusicBert (Zeng et al.)) but without the Program token. OctupleMono is hence better suited for tasks with one track. Each pooled token will be a list of the form (index: Token type): * 0: Pitch * 1: Velocity * 2: Duration * 3: Position * 4: Bar * (+ Optional) Tempo * (+ Optional) TimeSignature

Parameters:

pitch_range – range of MIDI pitches to use
beat_res – beat resolutions, as a dictionary: {(beat_x1, beat_x2): beat_res_1, (beat_x2, beat_x3): beat_res_2, …} The keys are tuples indicating a range of beats, ex 0 to 3 for the first bar, and the values are the resolution to apply to the ranges, in samples per beat, ex 8
nb_velocities – number of velocity bins
additional_tokens – additional tokens (chords, time signature, rests, tempo…) to use, to be given as a dictionary. (default: None is used)
pad – will add a special PAD token to the vocabulary, to use to pad sequences when training a model with batches of different sequence lengths. (default: True)
sos_eos – adds special Start Of Sequence (SOS) and End Of Sequence (EOS) tokens to the vocabulary. (default: False)
mask – will add a special MASK token to the vocabulary (default: False)
sep – will add a special SEP token to the vocabulary (default: False)
params – path to a tokenizer config file. This will override other arguments and load the tokenizer based on the config file. This is particularly useful if the tokenizer learned Byte Pair Encoding. (default: None)

MuMIDI

class miditok.MuMIDI(pitch_range: range = range(21, 109), beat_res: Dict[Tuple[int, int], int] = {(0, 4): 8, (4, 12): 4}, nb_velocities: int = 32, additional_tokens: Dict[str, bool] = {'Chord': False, 'Program': False, 'Rest': False, 'Tempo': False, 'TimeSignature': False, 'nb_tempos': 32, 'rest_range': (2, 8), 'tempo_range': (40, 250), 'time_signature_range': (8, 2)}, programs: Optional[List[int]] = None, pad: bool = True, sos_eos: bool = False, mask: bool = False, sep: bool = False, params: Optional[Union[Path, str]] = None, drum_pitch_range: range = range(27, 88))

Bases: MIDITokenizer

Introduced with PopMAG (Ren et al.), this tokenization made for multitrack tasks and uses embedding pooling. Time is represented with Bar and Position tokens. The key idea of MuMIDI is to represent all tracks in a single token sequence. At each time step, Track tokens preceding note tokens indicate their track. MuMIDI also include a “built-in” and learned positional encoding. As in the original paper, the pitches of drums are distinct from those of all other instruments. Each pooled token will be a list of the form (index: Token type): * 0: Pitch / DrumPitch / Position / Bar / Program / (Chord) / (Rest) * 1: BarPosEnc * 2: PositionPosEnc * (-3 / 3: Tempo) * -2: Velocity * -1: Duration

The output hidden states of the model will then be fed to several output layers (one per token type). This means that the training requires to add multiple losses. For generation, the decoding implies sample from several distributions, which can be very delicate. Hence, we do not recommend this tokenization for generation with small models.

Notes:

Tokens are first sorted by time, then track, then pitch values.
Tracks with the same Program will be merged.

Parameters:

pitch_range – range of MIDI pitches to use
beat_res – beat resolutions, as a dictionary: {(beat_x1, beat_x2): beat_res_1, (beat_x2, beat_x3): beat_res_2, …} The keys are tuples indicating a range of beats, ex 0 to 3 for the first bar, and the values are the resolution to apply to the ranges, in samples per beat, ex 8
nb_velocities – number of velocity bins
additional_tokens – additional tokens (chords, time signature, rests, tempo…) to use, to be given as a dictionary. (default: None is used)
pad – will add a special PAD token to the vocabulary, to use to pad sequences when training a model with batches of different sequence lengths. (default: True)
sos_eos – adds special Start Of Sequence (SOS) and End Of Sequence (EOS) tokens to the vocabulary. (default: False)
mask – will add a special MASK token to the vocabulary (default: False)
sep – will add a special SEP token to the vocabulary (default: False)
params – path to a tokenizer config file. This will override other arguments and load the tokenizer based on the config file. This is particularly useful if the tokenizer learned Byte Pair Encoding. (default: None)
drum_pitch_range – range of used MIDI pitches for drums exclusively

Create yours

You can easily create your own tokenization and benefit from the MidiTok framework. Just create a class inheriting from miditok.MIDITokenizer, and override the miditok.MIDITokenizer.track_to_tokens(), miditok.MIDITokenizer.tokens_to_track(), miditok.MIDITokenizer._create_vocabulary() and miditok.MIDITokenizer._create_token_types_graph() methods with your tokenization strategy.

We encourage you to read the documentation of the Vocabulary class to learn how to use it for your tokenization. If you think people can benefit from it, feel free to send a pull request on Github.