Tokenizations

This page details the tokenizations featured by MidiTok. They inherit from miditok.MIDITokenizer, see the documentation for learn to use the common methods. For each of them, the token equivalent of the lead sheet below is showed.

REMI

class miditok.REMI(tokenizer_config: TokenizerConfig = None, max_bar_embedding: int | None = None, params: str | Path = None)

Bases: MIDITokenizer

REMI, standing for Revamped MIDI and introduced with the Pop Music Transformer (Huang and Yang), is a tokenization that represents notes as successions of Pitch, Velocity and Duration tokens, and time with Bar and Position tokens. A Bar token indicate that a new bar is beginning, and Position the current position within the current bar. The number of positions is determined by the beat_res argument, the maximum value will be used as resolution. With the Program and TimeSignature additional tokens enables, this class is equivalent to REMI+. REMI+ is an extended version of REMI (Huang and Yang) for general multi-track, multi-signature symbolic music sequences, introduced in FIGARO (Rütte et al.) <https://arxiv.org/abs/2201.10936>, which handle multiple instruments by adding Program tokens before the Pitch ones.

Note: in the original paper, the tempo information is represented as the succession of two token types: a TempoClass indicating if the tempo is fast or slow, and a TempoValue indicating its value. MidiTok only uses one Tempo token for its value (see Additional tokens). Note: When decoding multiple token sequences (of multiple tracks), i.e. when config.use_programs is False, only the tempos and time signatures of the first sequence will be decoded for the whole MIDI.

Parameters:

tokenizer_config – the tokenizer’s configuration, as a miditok.classes.TokenizerConfig object.
max_bar_embedding – Maximum number of bars (“Bar_0”, “Bar_1”,…,”Bar_{num_bars-1}”). If None passed, creates “Bar_None” token only in vocabulary for Bar token.
params – path to a tokenizer config file. This will override other arguments and load the tokenizer based on the config file. This is particularly useful if the tokenizer learned Byte Pair Encoding. (default: None)

REMIPlus

class miditok.REMIPlus(tokenizer_config: TokenizerConfig = None, max_bar_embedding: int | None = None, params: str | Path = None)

Bases: REMI

REMI+ is an extended version of REMI (Huang and Yang) for general multi-track, multi-signature symbolic music sequences, introduced in FIGARO (Rütte et al.) <https://arxiv.org/abs/2201.10936>, which handle multiple instruments by adding Program tokens before the Pitch ones.

This class is identical to REMI with Program and TimeSignature tokens enabled.

MIDI-Like

class miditok.MIDILike(tokenizer_config: TokenizerConfig = None, one_token_stream: bool = False, params: str | Path = None)

Bases: MIDITokenizer

Introduced in This time with feeling (Oore et al.) and later used with Music Transformer (Huang et al.) and MT3 (Gardner et al.), this tokenization simply converts MIDI messages (NoteOn, NoteOff, TimeShift…) as tokens, hence the name “MIDI-Like”. If you specify use_programs as True in the config file, the tokenizer will add Program tokens before each Pitch tokens to specify its instrument, and will treat all tracks as a single stream of tokens.

Note: as MIDILike uses TimeShifts events to move the time from note to note, it could be unsuited for tracks with long pauses. In such case, the maximum TimeShift value will be used. Also, the MIDILike tokenizer might alter the durations of overlapping notes. If two notes of the same instrument with the same pitch are overlapping, i.e. a first one is still being played when a second one is also played, the offset time of the first will be set to the onset time of the second. This is done to prevent unwanted duration alterations that could happen in such case, as the NoteOff token associated to the first note will also end the second one. Note: When decoding multiple token sequences (of multiple tracks), i.e. when config.use_programs is False, only the tempos and time signatures of the first sequence will be decoded for the whole MIDI.

TSD

TSD sequence, like MIDI-Like with Duration tokens

class miditok.TSD(tokenizer_config: TokenizerConfig = None, one_token_stream: bool = False, params: str | Path = None)

Bases: MIDITokenizer

TSD, for Time Shift Duration, is similar to MIDI-Like MIDI-Like but uses explicit Duration tokens to represent note durations, which have showed better results than with NoteOff tokens. If you specify use_programs as True in the config file, the tokenizer will add Program tokens before each Pitch tokens to specify its instrument, and will treat all tracks as a single stream of tokens.

Note: as TSD uses TimeShifts events to move the time from note to note, it can be unsuited for tracks with pauses longer than the maximum TimeShift value. In such cases, the maximum TimeShift value will be used. Note: When decoding multiple token sequences (of multiple tracks), i.e. when config.use_programs is False, only the tempos and time signatures of the first sequence will be decoded for the whole MIDI.

Structured

class miditok.Structured(tokenizer_config: TokenizerConfig = None, one_token_stream: bool = False, params: str | Path = None)

Bases: MIDITokenizer

Introduced with the Piano Inpainting Application, it is similar to TSD but is based on a consistent token type successions. Token types always follow the same pattern: Pitch -> Velocity -> Duration -> TimeShift. The latter is set to 0 for simultaneous notes. To keep this property, no additional token can be inserted in MidiTok’s implementation, except Program that can optionally be added preceding Pitch tokens. If you specify use_programs as True in the config file, the tokenizer will add Program tokens before each Pitch tokens to specify its instrument, and will treat all tracks as a single stream of tokens.

Note: as Structured uses TimeShifts events to move the time from note to note, it can be unsuited for tracks with pauses longer than the maximum TimeShift value. In such cases, the maximum TimeShift value will be used.

CPWord

class miditok.CPWord(tokenizer_config: TokenizerConfig = None, one_token_stream: bool = False, params: str | Path = None)

Bases: MIDITokenizer

Introduced with the Compound Word Transformer (Hsiao et al.), this tokenization is similar to REMI but uses embedding pooling operations to reduce the overall sequence length: note tokens (Pitch, Velocity and Duration) are first independently converted to embeddings which are then merged (pooled) into a single one. Each compound token will be a list of the form (index: Token type): * 0: Family * 1: Bar/Position * 2: Pitch * 3: Velocity * 4: Duration * (+ Optional) Program: associated with notes (pitch/velocity/duration) or chords * (+ Optional) Chord: chords occurring with position tokens * (+ Optional) Rest: rest acting as a TimeShift token * (+ Optional) Tempo: occurring with position tokens

The output hidden states of the model will then be fed to several output layers (one per token type). This means that the training requires to add multiple losses. For generation, the decoding implies sample from several distributions, which can be very delicate. Hence, we do not recommend this tokenization for generation with small models. Note: When decoding multiple token sequences (of multiple tracks), i.e. when config.use_programs is False, only the tempos and time signatures of the first sequence will be decoded for the whole MIDI.

Octuple

class miditok.Octuple(tokenizer_config: TokenizerConfig = None, one_token_stream: bool = False, params: str | Path = None)

Bases: MIDITokenizer

Introduced with MusicBert (Zeng et al.), the idea of Octuple is to use embedding pooling so that each pooled embedding represents a single note. Tokens (Pitch, Velocity…) are first independently converted to embeddings which are then merged (pooled) into a single one. Each pooled token will be a list of the form (index: Token type): * 0: Pitch * 1: Velocity * 2: Duration * 3: Position * 4: Bar * (+ Optional) Program * (+ Optional) Tempo * (+ Optional) TimeSignature

Its considerably reduces the sequence lengths, while handling multitrack. The output hidden states of the model will then be fed to several output layers (one per token type). This means that the training requires to add multiple losses. For generation, the decoding implies sample from several distributions, which can be very delicate. Hence, we do not recommend this tokenization for generation with small models.

Notes: * Tokens are first sorted by time, then track, then pitch values. * Tracks with the same Program will be merged. * When decoding multiple token sequences (of multiple tracks), i.e. when config.use_programs is False, only the tempos and time signatures of the first sequence will be decoded for the whole MIDI.

MuMIDI

class miditok.MuMIDI(tokenizer_config: TokenizerConfig = None, one_token_stream: bool = False, params: str | Path = None)

Bases: MIDITokenizer

Introduced with PopMAG (Ren et al.), this tokenization made for multitrack tasks and uses embedding pooling. Time is represented with Bar and Position tokens. The key idea of MuMIDI is to represent all tracks in a single token sequence. At each time step, Track tokens preceding note tokens indicate their track. MuMIDI also include a “built-in” and learned positional encoding. As in the original paper, the pitches of drums are distinct from those of all other instruments. Each pooled token will be a list of the form (index: Token type): * 0: Pitch / DrumPitch / Position / Bar / Program / (Chord) / (Rest) * 1: BarPosEnc * 2: PositionPosEnc * (-3 / 3: Tempo) * -2: Velocity * -1: Duration

The output hidden states of the model will then be fed to several output layers (one per token type). This means that the training requires to add multiple losses. For generation, the decoding implies sample from several distributions, which can be very delicate. Hence, we do not recommend this tokenization for generation with small models.

Add a `drum_pitch_range` entry in the config, mapping to a tuple of values to restrict the range of drum pitches to use.

Notes:

Tokens are first sorted by time, then track, then pitch values.
Tracks with the same Program will be merged.

MMM

class miditok.MMM(tokenizer_config: TokenizerConfig = None, one_token_stream: bool = False, params: str | Path = None)

Bases: MIDITokenizer

MMM, standing for [Multi-Track Music Machine](https://arxiv.org/abs/2008.06048), is a multitrack tokenization primarily designed for music inpainting and infilling. Tracks are tokenized independently and concatenated into a single token sequence. Bar_Fill tokens are used to specify the bars to fill (or inpaint, or rewrite), the new tokens are then autoregressively generated. Note that this implementation represents note durations with ``Duration`` tokens instead of the NoteOff strategy of the [original paper](https://arxiv.org/abs/2008.06048). The reason being that NoteOff tokens perform poorer for generation with causal models.

Add a `density_bins_max` entry in the config, mapping to a tuple specifying the number of density bins, and the maximum density in notes per beat to consider. (default: (10, 20))

Note: When decoding tokens with tempos, only the tempos of the first track will be decoded.

Create yours

You can easily create your own tokenization and benefit from the MidiTok framework. Just create a class inheriting from miditok.MIDITokenizer, and override the miditok.MIDITokenizer._add_time_events(), miditok.MIDITokenizer.tokens_to_midi() / miditok.MIDITokenizer.tokens_to_track(), miditok.MIDITokenizer._create_vocabulary() and miditok.MIDITokenizer._create_token_types_graph() (and optionally miditok.MIDITokenizer._midi_to_tokens(), miditok.MIDITokenizer._create_track_events() and miditok.MIDITokenizer._create_midi_events()) methods with your tokenization strategy.

We encourage you to read the documentation of the Vocabulary class to learn how to use it for your tokenization. If you think people can benefit from it, feel free to send a pull request on Github.