Tokenizations

This page details the tokenizations featured by MidiTok. They inherit from miditok.MIDITokenizer, see the documentation for learn to use the common methods. For each of them, the token equivalent of the lead sheet below is showed.

Music sheet example

REMI

REMI sequence, time is tracked with Bar and position tokens
class miditok.REMI(tokenizer_config: TokenizerConfig = None, max_bar_embedding: int | None = None, params: str | Path | None = None)

Bases: MIDITokenizer

REMI (Revamped MIDI) tokenizer.

Introduced with the Pop Music Transformer (Huang and Yang), REMI represents notes as successions of Pitch, Velocity and Duration tokens, and time with Bar and Position tokens. A Bar token indicate that a new bar is beginning, and Position the current position within the current bar. The number of positions is determined by the beat_res argument, the maximum value will be used as resolution. With the Program and TimeSignature additional tokens enables, this class is equivalent to REMI+. REMI+ is an extended version of REMI (Huang and Yang) for general multi-track, multi-signature symbolic music sequences, introduced in FIGARO (Rütte et al.) <https://arxiv.org/abs/2201.10936>, which handle multiple instruments by adding Program tokens before the Pitch ones.

Note: in the original paper, the tempo information is represented as the succession of two token types: a TempoClass indicating if the tempo is fast or slow, and a TempoValue indicating its value. MidiTok only uses one Tempo token for its value (see Additional tokens). Note: When decoding multiple token sequences (of multiple tracks), i.e. when config.use_programs is False, only the tempos and time signatures of the first sequence will be decoded for the whole MIDI.

Parameters:
  • tokenizer_config – the tokenizer’s configuration, as a miditok.classes.TokenizerConfig object.

  • max_bar_embedding – Maximum number of bars (“Bar_0”, “Bar_1”,…, “Bar_{num_bars-1}”). If None passed, creates “Bar_None” token only in vocabulary for Bar token.

  • params – path to a tokenizer config file. This will override other arguments and load the tokenizer based on the config file. This is particularly useful if the tokenizer learned Byte Pair Encoding. (default: None)

REMIPlus

REMI+ is an extended version of REMI (Huang and Yang) for general multi-track, multi-signature symbolic music sequences, introduced in FIGARO (Rütte et al.) <https://arxiv.org/abs/2201.10936>, which handle multiple instruments by adding Program tokens before the Pitch ones.

In the previous versions of MidiTok, we used to implement REMI+ as a dedicated class. Now that all the tokenizers supports the additional tokens in a more flexible way, you can get the REMI+ tokenization by using the REMI tokenizer with config.use_programs and config.one_token_stream_for_programs and config.use_time_signatures set to True.

MIDI-Like

MIDI-Like token sequence, with TimeShift and NoteOff tokens
class miditok.MIDILike(tokenizer_config: TokenizerConfig = None, params: str | Path | None = None)

Bases: MIDITokenizer

MIDI-Like tokenizer.

Introduced in This time with feeling (Oore et al.) and later used with Music Transformer (Huang et al.) and MT3 (Gardner et al.), this tokenization simply converts MIDI messages (NoteOn, NoteOff, TimeShift…) to tokens, hence the name “MIDI-Like”. MIDILike decode tokens following a FIFO (First In First Out) logic. When decoding tokens, you can limit the duration of the created notes by setting a max_duration entry in the tokenizer’s config (config.additional_params["max_duration"]) to be given as a tuple of three integers following (num_beats, num_frames, res_frames), the resolutions being in the frames per beat. If you specify use_programs as True in the config file, the tokenizer will add Program tokens before each Pitch tokens to specify its instrument, and will treat all tracks as a single stream of tokens.

Note: as MIDILike uses TimeShifts events to move the time from note to note, it could be unsuited for tracks with long pauses. In such case, the maximum TimeShift value will be used. Also, the MIDILike tokenizer might alter the durations of overlapping notes. If two notes of the same instrument with the same pitch are overlapping, i.e. a first one is still being played when a second one is also played, the offset time of the first will be set to the onset time of the second. This is done to prevent unwanted duration alterations that could happen in such case, as the NoteOff token associated to the first note will also end the second one. Note: When decoding multiple token sequences (of multiple tracks), i.e. when config.use_programs is False, only the tempos and time signatures of the first sequence will be decoded for the whole MIDI.

TSD

TSD sequence, like MIDI-Like with Duration tokens
class miditok.TSD(tokenizer_config: TokenizerConfig = None, params: str | Path | None = None)

Bases: MIDITokenizer

TSD (Time Shift Duration) tokenizer.

It is similar to MIDI-Like MIDI-Like but uses explicit Duration tokens to represent note durations, which have showed better results than with *NoteOff* tokens. If you specify use_programs as True in the config file, the tokenizer will add Program tokens before each Pitch tokens to specify its instrument, and will treat all tracks as a single stream of tokens.

Note: as TSD uses TimeShifts events to move the time from note to note, it can be unsuited for tracks with pauses longer than the maximum TimeShift value. In such cases, the maximum TimeShift value will be used. Note: When decoding multiple token sequences (of multiple tracks), i.e. when config.use_programs is False, only the tempos and time signatures of the first sequence will be decoded for the whole MIDI.

Structured

Structured tokenization, the token types always follow the same succession pattern
class miditok.Structured(tokenizer_config: TokenizerConfig = None, params: str | Path | None = None)

Bases: MIDITokenizer

Structured tokenizer, with a recurrent token type succession.

Introduced with the Piano Inpainting Application, it is similar to TSD but is based on a consistent token type successions. Token types always follow the same pattern: Pitch -> Velocity -> Duration -> TimeShift. The latter is set to 0 for simultaneous notes. To keep this property, no additional token can be inserted in MidiTok’s implementation, except Program that can optionally be added preceding Pitch tokens. If you specify use_programs as True in the config file, the tokenizer will add Program tokens before each Pitch tokens to specify its instrument, and will treat all tracks as a single stream of tokens.

Note: as Structured uses TimeShifts events to move the time from note to note, it can be unsuited for tracks with pauses longer than the maximum TimeShift value. In such cases, the maximum TimeShift value will be used.

CPWord

CP Word sequence, tokens of the same family are grouped together
class miditok.CPWord(tokenizer_config: TokenizerConfig = None, params: str | Path | None = None)

Bases: MIDITokenizer

Compound Word tokenizer.

Introduced with the Compound Word Transformer (Hsiao et al.), this tokenization is similar to REMI but uses embedding pooling operations to reduce the overall sequence length: note tokens (Pitch, Velocity and Duration) are first independently converted to embeddings which are then merged (pooled) into a single one. Each compound token will be a list of the form (index: Token type): * 0: Family; * 1: Bar/Position; * 2: Pitch; * 3: Velocity; * 4: Duration; * (+ Optional) Program: associated with notes (pitch/velocity/duration) or chords; * (+ Optional) Chord: chords occurring with position tokens; * (+ Optional) Rest: rest acting as a TimeShift token; * (+ Optional) Tempo: occurring with position tokens; * (+ Optional) TimeSig: occurring with bar tokens.

The output hidden states of the model will then be fed to several output layers (one per token type). This means that the training requires to add multiple losses. For generation, the decoding implies sample from several distributions, which can be very delicate. Hence, we do not recommend this tokenization for generation with small models. Note: When decoding multiple token sequences (of multiple tracks), i.e. when config.use_programs is False, only the tempos and time signatures of the first sequence will be decoded for the whole MIDI.

Octuple

Octuple sequence, with a bar and position embeddings
class miditok.Octuple(tokenizer_config: TokenizerConfig = None, params: str | Path | None = None)

Bases: MIDITokenizer

Octuple tokenizer.

Introduced with MusicBert (Zeng et al.), the idea of Octuple is to use embedding pooling so that each pooled embedding represents a single note. Tokens (Pitch, Velocity…) are first independently converted to embeddings which are then merged (pooled) into a single one. Each pooled token will be a list of the form (index: Token type): * 0: Pitch/PitchDrum; * 1: Velocity; * 2: Duration; * 3: Position; * 4: Bar; * (+ Optional) Program; * (+ Optional) Tempo; * (+ Optional) TimeSignature.

Its considerably reduces the sequence lengths, while handling multitrack. The output hidden states of the model will then be fed to several output layers (one per token type). This means that the training requires to add multiple losses. For generation, the decoding implies sample from several distributions, which can be very delicate. Hence, we do not recommend this tokenization for generation with small models.

Notes: * As the time signature is carried simultaneously with the note tokens, if a Time Signature change occurs and that the following bar do not contain any note, the time will be shifted by one or multiple bars depending on the previous time signature numerator and time gap between the last and current note. Octuple cannot represent time signature accurately, hence some unavoidable errors of conversion can happen. For this reason, Octuple is implemented with Time Signature but tested without. * Tokens are first sorted by time, then track, then pitch values. * Tracks with the same Program will be merged. * When decoding multiple token sequences (of multiple tracks), i.e. when config.use_programs is False, only the tempos and time signatures of the first sequence will be decoded for the whole MIDI.

MuMIDI

MuMIDI sequence, with a bar and position embeddings
class miditok.MuMIDI(tokenizer_config: TokenizerConfig = None, params: str | Path | None = None)

Bases: MIDITokenizer

MuMIDI tokenizer.

Introduced with PopMAG (Ren et al.), this tokenization made for multitrack tasks and uses embedding pooling. Time is represented with Bar and Position tokens. The key idea of MuMIDI is to represent all tracks in a single token sequence. At each time step, Track tokens preceding note tokens indicate their track. MuMIDI also include a “built-in” and learned positional encoding. As in the original paper, the pitches of drums are distinct from those of all other instruments. Each pooled token will be a list of the form (index: Token type): * 0: Pitch / PitchDrum / Position / Bar / Program / (Chord) / (Rest); * 1: BarPosEnc; * 2: PositionPosEnc; * (-3 / 3: Tempo); * -2: Velocity; * -1: Duration.

The output hidden states of the model will then be fed to several output layers (one per token type). This means that the training requires to add multiple losses. For generation, the decoding implies sample from several distributions, which can be very delicate. Hence, we do not recommend this tokenization for generation with small models.

Notes:
  • Tokens are first sorted by time, then track, then pitch values.

  • Tracks with the same Program will be merged.

MMM

class miditok.MMM(tokenizer_config: TokenizerConfig = None, params: str | Path | None = None)

Bases: MIDITokenizer

MMM tokenizer.

Standing for Multi-Track Music Machine, MMM is a multitrack tokenization primarily designed for music inpainting and infilling. Tracks are tokenized independently and concatenated into a single token sequence. Bar_Fill tokens are used to specify the bars to fill (or inpaint, or rewrite), the new tokens are then autoregressively generated. Note that this implementation represents note durations with ``Duration`` tokens instead of the NoteOff strategy of the original paper. The reason being that NoteOff tokens perform poorer for generation with causal models.

Add a `density_bins_max` entry in the config, mapping to a tuple specifying the number of density bins, and the maximum density in notes per beat to consider. (default: (10, 20))

Note: When decoding tokens with tempos, only the tempos of the first track will be decoded.

Create yours

You can easily create your own tokenizer and benefit from the MidiTok framework. Just create a class inheriting from miditok.MIDITokenizer, and override the miditok.MIDITokenizer._add_time_events(), miditok.MIDITokenizer._tokens_to_midi(), miditok.MIDITokenizer._create_vocabulary() and miditok.MIDITokenizer._create_token_types_graph() (and optionally if needed miditok.MIDITokenizer._midi_to_tokens(), miditok.MIDITokenizer._create_track_events() and miditok.MIDITokenizer._create_midi_events()) methods with your tokenization strategy.

If you think people can benefit from it, feel free to send a pull request on Github.