Byte Pair Encoding (BPE)

What is BPE

BPE is a compression technique that replaces the most recurrent byte (tokens in our case) successions of a corpus, by newly created ones. For instance, in the character sequence aabaabaacaa, the sub-sequence aa occurs three times and is the most recurrent one. Learning and applying BPE on this sequence would replace aa with a new symbol, e.g., d, resulting in a compressed sequence dbdbdcd. The latter can be reduced again by replacing the db subsequence, giving eedcd. The vocabulary, which initially contained three characters (a, b and c) now also contains d and e. In practice BPE is learned on a corpus until the vocabulary reaches a target size.

Today in the NLP field, BPE is used with almost all tokenizations to build their vocabulary, as it allows to encode rare words and segmenting unknown or composed words as sequences of sub-word units. The base initial vocabulary is the set of all the unique characters present in the data, which compose the words that are automatically learned as tokens by the BPE algorithm.

BPE for symbolic music

In the case of symbolic music we will consider the tokens of a tokenizer without BPE as the base vocabulary, which can be seen as the equivalent of the characters (or bytes) for text. To compute BPE, MidiTok is backed by the Hugging Face 🤗tokenizers Rust library allowing super-fast training and encoding. Thus, internally we represent base tokens (from base vocab) as characters (bytes). Essentially, a token will have three unique (non-shared by others) forms:

The text describing the token itself, e.g. Pitch_58, Position_4 …;
An id as an integer, that will be fed to the model, e.g. 65;
A byte form as a character or succession of characters, e.g. a or any unicode character starting from the 33rd one (0x21).

A token learned with BPE will be represented by the succession of the unique characters of the base tokens it represent. You can access to several vocabularies to get the equivalents forms of tokens:

vocab: the base vocabulary, binding token descriptions to their ids;
vocab_bpe: the vocabulary with BPE applied, binding byte forms to their integer id;
_vocab_base: a copy of the initial base vocabulary, this attribute is used in case the initial base vocab is overriden by miditok.MIDITokenizer.learn_bpe() with the start_from_empty_voc option;
_vocab_base:
_vocab_base_byte_to_token: biding the base token byte forms to their string forms;
_vocab_base_id_to_byte: biding the base token ids (integers) to their byte forms;
_vocab_bpe_bytes_to_tokens: biding the byte forms of the complete vocab to their string forms, as a list of string;

For symbolic music, BPE has been showned to improve the performances of Transformers models while helping them to learn more isotropic embedding representations. BPE can be applied on top of any tokenization, as long as it is not based on embedding pooling! (is_multi_voc)

BPE example

from miditok import REMI, TokSequence
from copy import deepcopy

tokenizer = REMI()  # using defaults parameters (constants.py)
paths_midis = list(Path("path", "to", "midis").glob('**/*.mid'))

# Learns the vocabulary with BPE
tokenizer.learn_bpe(
    vocab_size=500,
    files_paths=paths_midis,
)

# Opens tokens, apply BPE on them, and decode BPE back
tokens = tokenizer.load_tokens(tokens_no_bpe_paths[0])
tokens = TokSequence(ids=tokens)
tokens_with_bpe = tokenizer.apply_bpe(deepcopy(tokens))  # copy as the method is inplace
tokens_no_bpe = tokenizer.decode_bpe(deepcopy(tokens_with_bpe))

Methods

To use BPE, you must first train your tokenizer from data (miditok.MIDITokenizer.learn_bpe()), then BPE will be automatically applied when tokenizing any MIDI file.

Tokenizers can be saved and loaded (Save / Load tokenizer).

miditok.MIDITokenizer.learn_bpe(self, vocab_size: int, iterator: Iterable | None = None, files_paths: list[Path | str] | None = None, start_from_empty_voc: bool = False, **kwargs) → None

Construct the vocabulary from BPE backed by the 🤗tokenizers library.

The data used for training can either be given through the iterator argument as an iterable object yielding strings, or by files_paths as a list of paths to MIDI files that will be tokenized. You can read the Hugging Face 🤗tokenizers documentation, 🤗tokenizers API documentation and 🤗tokenizers course for more details about the iterator and input type.

The training progress bar will not appear with non-proper terminals. (cf GitHub issue )

Parameters:

vocab_size – size of the vocabulary to learn / build.
iterator – an iterable object yielding the training data, as lists of string. It can be a list or a Generator. This iterator will be passed to the BPE model for training. It musts implement the __len__ method. If None is given, you must use the tokens_paths argument. (default: None)
files_paths – paths of the files to load and use. They can be either MIDI or tokens (json) files. (default: None)
start_from_empty_voc – the training will start from an empty base vocabulary. The tokenizer will then have a base vocabulary only based on the unique bytes present in the training data. If you set this argument to True, you should use the tokenizer only with the training data, as new data might contain “unknown” tokens missing from the vocabulary. Comparing this to text, setting this argument to True would create a tokenizer that will only know the characters present in the training data, and would not be compatible/know other characters. This argument can allow to optimize the vocabulary size. If you are unsure about this, leave it to False. (default: False)
kwargs – any additional argument to pass to the trainer. See the tokenizers docs for more details.

miditok.MIDITokenizer.apply_bpe(self, seq: TokSequence | list[TokSequence]) → None

Apply Byte Pair Encoding (BPE) to a TokSequence, or list of TokSequences.

If a list is given, BPE will be applied by batch on all sequences at the time.

Parameters:: seq – Sequence(s) to apply BPE.

miditok.MIDITokenizer.decode_bpe(self, seq: TokSequence | list[TokSequence]) → None

Decode (inplace) the BPE of the ids of a miditok.TokSequence.

This method only modifies the .ids attribute of the input sequence(s) only and does not complete it. This method can be used recursively on lists of miditok.TokSequence.

Parameters:: seq – token sequence to decompose.