Byte Pair Encoding (BPE)

What is BPE

BPE is a compression technique that replaces the most recurrent byte (tokens in our case) successions of a corpus, by newly created ones. The most recurrent token successions can be replaced with new created tokens, thus decreasing the sequence length and increasing the vocabulary size. Today in the NLP field, BPE is used with almost all tokenizations to build their vocabulary, as it allows to encode rare words and segmenting unknown or composed words as sequences of sub-word units. In the case of symbolic, it has been showned to improve the performances of Transformers models while helping them to learn more isotropic embedding representations.

You can apply it to symbolic music with MidiTok, by first learning the vocabulary (tokenizer.learn_bpe()), and then convert a dataset with BPE (tokenizer.apply_bpe_to_dataset()). All tokenizations not based on embedding pooling are compatible!

Methods

miditok.MIDITokenizer.learn_bpe(self, tokens_path: Union[Path, str], vocab_size: int, out_dir: Union[Path, str], files_lim: Optional[int] = None, save_converted_samples: bool = False, print_seq_len_variation: bool = True) → Tuple[List[float], List[int], List[float]]

Byte Pair Encoding (BPE) method to build the vocabulary. This method will build (modify) the vocabulary by analyzing an already tokenized dataset to find the most recurrent token successions. Note that this implementation is in pure Python and will be slow if you use a large amount of tokens files. You might use the files_lim argument.

Parameters:

tokens_path – path to token files to learn the BPE combinations from.
vocab_size – the new vocabulary size.
out_dir – directory to save the tokenizer’s parameters and vocabulary after BPE learning is finished.
files_lim – limit of token files to use. (default: None)
save_converted_samples – will save in out_path the samples that have been used to create the BPE vocab. Files will keep the same name and relative path. (default: True)
print_seq_len_variation – prints the mean sequence length before and after BPE, and the variation in %. (default: True)

Returns:

learning metrics, as lists of: - the average number of token combinations covered by the newly created BPE tokens - the maximum number of token combinations - the average sequence length Each index in the list correspond to a learning step.

miditok.MIDITokenizer.apply_bpe(self, tokens: List[int]) → List[int]

Converts a sequence of tokens into tokens with BPE.

Parameters:: tokens – tokens to convert.
Returns:: the tokens with BPE applied.

miditok.MIDITokenizer.apply_bpe_to_dataset(self, dataset_path: Union[Path, str], out_path: Union[Path, str])

Applies BPE to an already tokenized dataset (with no BPE).

Parameters:

dataset_path – path to token files to load.
out_path – output directory to save.

miditok.MIDITokenizer.decompose_bpe(self, tokens: List[int]) → List[int]

Decomposes a sequence of tokens containing BP encoded tokens into “prime” tokens. It is an inplace operation.

Parameters:: tokens – token sequence to decompose.
Returns:: decomposed token sequence.

Tokenizers can be saved and loaded (Save / Load tokenizer). After learning BPE (miditok.MIDITokenizer.learn_bpe()), the tokenizer will automatically be saved.