PyTorch data loaders

MidiTok features PyTorch Dataset objects to load MIDI or token files during training. You can use them with a PyTorch DataLoader or your preferred libraries. When indexed, the Dataset will output dictionaries with values corresponding to the inputs and labels.

MidiTok also provides an “all-in-one” data collator: miditok.pytorch_data.DataCollator to be used with PyTorch a DataLoader in order to pad batches, add BOS and EOS tokens and create attention masks.

Note: This module is imported only if torch is installed in your Python environment.

Dataset classes and data collators to be used with PyTorch when training a model.

DatasetTok is a general i/o class that loads and tokenize MIDI files and saves them in memory during its initialization, that can chunk the whole token sequences into smaller sections with a minimum and maximum size. DatasetJsonIO loads json tokens files on the fly when it is iterated during batch creations.

class miditok.pytorch_data.DataCollator(pad_token_id: int, bos_token_id: int | None = None, eos_token_id: int | None = None, pad_on_left: bool = False, copy_inputs_as_labels: bool = False, shift_labels: bool = False, labels_pad_idx: int = -100, inputs_kwarg_name: str = 'input_ids', labels_kwarg_name: str = 'labels')

All-in-one data collator for PyTorch DataLoader.

It allows to apply padding (right or left side of sequences), prepend or append BOS and EOS tokens. It will also add an "attention_mask" entry to the batch, following the padding applied.

Parameters:

pad_token_id – padding token id.
bos_token_id – BOS token id. (default: None)
eos_token_id – EOS token id. (default: None)
pad_on_left – if given True, it will pad the sequences on the left. This can be required when using some libraries expecting padding on left, for example when generating with Hugging Face Transformers. (default: False)
copy_inputs_as_labels – will add a labels entry (inputs_kwarg_name) to the batch (or replace the existing one), which is a copy to the input entry (labels_kwarg_name). (default: False)
shift_labels – will shift inputs and labels for autoregressive training/teacher forcing. (default: False)
labels_pad_idx – padding id for labels. (default: -100)
inputs_kwarg_name – name of dict / kwarg key for inputs. (default: "input_ids")
labels_kwarg_name – name of dict / kwarg key for inputs. (default: "labels")

class miditok.pytorch_data.DatasetJsonIO(files_paths: Sequence[Path], max_seq_len: int | None = None)

Basic Dataset loading Json files of tokenized MIDIs on the fly.

When indexing it (dataset[idx]), this class will load the files_paths[idx] json file and return the token ids, that can be used to train generative models. This class is only compatible with tokens saved as a single stream of tokens ( tokenizer.one_token_stream ). If you plan to use it with token files containing multiple token streams, you should first it with miditok.pytorch_data.split_dataset_to_subsequences().

It allows to reduce the sequence length up to a max_seq_len limit, but will not split the sequences into subsequences. If your dataset contains sequences with lengths largely varying, you might want to first split it into subsequences with the miditok.pytorch_data.split_dataset_to_subsequences() method before loading it to avoid losing data.

This Dataset class is well suited if you are using a large dataset, or have access to limited RAM resources.

Parameters:

files_paths – list of paths to files to load.
max_seq_len – maximum sequence length (in num of tokens). (default: None)

class miditok.pytorch_data.DatasetTok(files_paths: Sequence[Path], min_seq_len: int, max_seq_len: int, tokenizer: MIDITokenizer = None, one_token_stream: bool = True, func_to_get_labels: Callable[[Score | Sequence, Path], int] | None = None, sample_key_name: str = 'input_ids', labels_key_name: str = 'labels')

Basic Dataset loading and tokenizing MIDIs or JSON token files.

The token ids will be stored in RAM. It outputs token sequences that can be used to train models.

The tokens sequences being loaded will then be split into subsequences, of length comprise between min_seq_len and max_seq_len. For example, with min_seq_len = 50 and max_seq_len = 100: * a sequence of 650 tokens will be split into 6 subsequences of 100 tokens plus one subsequence of 50 tokens; * a sequence of 620 tokens will be split into 6 subsequences of 100 tokens, the last 20 tokens will be discarded; * a sequence of 670 tokens will be split into 6 subsequences of 100 tokens plus one subsequence of 50 tokens, and the last 20 tokens will be discarded.

This Dataset class is well suited if you have enough RAM to store all the data, as it does not require you to prior split the dataset into subsequences of the length you desire. Note that if you directly load MIDI files, the loading can take some time as they will need to be tokenized. You might want to tokenize them before once with the tokenizer.tokenize_midi_dataset() method.

Additionally, you can use the func_to_get_labels argument to provide a method allowing to use labels (one label per file).

Parameters:

files_paths – list of paths to files to load.
min_seq_len – minimum sequence length (in num of tokens)
max_seq_len – maximum sequence length (in num of tokens)
tokenizer – tokenizer object, to use to load MIDIs instead of tokens. (default: None)
one_token_stream – give False if the token files contains multiple tracks, i.e. the first dimension of the value of the “ids” entry corresponds to several tracks. Otherwise, leave False. (default: True)
func_to_get_labels – a function to retrieve the label of a file. The method must take two positional arguments: the first is either a MidiFile or the tokens loaded from the json file, the second is the path to the file just loaded. The method must return an integer which correspond to the label id (and not the absolute value, e.g. if you are classifying 10 musicians, return the id from 0 to 9 included corresponding to the musician). (default: None)
sample_key_name – name of the dictionary key containing the sample data when iterating the dataset. (default: "input_ids")
labels_key_name – name of the dictionary key containing the labels data when iterating the dataset. (default: "labels")

miditok.pytorch_data.split_dataset_to_subsequences(files_paths: Sequence[Path | str], out_dir: Path | str, min_seq_len: int, max_seq_len: int, one_token_stream: bool = True) → None

Split a dataset of tokens files into subsequences.

This method is particularly useful if you plan to use a miditok.pytorch_data.DatasetJsonIO, as it would split token sequences into subsequences with the desired lengths before loading them for training.

Parameters:

files_paths – list of files of tokens to split.
out_dir – output directory to save the subsequences.
min_seq_len – minimum sequence length.
max_seq_len – maximum sequence length.
one_token_stream – give False if the token files contains multiple tracks, i.e. the first dimension of the value of the “ids” entry corresponds to several tracks. Otherwise, leave False. (default: True)

miditok.pytorch_data.split_seq_in_subsequences(seq: Sequence[any], min_seq_len: int, max_seq_len: int) → list[Sequence[Any]]

Split a sequence of tokens into subsequences.

The subsequences will have lengths comprised between min_seq_len and max_seq_len: min_seq_len <= len(sub_seq) <= max_seq_len.

Parameters:

seq – sequence to split.
min_seq_len – minimum sequence length.
max_seq_len – maximum sequence length.

Returns:

list of subsequences.