Welcome to MidiTok’s documentation!¶
MidiTok is a Python package for MIDI file tokenization, introduced at the ISMIR 2021 LBDs (paper). It tokenize symbolic music files (MIDI, abc), i.e. convert them into sequences of tokens ready to be fed to models such as Transformer, for any generation, transcription or MIR task. MidiTok features most known MIDI Tokenizations, and is built around the idea that they all share common methods. Tokenizers can be trained with BPE, Unigram or WordPiece (Training a tokenizer) and be push to and pulled from the Hugging Face hub! Github repository
Installation¶
pip install miditok
MidiTok uses symusic to read and write MIDI files, and tokenizer training is backed by the Hugging Face 🤗tokenizers for super fast encoding.
Citation¶
If you use MidiTok for your research, a citation in your manuscript would be gladly appreciated. ❤️
You can also find BibTeX Citations of tokenizations.
@inproceedings{miditok2021,
title={{MidiTok}: A Python package for {MIDI} file tokenization},
author={Fradet, Nathan and Briot, Jean-Pierre and Chhel, Fabien and El Fallah Seghrouchni, Amal and Gutowski, Nicolas},
booktitle={Extended Abstracts for the Late-Breaking Demo Session of the 22nd International Society for Music Information Retrieval Conference},
year={2021},
url={https://archives.ismir.net/ismir2021/latebreaking/000005.pdf},
}
Contents¶
- Bases
- Tokens and vocabulary
- Vocabulary
- TokSequence
- The MusicTokenizer class
MusicTokenizer
MusicTokenizer.add_to_vocab()
MusicTokenizer.complete_sequence()
MusicTokenizer.decode()
MusicTokenizer.decode_token_ids()
MusicTokenizer.encode()
MusicTokenizer.encode_token_ids()
MusicTokenizer.io_format
MusicTokenizer.is_multi_voc
MusicTokenizer.is_trained
MusicTokenizer.len
MusicTokenizer.load_tokens()
MusicTokenizer.pad_token_id
MusicTokenizer.preprocess_score()
MusicTokenizer.save_params()
MusicTokenizer.save_pretrained()
MusicTokenizer.save_tokens()
MusicTokenizer.score_has_time_signatures_not_in_vocab()
MusicTokenizer.special_tokens
MusicTokenizer.special_tokens_ids
MusicTokenizer.token_id_type()
MusicTokenizer.token_ids_of_type()
MusicTokenizer.tokenize_dataset()
MusicTokenizer.tokens_errors()
MusicTokenizer.train()
MusicTokenizer.vocab
MusicTokenizer.vocab_model
MusicTokenizer.vocab_size
- Tokenizer config
- Additional tokens
- Special tokens
- Tokens & TokSequence input / output format
- Magic methods
- Save / Load tokenizer
- Examples
- Tokenizations
- Training a tokenizer
- Hugging Face hub
- PyTorch Training
- Data augmentation
- Utils methods
compute_ticks_per_bar()
compute_ticks_per_beat()
concat_scores()
convert_ids_tensors_to_list()
detect_chords()
filter_dataset()
fix_offsets_overlapping_notes()
get_bars_ticks()
get_beats_ticks()
get_num_notes_per_bar()
get_score_programs()
get_score_ticks_per_beat()
merge_same_program_tracks()
merge_scores()
merge_tracks()
merge_tracks_per_class()
num_bar_pos()
remove_duplicated_notes()
split_score_per_beats()
split_score_per_ticks()
split_score_per_tracks()
- Citations