xlstm_jax.dataset.lmeval_dataset#
Classes#
Dataset that tokenizes (HuggingFace) and splits documents according to the structure of |
|
Dataset mapper modeling a simplified lm_eval dataset. Post-processing here could be done |
Module Contents#
- class xlstm_jax.dataset.lmeval_dataset.HFTokenizeLogLikelihoodRolling(tokenizer_path, max_length, batch_size=1, hf_access_token='', tokenizer_cache_dir=None, add_bos_token=True, add_eos_token=False, bos_token_id=None, eos_token_id=None)#
Dataset that tokenizes (HuggingFace) and splits documents according to the structure of loglikelihood_rolling.
Targets are shifted for next token prediction. It does not work on a test instance level as used in e.g. grain, as documents are split into multiple sequences to match the maximal sequence length. However, we employ the .map() paradigm converting a list of lm_eval Instances to training instances (dict). See: EleutherAI/lm-evaluation-harness
Prefix tokens are handled here as well and masked out in the targets_segmentation.
- Parameters:
tokenizer_path (str) – HuggingFace tokenizer name
max_length (int) – Maximal sequence length / context_length
batch_size (int) – Batch size to be used for filling up the last batch.
hf_access_token (str) – HuggingFace access token for other tokenizers.
tokenizer_cache_dir (str | None) – HuggingFace tokenizer cache dir
add_bos_token (bool) – Whether to add a beginning of sequence / document token.
add_eos_token (bool) – Whether to add an end of sequence / document token.
bos_token_id (int | None) – BOS token id if not taken from tokenizer.
eos_token_id (int | None) – EOS token id if not taken from tokenizer.
- tokenizer#
- batch_size = 1#
- max_length#
- add_bos_token = True#
- add_eos_token = False#
- _mapped_data = None#
- bos_token_id = None#
- eos_token_id = None#
- _tokenize(example)#
Tokenize a string with the tokenizer.
- simple_array(*, prefix_tokens, all_tokens, doc_idx, seq_idx)#
Creates a simple document instance with “standard” padding and masks. This is for documents not exceeding the max_length or all sequences except the last for a longer document.
- map(requests)#
Maps a list of lm_eval Instances to a (potentially longer) list of sequences for a language model evaluation. Generated instances are padded to max_length and contain position and segmentation information as well as document and sequnce indices.
- Parameters:
requests (list[lm_eval.api.instance.Instance]) – List of lm_eval Instances / Requests.
- Returns:
List of converted instances for lm processing.
- Return type:
grain.python.MapDataset
- class xlstm_jax.dataset.lmeval_dataset.HFTokenizeLogLikelihood#
Dataset mapper modeling a simplified lm_eval dataset. Post-processing here could be done using the grain pipeline. However, instances are not split if they exceed the maximal sequence length as for LoglikelihoodRolling See: EleutherAI/lm-evaluation-harness