xlstm_jax.dataset.lmeval_dataset#

Classes#

`HFTokenizeLogLikelihoodRolling`	Dataset that tokenizes (HuggingFace) and splits documents according to the structure of
`HFTokenizeLogLikelihood`	Dataset mapper modeling a simplified lm_eval dataset. Post-processing here could be done

Module Contents#

class xlstm_jax.dataset.lmeval_dataset.HFTokenizeLogLikelihoodRolling(tokenizer_path, max_length, batch_size=1, hf_access_token='', tokenizer_cache_dir=None, add_bos_token=True, add_eos_token=False, bos_token_id=None, eos_token_id=None)#

Dataset that tokenizes (HuggingFace) and splits documents according to the structure of loglikelihood_rolling.

Targets are shifted for next token prediction. It does not work on a test instance level as used in e.g. grain, as documents are split into multiple sequences to match the maximal sequence length. However, we employ the .map() paradigm converting a list of lm_eval Instances to training instances (dict). See: EleutherAI/lm-evaluation-harness

Prefix tokens are handled here as well and masked out in the targets_segmentation.

Parameters:

tokenizer_path (str) – HuggingFace tokenizer name
max_length (int) – Maximal sequence length / context_length
batch_size (int) – Batch size to be used for filling up the last batch.
hf_access_token (str) – HuggingFace access token for other tokenizers.
tokenizer_cache_dir (str | None) – HuggingFace tokenizer cache dir
add_bos_token (bool) – Whether to add a beginning of sequence / document token.
add_eos_token (bool) – Whether to add an end of sequence / document token.
bos_token_id (int | None) – BOS token id if not taken from tokenizer.
eos_token_id (int | None) – EOS token id if not taken from tokenizer.

tokenizer#

batch_size = 1#

max_length#

add_bos_token = True#

add_eos_token = False#

_mapped_data = None#

bos_token_id = None#

eos_token_id = None#

_tokenize(example)#

Tokenize a string with the tokenizer.

Parameters:: example (str) – String to tokenize
Returns:: BatchEncoding in HF format with tokens.
Return type:: transformers.tokenization_utils_base.BatchEncoding[str, list[int]]

simple_array(*, prefix_tokens, all_tokens, doc_idx, seq_idx)#

Creates a simple document instance with “standard” padding and masks. This is for documents not exceeding the max_length or all sequences except the last for a longer document.

Parameters:

prefix_tokens (list[int]) – List of prefix tokens
all_tokens (list[int]) – List of all tokens
doc_idx (int) – Document index
seq_idx (int) – Sequence index (in document)

Returns:

Data instance dictionary.

Return type:

grain.python.MapDataset

map(requests)#

Maps a list of lm_eval Instances to a (potentially longer) list of sequences for a language model evaluation. Generated instances are padded to max_length and contain position and segmentation information as well as document and sequnce indices.

Parameters:: requests (list[lm_eval.api.instance.Instance]) – List of lm_eval Instances / Requests.
Returns:: List of converted instances for lm processing.
Return type:: grain.python.MapDataset

class xlstm_jax.dataset.lmeval_dataset.HFTokenizeLogLikelihood#

Dataset mapper modeling a simplified lm_eval dataset. Post-processing here could be done using the grain pipeline. However, instances are not split if they exceed the maximal sequence length as for LoglikelihoodRolling See: EleutherAI/lm-evaluation-harness

static map(requests)#

Maps a list of lm_eval Instances to a dictionary usable in grain transforms.

Parameters:: requests (list[lm_eval.api.instance.Instance]) – List of lm_eval Instances / Requests.
Returns:: List of converted instances for lm processing.
Return type:: grain.python.MapDataset