xlstm_jax.dataset.lmeval_dataset#

Classes#

HFTokenizeLogLikelihoodRolling

Dataset that tokenizes (HuggingFace) and splits documents according to the structure of

HFTokenizeLogLikelihood

Dataset mapper modeling a simplified lm_eval dataset. Post-processing here could be done

Module Contents#

class xlstm_jax.dataset.lmeval_dataset.HFTokenizeLogLikelihoodRolling(tokenizer_path, max_length, batch_size=1, hf_access_token='', tokenizer_cache_dir=None, add_bos_token=True, add_eos_token=False, bos_token_id=None, eos_token_id=None)#

Dataset that tokenizes (HuggingFace) and splits documents according to the structure of loglikelihood_rolling.

Targets are shifted for next token prediction. It does not work on a test instance level as used in e.g. grain, as documents are split into multiple sequences to match the maximal sequence length. However, we employ the .map() paradigm converting a list of lm_eval Instances to training instances (dict). See: EleutherAI/lm-evaluation-harness

Prefix tokens are handled here as well and masked out in the targets_segmentation.

Parameters:
  • tokenizer_path (str) – HuggingFace tokenizer name

  • max_length (int) – Maximal sequence length / context_length

  • batch_size (int) – Batch size to be used for filling up the last batch.

  • hf_access_token (str) – HuggingFace access token for other tokenizers.

  • tokenizer_cache_dir (str | None) – HuggingFace tokenizer cache dir

  • add_bos_token (bool) – Whether to add a beginning of sequence / document token.

  • add_eos_token (bool) – Whether to add an end of sequence / document token.

  • bos_token_id (int | None) – BOS token id if not taken from tokenizer.

  • eos_token_id (int | None) – EOS token id if not taken from tokenizer.

tokenizer#
batch_size = 1#
max_length#
add_bos_token = True#
add_eos_token = False#
_mapped_data = None#
bos_token_id = None#
eos_token_id = None#
_tokenize(example)#

Tokenize a string with the tokenizer.

Parameters:

example (str) – String to tokenize

Returns:

BatchEncoding in HF format with tokens.

Return type:

transformers.tokenization_utils_base.BatchEncoding[str, list[int]]

simple_array(*, prefix_tokens, all_tokens, doc_idx, seq_idx)#

Creates a simple document instance with “standard” padding and masks. This is for documents not exceeding the max_length or all sequences except the last for a longer document.

Parameters:
  • prefix_tokens (list[int]) – List of prefix tokens

  • all_tokens (list[int]) – List of all tokens

  • doc_idx (int) – Document index

  • seq_idx (int) – Sequence index (in document)

Returns:

Data instance dictionary.

Return type:

grain.python.MapDataset

map(requests)#

Maps a list of lm_eval Instances to a (potentially longer) list of sequences for a language model evaluation. Generated instances are padded to max_length and contain position and segmentation information as well as document and sequnce indices.

Parameters:

requests (list[lm_eval.api.instance.Instance]) – List of lm_eval Instances / Requests.

Returns:

List of converted instances for lm processing.

Return type:

grain.python.MapDataset

class xlstm_jax.dataset.lmeval_dataset.HFTokenizeLogLikelihood#

Dataset mapper modeling a simplified lm_eval dataset. Post-processing here could be done using the grain pipeline. However, instances are not split if they exceed the maximal sequence length as for LoglikelihoodRolling See: EleutherAI/lm-evaluation-harness

static map(requests)#

Maps a list of lm_eval Instances to a dictionary usable in grain transforms.

Parameters:

requests (list[lm_eval.api.instance.Instance]) – List of lm_eval Instances / Requests.

Returns:

List of converted instances for lm processing.

Return type:

grain.python.MapDataset