xlstm_jax.dataset.configs#
Attributes#
Classes#
Base data configuration. |
|
HuggingFace dataset configuration for datasets on HuggingFace. |
|
Synthetic dataset configuration. |
|
Grain dataset configuration for ArrayRecords datasets. |
Module Contents#
- xlstm_jax.dataset.configs.T#
- class xlstm_jax.dataset.configs.DataConfig#
Base data configuration.
- shuffle_data: bool = False#
Whether to shuffle the data. Usually True for training and False for validation.
- data_config_type: str | None = None#
Type of data configuration. Used for initialization via Hydra. Can be ‘synthetic’, ‘huggingface_hub’, ‘huggingface_local’, or ‘grain_arrayrecord’.
- classmethod create_train_eval_configs(train_kwargs=None, eval_kwargs=None, **kwargs)#
Create training and evaluation configurations.
- Parameters:
- Returns:
Training and evaluation configurations.
- Return type:
Tuple[DataConfig, DataConfig]
- class xlstm_jax.dataset.configs.HFHubDataConfig#
Bases:
DataConfigHuggingFace dataset configuration for datasets on HuggingFace.
- hf_path: pathlib.Path | str#
Path to the dataset on HuggingFace.
- hf_cache_dir: pathlib.Path | str | None = None#
Directory to cache the dataset.
- hf_data_dir: pathlib.Path | str | None = None#
Directory for additional data files.
- max_steps_per_epoch: int | None = None#
Maximum number of steps per epoch (for training or evaluation).
- grain_packing_bin_count: int | None = None#
Number of bins for grain packing. If None, use the local batch size. Higher values may improve packing efficiency but may also increase memory usage and pre-processing times.
- drop_remainder: bool = False#
Whether to drop the remainder of the dataset when it does not divide evenly by the global batch size.
- batch_rampup_factors: dict[int, float] | None = None#
Ramp up the batch size if provided. The dictionary maps the step count to the scaling factor. See the boundaries_and_scales doc in :func:grain_batch_rampup.create_batch_rampup_schedule for more details.
- shuffle_data: bool = False#
Whether to shuffle the data. Usually True for training and False for validation.
- data_config_type: str | None = None#
Type of data configuration. Used for initialization via Hydra. Can be ‘synthetic’, ‘huggingface_hub’, ‘huggingface_local’, or ‘grain_arrayrecord’.
- classmethod create_train_eval_configs(train_kwargs=None, eval_kwargs=None, **kwargs)#
Create training and evaluation configurations.
- Parameters:
- Returns:
Training and evaluation configurations.
- Return type:
Tuple[DataConfig, DataConfig]
- class xlstm_jax.dataset.configs.SyntheticDataConfig#
Bases:
DataConfigSynthetic dataset configuration.
- shuffle_data: bool = False#
Whether to shuffle the data. Usually True for training and False for validation.
- data_config_type: str | None = None#
Type of data configuration. Used for initialization via Hydra. Can be ‘synthetic’, ‘huggingface_hub’, ‘huggingface_local’, or ‘grain_arrayrecord’.
- classmethod create_train_eval_configs(train_kwargs=None, eval_kwargs=None, **kwargs)#
Create training and evaluation configurations.
- Parameters:
- Returns:
Training and evaluation configurations.
- Return type:
Tuple[DataConfig, DataConfig]
- class xlstm_jax.dataset.configs.GrainArrayRecordsDataConfig#
Bases:
DataConfigGrain dataset configuration for ArrayRecords datasets.
- data_path: pathlib.Path#
Path to the dataset directory.
- split: str = 'train'#
Dataset split to use, e.g. ‘train’ or ‘validation’. Should be a subdirectory of data_dir.
- drop_remainder: bool = False#
Whether to drop the remainder of the dataset when it does not divide evenly by the global batch size.
- tokenize_data: bool = True#
Whether to tokenize the data data. If False, the data is assumed to be already tokenized.
- grain_packing_bin_count: int | None = None#
Number of bins for grain packing. If None, use the local batch size. Higher values may improve packing efficiency but may also increase memory usage and pre-processing times.
- hf_cache_dir: pathlib.Path | None = None#
Directory to cache the dataset. Used to get the HF tokenizer.
- batch_rampup_factors: dict[int, float] | None = None#
Ramp up the batch size if provided. The dictionary maps the step count to the scaling factor. See the boundaries_and_scales doc in :func:grain_batch_rampup.create_batch_rampup_schedule for more details.
- shuffle_data: bool = False#
Whether to shuffle the data. Usually True for training and False for validation.
- data_config_type: str | None = None#
Type of data configuration. Used for initialization via Hydra. Can be ‘synthetic’, ‘huggingface_hub’, ‘huggingface_local’, or ‘grain_arrayrecord’.
- classmethod create_train_eval_configs(train_kwargs=None, eval_kwargs=None, **kwargs)#
Create training and evaluation configurations.
- Parameters:
- Returns:
Training and evaluation configurations.
- Return type:
Tuple[DataConfig, DataConfig]