hfutils.operate.warmup

Huggingface File Management Module

This module provides utilities for managing and downloading files from Huggingface repositories. It includes functions for warming up (pre-downloading) individual files and entire directories from Huggingface repositories, with support for concurrent downloads, retries, and progress tracking.

hf_warmup_file

hfutils.operate.warmup.hf_warmup_file(repo_id: str, filename: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', hf_token: str | None = None, cache_dir: str | None = None)[source]

Download and cache a single file from Huggingface repository.

This function downloads a specific file from a Huggingface repository and caches it locally. It’s useful for pre-downloading files that will be accessed later, ensuring they’re available in the local cache for faster subsequent access.

Parameters:
  • repo_id (str) – ID of the huggingface repository (e.g., ‘username/repository’)

  • filename (str) – Name of the file to download including path within repository

  • repo_type (RepoTypeTyping) – Type of repository (‘dataset’, ‘model’, etc.), defaults to ‘dataset’

  • revision (str) – Git revision to use, defaults to ‘main’

  • hf_token (Optional[str]) – Huggingface authentication token for private repositories

  • cache_dir (Optional[str]) – Directory to cache the downloaded file, if None uses default cache

Returns:

Local path to the downloaded file

Return type:

str

Example::
>>> # Download a model configuration file
>>> local_path = hf_warmup_file('bert-base-uncased', 'config.json', repo_type='model')
>>> # Download a dataset file with specific revision
>>> local_path = hf_warmup_file('username/dataset', 'data/train.csv', revision='v1.0')

hf_warmup_directory

hfutils.operate.warmup.hf_warmup_directory(repo_id: str, dir_in_repo: str = '.', pattern: List[str] | str = '**/*', repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', silent: bool = False, max_workers: int = 8, max_retries: int = 5, hf_token: str | None = None, cache_dir: str | None = None)[source]

Download and cache an entire directory from Huggingface repository with concurrent processing.

This function efficiently downloads multiple files from a directory in a Huggingface repository using concurrent workers. It includes retry mechanisms for failed downloads and progress tracking. This is particularly useful for pre-downloading large datasets or model repositories.

Parameters:
  • repo_id (str) – ID of the huggingface repository (e.g., ‘username/repository’)

  • dir_in_repo (str) – Directory path within the repository to download, defaults to ‘.’ (root)

  • pattern (Union[List[str], str]) – Glob pattern for filtering files (e.g., ‘.txt’ for text files only), defaults to ‘*/*

  • repo_type (RepoTypeTyping) – Type of repository (‘dataset’, ‘model’, etc.), defaults to ‘dataset’

  • revision (str) – Git revision to use, defaults to ‘main’

  • silent (bool) – Whether to hide progress bar, defaults to False

  • max_workers (int) – Maximum number of concurrent download workers, defaults to 8

  • max_retries (int) – Maximum number of retry attempts for failed downloads, defaults to 5

  • hf_token (Optional[str]) – Huggingface authentication token for private repositories

  • cache_dir (Optional[str]) – Directory to cache the downloaded files

Example::
>>> # Download all files from the root directory
>>> hf_warmup_directory('username/dataset')
>>> # Download only .txt files from the 'data' directory using 4 workers
>>> hf_warmup_directory('username/repo', 'data', '*.txt', max_workers=4)
>>> # Download with custom retry settings and silent mode
>>> hf_warmup_directory('username/repo', 'models', max_retries=3, silent=True)