hfutils.operate.warmup
Huggingface File Management Module
This module provides utilities for managing and downloading files from Huggingface repositories. It includes functions for warming up (pre-downloading) individual files and entire directories from Huggingface repositories, with support for concurrent downloads, retries, and progress tracking.
hf_warmup_file
- hfutils.operate.warmup.hf_warmup_file(repo_id: str, filename: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', hf_token: str | None = None, cache_dir: str | None = None)[source]
Download and cache a single file from Huggingface repository.
This function downloads a specific file from a Huggingface repository and caches it locally. It’s useful for pre-downloading files that will be accessed later, ensuring they’re available in the local cache for faster subsequent access.
- Parameters:
repo_id (str) – ID of the huggingface repository (e.g., ‘username/repository’)
filename (str) – Name of the file to download including path within repository
repo_type (RepoTypeTyping) – Type of repository (‘dataset’, ‘model’, etc.), defaults to ‘dataset’
revision (str) – Git revision to use, defaults to ‘main’
hf_token (Optional[str]) – Huggingface authentication token for private repositories
cache_dir (Optional[str]) – Directory to cache the downloaded file, if None uses default cache
- Returns:
Local path to the downloaded file
- Return type:
str
- Example::
>>> # Download a model configuration file >>> local_path = hf_warmup_file('bert-base-uncased', 'config.json', repo_type='model') >>> # Download a dataset file with specific revision >>> local_path = hf_warmup_file('username/dataset', 'data/train.csv', revision='v1.0')
hf_warmup_directory
- hfutils.operate.warmup.hf_warmup_directory(repo_id: str, dir_in_repo: str = '.', pattern: List[str] | str = '**/*', repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', silent: bool = False, max_workers: int = 8, max_retries: int = 5, hf_token: str | None = None, cache_dir: str | None = None)[source]
Download and cache an entire directory from Huggingface repository with concurrent processing.
This function efficiently downloads multiple files from a directory in a Huggingface repository using concurrent workers. It includes retry mechanisms for failed downloads and progress tracking. This is particularly useful for pre-downloading large datasets or model repositories.
- Parameters:
repo_id (str) – ID of the huggingface repository (e.g., ‘username/repository’)
dir_in_repo (str) – Directory path within the repository to download, defaults to ‘.’ (root)
pattern (Union[List[str], str]) – Glob pattern for filtering files (e.g., ‘.txt’ for text files only), defaults to ‘*/*’
repo_type (RepoTypeTyping) – Type of repository (‘dataset’, ‘model’, etc.), defaults to ‘dataset’
revision (str) – Git revision to use, defaults to ‘main’
silent (bool) – Whether to hide progress bar, defaults to False
max_workers (int) – Maximum number of concurrent download workers, defaults to 8
max_retries (int) – Maximum number of retry attempts for failed downloads, defaults to 5
hf_token (Optional[str]) – Huggingface authentication token for private repositories
cache_dir (Optional[str]) – Directory to cache the downloaded files
- Example::
>>> # Download all files from the root directory >>> hf_warmup_directory('username/dataset') >>> # Download only .txt files from the 'data' directory using 4 workers >>> hf_warmup_directory('username/repo', 'data', '*.txt', max_workers=4) >>> # Download with custom retry settings and silent mode >>> hf_warmup_directory('username/repo', 'models', max_retries=3, silent=True)