hfutils.operate.warmup
Huggingface File Management Module
This module provides utilities for managing and downloading files from Huggingface repositories. It includes functions for warming up (pre-downloading) individual files and entire directories from Huggingface repositories, with support for concurrent downloads, retries, and progress tracking.
hf_warmup_file
- hfutils.operate.warmup.hf_warmup_file(repo_id: str, filename: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', hf_token: str | None = None, cache_dir: str | None = None)[source]
Download and cache a single file from Huggingface repository.
- Parameters:
repo_id (str) – ID of the huggingface repository (e.g., ‘username/repository’)
filename (str) – Name of the file to download including path within repository
repo_type (RepoTypeTyping) – Type of repository (‘dataset’, ‘model’, etc.)
revision (str) – Git revision to use, defaults to ‘main’
hf_token (Optional[str]) – Huggingface authentication token for private repositories
cache_dir (Optional[str]) – Directory to cache the downloaded file, if None uses default cache
- Returns:
Local path to the downloaded file
- Return type:
str
- Example:
>>> local_path = hf_warmup_file('bert-base-uncased', 'config.json', repo_type='model')
hf_warmup_directory
- hfutils.operate.warmup.hf_warmup_directory(repo_id: str, dir_in_repo: str = '.', pattern: str = '**/*', repo_type: ~typing.Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', silent: bool = False, ignore_patterns: ~typing.List[str] = <object object>, max_workers: int = 8, max_retries: int = 5, hf_token: str | None = None, cache_dir: str | None = None)[source]
Download and cache an entire directory from Huggingface repository with concurrent processing.
- Parameters:
repo_id (str) – ID of the huggingface repository (e.g., ‘username/repository’)
dir_in_repo (str) – Directory path within the repository to download
pattern (str) – Glob pattern for filtering files (e.g., ‘*.txt’ for text files only)
repo_type (RepoTypeTyping) – Type of repository (‘dataset’, ‘model’, etc.)
revision (str) – Git revision to use
silent (bool) – Whether to hide progress bar
ignore_patterns (List[str]) – List of patterns to ignore during download
max_workers (int) – Maximum number of concurrent download workers
max_retries (int) – Maximum number of retry attempts for failed downloads
hf_token (Optional[str]) – Huggingface authentication token for private repositories
cache_dir (Optional[str]) – Directory to cache the downloaded files
- Example:
>>> # Downloads all .txt files from the 'data' directory using 4 workers >>> hf_warmup_directory('username/repo', 'data', '*.txt', max_workers=4)