hfutils.operate.warmup

Huggingface File Management Module

This module provides utilities for managing and downloading files from Huggingface repositories. It includes functions for warming up (pre-downloading) individual files and entire directories from Huggingface repositories, with support for concurrent downloads, retries, and progress tracking.

hf_warmup_file

hfutils.operate.warmup.hf_warmup_file(repo_id: str, filename: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', hf_token: str | None = None, cache_dir: str | None = None)[source]

Download and cache a single file from Huggingface repository.

Parameters:
  • repo_id (str) – ID of the huggingface repository (e.g., ‘username/repository’)

  • filename (str) – Name of the file to download including path within repository

  • repo_type (RepoTypeTyping) – Type of repository (‘dataset’, ‘model’, etc.)

  • revision (str) – Git revision to use, defaults to ‘main’

  • hf_token (Optional[str]) – Huggingface authentication token for private repositories

  • cache_dir (Optional[str]) – Directory to cache the downloaded file, if None uses default cache

Returns:

Local path to the downloaded file

Return type:

str

Example:
>>> local_path = hf_warmup_file('bert-base-uncased', 'config.json', repo_type='model')

hf_warmup_directory

hfutils.operate.warmup.hf_warmup_directory(repo_id: str, dir_in_repo: str = '.', pattern: str = '**/*', repo_type: ~typing.Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', silent: bool = False, ignore_patterns: ~typing.List[str] = <object object>, max_workers: int = 8, max_retries: int = 5, hf_token: str | None = None, cache_dir: str | None = None)[source]

Download and cache an entire directory from Huggingface repository with concurrent processing.

Parameters:
  • repo_id (str) – ID of the huggingface repository (e.g., ‘username/repository’)

  • dir_in_repo (str) – Directory path within the repository to download

  • pattern (str) – Glob pattern for filtering files (e.g., ‘*.txt’ for text files only)

  • repo_type (RepoTypeTyping) – Type of repository (‘dataset’, ‘model’, etc.)

  • revision (str) – Git revision to use

  • silent (bool) – Whether to hide progress bar

  • ignore_patterns (List[str]) – List of patterns to ignore during download

  • max_workers (int) – Maximum number of concurrent download workers

  • max_retries (int) – Maximum number of retry attempts for failed downloads

  • hf_token (Optional[str]) – Huggingface authentication token for private repositories

  • cache_dir (Optional[str]) – Directory to cache the downloaded files

Example:
>>> # Downloads all .txt files from the 'data' directory using 4 workers
>>> hf_warmup_directory('username/repo', 'data', '*.txt', max_workers=4)