hfutils.operate.download

This module provides functions for downloading files and directories from Hugging Face repositories.

It includes utilities for downloading individual files, archives, and entire directories, with support for concurrent downloads, retries, and progress tracking.

The module interacts with the Hugging Face Hub API to fetch repository contents and download files, handling various repository types and revisions.

Key features:

  • Download individual files from Hugging Face repositories

  • Download and extract archive files

  • Download entire directories with pattern matching and ignore rules

  • Concurrent downloads with configurable worker count

  • Retry mechanism for failed downloads

  • Progress tracking with tqdm

  • Support for different repository types (dataset, model, space)

  • Token-based authentication for accessing private repositories

This module is particularly useful for managing and synchronizing local copies of Hugging Face repository contents, especially when dealing with large datasets or models.

download_file_to_file

hfutils.operate.download.download_file_to_file(local_file: str, repo_id: str, file_in_repo: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', soft_mode_when_check: bool = False, hf_token: str | None = None)[source]

Download a file from a Hugging Face repository and save it to a local file.

This function downloads a single file from a Hugging Face repository to a specified local path. It includes validation to check if the local file already exists and is up-to-date, skipping the download if the file is already present and valid.

Parameters:
  • local_file (str) – The local file path where the downloaded file will be saved.

  • repo_id (str) – The identifier of the repository (e.g., ‘username/repo-name’).

  • file_in_repo (str) – The file path within the repository relative to the repository root.

  • repo_type (RepoTypeTyping) – The type of the repository (‘dataset’, ‘model’, ‘space’).

  • revision (str) – The revision of the repository (e.g., branch name, tag, commit hash).

  • soft_mode_when_check (bool) – If True, only check the file size for validation instead of full integrity check.

  • hf_token (str, optional) – Hugging Face authentication token for accessing private repositories.

Example::
>>> download_file_to_file(
...     local_file="./local_model.bin",
...     repo_id="username/my-model",
...     file_in_repo="pytorch_model.bin",
...     repo_type="model"
... )

download_archive_as_directory

hfutils.operate.download.download_archive_as_directory(local_directory: str, repo_id: str, file_in_repo: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', password: str | None = None, hf_token: str | None = None)[source]

Download an archive file from a Hugging Face repository and extract it to a local directory.

This function downloads an archive file (such as ZIP, TAR, etc.) from a Hugging Face repository and automatically extracts its contents to the specified local directory. It supports password-protected archives and handles the extraction process transparently.

Parameters:
  • local_directory (str) – The local directory path where the archive contents will be extracted.

  • repo_id (str) – The identifier of the repository (e.g., ‘username/repo-name’).

  • file_in_repo (str) – The archive file path within the repository relative to the repository root.

  • repo_type (RepoTypeTyping) – The type of the repository (‘dataset’, ‘model’, ‘space’).

  • revision (str) – The revision of the repository (e.g., branch name, tag, commit hash).

  • password (str, optional) – The password for extracting password-protected archives.

  • hf_token (str, optional) – Hugging Face authentication token for accessing private repositories.

Example::
>>> download_archive_as_directory(
...     local_directory="./extracted_data",
...     repo_id="username/dataset",
...     file_in_repo="data.zip",
...     repo_type="dataset"
... )

download_directory_as_directory

hfutils.operate.download.download_directory_as_directory(local_directory: str, repo_id: str, dir_in_repo: str = '.', pattern: List[str] | str = '**/*', repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', silent: bool = False, max_workers: int = 8, max_retries: int = 5, soft_mode_when_check: bool = False, hf_token: str | None = None)[source]

Download all files in a directory from a Hugging Face repository to a local directory.

This function recursively downloads all files from a specified directory in a Hugging Face repository to a local directory. It supports concurrent downloads with configurable worker threads, retry mechanisms for failed downloads, and pattern-based file filtering. The function maintains the directory structure from the repository in the local destination.

Parameters:
  • local_directory (str) – The local directory path where downloaded files will be saved.

  • repo_id (str) – The identifier of the repository (e.g., ‘username/repo-name’).

  • dir_in_repo (str) – The directory path within the repository to download. Use ‘.’ for repository root.

  • pattern (Union[List[str], str]) – File patterns for filtering which files to download. Can be a single pattern string or list of patterns.

  • repo_type (RepoTypeTyping) – The type of the repository (‘dataset’, ‘model’, ‘space’).

  • revision (str) – The revision of the repository (e.g., branch name, tag, commit hash).

  • silent (bool) – If True, suppress progress bar output during download.

  • max_workers (int) – Maximum number of concurrent download threads.

  • max_retries (int) – Maximum number of retry attempts for failed downloads.

  • soft_mode_when_check (bool) – If True, only check file size for validation instead of full integrity check.

  • hf_token (str, optional) – Hugging Face authentication token for accessing private repositories.

Example::
>>> download_directory_as_directory(
...     local_directory="./local_dataset",
...     repo_id="username/my-dataset",
...     dir_in_repo="data",
...     pattern="*.json",
...     repo_type="dataset",
...     max_workers=4
... )