hfutils.operate.base

This module provides utilities for interacting with the Hugging Face Hub API and filesystem. It includes functions for retrieving API clients, listing files in repositories, and handling file patterns and ignore rules.

The module offers the following main functionalities:

  1. Retrieving Hugging Face API tokens and clients

  2. Accessing the Hugging Face filesystem

  3. Listing files in Hugging Face repositories with pattern matching and ignore rules

  4. Parsing and normalizing Hugging Face filesystem paths

These utilities are designed to simplify working with Hugging Face repositories, especially when dealing with datasets, models, and spaces.

get_hf_client

hfutils.operate.base.get_hf_client(hf_token: str | None = None) HfApi[source]

Get the Hugging Face API client.

This function returns an instance of the Hugging Face API client. If a token is not provided, it attempts to use the token from the environment variable.

Parameters:

hf_token (Optional[str]) – Hugging Face token for API client. If not provided, uses the ‘HF_TOKEN’ environment variable.

Returns:

An instance of the Hugging Face API client.

Return type:

HfApi

Example:
>>> client = get_hf_client()
>>> # Use client to interact with Hugging Face API
>>> client.list_repos(organization="huggingface")

get_hf_fs

hfutils.operate.base.get_hf_fs(hf_token: str | None = None) HfFileSystem[source]

Get the Hugging Face file system.

This function returns an instance of the Hugging Face file system. If a token is not provided, it attempts to use the token from the environment variable. The file system is configured not to use listings cache to ensure fresh results.

Parameters:

hf_token (Optional[str]) – Hugging Face token for API client. If not provided, uses the ‘HF_TOKEN’ environment variable.

Returns:

An instance of the Hugging Face file system.

Return type:

HfFileSystem

Example:
>>> fs = get_hf_fs()
>>> # Use fs to interact with Hugging Face file system
>>> fs.ls("dataset/example")

list_all_with_pattern

hfutils.operate.base.list_all_with_pattern(repo_id: str, pattern: str = '**/*', repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', startup_batch: int = 500, batch_factor: float = 0.8, hf_token: str | None = None, silent: bool = False) Iterator[RepoFile | RepoFolder][source]

List all files and folders in a Hugging Face repository matching a given pattern.

This function retrieves information about files and folders in a repository that match the specified pattern. It uses batching to handle large repositories efficiently.

Parameters:
  • repo_id (str) – The identifier of the repository.

  • pattern (str) – Wildcard pattern to match files and folders. Default is **/* (all files and folders).

  • repo_type (RepoTypeTyping) – The type of the repository (‘dataset’, ‘model’, ‘space’). Default is ‘dataset’.

  • revision (str) – The revision of the repository (e.g., branch, tag, commit hash). Default is ‘main’.

  • startup_batch (int) – Initial batch size for retrieving path information. Default is 500.

  • batch_factor (float) – Factor to reduce batch size if a request fails. Default is 0.8.

  • hf_token (Optional[str]) – Hugging Face token for API client. If not provided, uses the ‘HF_TOKEN’ environment variable.

  • silent (bool) – If True, suppresses progress bar. Default is False.

Returns:

An iterator of RepoFile and RepoFolder objects matching the pattern.

Return type:

Iterator[Union[RepoFile, RepoFolder]]

Raises:

HfHubHTTPError – If there’s an error in the API request that’s not related to batch size.

Example:
>>> for item in list_all_with_pattern("username/repo", pattern="*.txt"):
...     print(item.path)

list_files_in_repository

hfutils.operate.base.list_files_in_repository(repo_id: str, repo_type: ~typing.Literal['dataset', 'model', 'space'] = 'dataset', subdir: str = '', pattern: str = '**/*', revision: str = 'main', ignore_patterns: ~typing.List[str] = <object object>, hf_token: str | None = None, silent: bool = False) List[str][source]

List files in a Hugging Face repository based on the given parameters.

This function retrieves a list of file paths in a specified repository that match the given pattern and are not ignored by the ignore patterns.

Parameters:
  • repo_id (str) – The identifier of the repository.

  • repo_type (RepoTypeTyping) – The type of the repository (‘dataset’, ‘model’, ‘space’). Default is ‘dataset’.

  • subdir (str) – The subdirectory to list files from. Default is an empty string (root directory).

  • pattern (str) – Wildcard pattern of the target files. Default is **/* (all files).

  • revision (str) – The revision of the repository (e.g., branch, tag, commit hash). Default is ‘main’.

  • ignore_patterns (List[str]) – List of file patterns to ignore. If not set, uses default ignore patterns.

  • hf_token (Optional[str]) – Hugging Face token for API client. If not provided, uses the ‘HF_TOKEN’ environment variable.

  • silent (bool) – If True, suppresses progress bar. Default is False.

Returns:

A list of file paths that match the criteria.

Return type:

List[str]

Example:
>>> files = list_files_in_repository("username/repo", pattern="*.txt", ignore_patterns=[".git*", "*.log"])
>>> print(files)
['file1.txt', 'folder/file2.txt']