hfutils.operate.base

This module provides utilities for interacting with the Hugging Face Hub API and filesystem. It includes functions for retrieving API clients, listing files in repositories, and handling file patterns and ignore rules.

The module offers the following main functionalities:

  1. Retrieving Hugging Face API tokens and clients

  2. Accessing the Hugging Face filesystem

  3. Listing files in Hugging Face repositories with pattern matching and ignore rules

  4. Parsing and normalizing Hugging Face filesystem paths

These utilities are designed to simplify working with Hugging Face repositories, especially when dealing with datasets, models, and spaces.

get_hf_client

hfutils.operate.base.get_hf_client(hf_token: str | None = None) HfApi[source]

Get the Hugging Face API client.

This function returns an instance of the Hugging Face API client. If a token is not provided, it attempts to use the token from the environment variable.

Parameters:

hf_token (Optional[str]) – Hugging Face token for API client. If not provided, uses the ‘HF_TOKEN’ environment variable.

Returns:

An instance of the Hugging Face API client.

Return type:

HfApi

Example:

>>> client = get_hf_client()
>>> # Use client to interact with Hugging Face API
>>> client.list_models(organization="huggingface")

get_hf_fs

hfutils.operate.base.get_hf_fs(hf_token: str | None = None) HfFileSystem[source]

Get the Hugging Face file system.

This function returns an instance of the Hugging Face file system. If a token is not provided, it attempts to use the token from the environment variable. The file system is configured not to use listings cache to ensure fresh results.

Parameters:

hf_token (Optional[str]) – Hugging Face token for API client. If not provided, uses the ‘HF_TOKEN’ environment variable.

Returns:

An instance of the Hugging Face file system.

Return type:

HfFileSystem

Example:

>>> fs = get_hf_fs()
>>> # Use fs to interact with Hugging Face file system
>>> fs.ls("dataset/example")

hf_repo_glob

hfutils.operate.base.hf_repo_glob(repo_id: str, pattern: List[str] | str = '**/*', repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', include_files: bool = True, include_directories: bool = False, raise_when_base_not_exist: bool = False, hf_token: str | None = None) List[RepoFile | RepoFolder][source]

Glob files and directories in a Hugging Face repository using pattern matching.

This function performs pattern matching on files and directories in a Hugging Face repository, similar to filesystem globbing. It supports wildcard patterns and negation patterns for flexible file selection.

Note

Pattern matching syntax supports:

  • *: Matches everything except slashes (single level)

  • **: Matches zero or more directories recursively (requires GLOBSTAR flag, which is enabled)

  • ?: Matches any single character

  • [seq]: Matches any character in sequence

  • [!seq]: Matches any character not in sequence

  • !pattern: Negation pattern when used at start (requires NEGATE flag, which is enabled)

  • Multiple patterns can be provided as a list

  • Negation patterns filter out matches from inclusion patterns

  • Dot files are matched by default (DOTMATCH flag is enabled)

Note that * only matches at a single directory level, while **/* matches recursively including subdirectories and the top level.

Parameters:
  • repo_id (str) – The identifier of the repository.

  • pattern (Union[List[str], str]) – Wildcard pattern(s) to match files and folders. Default is ‘**/*’ (all items).

  • repo_type (RepoTypeTyping) – The type of the repository (‘dataset’, ‘model’, ‘space’). Default is ‘dataset’.

  • revision (str) – The revision of the repository (e.g., branch, tag, commit hash). Default is ‘main’.

  • include_files (bool) – Whether to include files in the results. Default is True.

  • include_directories (bool) – Whether to include directories in the results. Default is False.

  • raise_when_base_not_exist (bool) – Whether to raise an exception when the repository doesn’t exist. Default is False.

  • hf_token (Optional[str]) – Hugging Face token for API client. If not provided, uses the ‘HF_TOKEN’ environment variable.

Returns:

A list of RepoFile and/or RepoFolder objects matching the pattern.

Return type:

List[Union[RepoFile, RepoFolder]]

Raises:
  • RepositoryNotFoundError – If the repository is not found and raise_when_base_not_exist is True.

  • GatedRepoError – If the repository is gated and raise_when_base_not_exist is True.

  • DisabledRepoError – If the repository is disabled and raise_when_base_not_exist is True.

  • RevisionNotFoundError – If the revision is not found and raise_when_base_not_exist is True.

Example:

>>> # Get all Python files in a repository
>>> files = hf_repo_glob("username/repo", pattern="*.py")
>>> # Get all files except hidden ones
>>> files = hf_repo_glob("username/repo", pattern=["**/*", "!.*"])
>>> # Get only directories
>>> dirs = hf_repo_glob("username/repo", pattern="**/*", include_files=False, include_directories=True)

list_all_with_pattern

hfutils.operate.base.list_all_with_pattern(repo_id: str, pattern: List[str] | str = '**/*', repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', raise_when_base_not_exist: bool = False, hf_token: str | None = None) List[RepoFile | RepoFolder][source]

List all files and folders in a Hugging Face repository matching a given pattern.

This function retrieves information about files and folders in a repository that match the specified pattern. It includes both files and directories in the results.

Parameters:
  • repo_id (str) – The identifier of the repository.

  • pattern (Union[List[str], str]) – Wildcard pattern(s) to match files and folders. Default is ‘**/*’ (all files and folders).

  • repo_type (RepoTypeTyping) – The type of the repository (‘dataset’, ‘model’, ‘space’). Default is ‘dataset’.

  • revision (str) – The revision of the repository (e.g., branch, tag, commit hash). Default is ‘main’.

  • raise_when_base_not_exist (bool) – Whether to raise an exception when the repository doesn’t exist. Default is False.

  • hf_token (Optional[str]) – Hugging Face token for API client. If not provided, uses the ‘HF_TOKEN’ environment variable.

Returns:

A list of RepoFile and RepoFolder objects matching the pattern.

Return type:

List[Union[RepoFile, RepoFolder]]

Example:

>>> # List all items matching a pattern
>>> items = list_all_with_pattern("username/repo", pattern="*.txt")
>>> for item in items:
...     print(f"{'File' if isinstance(item, RepoFile) else 'Folder'}: {item.path}")

list_repo_files_in_repository

hfutils.operate.base.list_repo_files_in_repository(repo_id: str, repo_type: ~typing.Literal['dataset', 'model', 'space'] = 'dataset', subdir: str = '', pattern: ~typing.List[str] | str = <object object>, revision: str = 'main', raise_when_base_not_exist: bool = False, hf_token: str | None = None) List[Tuple[RepoFile, str]][source]

List repository files with their paths in a Hugging Face repository.

This function returns a list of tuples containing RepoFile objects and their corresponding relative paths that match the given pattern. By default, it excludes git-related files.

Parameters:
  • repo_id (str) – The identifier of the repository.

  • repo_type (RepoTypeTyping) – The type of the repository (‘dataset’, ‘model’, ‘space’). Default is ‘dataset’.

  • subdir (str) – The subdirectory to list files from. Default is an empty string (root directory).

  • pattern (Union[List[str], str]) – Wildcard pattern(s) of the target files. Default includes all files except git files.

  • revision (str) – The revision of the repository (e.g., branch, tag, commit hash). Default is ‘main’.

  • raise_when_base_not_exist (bool) – Whether to raise an exception when the repository doesn’t exist. Default is False.

  • hf_token (Optional[str]) – Hugging Face token for API client. If not provided, uses the ‘HF_TOKEN’ environment variable.

Returns:

A list of tuples containing RepoFile objects and their corresponding relative paths.

Return type:

List[Tuple[RepoFile, str]]

Example:

>>> files = list_repo_files_in_repository("username/repo", pattern="*.txt")
>>> for repo_file, path in files:
...     print(f"File: {path}, Size: {repo_file.size}")

list_files_in_repository

hfutils.operate.base.list_files_in_repository(repo_id: str, repo_type: ~typing.Literal['dataset', 'model', 'space'] = 'dataset', subdir: str = '', pattern: ~typing.List[str] | str = <object object>, revision: str = 'main', raise_when_base_not_exist: bool = False, hf_token: str | None = None) List[str][source]

List files in a Hugging Face repository based on the given parameters.

This function retrieves a list of file paths in a specified repository that match the given pattern. By default, it excludes git-related files and returns only the relative paths as strings.

Parameters:
  • repo_id (str) – The identifier of the repository.

  • repo_type (RepoTypeTyping) – The type of the repository (‘dataset’, ‘model’, ‘space’). Default is ‘dataset’.

  • subdir (str) – The subdirectory to list files from. Default is an empty string (root directory).

  • pattern (Union[List[str], str]) – Wildcard pattern(s) of the target files. Default includes all files except git files.

  • revision (str) – The revision of the repository (e.g., branch, tag, commit hash). Default is ‘main’.

  • raise_when_base_not_exist (bool) – Whether to raise an exception when the repository doesn’t exist. Default is False.

  • hf_token (Optional[str]) – Hugging Face token for API client. If not provided, uses the ‘HF_TOKEN’ environment variable.

Returns:

A list of file paths that match the criteria.

Return type:

List[str]

Example:

>>> files = list_files_in_repository("username/repo", pattern="*.txt")
>>> print(files)
['file1.txt', 'folder/file2.txt']
>>> # List files in a specific subdirectory
>>> files = list_files_in_repository("username/repo", subdir="data", pattern="*.json")