hfutils.index.make

This module provides functionalities for handling and indexing TAR archive files, especially for use with the Hugging Face ecosystem. It includes functions to create and retrieve index information of TAR archives, which is crucial for efficient data retrieval and management in large datasets. The module also integrates with Hugging Face’s repository system, allowing for operations like uploading and downloading TAR files and their indices.

Key functionalities include:

  • Extracting index information from TAR files.

  • Creating index files for TAR archives locally or in a directory.

  • Integrating with Hugging Face repositories to manage TAR archives and their indices.

The module utilizes cryptographic hash functions for data integrity checks and supports operations on both local and remote repositories. It is designed to work seamlessly with the Hugging Face platform, enabling users to handle large datasets efficiently.

tar_create_index

hfutils.index.make.tar_create_index(src_tar_file, dst_index_file: str | None = None, chunk_for_hash: int = 1048576, with_hash: bool = True, silent: bool = False)[source]

Create an index file for a tar archive file.

Parameters:
  • src_tar_file (str) – The path to the source tar archive file.

  • dst_index_file (str, optional) – The path to save the index file, defaults to None.

  • chunk_for_hash (int, optional) – The chunk size for hashing, defaults to 1 << 20 (1 MB).

  • with_hash (bool, optional) – Whether to include file hashes in the index, defaults to True.

  • silent (bool, optional) – Whether to suppress progress bars and logging messages, defaults to False.

Returns:

The path to the created index file.

Return type:

str

tar_create_index_for_directory

hfutils.index.make.tar_create_index_for_directory(src_tar_directory: str, dst_index_directory: str | None = None, chunk_for_hash: int = 1048576, with_hash: bool = True, silent: bool = False)[source]

Create index files for all tar archives in a specified directory.

This function scans through the given directory to find all tar files, generates an index for each, and saves these indices to the specified destination directory. If no destination directory is provided, indices are saved in the same directory as the tar files.

Parameters:
  • src_tar_directory (str) – The path to the directory containing tar files.

  • dst_index_directory (str, optional) – The path to the directory where index files will be saved, defaults to the same as src_tar_directory.

  • chunk_for_hash (int, optional) – The chunk size for hashing, defaults to 1 << 20 (1 MB).

  • with_hash (bool, optional) – Whether to include file hashes in the index, defaults to True.

  • silent (bool, optional) – Whether to suppress progress bars and logging messages, defaults to False.

Returns:

The path to the directory where index files are saved.

Return type:

str

hf_tar_create_index

hfutils.index.make.hf_tar_create_index(repo_id: str, archive_in_repo: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, chunk_for_hash: int = 1048576, with_hash: bool = True, skip_when_synced: bool = True, hf_token: str | None = None)[source]

Create an index file for a tar archive file in a Hugging Face repository.

Parameters:
  • repo_id (str) – The identifier of the repository.

  • archive_in_repo (str) – The path to the tar archive file.

  • repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository, defaults to ‘dataset’.

  • revision (str, optional) – The revision of the repository, defaults to ‘main’.

  • idx_repo_id (str, optional) – The identifier of the index repository, defaults to None.

  • idx_file_in_repo (str, optional) – The path to save the index file in the index repository, defaults to None.

  • idx_repo_type (RepoTypeTyping, optional) – The type of the index repository, defaults to None.

  • idx_revision (str, optional) – The revision of the index repository, defaults to None.

  • chunk_for_hash (int, optional) – The chunk size for hashing, defaults to 1 << 20 (1 MB).

  • with_hash (bool, optional) – Whether to include file hashes in the index, defaults to True.

  • skip_when_synced (bool) – Skip syncing when index is ready, defaults to True.

  • hf_token (str, optional) – The Hugging Face access token, defaults to None.

tar_get_index_info

hfutils.index.make.tar_get_index_info(src_tar_file, chunk_for_hash: int = 1048576, with_hash: bool = True, silent: bool = False)[source]

Get the index information of a tar archive file.

Note

The return value of this function will be directly used as the index json file.

Parameters:
  • src_tar_file (str) – The path to the source tar archive file.

  • chunk_for_hash (int, optional) – The chunk size for hashing, defaults to 1 << 20 (1 MB).

  • with_hash (bool, optional) – Whether to include file hashes in the index, defaults to True.

  • silent (bool, optional) – Whether to suppress progress bars and logging messages, defaults to False.

Returns:

The index information of the tar archive file.

Return type:

dict

hf_tar_create_from_directory

hfutils.index.make.hf_tar_create_from_directory(repo_id: str, archive_in_repo: str, local_directory: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', chunk_for_hash: int = 1048576, with_hash: bool = True, silent: bool = False, hf_token: str | None = None)[source]

Create a tar archive file from a local directory and upload it to a Hugging Face repository.

Parameters:
  • repo_id (str) – The identifier of the repository.

  • archive_in_repo (str) – The path to save the tar archive file in the repository.

  • local_directory (str) – The path to the local directory to be archived.

  • repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository, defaults to ‘dataset’.

  • revision (str, optional) – The revision of the repository, defaults to ‘main’.

  • chunk_for_hash (int, optional) – The chunk size for hashing, defaults to 1 << 20 (1 MB).

  • with_hash (bool, optional) – Whether to include file hashes in the index, defaults to True.

  • silent (bool, optional) – Whether to suppress progress bars and logging messages, defaults to False.

  • hf_token (str, optional) – The Hugging Face access token, defaults to None.