hfutils.index.fetch

ArchiveStandaloneFileIncompleteDownload

class hfutils.index.fetch.ArchiveStandaloneFileIncompleteDownload[source]

Exception raised when a standalone file in an archive is incompletely downloaded.

ArchiveStandaloneFileHashNotMatch

class hfutils.index.fetch.ArchiveStandaloneFileHashNotMatch[source]

Exception raised when the hash of a standalone file in an archive does not match.

hf_tar_get_index

hfutils.index.fetch.hf_tar_get_index(repo_id: str, archive_in_repo: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, hf_token: str | None = None)[source]

Get the index of a tar archive file in a Hugging Face repository.

Parameters:
  • repo_id (str) – The identifier of the repository.

  • archive_in_repo (str) – The path to the archive file in the repository.

  • repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.

  • revision (str, optional) – The revision of the repository.

  • idx_repo_id (str, optional) – The identifier of the index repository.

  • idx_file_in_repo (str, optional) – The path to the index file in the index repository.

  • idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.

  • idx_revision (str, optional) – The revision of the index repository.

  • hf_token (str, optional) – The Hugging Face access token.

Returns:

The index of the tar archive file.

Return type:

Dict

Examples::
>>> from hfutils.index import hf_tar_get_index
>>>
>>> idx = hf_tar_get_index(
...     repo_id='deepghs/danbooru_newest',
...     archive_in_repo='images/0000.tar',
... )
>>> idx.keys()
dict_keys(['filesize', 'hash', 'hash_lfs', 'files'])
>>> idx['files'].keys()
dict_keys(['7507000.jpg', '7506000.jpg', '7505000.jpg', ...])

Note

Besides, if the tar and index files are in different repositories, you can also use this function to get the index information by explicitly assigning the idx_repo_id argument.

>>> from hfutils.index import hf_tar_get_index
>>>
>>> idx = hf_tar_get_index(
...     repo_id='nyanko7/danbooru2023',
...     idx_repo_id='deepghs/danbooru2023_index',
...     archive_in_repo='original/data-0000.tar',
... )
>>> idx.keys()
dict_keys(['filesize', 'hash', 'hash_lfs', 'files'])
>>> idx['files'].keys()
dict_keys(['./1000.png', './10000.jpg', './100000.jpg', ...])

hf_tar_list_files

hfutils.index.fetch.hf_tar_list_files(repo_id: str, archive_in_repo: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, hf_token: str | None = None) List[str][source]

List files inside a tar archive file in a Hugging Face repository.

Parameters:
  • repo_id (str) – The identifier of the repository.

  • archive_in_repo (str) – The path to the archive file in the repository.

  • repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.

  • revision (str, optional) – The revision of the repository.

  • idx_repo_id (str, optional) – The identifier of the index repository.

  • idx_file_in_repo (str, optional) – The path to the index file in the index repository.

  • idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.

  • idx_revision (str, optional) – The revision of the index repository.

  • hf_token (str, optional) – The Hugging Face access token.

Returns:

The list of files inside the tar archive.

Return type:

List[str]

Examples::
>>> from hfutils.index import hf_tar_list_files
>>>
>>> hf_tar_list_files(
...     repo_id='deepghs/danbooru_newest',
...     archive_in_repo='images/0000.tar',
... )
['7507000.jpg', '7506000.jpg', '7505000.jpg', ...]

Note

Besides, if the tar and index files are in different repositories, you can also use this function to list all the files by explicitly assigning the idx_repo_id argument.

>>> from hfutils.index import hf_tar_list_files
>>>
>>> hf_tar_list_files(
...     repo_id='nyanko7/danbooru2023',
...     idx_repo_id='deepghs/danbooru2023_index',
...     archive_in_repo='original/data-0000.tar',
... )
['./1000.png', './10000.jpg', './100000.jpg', ...]

hf_tar_file_exists

hfutils.index.fetch.hf_tar_file_exists(repo_id: str, archive_in_repo: str, file_in_archive: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, hf_token: str | None = None)[source]

Check if a file exists inside a tar archive file in a Hugging Face repository.

Parameters:
  • repo_id (str) – The identifier of the repository.

  • archive_in_repo (str) – The path to the archive file in the repository.

  • file_in_archive (str) – The path to the file inside the archive.

  • repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.

  • revision (str, optional) – The revision of the repository.

  • idx_repo_id (str, optional) – The identifier of the index repository.

  • idx_file_in_repo (str, optional) – The path to the index file in the index repository.

  • idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.

  • idx_revision (str, optional) – The revision of the index repository.

  • hf_token (str, optional) – The Hugging Face access token.

Returns:

True if the file exists, False otherwise.

Return type:

bool

Examples::
>>> from hfutils.index import hf_tar_file_exists
>>>
>>> hf_tar_file_exists(
...     repo_id='deepghs/danbooru_newest',
...     archive_in_repo='images/0000.tar',
...     file_in_archive='7506000.jpg',
... )
True
>>> hf_tar_file_exists(
...     repo_id='deepghs/danbooru_newest',
...     archive_in_repo='images/0000.tar',
...     file_in_archive='17506000.jpg',
... )
False

Note

Besides, if the tar and index files are in different repositories, you can also use this function to check the file existence by explicitly assigning the idx_repo_id argument.

>>> from hfutils.index import hf_tar_file_exists
>>>
>>> hf_tar_file_exists(
...     repo_id='nyanko7/danbooru2023',
...     idx_repo_id='deepghs/danbooru2023_index',
...     archive_in_repo='original/data-0000.tar',
...     file_in_archive='1000.png'
... )
True
>>> hf_tar_file_exists(
...     repo_id='nyanko7/danbooru2023',
...     idx_repo_id='deepghs/danbooru2023_index',
...     archive_in_repo='original/data-0000.tar',
...     file_in_archive='10000000001000.png'
... )
False

hf_tar_file_size

hfutils.index.fetch.hf_tar_file_size(repo_id: str, archive_in_repo: str, file_in_archive: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, hf_token: str | None = None) int[source]

Get a file’s size in index tars.

Parameters:
  • repo_id (str) – The identifier of the repository.

  • archive_in_repo (str) – The path to the archive file in the repository.

  • file_in_archive (str) – The path to the file inside the archive.

  • repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.

  • revision (str, optional) – The revision of the repository.

  • idx_repo_id (str, optional) – The identifier of the index repository.

  • idx_file_in_repo (str, optional) – The path to the index file in the index repository.

  • idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.

  • idx_revision (str, optional) – The revision of the index repository.

  • hf_token (str, optional) – The Hugging Face access token.

Returns:

Return an integer which represents the size of this file.

Return type:

int

Raises:

FileNotFoundError – Raise this when file not exist in tar archive.

Examples::
>>> from hfutils.index import hf_tar_file_size
>>>
>>> hf_tar_file_size(
...     repo_id='deepghs/danbooru_newest',
...     archive_in_repo='images/0000.tar',
...     file_in_archive='7506000.jpg',
... )
435671

Note

Besides, if the tar and index files are in different repositories, you can also use this function to get the file size by explicitly assigning the idx_repo_id argument.

>>> from hfutils.index import hf_tar_file_size
>>>
>>> hf_tar_file_size(
...     repo_id='nyanko7/danbooru2023',
...     idx_repo_id='deepghs/danbooru2023_index',
...     archive_in_repo='original/data-0000.tar',
...     file_in_archive='1000.png'
... )
11966

hf_tar_file_info

hfutils.index.fetch.hf_tar_file_info(repo_id: str, archive_in_repo: str, file_in_archive: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, hf_token: str | None = None) dict[source]

Get a file’s detailed information in index tars, including offset, sha256 and size.

Parameters:
  • repo_id (str) – The identifier of the repository.

  • archive_in_repo (str) – The path to the archive file in the repository.

  • file_in_archive (str) – The path to the file inside the archive.

  • repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.

  • revision (str, optional) – The revision of the repository.

  • idx_repo_id (str, optional) – The identifier of the index repository.

  • idx_file_in_repo (str, optional) – The path to the index file in the index repository.

  • idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.

  • idx_revision (str, optional) – The revision of the index repository.

  • hf_token (str, optional) – The Hugging Face access token.

Returns:

Return a dictionary object with meta information of this file.

Return type:

dict

Raises:

FileNotFoundError – Raise this when file not exist in tar archive.

Examples::
>>> from hfutils.index import hf_tar_file_info
>>>
>>> hf_tar_file_info(
...     repo_id='deepghs/danbooru_newest',
...     archive_in_repo='images/0000.tar',
...     file_in_archive='7506000.jpg',
... )
{'offset': 265728, 'size': 435671, 'sha256': 'ef6a4e031fdffb705c8ce2c64e8cb8d993f431a887d7c1c0b1e6fa56e6107fcd'}

Note

Besides, if the tar and index files are in different repositories, you can also use this function to get the file information by explicitly assigning the idx_repo_id argument.

>>> from hfutils.index import hf_tar_file_info
>>>
>>> hf_tar_file_info(
...     repo_id='nyanko7/danbooru2023',
...     idx_repo_id='deepghs/danbooru2023_index',
...     archive_in_repo='original/data-0000.tar',
...     file_in_archive='1000.png'
... )
{'offset': 1024, 'size': 11966, 'sha256': '478d3313860519372f6a75ede287d4a7c18a2d851bbc79b3dd65caff4c716858'}

hf_tar_file_download

hfutils.index.fetch.hf_tar_file_download(repo_id: str, archive_in_repo: str, file_in_archive: str, local_file: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, proxies: Dict | None = None, user_agent: Dict | str | None = None, headers: Dict[str, str] | None = None, endpoint: str | None = None, force_download: bool = False, silent: bool = False, hf_token: str | None = None)[source]

Download a specific file from a tar archive stored in a Hugging Face repository.

This function allows you to extract and download a single file from a tar archive that is hosted in a Hugging Face repository. It handles authentication, supports different repository types, and can work with separate index repositories.

Parameters:
  • repo_id (str) – The identifier of the repository containing the tar archive.

  • archive_in_repo (str) – The path to the tar archive file within the repository.

  • file_in_archive (str) – The path to the desired file inside the tar archive.

  • local_file (str) – The local path where the downloaded file will be saved.

  • repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository (e.g., ‘dataset’, ‘model’, ‘space’).

  • revision (str, optional) – The specific revision of the repository to use.

  • idx_repo_id (str, optional) – The identifier of a separate index repository, if applicable.

  • idx_file_in_repo (str, optional) – The path to the index file in the index repository.

  • idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.

  • idx_revision (str, optional) – The revision of the index repository.

  • proxies (Dict, optional) – Proxy settings for the HTTP request.

  • user_agent (Union[Dict, str, None], optional) – Custom user agent for the HTTP request.

  • headers (Dict[str, str], optional) – Additional headers for the HTTP request.

  • endpoint (str, optional) – Custom Hugging Face API endpoint.

  • force_download (bool) – If True, force re-download even if the file exists locally.

  • silent (bool) – If True, suppress progress bar output.

  • hf_token (str, optional) – Hugging Face authentication token.

Raises:

This function performs several steps:

  1. Retrieves the index of the tar archive.

  2. Checks if the desired file exists in the archive.

  3. Constructs the download URL and headers.

  4. Checks if the file already exists locally and matches the expected size and hash.

  5. Downloads the file if necessary, using byte range requests for efficiency.

  6. Verifies the downloaded file’s size and hash.

Usage examples:
  1. Basic usage:
    >>> hf_tar_file_download(
    ...     repo_id='deepghs/danbooru_newest',
    ...     archive_in_repo='images/0000.tar',
    ...     file_in_archive='7506000.jpg',
    ...     local_file='test_example.jpg'  # download destination
    ... )
    
  2. Using a separate index repository:
    >>> hf_tar_file_download(
    ...     repo_id='nyanko7/danbooru2023',
    ...     idx_repo_id='deepghs/danbooru2023_index',
    ...     archive_in_repo='original/data-0000.tar',
    ...     file_in_archive='1000.png',
    ...     local_file='test_example.png'  # download destination
    ... )
    

Note

  • This function is particularly useful for efficiently downloading single files from large tar archives without having to download the entire archive.

  • It supports authentication via the hf_token parameter, which is crucial for accessing private repositories.

  • The function includes checks to avoid unnecessary downloads and to ensure the integrity of the downloaded file.