hfutils.index.fetch

ArchiveStandaloneFileIncompleteDownload

class hfutils.index.fetch.ArchiveStandaloneFileIncompleteDownload[source]

Exception raised when a standalone file in an archive is incompletely downloaded.

ArchiveStandaloneFileHashNotMatch

class hfutils.index.fetch.ArchiveStandaloneFileHashNotMatch[source]

Exception raised when the hash of a standalone file in an archive does not match.

hf_tar_get_index

hfutils.index.fetch.hf_tar_get_index(repo_id: str, archive_in_repo: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, hf_token: str | None = None)[source]

Get the index of a tar archive file in a Hugging Face repository.

Parameters:
  • repo_id (str) – The identifier of the repository.

  • archive_in_repo (str) – The path to the archive file in the repository.

  • repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.

  • revision (str, optional) – The revision of the repository.

  • idx_repo_id (str, optional) – The identifier of the index repository.

  • idx_file_in_repo (str, optional) – The path to the index file in the index repository.

  • idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.

  • idx_revision (str, optional) – The revision of the index repository.

  • hf_token (str, optional) – The Hugging Face access token.

Returns:

The index of the tar archive file.

Return type:

Dict

Examples::
>>> from hfutils.index import hf_tar_get_index
>>>
>>> idx = hf_tar_get_index(
...     repo_id='deepghs/danbooru_newest',
...     archive_in_repo='images/0000.tar',
... )
>>> idx.keys()
dict_keys(['filesize', 'hash', 'hash_lfs', 'files'])
>>> idx['files'].keys()
dict_keys(['7507000.jpg', '7506000.jpg', '7505000.jpg', ...])

Note

Besides, if the tar and index files are in different repositories, you can also use this function to get the index information by explicitly assigning the idx_repo_id argument.

>>> from hfutils.index import hf_tar_get_index
>>>
>>> idx = hf_tar_get_index(
...     repo_id='nyanko7/danbooru2023',
...     idx_repo_id='deepghs/danbooru2023_index',
...     archive_in_repo='original/data-0000.tar',
... )
>>> idx.keys()
dict_keys(['filesize', 'hash', 'hash_lfs', 'files'])
>>> idx['files'].keys()
dict_keys(['./1000.png', './10000.jpg', './100000.jpg', ...])

hf_tar_list_files

hfutils.index.fetch.hf_tar_list_files(repo_id: str, archive_in_repo: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, hf_token: str | None = None) List[str][source]

List files inside a tar archive file in a Hugging Face repository.

Parameters:
  • repo_id (str) – The identifier of the repository.

  • archive_in_repo (str) – The path to the archive file in the repository.

  • repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.

  • revision (str, optional) – The revision of the repository.

  • idx_repo_id (str, optional) – The identifier of the index repository.

  • idx_file_in_repo (str, optional) – The path to the index file in the index repository.

  • idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.

  • idx_revision (str, optional) – The revision of the index repository.

  • hf_token (str, optional) – The Hugging Face access token.

Returns:

The list of files inside the tar archive.

Return type:

List[str]

Examples::
>>> from hfutils.index import hf_tar_list_files
>>>
>>> hf_tar_list_files(
...     repo_id='deepghs/danbooru_newest',
...     archive_in_repo='images/0000.tar',
... )
['7507000.jpg', '7506000.jpg', '7505000.jpg', ...]

Note

Besides, if the tar and index files are in different repositories, you can also use this function to list all the files by explicitly assigning the idx_repo_id argument.

>>> from hfutils.index import hf_tar_list_files
>>>
>>> hf_tar_list_files(
...     repo_id='nyanko7/danbooru2023',
...     idx_repo_id='deepghs/danbooru2023_index',
...     archive_in_repo='original/data-0000.tar',
... )
['./1000.png', './10000.jpg', './100000.jpg', ...]

hf_tar_file_exists

hfutils.index.fetch.hf_tar_file_exists(repo_id: str, archive_in_repo: str, file_in_archive: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, hf_token: str | None = None)[source]

Check if a file exists inside a tar archive file in a Hugging Face repository.

Parameters:
  • repo_id (str) – The identifier of the repository.

  • archive_in_repo (str) – The path to the archive file in the repository.

  • file_in_archive (str) – The path to the file inside the archive.

  • repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.

  • revision (str, optional) – The revision of the repository.

  • idx_repo_id (str, optional) – The identifier of the index repository.

  • idx_file_in_repo (str, optional) – The path to the index file in the index repository.

  • idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.

  • idx_revision (str, optional) – The revision of the index repository.

  • hf_token (str, optional) – The Hugging Face access token.

Returns:

True if the file exists, False otherwise.

Return type:

bool

Examples::
>>> from hfutils.index import hf_tar_file_exists
>>>
>>> hf_tar_file_exists(
...     repo_id='deepghs/danbooru_newest',
...     archive_in_repo='images/0000.tar',
...     file_in_archive='7506000.jpg',
... )
True
>>> hf_tar_file_exists(
...     repo_id='deepghs/danbooru_newest',
...     archive_in_repo='images/0000.tar',
...     file_in_archive='17506000.jpg',
... )
False

Note

Besides, if the tar and index files are in different repositories, you can also use this function to check the file existence by explicitly assigning the idx_repo_id argument.

>>> from hfutils.index import hf_tar_file_exists
>>>
>>> hf_tar_file_exists(
...     repo_id='nyanko7/danbooru2023',
...     idx_repo_id='deepghs/danbooru2023_index',
...     archive_in_repo='original/data-0000.tar',
...     file_in_archive='1000.png'
... )
True
>>> hf_tar_file_exists(
...     repo_id='nyanko7/danbooru2023',
...     idx_repo_id='deepghs/danbooru2023_index',
...     archive_in_repo='original/data-0000.tar',
...     file_in_archive='10000000001000.png'
... )
False

hf_tar_file_size

hfutils.index.fetch.hf_tar_file_size(repo_id: str, archive_in_repo: str, file_in_archive: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, hf_token: str | None = None) int[source]

Get a file’s size in index tars.

Parameters:
  • repo_id (str) – The identifier of the repository.

  • archive_in_repo (str) – The path to the archive file in the repository.

  • file_in_archive (str) – The path to the file inside the archive.

  • repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.

  • revision (str, optional) – The revision of the repository.

  • idx_repo_id (str, optional) – The identifier of the index repository.

  • idx_file_in_repo (str, optional) – The path to the index file in the index repository.

  • idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.

  • idx_revision (str, optional) – The revision of the index repository.

  • hf_token (str, optional) – The Hugging Face access token.

Returns:

Return an integer which represents the size of this file.

Return type:

int

Raises:

FileNotFoundError – Raise this when file not exist in tar archive.

Examples::
>>> from hfutils.index import hf_tar_file_size
>>>
>>> hf_tar_file_size(
...     repo_id='deepghs/danbooru_newest',
...     archive_in_repo='images/0000.tar',
...     file_in_archive='7506000.jpg',
... )
435671

Note

Besides, if the tar and index files are in different repositories, you can also use this function to get the file size by explicitly assigning the idx_repo_id argument.

>>> from hfutils.index import hf_tar_file_size
>>>
>>> hf_tar_file_size(
...     repo_id='nyanko7/danbooru2023',
...     idx_repo_id='deepghs/danbooru2023_index',
...     archive_in_repo='original/data-0000.tar',
...     file_in_archive='1000.png'
... )
11966

hf_tar_file_info

hfutils.index.fetch.hf_tar_file_info(repo_id: str, archive_in_repo: str, file_in_archive: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, hf_token: str | None = None) dict[source]

Get a file’s detailed information in index tars, including offset, sha256 and size.

Parameters:
  • repo_id (str) – The identifier of the repository.

  • archive_in_repo (str) – The path to the archive file in the repository.

  • file_in_archive (str) – The path to the file inside the archive.

  • repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.

  • revision (str, optional) – The revision of the repository.

  • idx_repo_id (str, optional) – The identifier of the index repository.

  • idx_file_in_repo (str, optional) – The path to the index file in the index repository.

  • idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.

  • idx_revision (str, optional) – The revision of the index repository.

  • hf_token (str, optional) – The Hugging Face access token.

Returns:

Return a dictionary object with meta information of this file.

Return type:

dict

Raises:

FileNotFoundError – Raise this when file not exist in tar archive.

Examples::
>>> from hfutils.index import hf_tar_file_info
>>>
>>> hf_tar_file_info(
...     repo_id='deepghs/danbooru_newest',
...     archive_in_repo='images/0000.tar',
...     file_in_archive='7506000.jpg',
... )
{'offset': 265728, 'size': 435671, 'sha256': 'ef6a4e031fdffb705c8ce2c64e8cb8d993f431a887d7c1c0b1e6fa56e6107fcd'}

Note

Besides, if the tar and index files are in different repositories, you can also use this function to get the file information by explicitly assigning the idx_repo_id argument.

>>> from hfutils.index import hf_tar_file_info
>>>
>>> hf_tar_file_info(
...     repo_id='nyanko7/danbooru2023',
...     idx_repo_id='deepghs/danbooru2023_index',
...     archive_in_repo='original/data-0000.tar',
...     file_in_archive='1000.png'
... )
{'offset': 1024, 'size': 11966, 'sha256': '478d3313860519372f6a75ede287d4a7c18a2d851bbc79b3dd65caff4c716858'}

hf_tar_file_download

hfutils.index.fetch.hf_tar_file_download(repo_id: str, archive_in_repo: str, file_in_archive: str, local_file: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, proxies: Dict | None = None, user_agent: Dict | str | None = None, headers: Dict[str, str] | None = None, endpoint: str | None = None, force_download: bool = False, hf_token: str | None = None)[source]

Download a file from a tar archive file in a Hugging Face repository.

Parameters:
  • repo_id (str) – The identifier of the repository.

  • archive_in_repo (str) – The path to the archive file in the repository.

  • file_in_archive (str) – The path to the file inside the archive.

  • local_file (str) – The path to save the downloaded file locally.

  • repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.

  • revision (str, optional) – The revision of the repository.

  • idx_repo_id (str, optional) – The identifier of the index repository.

  • idx_file_in_repo (str, optional) – The path to the index file in the index repository.

  • idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.

  • idx_revision (str, optional) – The revision of the index repository.

  • proxies (Dict, optional) – The proxies to be used for the HTTP request.

  • user_agent (Union[Dict, str, None], optional) – The user agent for the HTTP request.

  • headers (Dict[str, str], optional) – The additional headers for the HTTP request.

  • endpoint (str, optional) – The Hugging Face API endpoint.

  • force_download (bool) – Force download the file to destination path. Defualt to False, downloading will be skipped if the local file is fully matched with expected file.

  • hf_token (str, optional) – The Hugging Face access token.

Raises:
Examples::
>>> from hfutils.index import hf_tar_file_download
>>>
>>> hf_tar_file_download(
...     repo_id='deepghs/danbooru_newest',
...     archive_in_repo='images/0000.tar',
...     file_in_archive='7506000.jpg',
...     local_file='test_example.jpg'  # download destination
... )

Note

Besides, if the tar and index files are in different repositories, you can also use this function to download the given file by explicitly assigning the idx_repo_id argument.

>>> from hfutils.index import hf_tar_file_download
>>>
>>> hf_tar_file_download(
...     repo_id='nyanko7/danbooru2023',
...     idx_repo_id='deepghs/danbooru2023_index',
...     archive_in_repo='original/data-0000.tar',
...     file_in_archive='1000.png',
...     local_file='test_example.png'  # download destination
... )