hfutils.index.fetch
ArchiveStandaloneFileIncompleteDownload
ArchiveStandaloneFileHashNotMatch
hf_tar_get_index
- hfutils.index.fetch.hf_tar_get_index(repo_id: str, archive_in_repo: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, hf_token: str | None = None)[source]
Get the index of a tar archive file in a Hugging Face repository.
- Parameters:
repo_id (str) – The identifier of the repository.
archive_in_repo (str) – The path to the archive file in the repository.
repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.
revision (str, optional) – The revision of the repository.
idx_repo_id (str, optional) – The identifier of the index repository.
idx_file_in_repo (str, optional) – The path to the index file in the index repository.
idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.
idx_revision (str, optional) – The revision of the index repository.
hf_token (str, optional) – The Hugging Face access token.
- Returns:
The index of the tar archive file.
- Return type:
Dict
- Examples::
>>> from hfutils.index import hf_tar_get_index >>> >>> idx = hf_tar_get_index( ... repo_id='deepghs/danbooru_newest', ... archive_in_repo='images/0000.tar', ... ) >>> idx.keys() dict_keys(['filesize', 'hash', 'hash_lfs', 'files']) >>> idx['files'].keys() dict_keys(['7507000.jpg', '7506000.jpg', '7505000.jpg', ...])
Note
Besides, if the tar and index files are in different repositories, you can also use this function to get the index information by explicitly assigning the
idx_repo_id
argument.>>> from hfutils.index import hf_tar_get_index >>> >>> idx = hf_tar_get_index( ... repo_id='nyanko7/danbooru2023', ... idx_repo_id='deepghs/danbooru2023_index', ... archive_in_repo='original/data-0000.tar', ... ) >>> idx.keys() dict_keys(['filesize', 'hash', 'hash_lfs', 'files']) >>> idx['files'].keys() dict_keys(['./1000.png', './10000.jpg', './100000.jpg', ...])
hf_tar_list_files
- hfutils.index.fetch.hf_tar_list_files(repo_id: str, archive_in_repo: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, hf_token: str | None = None) List[str] [source]
List files inside a tar archive file in a Hugging Face repository.
- Parameters:
repo_id (str) – The identifier of the repository.
archive_in_repo (str) – The path to the archive file in the repository.
repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.
revision (str, optional) – The revision of the repository.
idx_repo_id (str, optional) – The identifier of the index repository.
idx_file_in_repo (str, optional) – The path to the index file in the index repository.
idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.
idx_revision (str, optional) – The revision of the index repository.
hf_token (str, optional) – The Hugging Face access token.
- Returns:
The list of files inside the tar archive.
- Return type:
List[str]
- Examples::
>>> from hfutils.index import hf_tar_list_files >>> >>> hf_tar_list_files( ... repo_id='deepghs/danbooru_newest', ... archive_in_repo='images/0000.tar', ... ) ['7507000.jpg', '7506000.jpg', '7505000.jpg', ...]
Note
Besides, if the tar and index files are in different repositories, you can also use this function to list all the files by explicitly assigning the
idx_repo_id
argument.>>> from hfutils.index import hf_tar_list_files >>> >>> hf_tar_list_files( ... repo_id='nyanko7/danbooru2023', ... idx_repo_id='deepghs/danbooru2023_index', ... archive_in_repo='original/data-0000.tar', ... ) ['./1000.png', './10000.jpg', './100000.jpg', ...]
hf_tar_file_exists
- hfutils.index.fetch.hf_tar_file_exists(repo_id: str, archive_in_repo: str, file_in_archive: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, hf_token: str | None = None)[source]
Check if a file exists inside a tar archive file in a Hugging Face repository.
- Parameters:
repo_id (str) – The identifier of the repository.
archive_in_repo (str) – The path to the archive file in the repository.
file_in_archive (str) – The path to the file inside the archive.
repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.
revision (str, optional) – The revision of the repository.
idx_repo_id (str, optional) – The identifier of the index repository.
idx_file_in_repo (str, optional) – The path to the index file in the index repository.
idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.
idx_revision (str, optional) – The revision of the index repository.
hf_token (str, optional) – The Hugging Face access token.
- Returns:
True if the file exists, False otherwise.
- Return type:
bool
- Examples::
>>> from hfutils.index import hf_tar_file_exists >>> >>> hf_tar_file_exists( ... repo_id='deepghs/danbooru_newest', ... archive_in_repo='images/0000.tar', ... file_in_archive='7506000.jpg', ... ) True >>> hf_tar_file_exists( ... repo_id='deepghs/danbooru_newest', ... archive_in_repo='images/0000.tar', ... file_in_archive='17506000.jpg', ... ) False
Note
Besides, if the tar and index files are in different repositories, you can also use this function to check the file existence by explicitly assigning the
idx_repo_id
argument.>>> from hfutils.index import hf_tar_file_exists >>> >>> hf_tar_file_exists( ... repo_id='nyanko7/danbooru2023', ... idx_repo_id='deepghs/danbooru2023_index', ... archive_in_repo='original/data-0000.tar', ... file_in_archive='1000.png' ... ) True >>> hf_tar_file_exists( ... repo_id='nyanko7/danbooru2023', ... idx_repo_id='deepghs/danbooru2023_index', ... archive_in_repo='original/data-0000.tar', ... file_in_archive='10000000001000.png' ... ) False
hf_tar_file_size
- hfutils.index.fetch.hf_tar_file_size(repo_id: str, archive_in_repo: str, file_in_archive: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, hf_token: str | None = None) int [source]
Get a file’s size in index tars.
- Parameters:
repo_id (str) – The identifier of the repository.
archive_in_repo (str) – The path to the archive file in the repository.
file_in_archive (str) – The path to the file inside the archive.
repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.
revision (str, optional) – The revision of the repository.
idx_repo_id (str, optional) – The identifier of the index repository.
idx_file_in_repo (str, optional) – The path to the index file in the index repository.
idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.
idx_revision (str, optional) – The revision of the index repository.
hf_token (str, optional) – The Hugging Face access token.
- Returns:
Return an integer which represents the size of this file.
- Return type:
int
- Raises:
FileNotFoundError – Raise this when file not exist in tar archive.
- Examples::
>>> from hfutils.index import hf_tar_file_size >>> >>> hf_tar_file_size( ... repo_id='deepghs/danbooru_newest', ... archive_in_repo='images/0000.tar', ... file_in_archive='7506000.jpg', ... ) 435671
Note
Besides, if the tar and index files are in different repositories, you can also use this function to get the file size by explicitly assigning the
idx_repo_id
argument.>>> from hfutils.index import hf_tar_file_size >>> >>> hf_tar_file_size( ... repo_id='nyanko7/danbooru2023', ... idx_repo_id='deepghs/danbooru2023_index', ... archive_in_repo='original/data-0000.tar', ... file_in_archive='1000.png' ... ) 11966
hf_tar_file_info
- hfutils.index.fetch.hf_tar_file_info(repo_id: str, archive_in_repo: str, file_in_archive: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, hf_token: str | None = None) dict [source]
Get a file’s detailed information in index tars, including offset, sha256 and size.
- Parameters:
repo_id (str) – The identifier of the repository.
archive_in_repo (str) – The path to the archive file in the repository.
file_in_archive (str) – The path to the file inside the archive.
repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.
revision (str, optional) – The revision of the repository.
idx_repo_id (str, optional) – The identifier of the index repository.
idx_file_in_repo (str, optional) – The path to the index file in the index repository.
idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.
idx_revision (str, optional) – The revision of the index repository.
hf_token (str, optional) – The Hugging Face access token.
- Returns:
Return a dictionary object with meta information of this file.
- Return type:
dict
- Raises:
FileNotFoundError – Raise this when file not exist in tar archive.
- Examples::
>>> from hfutils.index import hf_tar_file_info >>> >>> hf_tar_file_info( ... repo_id='deepghs/danbooru_newest', ... archive_in_repo='images/0000.tar', ... file_in_archive='7506000.jpg', ... ) {'offset': 265728, 'size': 435671, 'sha256': 'ef6a4e031fdffb705c8ce2c64e8cb8d993f431a887d7c1c0b1e6fa56e6107fcd'}
Note
Besides, if the tar and index files are in different repositories, you can also use this function to get the file information by explicitly assigning the
idx_repo_id
argument.>>> from hfutils.index import hf_tar_file_info >>> >>> hf_tar_file_info( ... repo_id='nyanko7/danbooru2023', ... idx_repo_id='deepghs/danbooru2023_index', ... archive_in_repo='original/data-0000.tar', ... file_in_archive='1000.png' ... ) {'offset': 1024, 'size': 11966, 'sha256': '478d3313860519372f6a75ede287d4a7c18a2d851bbc79b3dd65caff4c716858'}
hf_tar_file_download
- hfutils.index.fetch.hf_tar_file_download(repo_id: str, archive_in_repo: str, file_in_archive: str, local_file: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, proxies: Dict | None = None, user_agent: Dict | str | None = None, headers: Dict[str, str] | None = None, endpoint: str | None = None, force_download: bool = False, silent: bool = False, hf_token: str | None = None)[source]
Download a specific file from a tar archive stored in a Hugging Face repository.
This function allows you to extract and download a single file from a tar archive that is hosted in a Hugging Face repository. It handles authentication, supports different repository types, and can work with separate index repositories.
- Parameters:
repo_id (str) – The identifier of the repository containing the tar archive.
archive_in_repo (str) – The path to the tar archive file within the repository.
file_in_archive (str) – The path to the desired file inside the tar archive.
local_file (str) – The local path where the downloaded file will be saved.
repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository (e.g., ‘dataset’, ‘model’, ‘space’).
revision (str, optional) – The specific revision of the repository to use.
idx_repo_id (str, optional) – The identifier of a separate index repository, if applicable.
idx_file_in_repo (str, optional) – The path to the index file in the index repository.
idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.
idx_revision (str, optional) – The revision of the index repository.
proxies (Dict, optional) – Proxy settings for the HTTP request.
user_agent (Union[Dict, str, None], optional) – Custom user agent for the HTTP request.
headers (Dict[str, str], optional) – Additional headers for the HTTP request.
endpoint (str, optional) – Custom Hugging Face API endpoint.
force_download (bool) – If True, force re-download even if the file exists locally.
silent (bool) – If True, suppress progress bar output.
hf_token (str, optional) – Hugging Face authentication token.
- Raises:
FileNotFoundError – If the specified file is not found in the tar archive.
ArchiveStandaloneFileIncompleteDownload – If the download is incomplete.
ArchiveStandaloneFileHashNotMatch – If the downloaded file’s hash doesn’t match the expected hash.
This function performs several steps:
Retrieves the index of the tar archive.
Checks if the desired file exists in the archive.
Constructs the download URL and headers.
Checks if the file already exists locally and matches the expected size and hash.
Downloads the file if necessary, using byte range requests for efficiency.
Verifies the downloaded file’s size and hash.
- Usage examples:
- Basic usage:
>>> hf_tar_file_download( ... repo_id='deepghs/danbooru_newest', ... archive_in_repo='images/0000.tar', ... file_in_archive='7506000.jpg', ... local_file='test_example.jpg' # download destination ... )
- Using a separate index repository:
>>> hf_tar_file_download( ... repo_id='nyanko7/danbooru2023', ... idx_repo_id='deepghs/danbooru2023_index', ... archive_in_repo='original/data-0000.tar', ... file_in_archive='1000.png', ... local_file='test_example.png' # download destination ... )
Note
This function is particularly useful for efficiently downloading single files from large tar archives without having to download the entire archive.
It supports authentication via the hf_token parameter, which is crucial for accessing private repositories.
The function includes checks to avoid unnecessary downloads and to ensure the integrity of the downloaded file.