hfutils.index.fetch
ArchiveStandaloneFileIncompleteDownload
ArchiveStandaloneFileHashNotMatch
hf_tar_get_index
- hfutils.index.fetch.hf_tar_get_index(repo_id: str, archive_in_repo: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, hf_token: str | None = None)[source]
Get the index of a tar archive file in a Hugging Face repository.
- Parameters:
repo_id (str) – The identifier of the repository.
archive_in_repo (str) – The path to the archive file in the repository.
repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.
revision (str, optional) – The revision of the repository.
idx_repo_id (str, optional) – The identifier of the index repository.
idx_file_in_repo (str, optional) – The path to the index file in the index repository.
idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.
idx_revision (str, optional) – The revision of the index repository.
hf_token (str, optional) – The Hugging Face access token.
- Returns:
The index of the tar archive file.
- Return type:
Dict
- Examples::
>>> from hfutils.index import hf_tar_get_index >>> >>> idx = hf_tar_get_index( ... repo_id='deepghs/danbooru_newest', ... archive_in_repo='images/0000.tar', ... ) >>> idx.keys() dict_keys(['filesize', 'hash', 'hash_lfs', 'files']) >>> idx['files'].keys() dict_keys(['7507000.jpg', '7506000.jpg', '7505000.jpg', ...])
Note
Besides, if the tar and index files are in different repositories, you can also use this function to get the index information by explicitly assigning the
idx_repo_id
argument.>>> from hfutils.index import hf_tar_get_index >>> >>> idx = hf_tar_get_index( ... repo_id='nyanko7/danbooru2023', ... idx_repo_id='deepghs/danbooru2023_index', ... archive_in_repo='original/data-0000.tar', ... ) >>> idx.keys() dict_keys(['filesize', 'hash', 'hash_lfs', 'files']) >>> idx['files'].keys() dict_keys(['./1000.png', './10000.jpg', './100000.jpg', ...])
hf_tar_list_files
- hfutils.index.fetch.hf_tar_list_files(repo_id: str, archive_in_repo: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, hf_token: str | None = None) List[str] [source]
List files inside a tar archive file in a Hugging Face repository.
- Parameters:
repo_id (str) – The identifier of the repository.
archive_in_repo (str) – The path to the archive file in the repository.
repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.
revision (str, optional) – The revision of the repository.
idx_repo_id (str, optional) – The identifier of the index repository.
idx_file_in_repo (str, optional) – The path to the index file in the index repository.
idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.
idx_revision (str, optional) – The revision of the index repository.
hf_token (str, optional) – The Hugging Face access token.
- Returns:
The list of files inside the tar archive.
- Return type:
List[str]
- Examples::
>>> from hfutils.index import hf_tar_list_files >>> >>> hf_tar_list_files( ... repo_id='deepghs/danbooru_newest', ... archive_in_repo='images/0000.tar', ... ) ['7507000.jpg', '7506000.jpg', '7505000.jpg', ...]
Note
Besides, if the tar and index files are in different repositories, you can also use this function to list all the files by explicitly assigning the
idx_repo_id
argument.>>> from hfutils.index import hf_tar_list_files >>> >>> hf_tar_list_files( ... repo_id='nyanko7/danbooru2023', ... idx_repo_id='deepghs/danbooru2023_index', ... archive_in_repo='original/data-0000.tar', ... ) ['./1000.png', './10000.jpg', './100000.jpg', ...]
hf_tar_file_exists
- hfutils.index.fetch.hf_tar_file_exists(repo_id: str, archive_in_repo: str, file_in_archive: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, hf_token: str | None = None)[source]
Check if a file exists inside a tar archive file in a Hugging Face repository.
- Parameters:
repo_id (str) – The identifier of the repository.
archive_in_repo (str) – The path to the archive file in the repository.
file_in_archive (str) – The path to the file inside the archive.
repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.
revision (str, optional) – The revision of the repository.
idx_repo_id (str, optional) – The identifier of the index repository.
idx_file_in_repo (str, optional) – The path to the index file in the index repository.
idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.
idx_revision (str, optional) – The revision of the index repository.
hf_token (str, optional) – The Hugging Face access token.
- Returns:
True if the file exists, False otherwise.
- Return type:
bool
- Examples::
>>> from hfutils.index import hf_tar_file_exists >>> >>> hf_tar_file_exists( ... repo_id='deepghs/danbooru_newest', ... archive_in_repo='images/0000.tar', ... file_in_archive='7506000.jpg', ... ) True >>> hf_tar_file_exists( ... repo_id='deepghs/danbooru_newest', ... archive_in_repo='images/0000.tar', ... file_in_archive='17506000.jpg', ... ) False
Note
Besides, if the tar and index files are in different repositories, you can also use this function to check the file existence by explicitly assigning the
idx_repo_id
argument.>>> from hfutils.index import hf_tar_file_exists >>> >>> hf_tar_file_exists( ... repo_id='nyanko7/danbooru2023', ... idx_repo_id='deepghs/danbooru2023_index', ... archive_in_repo='original/data-0000.tar', ... file_in_archive='1000.png' ... ) True >>> hf_tar_file_exists( ... repo_id='nyanko7/danbooru2023', ... idx_repo_id='deepghs/danbooru2023_index', ... archive_in_repo='original/data-0000.tar', ... file_in_archive='10000000001000.png' ... ) False
hf_tar_file_size
- hfutils.index.fetch.hf_tar_file_size(repo_id: str, archive_in_repo: str, file_in_archive: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, hf_token: str | None = None) int [source]
Get a file’s size in index tars.
- Parameters:
repo_id (str) – The identifier of the repository.
archive_in_repo (str) – The path to the archive file in the repository.
file_in_archive (str) – The path to the file inside the archive.
repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.
revision (str, optional) – The revision of the repository.
idx_repo_id (str, optional) – The identifier of the index repository.
idx_file_in_repo (str, optional) – The path to the index file in the index repository.
idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.
idx_revision (str, optional) – The revision of the index repository.
hf_token (str, optional) – The Hugging Face access token.
- Returns:
Return an integer which represents the size of this file.
- Return type:
int
- Raises:
FileNotFoundError – Raise this when file not exist in tar archive.
- Examples::
>>> from hfutils.index import hf_tar_file_size >>> >>> hf_tar_file_size( ... repo_id='deepghs/danbooru_newest', ... archive_in_repo='images/0000.tar', ... file_in_archive='7506000.jpg', ... ) 435671
Note
Besides, if the tar and index files are in different repositories, you can also use this function to get the file size by explicitly assigning the
idx_repo_id
argument.>>> from hfutils.index import hf_tar_file_size >>> >>> hf_tar_file_size( ... repo_id='nyanko7/danbooru2023', ... idx_repo_id='deepghs/danbooru2023_index', ... archive_in_repo='original/data-0000.tar', ... file_in_archive='1000.png' ... ) 11966
hf_tar_file_info
- hfutils.index.fetch.hf_tar_file_info(repo_id: str, archive_in_repo: str, file_in_archive: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, hf_token: str | None = None) dict [source]
Get a file’s detailed information in index tars, including offset, sha256 and size.
- Parameters:
repo_id (str) – The identifier of the repository.
archive_in_repo (str) – The path to the archive file in the repository.
file_in_archive (str) – The path to the file inside the archive.
repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.
revision (str, optional) – The revision of the repository.
idx_repo_id (str, optional) – The identifier of the index repository.
idx_file_in_repo (str, optional) – The path to the index file in the index repository.
idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.
idx_revision (str, optional) – The revision of the index repository.
hf_token (str, optional) – The Hugging Face access token.
- Returns:
Return a dictionary object with meta information of this file.
- Return type:
dict
- Raises:
FileNotFoundError – Raise this when file not exist in tar archive.
- Examples::
>>> from hfutils.index import hf_tar_file_info >>> >>> hf_tar_file_info( ... repo_id='deepghs/danbooru_newest', ... archive_in_repo='images/0000.tar', ... file_in_archive='7506000.jpg', ... ) {'offset': 265728, 'size': 435671, 'sha256': 'ef6a4e031fdffb705c8ce2c64e8cb8d993f431a887d7c1c0b1e6fa56e6107fcd'}
Note
Besides, if the tar and index files are in different repositories, you can also use this function to get the file information by explicitly assigning the
idx_repo_id
argument.>>> from hfutils.index import hf_tar_file_info >>> >>> hf_tar_file_info( ... repo_id='nyanko7/danbooru2023', ... idx_repo_id='deepghs/danbooru2023_index', ... archive_in_repo='original/data-0000.tar', ... file_in_archive='1000.png' ... ) {'offset': 1024, 'size': 11966, 'sha256': '478d3313860519372f6a75ede287d4a7c18a2d851bbc79b3dd65caff4c716858'}
hf_tar_file_download
- hfutils.index.fetch.hf_tar_file_download(repo_id: str, archive_in_repo: str, file_in_archive: str, local_file: str, repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', idx_repo_id: str | None = None, idx_file_in_repo: str | None = None, idx_repo_type: Literal['dataset', 'model', 'space'] | None = None, idx_revision: str | None = None, proxies: Dict | None = None, user_agent: Dict | str | None = None, headers: Dict[str, str] | None = None, endpoint: str | None = None, force_download: bool = False, hf_token: str | None = None)[source]
Download a file from a tar archive file in a Hugging Face repository.
- Parameters:
repo_id (str) – The identifier of the repository.
archive_in_repo (str) – The path to the archive file in the repository.
file_in_archive (str) – The path to the file inside the archive.
local_file (str) – The path to save the downloaded file locally.
repo_type (RepoTypeTyping, optional) – The type of the Hugging Face repository.
revision (str, optional) – The revision of the repository.
idx_repo_id (str, optional) – The identifier of the index repository.
idx_file_in_repo (str, optional) – The path to the index file in the index repository.
idx_repo_type (RepoTypeTyping, optional) – The type of the index repository.
idx_revision (str, optional) – The revision of the index repository.
proxies (Dict, optional) – The proxies to be used for the HTTP request.
user_agent (Union[Dict, str, None], optional) – The user agent for the HTTP request.
headers (Dict[str, str], optional) – The additional headers for the HTTP request.
endpoint (str, optional) – The Hugging Face API endpoint.
force_download (bool) – Force download the file to destination path. Defualt to False, downloading will be skipped if the local file is fully matched with expected file.
hf_token (str, optional) – The Hugging Face access token.
- Raises:
FileNotFoundError – Raise this when file not exist in tar archive.
ArchiveStandaloneFileIncompleteDownload – Raise when download incomplete.
ArchiveStandaloneFileHashNotMatch – Raise when download hash not match.
- Examples::
>>> from hfutils.index import hf_tar_file_download >>> >>> hf_tar_file_download( ... repo_id='deepghs/danbooru_newest', ... archive_in_repo='images/0000.tar', ... file_in_archive='7506000.jpg', ... local_file='test_example.jpg' # download destination ... )
Note
Besides, if the tar and index files are in different repositories, you can also use this function to download the given file by explicitly assigning the
idx_repo_id
argument.>>> from hfutils.index import hf_tar_file_download >>> >>> hf_tar_file_download( ... repo_id='nyanko7/danbooru2023', ... idx_repo_id='deepghs/danbooru2023_index', ... archive_in_repo='original/data-0000.tar', ... file_in_archive='1000.png', ... local_file='test_example.png' # download destination ... )