hfutils.repository.size

This module provides functionality for analyzing and managing Hugging Face repository files.

It includes classes and functions for representing repository files, creating file lists, and analyzing repository contents. The module is designed to work with the Hugging Face Hub API and provides utilities for handling file paths, sizes, and other metadata.

Key components: - RepoFileItem: Represents a single file in a repository. - RepoFileList: A sequence of RepoFileItems with additional metadata and utility methods. - hf_hub_repo_analysis: Function to analyze repository contents based on given criteria.

This module is particularly useful for developers working with Hugging Face repositories and need to analyze or manage file structures and metadata.

RepoFileItem

class hfutils.repository.size.RepoFileItem(path: str, size: int, is_lfs: bool, lfs_sha256: str | None, blob_id: str)[source]

Represents a file item in a Hugging Face repository.

This class encapsulates metadata about a single file, including its path, size, LFS status, and blob ID. It provides methods for creating instances from RepoFile objects and accessing file information.

Parameters:
  • path (str) – The file path relative to the repository root.

  • size (int) – The size of the file in bytes.

  • is_lfs (bool) – Whether the file is stored using Git LFS.

  • lfs_sha256 (Optional[str]) – The SHA256 hash of the LFS file, if applicable.

  • blob_id (str) – The Git blob ID of the file.

Example::
>>> from huggingface_hub.hf_api import RepoFile
>>> repo_file = RepoFile(path="model.bin", size=1024, blob_id="abc123")
>>> item = RepoFileItem.from_repo_file(repo_file)
>>> print(item.path)
model.bin
__repr__()[source]

Return a string representation of the RepoFileItem.

The representation includes the file path, size in human-readable format, and LFS status if applicable.

Returns:

A formatted string representation.

Return type:

str

Example::
>>> item = RepoFileItem("model.bin", 1048576, True, "sha256hash", "blob123")
>>> print(repr(item))
<RepoFileItem model.bin, size: 1.05 MB (LFS)>
classmethod from_repo_file(repo_file: RepoFile, subdir: str = '') RepoFileItem[source]

Create a RepoFileItem from a RepoFile object.

This method converts a RepoFile object from the Hugging Face Hub API into a RepoFileItem instance, handling LFS metadata and path normalization.

Parameters:
  • repo_file (RepoFile) – The RepoFile object to convert.

  • subdir (str) – The subdirectory to use as the base path (default: ‘’).

Returns:

A new RepoFileItem instance.

Return type:

RepoFileItem

Example::
>>> repo_file = RepoFile(path="data/file.txt", size=512, blob_id="def456")
>>> item = RepoFileItem.from_repo_file(repo_file, subdir="data")
>>> print(item.path)
file.txt
property path_segments: Tuple[str, ...]

Get the path segments of the file.

This property splits the file path into individual segments, which is useful for path-based sorting and hierarchical operations.

Returns:

A tuple of path segments.

Return type:

Tuple[str, …]

Example::
>>> item = RepoFileItem("folder/subfolder/file.txt", 100, False, None, "abc123")
>>> print(item.path_segments)
('folder', 'subfolder', 'file.txt')

RepoFileList

class hfutils.repository.size.RepoFileList(repo_id: str, items: List[RepoFileItem], repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', subdir: str | None = '')[source]

Represents a list of RepoFileItems with additional metadata and utility methods.

This class provides a way to manage and analyze a collection of files from a Hugging Face repository, including information about the repository itself. It implements the Sequence interface and provides methods for visualization and analysis of the file collection.

Parameters:
  • repo_id (str) – The ID of the repository.

  • items (List[RepoFileItem]) – A list of RepoFileItem objects.

  • repo_type (RepoTypeTyping) – The type of the repository (default: ‘dataset’).

  • revision (str) – The revision of the repository (default: ‘main’).

  • subdir (Optional[str]) – The subdirectory within the repository (default: ‘’).

Example::
>>> items = [RepoFileItem("file1.txt", 100, False, None, "abc"),
...          RepoFileItem("file2.txt", 200, False, None, "def")]
>>> file_list = RepoFileList("username/repo", items)
>>> print(len(file_list))
2
__getitem__(index)[source]

Get a RepoFileItem by index.

This method allows the RepoFileList to be accessed like a regular list or sequence, supporting both integer indices and slicing.

Parameters:

index (int) – The index of the item to retrieve.

Returns:

The RepoFileItem at the specified index.

Return type:

RepoFileItem

Example::
>>> file_list = RepoFileList("repo", [item1, item2, item3])
>>> first_item = file_list[0]
>>> print(first_item.path)
__init__(repo_id: str, items: List[RepoFileItem], repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', subdir: str | None = '')[source]
__len__() int[source]

Get the number of items in the list.

Returns:

The number of RepoFileItems in the list.

Return type:

int

Example::
>>> file_list = RepoFileList("repo", [item1, item2])
>>> print(len(file_list))
2
__repr__()[source]

Return a string representation of the RepoFileList.

This method provides a tree-like visualization of the file list, showing the repository information and contained files in a hierarchical format.

Returns:

A formatted string representation of the file list.

Return type:

str

Example::
>>> file_list = RepoFileList("username/repo", items)
>>> print(file_list)
username/repo (2 files, 1.05 MB)
├── <RepoFileItem file1.txt, size: 512 KB>
└── <RepoFileItem file2.txt, size: 512 KB>
repr(max_items: int | None = 10)[source]

Generate a custom string representation of the RepoFileList.

This method allows customization of the number of items displayed in the tree representation, which is useful for large file lists.

Parameters:

max_items (Optional[int]) – The maximum number of items to include in the representation (default: 10).

Returns:

A formatted string representation of the file list.

Return type:

str

Example::
>>> file_list = RepoFileList("username/repo", many_items)
>>> print(file_list.repr(max_items=5))
username/repo (100 files, 10.5 MB)
├── <RepoFileItem file1.txt, size: 512 KB>
├── <RepoFileItem file2.txt, size: 512 KB>
├── ...
└── ... (100 files) in total ...
property total_size: int

Get the total size of all files in the list.

This property calculates and returns the sum of all file sizes in the collection, which is useful for understanding the overall storage requirements of the repository or subdirectory.

Returns:

The total size in bytes.

Return type:

int

Example::
>>> file_list = RepoFileList("repo", items)
>>> print(f"Total size: {file_list.total_size} bytes")
Total size: 1048576 bytes

hf_hub_repo_analysis

hfutils.repository.size.hf_hub_repo_analysis(repo_id: str, pattern: List[str] | str = '**/*', repo_type: Literal['dataset', 'model', 'space'] = 'dataset', revision: str = 'main', hf_token: str | None = None, subdir: str = '', sort_by: Literal['none', 'path', 'size'] = 'path', **kwargs) RepoFileList[source]

Analyze the contents of a Hugging Face repository.

This function retrieves file information from a specified repository and creates a RepoFileList object containing detailed information about each file. It supports pattern matching, subdirectory filtering, and various sorting options.

Parameters:
  • repo_id (str) – The ID of the repository to analyze.

  • pattern (Union[List[str], str]) – A glob pattern or list of patterns to filter files (default: ‘**/*’).

  • repo_type (RepoTypeTyping) – The type of the repository (default: ‘dataset’).

  • revision (str) – The revision of the repository to analyze (default: ‘main’).

  • hf_token (Optional[str]) – The Hugging Face API token (optional).

  • subdir (str) – The subdirectory within the repository to analyze (default: ‘’).

  • sort_by (SortByTyping) – How to sort the file list (‘none’, ‘path’, or ‘size’) (default: ‘path’).

  • kwargs (dict) – Additional keyword arguments to pass to list_all_with_pattern.

Returns:

A RepoFileList object containing the analysis results.

Return type:

RepoFileList

Raises:

May raise exceptions related to API access or file operations.

Example::
>>> # Analyze all files in a model repository
>>> result = hf_hub_repo_analysis('bert-base-uncased', repo_type='model')
>>> print(f"Found {len(result)} files")
>>>
>>> # Analyze only text files in a specific subdirectory
>>> result = hf_hub_repo_analysis('username/dataset',
...                              pattern='*.txt',
...                              subdir='data',
...                              sort_by='size')
>>> print(result)