Strategies for Tagging ModelKits

Gorkem Ercan - Apr 19 - - Dev Community

ModelKits, much like other OCI artifacts, can be identified using tags that are comprehensible to humans. This blog explores various strategies for effectively tagging your ModelKits.

Multiple tags

A ModelKit can carry multiple tags and that is for a good reason. When you create a ModelKit, you typically want to have a tag that identifies its contents. For instance llama-2:7b-chat-q8_0 where 7b-chat-q8_0 tells the parameter size, variant and quantization of the llama2 model. If a ModelKit exclusively contains data, it might be tagged as categorization:sales-data-2023. These tags define the artifacts a user expects to receive. If a ModelKit includes both a model and significant datasets, it can be tagged with multiple tags that deliniates its contents. Generally, such tags are considered immutable by convention.

Besides content identification, tags can also signify different stages, environments and other characteristics for a ModelKit. Like latest, production, challenger. By convention, such tags are expected to be mutable.

Mutable vs Immutable

Tags are inherently mutable in nature. Some registries, like ECR, allow the configuration of tag mutability at the repository level. However, because a repository often needs to manage both mutable and immutable tags, enforcing immutability becomes impractical. Therefore, tags should not be relied upon when immutable references are required.

Instead, each ModelKit is equipped with a content-addressable tag called digest which is immutable. A ModelKit's digest is a 64-character hex-encoded SHA-256 hash of its contents. It is based on the contents of every artifact code, models, data and configuration. Changing any of these bits of information would result in a new digest.

You can pull a ModelKit by digest, verify the contents match the given digest. The digest is the canonical ID of a ModelKit by its contents.

Comparison with Git Tags

If you're familiar with Git, you may be thinking at this point that a digest is like a commit SHA, that is because a Git commit SHA is a 40-character hex-encoded SHA-1 hash of the contents of a given state of the source code. Changing a file's contents or metadata would result in a new commit SHA.

Git also has a concept of tags. When you want to share the state of a source code with others, you can apply a tag like 1.0.0, Git maps the tag 1.0.0 to some commit SHA. ModelKit tags are similar. Instead of having to deal with a long hex-encoded digest, you can deal with a tag like :7b-q4_0.

Both Git tags and ModelKit tags are mutable. In both cases, anyone with permission to push to the repo can also update and push a tag. This could be malicious, but usually it's a well-meaning mistake. If you want to know exactly the truth and immutability is really needed digests is the only way.

Some of you at this point may be thinking what about Git branches? Git branches also point to commits, when you Git commit something on a branch, the branch moves to point to the new commit. But this is really just a convention supported by Git tooling. ModelKits do not have a concept of branches but you can imagine the tags like latest that are expected to move to be a similar convention.

Conclusion

Effective tagging of ModelKits not only facilitates ease of identification and organization but also enhances the manageability of different versions and configurations of these artifacts. Whether leveraging mutable tags for operational flexibility or immutable digests for ensuring integrity, the thoughtful application of tagging strategies ensures that ModelKits can be seamlessly integrated and reliably referenced within any type of workflow. Remember, while tags offer convenience and adaptability, digests provide the cornerstone of trust and verification in the lifecycle of a ModelKit. Adopting these practices will empower your teams to maintain a robust, efficient, and secure AI/ML project management system.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .