The Vertex AI SDK for Python now offers local tokenization. This feature allows you to calculate the number of tokens in your text input before sending requests to Gemini. Let’s see how this feature works…

🧩 What is a token?

Large language models (LLMs) use a fundamental unit: the token. In other words, they process input tokens and generate output tokens.

A text token can represent characters, words, or phrases. On average, one token represents approximately four characters in English text. When you send a query to Gemini, your text input is transformed into tokens. This step is called tokenization. Gemini generates output tokens that are then converted back into text using the reverse operation.

📦️ Setting up the Vertex AI SDK

To use Gemini tokenizers, you need both the latest google-cloud-aiplatform package with the tokenization extra:

pip install --upgrade google-cloud-aiplatform[tokenization]

Ensure you have version 1.57.0 or later installed:

pip show google-cloud-aiplatform

Name: google-cloud-aiplatform
Version: 1.57.0
…

🧮 Counting tokens locally

Create a tokenizer for the Gemini model you’re using and call the count_tokens() method on your text input:

from vertexai.preview import tokenization

model_name = "gemini-1.5-flash-001"
tokenizer = tokenization.get_tokenizer_for_model(model_name)

contents = "Hello World!"
result = tokenizer.count_tokens(contents)

print(f"{result.total_tokens = :,}")

result.total_tokens = 3

Now, let’s try with a larger document:

import http.client
import typing
import urllib.request

def download_text_from_url(url: str) -> str:
    with urllib.request.urlopen(url) as response:
        response = typing.cast(http.client.HTTPResponse, response)
        encoding = response.headers.get_content_charset() or "utf-8"
        return response.read().decode(encoding)

url = "https://storage.googleapis.com/dataflow-samples/shakespeare/hamlet.txt"
contents = download_text_from_url(url)
result = tokenizer.count_tokens(contents)

print(f"{result.total_tokens = :,}")

result.total_tokens = 53,824

🚀 Perfect! The number of tokens is computed in a fraction of second.

🖼️ Multimodal inputs

In the tested version (1.57.0), local token counting is only supported for text inputs.

For multimodal inputs (image, video, audio, documents), check out the documentation for details on how different medias account for different token counts:

In all cases, you can send a request using the Vertex AI API as usual:

import vertexai
from vertexai.generative_models import GenerativeModel

# project_id = "PROJECT_ID"
# location = "us-central1"
# vertexai.init(project=project_id, location=location)

model = GenerativeModel(model_name)

response = model.count_tokens(contents)

print(f"{response.total_tokens = :,}")
print(f"{response.total_billable_characters = :,}")

response.total_tokens = 53,824
response.total_billable_characters = 144,443

🔍 Under the hood

Text tokenizers use a fixed data file, also called the LLM vocabulary. This vocabulary determines how to encode a text string into a sequence of tokens and, reversely, how to decode a sequence of tokens into a text string.

Here is what happens under the hood:

When you request a tokenizer for your model for the first time, the LLM vocabulary is downloaded and locally cached. On subsequent calls, the cached data is used.
When you call tokenizer.count_tokens(contents), the text is tokenized (becomes a sequence of tokens) and the corresponding number of tokens is returned.

Key points to remember:

💡 For Gemini 1.0 and 1.5, the LLM vocabulary size is only about 4 MB.
💡 Different Gemini models can use the same tokenizer but this is not guaranteed. Make sure to get the tokenizer for your model.

⚡ Benefits of knowing token counts

Knowing how many tokens your text represents enables the following:

✅ Determine whether your text input will be processed by a model. For example, Gemini 1.0 Pro supports up to 32,760 input tokens, while Gemini 1.5 Pro currently accepts up to 2,097,152 input tokens.
✅ Consolidate or estimate costs on a per-token basis (Gemini pricing is per 1k characters).
✅ Estimate how much time will be needed to process your text input before you start receiving a response (processing times are roughly proportional to the numbers of processed tokens).

You now have a new local tool to manage your inputs before sending requests to Gemini!

🖖 Follow me on Twitter/X or LinkedIn for more cloud explorations

Counting Gemini text tokens locally