Time to try out some coding with generative AI and LLM (Large Language Model).
I had one requirement for this and that was to run it locally on my machine and not use a web API, like the one provided by OpenAI.
To do this is I managed to stumble over this project llama.cpp by Georgi Gerganov, which also has a couple of bindings to other programming languages.
I went with Python, llama-cpp-python, since my goal is just to get a small project up and running locally.
Set up the project.
# Create a folder for the project
mkdir test-llama-cpp-python && cd $_
# Create a virtual environment
pyenv virtualenv 3.9.0 llm
pyenv activate llm
# Install the llama-cpp-python package
python -m pip install llama-cpp-python
python -m pip freeze > requirements.txt
# Create an empty main.py
touch main.py
# Open up the project in VS Code
code .
In the main file, add a simple skeleton prompt loop.
import os
def get_reply(prompt):
"""Local inference with llama-cpp-python"""
return ""
def clear():
"""Clears the terminal screen."""
os.system('cls' if os.name == 'nt' else 'clear')
def main():
"""The prompt loop."""
clear()
while True:
cli_prompt = input("You: ")
if cli_prompt == "exit":
break
else:
answer = get_reply(cli_prompt)
print(f"""Llama: {answer}""")
if __name__ == '__main__':
main()
From the examples on GitHub we can see that we need to import the class Llama into our main file and we will also need a model.
There is a popular AI community, Hugging Face, where we can find models to use. There is one requirement about the model file, and that is that it has to be in the GGML file format. There is a converter available in the llama.cpp GitHub project to do that.
However I searched for a model that already was in that format and took the first one I found, TheBloke/Llama-2-7B-Chat-GGML. I downloaded the following model at the end, llama-2-7b-chat.ggmlv3.q4_1.bin.
Picking correct/best model is for another topic and is out of scope for this post.
When the model is downloaded to the project folder we can update our main.py file to start using the Llama class and the model.
from llama_cpp import Llama
llama = Llama(model_path='llama-2-7b-chat.ggmlv3.q4_1.bin', verbose=False)
def get_reply(prompt):
"""Local inference with llama-cpp-python"""
response = llama(
f"""Q: {prompt} A:""", max_tokens=64, stop=["Q:", "\n"], echo=False
)
return response["choices"].pop()["text"].strip()
We first import Llama class and initialize an Llama object. The constructor needs the path to our model file which is given with model_path. I’m also setting the verbose flag to false to suppress noise messages from the llama-cpp package.
The get_reply method is doing all local inference with the llama-cpp-python package. The prompt to generate text from needs to be formatted in a specific way and therefore the Q and A is added.
Here is the final result of the code.
import os
from llama_cpp import Llama
llama = Llama(model_path="llama-2-7b-chat.ggmlv3.q4_1.bin", verbose=False)
def get_reply(prompt):
"""Local inference with llama-cpp-python"""
response = llama(
f"""Q: {prompt} A:""", max_tokens=64, stop=["Q:", "\n"], echo=False
)
return response["choices"].pop()["text"].strip()
def clear():
"""Clears the terminal screen."""
os.system("cls" if os.name == "nt" else "clear")
def main():
"""The prompt loop."""
clear()
while True:
cli_prompt = input("You: ")
if cli_prompt == "exit":
break
else:
answer = get_reply(cli_prompt)
print(f"""Llama: {answer}""")
if __name__ == "__main__":
main()
Test run it by executing following in your cli.
python main.py
And ask a question, rember that exit will close the prompt.
You: What are the names of the planets in the solar system?
Llama: The planets in our solar system, in order from closest to farthest from the Sun, are: Mercury Venus Earth Mars Jupiter Saturn Uranus Neptune
You: exit
Until next time!