TL;DR
I was recently assigned a challenging project that involved working with an existing Django repository, and I immediately realized I knew as much about Django as my pet, Goldfish.
You must be asking, Why not use ChatGPT? The problem was that the entire codebase could not fit in ChatGPTās prompt, and even if it could, it would have been highly unreliable.
So, I built an AI bot that lets you chat with any codebase with maximum accuracy using a code indexing tool.
So, hereās how I did it.
- I used a code indexing tool to analyse and index the entire codebase.
- Built an AI bot that accepts questions, understands the context, and retrieves relevant code chunks.
- Then, the bot analyzes the code and answers accordingly.
The crux of this workflow is the code indexing tool, which intelligently parses an entire code base and indexes the codes in a vector database.
What is RAG?
RAG stands for Retrieval Augmented Generation. As the name suggests, RAG involves retrieving data from various knowledge bases, such as Vector DBs, Web pages, Internet, etc and generating an answer using an LLM.
The Key components of a typical RAG system involve
- Embedding Model: A deep learning model is used to create embeddings of data (texts, images, etc).
- Vector DBs: For managing vector embeddings.
- LLM: Also, a deep learning model for generating text responses.
Here is a diagram of a typical RAG workflow.
Embeddings and Vector databases
Before moving ahead, letās get acquainted quickly with embeddings and vector databases.
Embeddings
Embeddings or vectors represent data(texts, images, etc.) numerically in a multi-dimensional space. Deep learning models trained over millions of data understand the relationship or proximity between different data points.
For example, the term Donald Trump will be closer to the US than China. The words āCatā and āKittenā will be close.
The embeddings are used to calculate the semantic similarity between sentences. We can extend this concept to codes as well.
Vector Databases
Traditional DBs are not suitable for managing embeddings. We need specialized DBs and algorithms to store and retrieve data. These are called vector databases.
Indexing techniques are the methods we use to organize data for search and storage in a vector database.
Vector databases use methods like HNSW, IVF, etc., for indexing and similarity search and BM25 and Hybrid Search for querying.
The best thing is that you do not have to worry about everything. The CodeAnalysis tool from Composio handles abstracts away all the complexities.
ComposioĀ - Open-source platform for AI tools & Integrations
Hereās a quick introduction about us.
Composio is an open-source tooling infrastructure for building robust and reliable AI applications. We provide over 100+ tools and integrations across industry verticals from CRM, HRM, and Sales to Productivity, Dev, and Social Media.
They also provide local tools such as CodeAnalyser, RAG, SQL, etc.
This article discusses using the CodeAnalysing tool to index a codebase for questions and answers.
Please help us with a star. š„¹
It would help us to create more articles like this š
Star the Composio repository ā
How does it work?
This project explains how to build an AI tool that lets you conveniently chat with any code base.
- Input Repository Path: Provide the path to a local codebase.
- Code Analysis and Indexing: The tool analyzes the code using a code analysis tool and indexes it into a vector database.
- Query with Prompts: After indexing, you can submit prompts or questions related to the codebase.
- Retrieve and Respond: The tool fetches relevant code snippets from the database and generates responses based on the code content.
Here is an overall workflow of the project.
Technical Description
Under the hoof, the AI bot receives the path string to the codebase and performs the following actions.
- Generates a Fully Qualified Domain Name (FQDN) cache for code entities.
- Creates an index of Python files.
- Builds a vector database from chunked codes for efficient searching.
Tech Stack
- CrewAI: For building the Agent.
- Composio: CodeIndexing and CodeAnalysis tool
Letās get started āØ
Begin by creating a Python virtual environment.
python -m venv code-search
cd code-search
source bin/activate
Now, install the following dependencies.
pip install composio-core
pip install crewai
pip install composio-crewai
-
composio-core
: Core Composio library is used to access the tools. -
crewai
: Agentic framework for building agents. -
composio-crewai
: CrewAI plugin for Composio.
Set up Composio
Next, set up Composio.
Login/sign up to Composio using the following command.
composio login
You will be directed to the login page.
Login using GitHub, Google, or your mail at your convenience.
Once you log in, an authentication key pops up. Copy it and paste it into your terminal.
Also, you will need an OpenAI API key. So, go to OpenAI
Next, Create aĀ .env
Ā file and add environment variables for the OpenAI API key.
OPENAI_API_KEY=your API key
To create an OpenAI API key, go to the official site and create an API key in the dashboard.
Importing Libraries
Letās import required libraries and modules and load environment variables.
import os
from dotenv import load_dotenv
from crewai import Agent, Task, Crew
from langchain_openai import ChatOpenAI
from composio_crewai import ComposioToolSet, Action, App
# Load environment variables
load_dotenv()
This will import libraries and load the environment variable.
Defining helper functions
In this section, we will define three helper functions.
-
get_repo_path
: This function prompts the user for a valid repository path. -
create_composio_toolset(repo_path)
: Create a ComposioToolSet instance for accessing tools. -
create_agent(tools, llm)
: Create a Code analysis agent using CrewAI.
So, letās take a look at the codes.
get_repo_path()
def get_repo_path():
"""
Prompt the user for a valid repository path.
Returns:
str: A valid directory path.
"""
while True:
path = input("Enter the path to the repo: ").strip()
if os.path.isdir(path):
return path
print("Invalid path. Please enter a valid directory path.")
The function simply asks a valid path of a code file to the user in the terminal.
create_composio_toolset(repo_path)
def create_composio_toolset(repo_path):
"""
Create a ComposioToolSet instance using the given repository path.
Args:
repo_path (str): Path to the repository to analyze.
Returns:
ComposioToolSet: Configured ComposioToolSet instance.
"""
return ComposioToolSet(
metadata={
App.CODE_ANALYSIS_TOOL: {
"dir_to_index_path": repo_path,
}
}
)
The above function returns an instance of ComposioToolSet
with the CODE_ANALYSIS_TOOL
. The tool accepts the code base path. This tool is responsible for creating indexes of the code files.
create_agent(tools, llm)
def create_agent(tools, llm):
"""
Create a Code Analysis Agent with the given tools and language model.
Args:
tools (list): List of tools for the agent to use.
llm (ChatOpenAI): Language model instance.
Returns:
Agent: Configured Code Analysis Agent.
"""
return Agent(
role="Code Analysis Agent",
goal="Analyze codebase and provide insights using Code Analysis Tool",
backstory=(
"You are an AI agent specialized in code analysis. "
"Your task is to use the Code Analysis Tool to extract "
"valuable information from the given codebase and provide "
"insightful answers to user queries."
),
verbose=True,
tools=tools,
llm=llm,
)
This function returns a CrewAI agent. The agent is defined with
- role: Role assigned to the agent.
- goal: Final goal of the agent.
- backstory: Provides additional context to the LLM for answer generation.
- tools: the CODE_ANALYSIS_TOOL
- llm: The OpenAI instance received in its arguments.
Defining the main() function
Finally, letās define the main()
function.
def main():
# Get repository path
repo_path = get_repo_path()
# Initialize ComposioToolSet
composio_toolset = create_composio_toolset(repo_path)
# create a code index for the repo.
print("Generating FQDN for codebase, Indexing the codebase, this might take a while...")
resp = composio_toolset.execute_action(
action=Action.CODE_ANALYSIS_TOOL_CREATE_CODE_MAP,
params={},
)
print("Indexing Result:")
print(resp)
print("Codebase indexed successfully.")
# Get tools for Code Analysis
tools = composio_toolset.get_tools(apps=[App.CODE_ANALYSIS_TOOL])
# Initialize language model
llm = ChatOpenAI(model="gpt-4o", temperature=0.7)
# Create agent
agent = create_agent(tools, llm)
# Get user question
question = input("Enter your question about the codebase: ")
# Create task
task = Task(
description=f"Analyze the codebase and answer the following question:\n{question}",
agent=agent,
expected_output="Provide a clear, concise, and informative answer to the user's question.",
)
# Create and execute crew
crew = Crew(agents=[agent], tasks=[task])
result = crew.kickoff()
# Display analysis result
print("\nAnalysis Result:")
print(result)
if __name__ == "__main__":
main()
This is what is happening in the above code.
We start by calling the get_repo
function, which asks for the repository directory and also defines the ComposioToolSet
instance.
- Next, using the
CODE_ANALYSIS_TOOL_CREATE_CODE_MAP
we create the vector index of the code files. This is the most prolonged and most compute-intensive phase. So, this may take a while. The tool crawls through the repository, intelligently chunks code, and creates vector indexes in a vector database. - Then create instances of OpenAI, CODE_ANALYSIS_TOOL, and finally, create an AI agent.
- In the next step, ask the user a question regarding the code base from the terminal.
- Now, define the task; this gives the Agent a purpose. The CrewAI Task is defined with
- description: A clear description.
- agent: The agent we defined earlier.
- expected_output: The expected outcome from the AI agent.
- Finally, kick off the Crew and log the result.
Once you are done, run the Python script.
python code_qa.py
This will initiate the entire flow.
For the first time, it will take a while, as the agent will crawl and index the code files.
So, here is the tool in action. š
You can find the complete code here: Code Indexing AI tool
Thank you for reading the article.
Next Steps
In this article, you built a complete AI tool that lets you question and answer over your codebase.
If you liked the article, explore and star the Composio repository for more AI use cases.
Ā
Ā
Star the Composio repository ā
Ā
Ā