As software developers, we dream of effortless coding, where we can transform complex problems into elegant and performant solutions. ⁤⁤⁤⁤However, software development is a complicated process, and writing multiple lines of error-free code is challenging, even for the most experienced developer. ⁤⁤Hence, pair programming, where two programmers work together simultaneously and provide feedback to each other, has been popular in software development.

⁤⁤Traditionally, one programmer writes the code while the other reviews each line, providing real-time feedback and suggestions. ⁤⁤With the rapid adoption of artificial intelligence (AI), pair programming with AI has enabled a single developer to write code quickly, enhancing their efficiency, improving code quality, facilitating rapid learning, and boosting overall productivity.

Today, developers can access an impressive range of tools built on top of large language models (LLMs) that go beyond basic code autocompletion and provide powerful AI-assisted coding experience. Even though OpenAI’s GPT-4o is leading most of the benchmarks for coding, Anthropic Claude and Google’s Gemini are not far behind. In this article, we’ll go through the best LLMs available for software development and use most of these LLMs interchangeably with Sourcegraph’s Cody.

Overview of popular AI Coding Assistants

Various coding tools with high-level languages are available to improve the coding experience. The integrated development environment (IDE) has streamlined the coding process by providing a comprehensive suite of tools, including a code editor, debugger, and compiler, all within a single interface. Developers can easily switch the environment according to their coding preferences for various programming languages.

The integration of artificial intelligence for assisted coding started with the basic autocompletion features that predict and complete code snippets based on context. Later, more advanced code completion tools, such as Microsoft IntelliSense, Kite, and Tabnine, were introduced with advanced machine learning models. Some of the most popular coding tools available today with advanced machine learning models are:

Sourcegraph Cody is a popular tool that uses multiple LLMs and advanced code search and analysis capabilities to enhance developers' understanding of code and generation.
GitHub Copilot was conceptualized from the success of OpenAI's GPT-3 as a code completion tool, with the first technical preview made available in June 2021 within the Visual Studio code development environment. Github Copilot is a proprietary tool that is available as a subscription-based service for developers. Since 2023, GitHub Copilot has used GPT models, such as GPT-4.
Tabnine is a proprietary tool that leverages AI to provide code completions for various programming languages, helping speed up the development process and increase productivity.

There are also a few open-source models, such as code llama. However, most of them use the same LLMs, such as OpenAI GPT-4, Anthropic's Claude, Mistral AI, Google Gemini, and others, for code and chat suggestions. Before diving into the most suitable LLMs for your development workflow, let's understand how these large transformer-based models work.

How LLMs work

A language model is usually a generative model used to understand and generate natural language in machine learning. Large language models have huge parameters (in millions, billions, or even trillions) and use deep learning architecture called transformers. These models are trained on very large datasets to predict the probability of a sequence of words paying attention to input sequences. Hence, these models use a self-attention mechanism that captures dependencies and relationships between the input sequence to process and generate a data sequence.

A wide variety of large language models and multimodel LLMs are available; some of the most popular are as follows:

OpenAI's GPT-4 is a 1 trillion parameter model, significantly improving GPT-3's (175 billion parameters). GPT-4 is still the leading model in many benchmarks. It provides a context window in two versions, i.e., 8,192 and 32,768 tokens.
Google's Gemini is the successor to LaMDA and PaLM2. Gemini 1.5 can handle up to 2 million tokens.
Anthropic Claude 2, Claude 3.5 sonnet, focuses on ethical considerations and safety.
Meta Llama 3 are open-source models available in 8B, 70B, and 405B versions.
Mistral AI Mixtral 8 X 7B and 8 X 22 B models and other variant models are also available for download for general and specialized purposes. Mixtral is also an open source LLMs.

All of these powerful models are trained on code-related data and can generate code and perform other coding tasks.

How code is generated with LLMs

A large language model uses natural language descriptions or prompts with enough context to perform a natural language processing task. Developers can fine-tune such models for coding-related tasks such as generating, debugging, etc. LLMs can demonstrate the ability to understand syntax, coding styles, and programming practices across multiple programming languages.

Given the previous tokens and prompts relevant to the tasks, the weights of these models are fine-tuned on these datasets to predict the next token in a sequence. This fine-tuning helps the transformer-based model generate coherent and context-relevant content.

Generating sound output from an LLM depends on the training data, the model's parameters, and good prompting. Vast and diverse code-related datasets from various sources, such as open-source repositories, documentation, online resources, coding books, tutorials, and coding forums, such as stack overflow, are used to train code generation.

Coding assistants like Cody use different models for various code-related tasks, such as real-time code generation, code completion, and code analysis, to help with debugging, refactoring, writing test cases, and code optimization tasks. They also offer chat interfaces to ask code-related questions, discuss and troubleshoot problems, and inspire developers with solutions.

Comparison of LLMs

Large language models and foundational models such as GPT-4, Gemini, Claude, and Mistral are versatile and unpredictable, which makes evaluating their performance complex and challenging. However, many organizations design specific benchmarks to provide a standardized, rigorous framework for comparing the capabilities of LLMs across core language-related tasks.

Most benchmarks consist of carefully designed tasks such as question answering, logical reasoning, numerical reasoning, code generation, and other natural language processing tasks. For example, Massive Multitask Language Understanding (MMLU) is a comprehensive benchmark designed to test an LLM's ability to understand and answer questions across various subjects. Similarly, Glue and SuperGlue are other popular benchmarks for testing general language understanding.

Benchmarks also require the selection of appropriate metrics for comparison; accuracy, BLEU score, and perplexity are the most common ones, while some other benchmarks also use human judgment for contrast.

Different challenges and benchmarks evaluate models for various purposes. Some benchmarks, such as ARC (AI2 Reasoning Challenge), evaluate LLMs’ ability to reason for complex, multi-step science problems. While benchmarks such as GSM8K (Grade School Math 8K) assess an LLM's math problem-solving skills. These benchmarks have enabled competition and the development of models with knowledge on par with school or college graduates.

Most benchmarks are focused on language and academic capability; however, there are a few benchmarks that evaluate programming-related tasks. The following image from the big code bench leaderboard evaluates LLMs with various practical and challenging programming tasks, including code generation and code completion based on natural language descriptions.

This leaderboard shows that the GPT-4 Turbo tops the chart, followed by the Claude 3.5 Sonnet and other models. Gemini 1.5, Mistral, and GPT-4o are also among the top 10. Diverse sets of benchmarks and leaderboards are available to test LLM coding ability comprehensively. Here are some selected coding benchmarks testing different LLMs:

EvalPlus leaderboard evaluates AI coders with rigorous tests. GPT4-Turbo also tops the chart here.
Chatbot arena leaderboard collected over 1,000,000 human pairwise comparisons to rank LLMs with the Bradley-Terry model and display the model ratings in Elo-scale. ChatGPT-4o ranks first, followed by Gemini-1.5 Pro, Claude 3.5 Sonnet, Meta Llama, and Mistral AI Large, among the top models.
Evo-Eval: Due to the popularity and age of other benchmarks, many are prone to data leakage, where example solutions for code can be readily found online. Evo-Eval transforms existing coding benchmarks (for example, HumanEval) into problems in domains such as difficulty, creativity, subtleness, etc. Here’s how the model ranks in the image below:
Spider and SQL-Eval are large-scale, complex, and cross-domain benchmarks for text-to-SQL tasks. Spider measures how well the generated SQL code or query retrieves the correct data from the database. SQL-Eval measures SQL code complexity and semantic diversity and ensures the generated queries are syntactically correct and meaningful. GPT-4 is used in most of the leading models in their leaderboards.

Strengths and weakness of LLM models

Each LLM has its strengths and weaknesses. Many benchmarks test the ability to generate code in different scenarios to understand the LLM's strengths and weaknesses. The following provides a detailed overview of the strengths and weaknesses of the most used LLMs:

Based on the above results, here is our ranking of these LLMS:

OpenAI’s GPT-4 and GPT-4o: As it performs the best across multiple benchmarks
Anthropic Claude 3.5 Sonnet: It is second to GPT-4 & GPT-4o in most of the benchmarks and also more accurate in different scenarios
Google’s Gemini 1.5: It is more factual and less imaginative than GPT-4, and it might fail in some scenarios requiring a more creative coding approach.
Mistral AI: Fast but has a comparably short context window

Over time as these models’ architecture improves and is trained on more data, this ranking is likely to change. Hence, we need a coding assistant that allows us to easily switch between these models. Sourcegraph’s AI assistant, Cody, can integrate with our favorite IDEs and allow us to change LLMs interchangeably.

Integrating LLMs with Cody

There are a large number and variations of LLMs, each with its own strengths and weaknesses. Cody, Sourcegraph’s AI assistant, can use different LLMs to help developers understand, write, and fix code more efficiently. Using multiple advanced LLMs allows Cody to provide fast single and multiple-line code completion. If you haven't used Cody yet, you can sign up for free to get started.

Installing Cody
Cody provides seamless integration with IDEs and code editors of your choice; it can be integrated into Microsoft Visual Studio Code, JetBrains IntelliJ IDEA, PhpStorm, PyCharm, WebStorm, RubyMine, GoLand, Google Android Studio, and NeoVim.
For Visual Studio Code IDE integration, you can get the Cody extension from the marketplace. Alternatively, you can open the page by clicking View > Extensions in VS code and searching for Cody AI.

When you install Cody in Microsoft Visual Studio for the first time, it is usually the last item in the sidebar. Check for the icon, as shown in the image below. You’ll need to sign in using the same login method and ID you used while creating your account.

Press the button again. You'll get a Cody chat sidebar, which you can use to chat with Cody to generate code, document your code, explain parts of code, write unit tests, find code smells, and more.

Switching LLMs within Cody
After installing Cody in VS Code, you can select between different LLMs through the drop-down menu below the Cody: Chat bar prompt box. Each model has its strengths and weaknesses, as previously discussed. However, Cody has grouped different models for accuracy, speed, or best of both.

Here’s the list of LLMs supported by Cody:

Anthropic Claude	Open AI GPT	Google Gemini	Mistral AI
Claude 3.5 Sonnet	GPT-4o	Gemini 1.5 Pro	Mixtral-8x22B
Claude 3 Sonnet	GPT-4 Turbo	Gemini 1.5 Flash	Mixtral 8x7B
Claude 3 Opus	GPT-3.5 Turbo
Claude 3 Haiku

The integration of these models provides several benefits to the programmers for AI-assisted software development:

Cody recommends that users select different options for better accuracy of code, faster generation speed, or balance. Thus, users can easily choose their preferences without researching each LLM.
LLMs enable Cody to provide more accurate and context-aware code completions and debugging.
LLMs empower Cody to understand code semantics better. As a result, it can explain complex code constructs, algorithms, and design patterns.

Using LLM models in Cody for code generation tasks

The following code example illustrates how Cody integrations allow experimenting with different LLMs to solve coding problems:

Generating algorithm to solve a Leet code problem:

Let’s experiment with the problem: Longest substring without repeating characters

Copy the description and put it in the chat or prompt in Cody using Claude 3.5 Sonnet. Cody will generate the solution that passes all the test cases. In this case, we are writing Python code. Here’s the code output:

The code passes all the test cases. Claude is quite detailed in its generation as it generates additional details and explanations of how the code works in the chat and the space and time complexity. Different LLMs will result in various output formats. Using the same prompt with Gemini 1.5 Pro generates the following code.

Using GPT-o, we get the following output with an explanation of the code and example usage.

The runtime for all the generations is in milliseconds, and none of the syntax errors in the generated code are present.

Generate code to train a deep-learning model
Training AI models is a common coding task these days. AI training AI models is another goal for automated coding systems. Using the following prompt, we can generate functioning Python code to train an object detection model.

Generate Python code to create and train a deep learning model for object detection using PyTorch. The code should include the following steps:

Import necessary libraries and modules.
Define the dataset and data loaders. Here, images are available in the dataset folder, while the annotations are provided in a json file, as mentioned in the Coco dataset website.
Use a pre-trained neural network model suitable for object detection.
Define the loss function and optimizer.
Implement the training loop with validation.
Save the trained model.

Ensure the code is well-commented and follows best practices for readability and efficiency.

Here's the code generated by Claude 3.5 Sonnet model; sections are capped to reduce the lengthy code.

import torch
import torchvision
...

# Step 1: Import necessary libraries and modules
import torch.optim as optim
from tqdm import tqdm

# Step 2: Define the dataset and data loaders
class CocoDataset(Dataset):
    def __init__(self, root_dir, annotation_file, transform=None):
        self.root_dir = root_dir
        self.coco = COCO(annotation_file)
        ...

# Create dataset and data loaders
train_dataset = CocoDataset(root_dir='dataset/train', annotation_file='annotations/instances_train2017.json')
val_dataset = CocoDataset(root_dir='dataset/val', annotation_file='annotations/instances_val2017.json')

train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True, num_workers=4, collate_fn=lambda x: tuple(zip(*x)))
val_loader = DataLoader(val_dataset, batch_size=2, shuffle=False, num_workers=4, collate_fn=lambda x: tuple(zip(*x)))

# Step 3: Use a pretrained neural network model suitable for object detection
def get_model(num_classes):
    model = fasterrcnn_resnet50_fpn(pretrained=True)
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    ...

# Step 4: Define the loss function and optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = optim.SGD(params, lr=0.005, momentum=0.9, weight_decay=0.0005)

# Step 5: Implement the training loop with validation
num_epochs = 10
for epoch in range(num_epochs):
    model.train()
    train_loss = 0
    for images, targets in tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}"):
        ...

# Step 6: Save the trained model
torch.save(model.state_dict(), 'object_detection_model.pth')
print("Model saved successfully!")

The code works as expected, with slight adjustments to correct the format in the project directory and some minor error fixing. Again, the model generates additional explanations, such as before, explaining the code generation and the assumptions the LLM model has made. GPT4-o, Gemini, and Mixtral also give very similar results, requiring some fundamental changes to the code.

Wrapping up

Integrating large language models in IDEs has transformed software development by enhancing efficiency, improving code quality, and facilitating rapid learning and prototyping. These advanced tools use LLMs to provide real-time code completion, code generation, explanation, debugging, and understanding features.

Among all the popular models, GPT-4o leads many coding benchmarks, and Claude 3.5 Sonnet is among the top five models in these benchmarks. Claude 3.5 and Mixtral-8X7B are extremely fast, while GPT-4o and Gemini work well with multimodal inputs. Each model has its strengths and weaknesses, and different scenarios may require different models.

Cody allows you to find the best-fitting LLM for the coding scenario and your organization’s coding practices. Whether you are looking for speed, accuracy, or balance, Cody’s adaptable integrations have you covered. You can start experimenting today with multiple LLMs, from GPT-4 to Claude 3.5, directly available from Sourcegraph Cody, and select the best model for your coding problem.

These are the best large language models for coding