Generating training data with OpenAI function calling

Krisztián Maurer - Jun 22 - - Dev Community

As I delve into the fields of Machine Learning and AI, it's clear that the quality of training data is crucial. Creating training data, such as labeling 10,000 texts or images, can be a tedious task. However, OpenAI models can be used to automate this process. OpenAI models can generate specific training or fine-tuning data for our own models. In this blog post, I will discuss how this works.

(btw did you know that GPT can generate memes?)
Maurer Krisztian

Why Use Function Calling for This?

One of the most useful features of OpenAI is function calling. It can call our functions with a predefined schema, ensuring consistency. When generating training data, this consistency is crucial. For example, most label values must follow a schema with a predefined set of options. Additionally, you can add logic to these functions to handle the clean, consistent data, such as saving it to a database or a CSV file.

My Motivation

In my latest side project, I created an RSS reader with AI features.
One of the features is to categorize post content as "positive," "negative," or "neutral." This allows users to filter out negative posts if they prefer. While I found many models that do a good job, I plan to fine-tune one with RSS feed data to improve accuracy. However, if I want to create a more advanced sentiment classifier with custom labels, I need to create my own training dataset and train my model. Whether I use an existing model or create one, I need high-quality training data. This is why I brainstormed and found the following method.

Labeling Data with OpenAI

First, gather some data. From this data, you can create a fine-tuned dataset by labeling or adding new machine learning features. Here's a simple guide:

  1. Provide Proper Context for OpenAI:

    • Add a clear system prompt, e.g., "Your task is to label the provided data."
    • Include the data context in the prompt, e.g., "Label this blog post with the label_tool: 'blog post content...'."
  2. Create the Function Schema for OpenAI:

    • Provide a detailed description of the tool.
    • Clearly define the parameters, using enums and other schema elements to restrict responses.
  3. Create the Function Defined by the Schema:

    • This function can process the data, save it, or perform other tasks. In my case, it can add a new row to a training data CSV file, creating a new training element.

By following these steps, you can accurately and consistently label data, making it ready for training your models.

Let's Look at a Simple Code Example

Here is a simple example of a text labeling tool. Keep in mind you can do much more complex things than this, such as creating complex ML features or utilizing image recognition or text-to-speech features. But to keep it clear, I chose this example:

In this example, I add a label to any text which can be ['positive', 'negative', 'neutral'] and write the result to a CSV file so it can be later used to teach or fine-tune a model.




import {ITool, ToolSchema} from './interfaces/tool.interface';
import {ToolUtils} from "../utils/tool-utils";
import * as path from 'path';
import {createObjectCsvWriter as createCsvWriter} from 'csv-writer';

export class LabelTool implements ITool<string[], { inputText: string }> {
    private csvWriter;

    constructor(private readonly labels: string[] = ['positive', 'negative', 'neutral'], private readonly csvFilePath: string = path.join('labeled_text.csv')) {
        this.csvWriter = createCsvWriter({
            path: this.csvFilePath,
            header: [
                {id: 'label', title: 'Label'},
                {id: 'text', title: 'Text'},
            ],
            append: true
        });
    }

    // The openAI model will call this fn with the proper "options" parameter, the ctx just our optional additional context.
    async callback(
        options: { label: string },
        ctx: { inputText: string },
    ): Promise<any> {

        // write the new labeled data row to a csv
        await this.csvWriter.writeRecords([{
            label: options.label,
            text: ctx.inputText
        }]);

        console.log(`Add CSV row: ${options.label} | ${ctx.inputText}`);

        return `Added label successfully: ${options.label}`;
    }

    // learn more about json schemas here https://json-schema.org/learn/getting-started-step-by-step
    async getSchema(ctx: { inputText: string }): Promise<ToolSchema> {

        // this is the provided schema for the LLM
        return {
            type: 'function',
            function: {
                name: 'set_label',
                description: 'Set label to text',
                function: ToolUtils.getToolFn(this, ctx),
                parse: JSON.parse,
                parameters: {
                    type: 'object',
                    properties: { // thies properties will be in the callback "options" param
                        label: {
                            type: 'string',
                            description: 'label of the input text',
                            enum: this.labels // restrict the possible strings
                        },
                    },
                },
            },
        };
    }
}




Enter fullscreen mode Exit fullscreen mode

With this tool, you can make requests to OpenAI and iterate over your data that needs to be labeled.




import OpenAI from "openai";
import {LabelTool} from "./tools/label.tool";
require('dotenv').config()

const client = new OpenAI({
    apiKey: process.env.OPENAI_API_KEY,
});

(async () => {
    const inputTexts = [ // the OpenAI model will label thies with ['positive', 'negative', 'neutral']
        "I love this product!",
        "This is the worst thing I have ever bought.",
        "It's okay, not great but not bad either.",
        "Not worth the money.",
        "Best purchase ever!",
    ];

    for (const inputText of inputTexts) {
        console.debug(`Prompt: Label this text: ${inputText}`);

        const tool = new LabelTool(['positive', 'negative', 'neutral']);
        const context = { inputText: inputText };
        const prompt = `Label this text: ${inputText}`;
        const system = 'You are a helpful assistant generating training data';

        const runner = client.beta.chat.completions.runTools({
            model: 'gpt-3.5-turbo',
            messages: [
                {
                    role: 'system',
                    content: system,
                },
                {
                    role: 'user',
                    content: prompt,
                },
            ],
            tools: [await tool.getSchema(context)],
            tool_choice: 'auto', // If you pass tool_choice: {function: {name: …}} instead of auto, it returns immediately after calling that function
        });

        const finalContent = await runner.finalContent();
        console.log(`AI response: ${finalContent}
        `);
    }
})();


Enter fullscreen mode Exit fullscreen mode

Let's run it:



➜  git:(main) ✗ npx ts-node index.ts


Enter fullscreen mode Exit fullscreen mode

Log result:



Prompt: Label this text: I love this product!
Add CSV row: positive | I love this product!
AI response: The text "I love this product!" has been labeled as positive.

Prompt: Label this text: This is the worst thing I have ever bought.
Add CSV row: negative | This is the worst thing I have ever bought.
AI response: The text "This is the worst thing I have ever bought." has been labeled as negative.

Prompt: Label this text: It's okay, not great but not bad either.
Add CSV row: neutral | It's okay, not great but not bad either.
AI response: The text "It's okay, not great but not bad either." has been labeled as neutral.

Prompt: Label this text: Not worth the money.
Add CSV row: negative | Not worth the money.
AI response: The text "Not worth the money." has been labeled as "negative".

Prompt: Label this text: Best purchase ever!
Add CSV row: positive | Best purchase ever!
AI response: The text "Best purchase ever!" has been labeled as positive.




Enter fullscreen mode Exit fullscreen mode

CSV file:



Label,Text
positive,I love this product!
negative,This is the worst thing I have ever bought.
neutral,"It's okay, not great but not bad either."
negative,Not worth the money.
positive,Best purchase ever!




Enter fullscreen mode Exit fullscreen mode

Of course, there are many ways to simplify and extend this method, but I chose this example to give you an idea. You can try out the code in this GitHub repository: https://github.com/MaurerKrisztian/training_data_genration_with_openai

Using OpenAI's function calling can make it much easier to create high-quality training data. Whether you're labeling text, images, audio, or other data, this method ensures that the labels are accurate and consistent. This can save a lot of time and effort when training or fine-tuning your machine learning models.

Thank you for reading this blog post! I'm still experimenting with this idea, so if you have any thoughts on how this method can be used or expanded, please leave a comment.

. . . . . . . . . . . . . . . . . . . . . . . .