How to Integrate OpenAI for Text Generation, Text-to-Speech, and Speech-to-Text in .NET

PeterMilovcik - Nov 7 - - Dev Community

With the release of OpenAI's latest NuGet package (version 2.0.0), developers can easily integrate AI-driven text generation, text-to-speech (TTS), and speech-to-text (STT) functionalities into their .NET applications. This guide will walk through creating an OpenAI service in .NET that allows you to generate text responses, convert text to audio, and transcribe audio files back to text.

This implementation will use the minimum configuration necessary. For Windows, we’ll also leverage the NAudio package for handling audio playback, as it offers a straightforward solution for recording and playing audio files.


Prerequisites

Before you start integrating OpenAI’s capabilities into your .NET project, make sure you have the following set up:

  1. Install the OpenAI NuGet Package (Version 2.0.0): Add the latest version of the OpenAI NuGet package to your .NET project:
dotnet add package OpenAI --version 2.0.0
Enter fullscreen mode Exit fullscreen mode
  1. Install NAudio (for Windows audio handling): If you're working on a Windows machine and need to handle audio recording or playback, add the NAudio NuGet package:
dotnet add package NAudio
Enter fullscreen mode Exit fullscreen mode
  1. Set the OpenAI API Key:
    • For Windows users, you can set the OPENAI_API_KEY environment variable using the Command Prompt:
setx OPENAI_API_KEY your_openai_api_key
Enter fullscreen mode Exit fullscreen mode
  • Note: Run this command in a Command Prompt with administrative privileges for a system-wide setting.
  • Restart any open Command Prompt or PowerShell windows after running this command to ensure the new variable is recognized.
  • For other platforms (macOS, Linux), you can set the environment variable using:
export OPENAI_API_KEY=your_openai_api_key
Enter fullscreen mode Exit fullscreen mode
  1. Ensure .NET SDK is Installed: Make sure you have the latest version of the .NET SDK installed. You can check your version using:
dotnet --version
Enter fullscreen mode Exit fullscreen mode

With these prerequisites in place, you are ready to start building AI-enhanced features into your .NET applications!


Step 1: Generating Text Responses

The following OpenAiService class uses OpenAI’s text generation API, powered by the GPT-4 model, to generate responses based on a given prompt.

public class OpenAiService
{
    private readonly ChatClient _chatClient;

    public OpenAiService()
    {
        var apiKey = Environment.GetEnvironmentVariable("OPENAI_API_KEY");
        _chatClient = new ChatClient("gpt-4o-mini", apiKey);
    }

    public async Task<string> GenerateResponseAsync(string prompt)
    {
        var messages = new List<ChatMessage>
        {
            new SystemChatMessage("You are a knowledgeable assistant."),
            new UserChatMessage($"Generate a response based on the prompt:\n\n{prompt}")
        };
        var completion = await _chatClient.CompleteChatAsync(messages);
        return completion.Content[0].Text;
    }
}
Enter fullscreen mode Exit fullscreen mode

In this class:

  • The GenerateResponseAsync method takes a prompt and generates a response.
  • We initiate a conversation by sending a system message, setting the tone as a "knowledgeable assistant."
  • Finally, we pass the prompt to the model and return the generated response.

Step 2: Converting Text to Speech

To convert text to speech, we’ll use OpenAI’s TTS functionality. This TextToSpeechService class converts a given text to an audio file and plays it.

public class TextToSpeechService
{
    private readonly AudioClient _audioClient;

    public TextToSpeechService()
    {
        var apiKey = Environment.GetEnvironmentVariable("OPENAI_API_KEY");
        _audioClient = new AudioClient("tts-1", apiKey);
    }

    public async Task ConvertTextToSpeechAsync(string text)
    {
        var speech = await _audioClient.GenerateSpeechAsync(text, GeneratedSpeechVoice.Onyx);
        using (var stream = File.OpenWrite("output.mp3"))
        {
            speech.ToStream().CopyTo(stream);
        }
        PlayAudio("output.mp3");
    }

    private void PlayAudio(string filePath)
    {
        using (var audioFile = new AudioFileReader(filePath))
        using (var outputDevice = new WaveOutEvent())
        {
            outputDevice.Init(audioFile);
            outputDevice.Play();
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Key points:

  • ConvertTextToSpeechAsync accepts a text string, converts it into speech, and saves it as an MP3 file.
  • The PlayAudio method leverages NAudio for playback. It reads the MP3 file and plays it back on your system.

Step 3: Transcribing Audio to Text

The following SpeechToTextService class uses OpenAI’s Whisper model to transcribe audio files into text. This can be incredibly useful for processing voice input.

public class SpeechToTextService
{
    private readonly AudioClient _audioClient;

    public SpeechToTextService()
    {
        var apiKey = Environment.GetEnvironmentVariable("OPENAI_API_KEY");
        _audioClient = new AudioClient("whisper-1", apiKey);
    }

    public async Task<string> TranscribeAudioAsync(string audioFilePath)
    {
        var transcription = await _audioClient.TranscribeAudioAsync(audioFilePath);
        return transcription.Text;
    }
}
Enter fullscreen mode Exit fullscreen mode

This class:

  • Accepts an audio file path and transcribes the audio content into text.
  • The transcription result is returned as a plain text string.

Step 4: Recording Audio with NAudio

For applications that need to capture audio from the user, such as for speech-to-text input, you can use the NAudio library to record audio and save it as a .wav file. This is especially useful for Windows-based applications, where NAudio provides a straightforward API for handling audio input.

The StartRecordingAsync method below demonstrates how to record audio from the default microphone, saving it to a specified output file path.

public async Task StartRecordingAsync(string outputFilePath, CancellationToken cancellationToken)
{
    var waveFormat = new WaveFormat(44100, 16, 1); // 44.1 kHz, 16-bit, mono
    using (var waveIn = new WaveInEvent { WaveFormat = waveFormat })
    using (var writer = new WaveFileWriter(outputFilePath, waveFormat))
    {
        waveIn.DataAvailable += (sender, e) =>
        {
            writer.Write(e.Buffer, 0, e.BytesRecorded);
        };

        waveIn.StartRecording();

        try
        {
            await Task.Delay(Timeout.Infinite, cancellationToken); // Keeps recording until cancellation
        }
        catch (TaskCanceledException)
        {
            waveIn.StopRecording();
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

In this code:

  1. Initialize Audio Format: We set up the audio format to 44.1 kHz, 16-bit, mono. These settings provide good quality for most voice recordings.
  2. Create Audio Input and Writer: We use WaveInEvent for capturing audio from the default microphone and WaveFileWriter to write the audio data to a file.
  3. Handle Data Available Event: As audio data becomes available (captured in chunks), it is written to the file through the writer.
  4. Start and Stop Recording: Recording starts with StartRecording() and will continue until the provided CancellationToken is canceled, at which point StopRecording() is called to end the recording.

Usage Example

To start recording audio, you can call this method and provide a file path and cancellation token:

var recordingService = new AudioRecordingService();
var cancellationTokenSource = new CancellationTokenSource();

Console.WriteLine("Recording audio. Press any key to stop...");

_ = recordingService.StartRecordingAsync("recording.wav", cancellationTokenSource.Token);

// Wait for a key press to stop recording
Console.ReadKey();
cancellationTokenSource.Cancel();
Enter fullscreen mode Exit fullscreen mode

This example will begin recording audio and save it to recording.wav until a key is pressed, triggering the cancellation of the recording.

With the addition of audio recording using NAudio, you now have a full toolkit for handling text generation, text-to-speech, speech-to-text, and audio recording within your .NET application. This setup provides a complete pipeline for interactive and conversational applications in .NET, enabling voice-based input, audio output, and seamless integration with OpenAI’s powerful language models.

Putting It All Together

With these services implemented, you have the foundation for a fully interactive .NET application that can generate text, convert text to speech, transcribe spoken input, and record audio. Here’s an example of how to use all four services in a cohesive application.

var openAiService = new OpenAiService();
var ttsService = new TextToSpeechService();
var sttService = new SpeechToTextService();
var recordingService = new AudioRecordingService();
var cancellationTokenSource = new CancellationTokenSource();

// Step 1: Generate a Text Response
string prompt = "Tell me something interesting about AI.";
string generatedText = await openAiService.GenerateResponseAsync(prompt);
Console.WriteLine("Generated Text: " + generatedText);

// Step 2: Convert Generated Text to Speech
await ttsService.ConvertTextToSpeechAsync(generatedText);

// Step 3: Record Audio Input
Console.WriteLine("Recording audio input. Press any key to stop recording...");
_ = recordingService.StartRecordingAsync("user_recording.wav", cancellationTokenSource.Token);
Console.ReadKey();
cancellationTokenSource.Cancel();

// Step 4: Transcribe Recorded Audio
string transcribedText = await sttService.TranscribeAudioAsync("user_recording.wav");
Console.WriteLine("Transcribed Text: " + transcribedText);
Enter fullscreen mode Exit fullscreen mode

Conclusion

Using OpenAI’s .NET SDK alongside NAudio, you can bring powerful AI capabilities into your .NET applications. This integration covers:

  • Text Generation: Generate contextually relevant responses.
  • Text-to-Speech: Convert generated text to audio for a more interactive experience.
  • Speech-to-Text: Capture and transcribe user input.
  • Audio Recording: Enable seamless audio capture for user interactions.

This setup provides a complete, interactive pipeline that can power chatbots, virtual assistants, or any voice-enabled application. By following this guide, you’ll have a solid foundation for enhancing your .NET applications with AI-powered, voice-driven features.

Before You Go...

Did this guide help you level up your .NET skills with OpenAI integration? If so, let’s spread the knowledge! Give it a like, share with fellow devs, or drop a comment below. Every interaction helps boost this content, bringing these tips to more developers. And hey—if it didn’t deliver, no hard feelings; your silence speaks louder than clicks! 😉

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .