DEMO - Voice to PDF - Complete PDF documents with voice commands using the Claude 3 Opus API

Juan Stoppa - Apr 27 - - Dev Community

I spent some time last weekend exploring the Claude 3 Opus API from Anthropic since I have heard so many comments about its potential which appears to surpass ChatGPT, especially when tasked with resolving complex problems such as writing code.

As I was looking into its capabilities, I decided to build an app that allows you to complete a PDF form with voice commands which ended up working much better than I expected.

The idea of the app was to:

  1. Upload a PDF form.
  2. Record voice commands to fill in the form.
  3. Download the completed PDF form.

You can see the demo below:

You can find the app on github at https://github.com/jstoppa/voice_to_pdf

How did I build it?

I spent more time getting the functionality working than building the prompting :-), I used plain JavaScript and NodeJS as I like to keep these demos as plain as possible so I can then pick up the code and use it in other frameworks without having to rely on framework-specific nuances.

The Frontend app:

It runs using Parcel, very simple and easy to setup. The app has 3 files:

  • app.js: main app that brings the entire solution together.
  • dragDrop.js: basic primitives to handle the drag and drop file functionality.
  • pdfHandler.js: this file contains the logic for two fundamental actions:
    • readPdf: used for reading the dropped file and displaying it on the screen, it uses PDF.js to load the file, get all fields and display it on the browser.
    • writePdf: used for writing the final completed PDF after receiving the response from the Claude 3 Opus API, this uses the PDF-lib library to manipulate and modify the final file.
  • speechRecognition.js: this uses the not well-known Web Speech API that comes with the browser and works impressively well. The file handles the action of listening to the voice and displaying the text on the screen.

The Backend app:

It's nothing impressive, just a single NodeJS end point (server.js) to proxy the API call to Claude Opus API, it's mainly used for keeping the API key in the server side and handle the CORS constrains dictated by the browser.

The most important bit it's really the prompt which consists in 3 parts:

  • The task definition: this goes inside the system parameter when calling the Claude Opus API (read more here).The prompt is the following
You are tasked with assisting in the completion of a PDF questionnaire using a provided JSON dataset.

The JSON data includes the following fields for each question in the form:
- id
- question
- isValidQuestion
- answer

Your specific duties include:

1. Question validation: Form the data for Processing you need to
    a. Analyse the "question" field
    b. Determine if the question is valid based on the "isValidQuestion" field
    c. If the question is valid, incorporate the corresponding answer provided in the answer field using the data provided by the user role

2. Strict Adherence to Data: Under no circumstances should you alter, rephrase, or modify any of the the question or id field, your main task is to populate the isValidQuestion and answer fields

3. Format Requirement: Return the results strictly in JSON format. Ensure that the output contains only the required information, maintaining the integrity and structure of the original JSON, including the id fields.\n

4. Valid Questions: Only return the questions that contain a valid question based on the "isValidQuestion" field but still conserving the original id

Important Note: Do not add extraneous text or information outside of the specified JSON structure.
Enter fullscreen mode Exit fullscreen mode
  • List of fields in the PDF: this is a JSON structure with the list of fields in the PDF form, this is generated dynamically based on the document uploaded. As the previous text describes, the task for Claude is to
    • Check if the question is valid by setting the value isValidQuestion to true or false
    • Answer the question using the context given on the user role.
[
    {
        "id": 0,
        "question": "First name",
        "isValidQuestion": true,
        "answer": "John"
    },
    {
        "id": 1,
        "question": "Last name",
        "isValidQuestion": true,
        "answer": "Doe"
    }
]
Enter fullscreen mode Exit fullscreen mode
  • User Role: This describes the context which is the text generated by the speech to text api. The prompt is defined as below where contextualText variable is the text.
{
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "This is the contextual text that you need to use to complete the questionnaire\n\n ${contextualText}"
                }
            ]
        }
    ]
}
Enter fullscreen mode Exit fullscreen mode

And that's pretty much it. You can see how, with a very simple app, you can perform a reasonably advanced task which was unimaginable to achieve just a few years ago.

If you like this post, you might also like Exploring the GPT-4 with Vision API using Images and Videos. And if you are completely new to working with AI, I suggest you check Getting started with OpenAI using Python in Windows and Getting started with Azure OpenAI

. . . . . . . . .