Combining GPT and Wolfram Alpha

JoeStrout - Jan 16 '23 - - Dev Community

An interesting essay appeared on Wolfram|Alpha's blog today:
Wolfram|Alpha as the Way to Bring Computational Knowledge Superpowers to ChatGPT. In it, the author (Stephen Wolfram?) argues that ChatGPT and Wolfram|Alpha complement each other — the latter being particularly good at numerical, mathematical, and computational tasks where ChatGPT is weak.

So I decided to try this for myself! Free API keys are available for both GPT-3.5 and Wolfram|Alpha, making them accessible in environments like Mini Micro using http.get and http.post. So my basic idea was: let's make a chat room with both of these powerful AIs on hand, and let either one jump in as they can?

My first thought was to use Wolfram|Alpha's Fast Query Recognizer API, which sounds like it's intended for exactly this purpose. But I couldn't get it to work; no matter what I try, it always replies "invalid appid" (using the same App ID that works just fine with other endpoints).

So in the end, I just pass every query first to Wolfram|Alpha, and if it can't deal with it, it returns an error code, and I then hand it to ChatGPT. As a refinement, I allow the user to direct a query directly to either AI, by prefacing their input with "Wolfram" or "GPT" (or any of several synonyms). And if input is just one of these names plus a question mark, the program repeats the last input, but directed toward the AI of interest.

Example

You didn't come here to hear me talk — you came to see what the AIs have to say! So let's dive right in with an example session.

  User: What's the third largest country in South America?
+ WolframAlpha: The 3rd largest country in South America by 
+ total area is Peru. The total area of Peru is about 496225 
+ square miles.
  User: How many cows do they have?
- GPT: According to the Food and Agriculture Organization of 
- the United Nations, Peru had an estimated 11.7 million 
- cattle as of 2019.
  User: Wolfram, how many cattle are there in Peru?
+ WolframAlpha: In 2020, Peru had about 5.64 million cattle 
+ as livestock.
  User: GPT?
- GPT: According to the Food and Agriculture Organization of 
- the United Nations, Peru had an estimated 11.7 million 
- cattle as of 2019.
Enter fullscreen mode Exit fullscreen mode

This example illustrates the key benefits of this team-AI system. The first question, a factual geography question, is the sort of thing Wolfram|Alpha is good at. It speaks up and answers my question directly, and you can trust its answer. But my follow-up question, "How many cows do they have?", was not understood by Wolfram|Alpha, whose natural language processing (NLP) is much more limited than GPT. So it declines to answer, and GPT jumps in, giving me a plausible-sounding answer of 11.7 million cows.

Plausible, but is it right? I rephrased my question and directed it to Wolfram|Alpha. This time it understood me, and gave me an accurate answer of 5.64 million cows. GPT was off by more than a factor of two. So I asked GPT to reconsider (saying simply "GPT?"). But GPT stuck to its guns, repeating its previous incorrect answer.

Wolfram|Alpha Knows Its Facts

As the example above demonstrates, when asked a factual question, GPT just makes up an answer. Sometimes it's in the right ballpark, and other times it's wildly off.

Having them both in the room is helpful because if your question is one that Wolfram can answer, it will answer first (and correctly). If not, GPT will make something up, but then you can ask follow-up questions of Wolfram to check its facts.

Wolfram|Alpha Is Really Stupid

Despite being a mathematical genius, Wolfram|Alpha often falls down and fails to understand even simple queries. (This happens so often, one might even suspect that the examples in the essay were cherry-picked!) For example, the first thing I tried today seemed like a question that would be right up Wolfram|Alpha's alley: how heavy are 30 elephants?

Check out this exchange:

  User: How much would 30 elephants weigh?
- GPT: According to WolframAlpha, a group of 30 elephants 
- would weigh approximately 1,800,000 pounds.
  User: Wolfram?
+ WolframAlpha: Error: Wolfram|Alpha did not understand your 
+ input
  User: Wolfram, how much does an elephant weigh?
+ WolframAlpha: The typical weight of an African bush 
+ elephant is 4400 to 13000 pounds.
  User: Wolfram, what is 30 times that?
+ WolframAlpha: 30 times Interval of the list Rational of
+ 200000000000 and 45359237, Rational of 610000000000 and 
+ 45359237 pounds is equivalent to a mass of 60 million to 
+ 183 million grams.
  User: What is that in pounds?
+ WolframAlpha: 6000000000000 over 45359237 to 18300000000000 
+ over 45359237 pounds is equivalent to 160754 to 490299 troy 
+ pounds.
  User: GPT, what does one elephant weigh?
- GPT: According to WolframAlpha, the typical weight of an 
- African bush elephant is 4400 to 13000 pounds.
  User: What is 30 times 13000 pounds?
+ WolframAlpha: 30 times 13000 troy pounds is equivalent to a 
+ mass of about 320914 pounds.
  User: GPT?
- GPT: According to WolframAlpha, 30 times 13000 pounds is 
- equivalent to about 320914 pounds.
Enter fullscreen mode Exit fullscreen mode

Though I expected Wolfram|Alpha to take this one, it did not, so GPT confidently responded that "According to WolframAlpha", the answer was 1.8 million pounds. But this is wildly wrong; as previously noted, it's just making stuff up. But when I asked WolframAlpha the same question, I got "Error: Wolfram|Alpha did not understand your input."

So I backed up and asked it "how much does an elephant weigh?" It gave me a good answer. Then I asked it "what is 30 times that?" (While Wolfram|Alpha does not take the entire chat history as context, as GPT does, it does have some limited internal context tracking.) It gave me an answer but in terms of "Interval of the list Rational" and grams, making it very hard to understand. And asking it to convert that to pounds made it hardly any more comprehensible.

Incidentally, the correct answer (which I finally busted out a calculator and computed myself) is 132,000 to 390,000 pounds. ChatGPT's answer was off by almost a factor of 5. Wolfram's answer was correct, I guess, except it insisted on giving it to me in "troy pounds" for some reason. A troy pound is only 3/4 a normal pound, as I learned from this later exchange:

  User: What is a "troy pound"?
+ WolframAlpha: The definition of the word "troy pound" is: 
+ an apothecary weight equal to 12 ounces or 373.242 grams.
   User: How many ounces in a normal pound?
- GPT: According to WolframAlpha, a normal pound is equal to 
- 16 ounces.
Enter fullscreen mode Exit fullscreen mode

Notice that GPT often says "According to WolframAlpha". This was an unexpected side-effect of my initial context prompt, which includes "Also present is WolframAlpha, which is very good at math and numerical facts." I think GPT is trying to play along by pretending to consult WolframAlpha even when that bot has not actually chimed in.

GPT Pays Attention

GPT does get the entire chat history as context, and so any fact that Wolfram|Alpha says first, it will accept and use.

  User: How far is it to Ceres right now?
+ WolframAlpha: The distance from Earth to 1 Ceres at 10:51 
+ A.M. Mountain Standard Time, Monday, January 16, 2023 is 
+ about 2.06 astronomical units.
  User: How far is that in km?
+ WolframAlpha: The distance from Earth to 1 Ceres at 10:52 
+ A.M. Mountain Standard Time, Monday, January 16, 2023 is 
+ about 308 million kilometers.
  User: How long would it take to get there at 100 kph?
- GPT: At a speed of 100 kilometers per hour, it would take 
- approximately 3,080,000 hours or 128,333.33 days to reach 1 
- Ceres from Earth.
Enter fullscreen mode Exit fullscreen mode

This is a good example of the two working together; Wolfram|Alpha understood my first two queries, and gave me a reliable result, but failed to understand my third one. Stepping in, GPT did the math, using the numbers Wolfram|Alpha had already given us. I checked this result by hand, and it was correct. And in other experiments, it seems like GPT is usually reliable at this sort of thing, if the facts it needs are already in the conversation.

If it doesn't have the facts, of course, it just makes something up. In a fresh session:

  User: GPT, how long would it take to get to Ceres at 100 
  kph?
- GPT: It would take approximately 7.5 years to reach the 
- dwarf planet Ceres at a speed of 100 kilometers per hour.
Enter fullscreen mode Exit fullscreen mode

7.5 years? No. The correct answer (128,333 days) is over 351 years.

GPT Won't Back Down

Once GPT gives a wrong answer, it sticks to its guns. Even if you try to draw its attention to its error! Watch GPT steadfastly arguing that Ceres (radius 292 miles) is bigger than Mercury (radius 1516 miles).

  User: What moons in the solar system are bigger than 
  Mercury?
+ WolframAlpha: The planetary moons with average radius 
+ bigger than that of Mercury are Ganymede and Titan.
  User: Are there any dwarf planets that big?
- GPT: Yes, there are five known dwarf planets in the Solar 
- System that are larger than Mercury, including Ceres, 
- Haumea, Makemake, Eris, and Pluto.
  User: what are the five biggest dwarf planets?
+ WolframAlpha: The five biggest dwarf planets by average 
+ radius are Pluto, 136199 Eris (2003 UB313), and 1 Ceres. 
+ The average radius of Pluto is about 739.37 miles. The 
+ average radius of 136199 Eris (2003 UB313) is about 722.7 
+ miles. The average radius of 1 Ceres is about 292 miles.
  User: What is the radius of Mercury?
+ WolframAlpha: The atomic radius of mercury is 150 
+ picometers.
  User: No I mean the planet Mercury.
- GPT: The average radius of the planet Mercury is about 1516 
- miles.
  User: Wolfram, what is the radius of the planet Mercury?
+ WolframAlpha: The average radius of Mercury is about 1520 
+ miles.
  User: Is Ceres bigger than Mercury?
- GPT: Yes, Ceres is larger than Mercury. The average radius 
- of Ceres is about 292 miles, whereas the average radius of  
- Mercury is about 1516 miles.
  User: 292 is not larger than 1516.
- GPT: You are correct, 292 is not larger than 1516. However, 
- the average radius of Ceres is about 292 miles, whereas the  
- average radius of Mercury is about 1516 miles, making Ceres 
- larger than Mercury.
Enter fullscreen mode Exit fullscreen mode

Now we know what it feels like to be gaslit by an AI!

How it works

I described in a previous post how to get set up with Mini Micro and interface with GPT-3. For this project, I just added a bit more code to interface with Wolfram|Alpha.

The key bit is a new wolframResponse function:

wolframResponse = function(query)
    url = "http://api.wolframalpha.com/v1/conversation.jsp" +
    "?appid=" + wolframAppId +
    "&i=" + urlEncode(query)

    if conversationID then
        url = url + "&conversationid=" + conversationID
    end if
    if conversationS then
        url = url + "&s=" + conversationS
    end if

    rawResult = http.get(url)
    result = json.parse(rawResult)
    if result.hasIndex("error") and result.error then
        return "Error: " + result.error
    end if
    globals.conversationID = result.conversationID
    if result.hasIndex("s") then
        globals.conversationS = result.s
    end if
    return result.result
end function
Enter fullscreen mode Exit fullscreen mode

In addition to returning the response, this updates a couple of global variables, conversationID and conversationS, to store small tokens that the API uses to keep track of context for follow-up questions.

The main program is then elaborated with a bit of logic to (1) check whether the input directly addresses either AI, and (2) if not, give WolframAlpha a first crack at it, falling back on GPT if the WolframAlpha API returns an error.

Here's the complete program, in case you're curious or want to try it yourself.
import "json"
import "stringUtil"
import "listUtil"
import "dateTime"

wolframAppId = "V4Q9R5-UEKX2X6XAV"

urlSafeChars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" +
  "abcdefghijklmnopqrstuvwxyz0123456789-_.~"

urlEncode = function(s)
    bytes = new RawData
    bytes.resize s.len * 3
    len = bytes.setUtf8(0, s)
    result = []
    for i in range(0, len-1)
        b = bytes.byte(i)
        c = char(b)
        if urlSafeChars.contains(c) then
            result.push c
        else
            result.push "%" + hex2(b)
        end if
    end for
    return result.join("")
end function

wolframCheckQuery = function(query)
    url = "http://www.wolframalpha.com/queryrecognizer/query.jsp" +
    "?appid=" + wolframAppId +
    "&mode=Default" +
    "&output=json" +
    "&i=" + urlEncode(query)

    globals.rawResult = http.get(url)
    globals.result = json.parse(rawResult)
    if result.hasIndex("error") and result.error then
        print "Error: " + result.msg
        return false
    end if
    return result.query.accepted
end function

// tokens Alpha uses to maintain conversational context:
conversationID = ""
conversationS = ""

wolframResponse = function(query)
    url = "http://api.wolframalpha.com/v1/conversation.jsp" +
    "?appid=" + wolframAppId +
    "&i=" + urlEncode(query)

    if conversationID then
        url = url + "&conversationid=" + conversationID
    end if
    if conversationS then
        url = url + "&s=" + conversationS
    end if

    globals.rawResult = http.get(url)
    globals.result = json.parse(rawResult)
    if result.hasIndex("error") and result.error then
        return "Error: " + result.error
    end if
    globals.conversationID = result.conversationID
    if result.hasIndex("s") then
        globals.conversationS = result.s
    end if
    return result.result
end function

context = []
context.push "Assistant is a large language model capable of "
context.push "helping the user in many ways. Also present is "
context.push "WolframAlpha, which is very good at math and "
context.push "numerical facts."
context.push "Knowledge cutoff: 2022-09"
context.push "Current date: " + dateTime.now("yyyy-MM-dd")

apiKey = function
    if outer.hasIndex("_apiKey") then return _apiKey
    data = file.readLines("/usr/API-key.txt")
    if data == null then
        print "API-key.txt file not found."
        exit
    end if
    outer._apiKey = data[0]
    return _apiKey
end function

gptResponse = function(prompt, temperature=0.5)
    url = "https://api.openai.com/v1/completions"
    headers = {}
    headers["Content-Type"] = "application/json"
    headers["Authorization"] = "Bearer " + apiKey
    data = {}
    data.model = "text-davinci-003"
    data.prompt = prompt
    data.temperature = temperature
    data.max_tokens = 2048

    globals.rawResult = http.post(url, json.toJSON(data), headers)
    globals.result = json.parse(rawResult)
    if result == null or not result.hasIndex("choices") then
        return rawResult
    end if
    return result.choices[0].text.trim
end function

splitAddress = function(s, possibleNames)
    slower = s.lower
    for name in possibleNames
        if slower.startsWith(name) then
            result = [name]
            s = s[name.len:]
            if s[0] == "," then s = s[1:]
            s = s.trim
            result.push s
            return result
        end if
    end for
    return false
end function

wolframNames = ["wolframalpha", "wolfram alpha", "wolfram", "alpha", "wa"]
gptNames = ["gpt", "chatgpt", "openai", "assistant"]

clear
print "AI Wonder Twins, unite!"
_printMark "(Enter `quit` to exit.)"
lastInput = ""
while true
    inp = input(">")
    if inp.lower == "quit" or inp.lower == "exit" then break

    // figure out who should take this query (and strip off any direct address)
    resp = ""
    wolframQuery = splitAddress(inp, wolframNames)
    if wolframQuery then
        inp = wolframQuery[1]
        responder = "WolframAlpha"
        if inp == "?" then inp = lastInput
    end if
    gptQuery = splitAddress(inp, gptNames)
    if gptQuery then
        inp = gptQuery[1]
        responder = "Assistant"
        if inp == "?" then inp = lastInput
    end if
    if not wolframQuery and not gptQuery then
        resp = wolframResponse(inp)
        if resp.startsWith("Error:") then
            resp = ""
            responder = "Assistant"
        else
            responder = "WolframAlpha"
        end if
    end if
    lastInput = inp

    context.push "User: " + inp
    oldColor = text.color
    if responder == "WolframAlpha" then
        if not resp then resp = wolframResponse(inp)
        context.push responder + ": " + resp
        text.color = "#66CC66"
    else
        context.push responder + ": "
        resp = gptResponse(context.join(char(13)))
        context[-1] = context[-1] + resp
        text.color = color.aqua
        responder = "GPT"
    end if

    for line in (responder + ": " + resp).wrap
        print line
    end for
    text.color = oldColor
end while
Enter fullscreen mode Exit fullscreen mode

Conclusions

This was a fun and interesting exercise, and is actually useful — it's nice not to have to think "which AI should I go to for this answer," but instead to just ask your question, and let your team tackle it. It's also sometimes helpful that GPT can see the whole conversation, and often fill in where Wolfram|Alpha fails to understand your question.

Wolfram|Alpha is absolutely brilliant at actually understanding numerical and computational concepts, and manipulating these in accurate ways. However, it is still very dumb when it comes to comprehending English. I hope that their R&D team is feverishly working to bolt a large language model like GPT to their Wolfram Language back-end, because right now it fails more often than it succeeds.

One thing I haven't tried yet would be conjoining these bots in a different way: ask GPT to write code in Wolfram Language to represent my query, and then passing that query on to Wolfram's API. That would be essentially doing what I just hoped their R&D team is up to, but without waiting for them to do it.

I'll be sure to post again if I try that, so if this sort of thing interests you, be sure to follow! And if you have any other ideas or feedback, please leave it in the comments below.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .