🪄 Enhance/fix data quality w. openai's API 🦾

adriens - Mar 30 '23 - - Dev Community

❔ About

🤔 Sometimes you face lack of data or data quality issues that prevent you from producing insights.

💡 Whatif you could call AI to the rescue to fix/enhance some data

I first started some Prompt engineering on chatGPT:

Image description

☝️ Notice

Notice that guessing gender on firstnames can seem useless or a bit dumb (or nerdy). Yes,but...

  • 🗺️ This work relies on openAI... which acts as a universal language firstname parser
  • 💡 This work is just an illustration of how prompt engineering and OpenAPI'API can help review/fix any kind of data quality issues... and makes a concrete illustration on how you may enrich your enterprise data pipeline

🎯 Target

The purpose of this article is to see how openai's API can help on a very specific testable dataset.

Image description

📝 Kaggle Notebook

This short notebook I will:

  1. 📥 Download data
  2. 🐼 Load data in pandas
  3. 🦾 Call openai's API to guess firstname's gender
  4. ⚖️ Compare guessed vs. real data

Image description

🍿 Demo

🗃️ Input Dataset

I have used the top-10-prenoms-a-noumea-depuis-1860 open dataset from data.gouv.nc:

Top 10 des Prénoms à Nouméa depuis 1860 — Open Data NC

Ce jeu de données présente la liste des dix prénoms les plus donnés à Nouméa, depuis 1860, d'après le registre de l'état civil. Fréquence de mise à jour : Annuelle

favicon data.gouv.nc

🤖 The text-davinci-003 model

I have used text-davinci-003 from GPT-3.5 models as they can:

"understand and generate natural language or code."

Image description

📊 Results 👏

Image description

☝️ Notice

Notice that I have put the guessed value in a dedicated structure... so we can easily flag it as AI generated when reporting its metadatas:

💰 Gains

  • 📈 Data quality
  • 💡 Better decisions & opportunities
  • 💸 Puts the cost of the lack of data quality in evidence (API calls are not free)
  • 🧠 Create more intelligence

👨‍🔬 Further optimizations

  • Benchmark models to spend as less money as possible while getting the best results as possible

🔭 News & perpsectives

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .