❔ About
🤔 Sometimes you face lack of data or data quality issues that prevent you from producing insights.
💡 Whatif you could call AI to the rescue to fix/enhance some data
I first started some Prompt engineering on chatGPT:
☝️ Notice
Notice that guessing gender on firstnames can seem useless or a bit dumb (or nerdy). Yes,but...
- 🗺️ This work relies on openAI... which acts as a universal language firstname parser
- 💡 This work is just an illustration of how prompt engineering and OpenAPI'API can help review/fix any kind of data quality issues... and makes a concrete illustration on how you may enrich your enterprise data pipeline
🎯 Target
The purpose of this article is to see how openai's API can help on a very specific testable dataset.
📝 Kaggle
Notebook
This short notebook I will:
- 📥 Download data
- 🐼 Load data in
pandas
- 🦾 Call
openai
's API to guess firstname's gender - ⚖️ Compare guessed vs. real data
🍿 Demo
🗃️ Input Dataset
I have used the top-10-prenoms-a-noumea-depuis-1860
open dataset from data.gouv.nc
:
🤖 The text-davinci-003
model
I have used text-davinci-003
from GPT-3.5
models as they can:
"understand and generate natural language or code."
📊 Results 👏
☝️ Notice
Notice that I have put the guessed value in a dedicated structure... so we can easily flag it as AI generated when reporting its metadatas:
💰 Gains
- 📈 Data quality
- 💡 Better decisions & opportunities
- 💸 Puts the cost of the lack of data quality in evidence (API calls are not free)
- 🧠 Create more intelligence
👨🔬 Further optimizations
- Benchmark models to spend as less money as possible while getting the best results as possible
🔭 News & perpsectives