With the fall of Stack Overflow, AI Coding Assistants like GitHub Copilot will have a data problem!

Robin-Manuel Thiel - Sep 1 - - Dev Community

Stack Overflow shares website data with some of its most active members (how cool is that?), which is currently showing a dramatic decrease in traffic, which ML engineer Ayhan Fuat Çelik named "The Fall of Stack Overflow". As one of the biggest data sources of LLM trainings (especially those focussing on assisting with coding), Stack Overflow plays a critical role in the success of AI Coding Assistants like GitHub Copilot. Even though Gergely Orosz pointed out, that the reports of Stack Overflow's downfall might be exaggerated, the number of questions being asked on the platform is indeed decreasing.

With fewer questions about current programming problems being asked on the public internet, the training data for the coding assistants of tomorrow gets reduced. Ironically, the coding assistants of today are one of the main reasons for the fall of Stack Overflow and why people ask their questions in private to an AI.

Why could this become problematic?

AI Coding Assistants are based on Generative AI Models, which can only be as good as their training data. With less public knowledge on coding, how do we train models to solve the coding challenges of the programming languages, frameworks, and tools of the future?

Developers don't have fewer questions, they just ask them to AI Assistants instead. Even if these AI Assistants would publish anonymized questions and answers to a public data set, it would not be good advice to train new AI Models on them, as training AI on AI-generated content risks poisoning the training data for future AI models. This can lead to a phenomenon called "model collapse," where errors accumulate over generations, resulting in nonsensical outputs. Experts warn that as AI content becomes more prevalent, it could bias models and degrade their quality.

Will AI Assistants have a Netflix Problem?

With human-created data-sets seeing a dramatic increase in value for training Large Language Models like Open AI's GPT family, we are already seeing owners of these training sets like Reddit guarding them and blocking automatic access.

In my opinion, the worst outcome of all this from a user experience perspective would be, if companies like Reddit or Stack Overflow came up with their own AI Assistants, which would then have exclusive access to their data. This is what I call the Netflix problem, referring to a time when Netflix had all existing shows and movies in their portfolio, and now we ended up with Amazon Prime Video, Disney+, Paramount and many more. The same happened to car sharing and E-Scooters.

People don't want to ask 5 different AI coding Assistants for help, hoping that one of them was trained on data that could lead to an answer. They want one. There could be different ones on the market, but once they decide on one, they want to use it for everything.

Or maybe it is not a big problem at all?

From my own experience, AI Coding Assistants work fantastic for narrowly scoped and straightforward questions. Whenever an issue becomes complex or might happen across multiple layers like programming language, framework, database driver, database type and so on, they still struggle, and I turn to Stack Overflow.

We should probably take a look at which kind of questions are being asked less on Stack Overflow. If had to guess, I would say it's the straightforward and less complex ones. Maybe these could also be answered by AI models by taking the official and public documentation into their training sets.

I'm sure, all this also applies to other industries that use AI Assistants and Copilots already. I am very curious to see, how this affects the quality of future AI applications.

. . . . . . .