2 analogies briefly describe the capabilities and limits of generative pre-trained transformers (ChatGPT):
- a) An amnesic(1) hallucinating(2) schizophrenic(3) with split personality disorder (4) who likes trivialities(5)
- b) LLM is a lossy compression of the internet (6)
(6) Training of LLMs is done using large bodies of text scrapped on the internet. GPT-3.5 with 175B parameters was (presumably) trained on 45 terabytes of text data. With each parameter (or weight) being 4 bytes, the total size of GPT-3.5 is roughly 700GB. Those weights encode the training set. When the base model receives a text snippet, it completes it with the most probable tokens (words). It tries to unpack some of the texts it had seen before. The compression ratio is 0,7/45=1,6%.
(5) Since the base model is essentially a generator of most probable tokens, it tends to rely on the texts it has seen most often. Hence some interesting side effects, such as memory trap (with LLM not coping to resist strong memories) or syntax winning over reason. That can also explain declining LLM performance in tasks requiring rare knowledge or creativity. E.g., it can perfectly solve the task of creating a boilerplate code for the Express.js server (cause it has seen many tutorials) and fail to fix a bug in your production code.
(4) You can easily define the personality and ask the model to be anyone: a novel writer, an expert programmer, or even a T-SQL interpreter. OpenAI's chat API has a special system message
that helps define personality. You can even ask it to be many persons simultaneously (tree-of-though prompting). Priming the model that way greatly influences the output.
(3) "Jesus! You turned ChatGPT into a lawyer. Those are experts on defending bs" - a comment from YouTube discussing how LLMs demonstrate disordered thinking. There's many example with LLM pressing on complete nonsense and contradicting with itself, insisting that war is peace, slavery is freedom.
(2) Making up things and not knowing the limits own knowledge is a real problem with no solution so far. Hallucinations and confidently making up facts is a problem with a not known solution (one can remedy the occurrence of haluscinations though). A general heuristic is that 80% of what LLMs produce is OK, and 20% can be random (yet coherent) text.
(1) LLMs are stateless. Chat models get as input a complete conversation. New messages are added to the end of the message list (to keep track of long chats older messages can be summarized and inserted at the top). LLMs don't have memories besides the ones acquired during training and whatever is present in the input. If you don't retrain a model, it can learn new information only via prompt (in-prompt learning, embeddings, RAG). This in-prompt knowledge is lost the moment the model completes the request and returns the text. Secondly, the prompt can't exceed the context size limit without seriously degrading quality. GPT4 has context sizes of 8k and 32k tokens. The largest context size for a publicly available model is 100k with Anthropic's Claude. That is still very little for practical use and holds back many use cases (home brew expert systems). E.g., PMBoK is roughly 300k tokens.
P.S.: the top cover image is a reference to Memento movie by Christopher Nolan. It revolves around day in a life of an amnesic person being in the middle of god knows what. He is and using text snippets on his body/tattoos to recover pieces of knowledge and move on.