The Dark Side
I guess by now everyone already used ChatGPT in some form. At some point I found myself using it to generate boilerplate code, write tests and document stuff. All the boring things. I do like some challenge, but writing same thing over and over again get's quite boring. Well good thing that AI excel at boring!
There is however some concern for not-so-boring stuff, which is privacy. Pasting a proprietary code into ChatGPT is like committing environment variables to git repository. Yet for obvious reason people keep using ChatGPT and Github Copilot on daily basis - for convenience.
You probably also heard already that Reddit and StackOverflow already sold their data to OpenAI for training (actually probably long before it was publicly announced!). We can be almost sure that chat history is also fed into teaching the model.
The Alternative
Enough of my ranting about OpenAI. The alternative is obvious - run it by yourself. If you are well versed in running your own models - you probably won't learn anything new. But if you are new to this, here's the gist: if you have a PC good enough to run a modern game in medium settings, then you are able to run it by yourself. You might be surprised how open source community is active in this area. More so - Meta (yes, Facebook's Meta) is quite in forefront of open-source llm models.
But there are downsides. Self-hosted model will never be up to par with Closed ones. And speed and quality relies heavily on your GPU capabilities. However, for many use cases any RTX cards, and even some GTX 10 Series are enough. Personally I'm using RTX 3080 TI 12GB of VRAM.
Oh - one more thing. While I recommend using NVidia cards for inference (running local model), I don't have experience with AMD Cards, and I cannot recommend or even be 100% sure my approach works for AMD. From what I heard, working with AMD drivers is hell not only for ML Researchers but also for game developers. It breaks my heart because I like their processors.
The Tutorial
There are several solutions out there, but I would go with one that is seamless, and runs in the background, which makes it almost invisible.
However there are some requirements:
- Docker
- CUDA enabled GPU Yeap, that's it. This solutions should work on Windows, Linux and MacOS. However you might be already aware that docker in windows requires WSL2 enabled.
Step. 1 Install Ollama
Ollama will be our inference backend. If you are familiar with docker, ollama will fill like home. It downloads models from it's own repository and serve them through it's own API which follows OpenAI format. You can also converse with chat in CLI.
To install Ollama go to ollama's downloads page and install it according to your OS.
For bash users (non-windows) here's quick install script:
curl -fsSL https://ollama.com/install.sh | sh
Important! I don't recommend running ollama on docker unless you really know what you are doing, and know how to setup GPU usage from docker. Otherwise, in my opinion Ollama works best on host system.
Now let's test our ollama (yes, after installing windows version, it should be available in cmd and git-bash as well):
ollama pull llama3
ollama run llama3
After second command you need to wait a little while until model loads to VRAM. Then you can chat with it in the CLI!
There are also other models worth checking, like:
-
llava
- multimodal model - you can drop media to prompt about -
codellama
,codegemma
,deepseek-coder
- three models dedicated for coding tasks -
qwen2
- multilingual competitor to llama3 that also performs twice as good on coding tasks
More: Ollama library
Step 2. Install OpenWebUI
If you are familiar with ChatGPT UI, and feel at home with it, you might like OpenWebUI which is heavily inspired by it. It actually might be more intuitive and powerful than ChatGPT, because it not only supports full chat mode, but also multi-modality (droping files into chat), and RAG workflow (Retrieval Augmented Generation - takes into account context of your files/documents). I checked multiple solutions so far, and this one by an extent is my favorite. And to have it running locally we will use docker.
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main
All the other installation options can be found here: https://github.com/open-webui/open-webui
Now, let's heads up towards http://localhost:3000/
We should be greeted with following screen:
In general you should be able to start chatting, but in case ollama is not set up as backend, you can go to Profile -> Settings -> Connections, and set connection http://host.docker.internal:11434
It means that your docker container will connect to localhost:11434 - exactly where your local ollama api sits.
Step 3: Local Copilot
As of writing this article, I found only one solid replacement for Github Copilot: continue.dev.
The best thing is - it works both with intellij and vscode!
It has all the good stuff: contextual autocomplete, chat with gpt, and shortcuts for snippets!
Setup process is pretty straightforward. It might ask you during install phase to login via github, but you can skip it, and you won't be asked again. Next up, choose your engine, public api or ollama. Then install starcoder:3b for completion and voila!
For me, most important was overall developer experience similar to Copilot, and lack of extra authentication layer. There are several solutions and extensions out there, but this setup works completely on your machine.
Conclusion
What I described here is most optimal workflow I found to be working for me. There are multiple ways to run open source models locally worth mentioning like Oobabooga WebUI or LM Studio, however I didn't found them to be so seamless, and fit my workflow.
For vscode extensions there are also many plugins that attempt to replicate Copilot experience like Llama Coder, CodeGPT or Ollama Autocoder and many more, and I tested a lot of them, however only Continue actually comes close or even surpasses by a bit actual Copilot.
Worth mentioning is also Pieces OS which is basically locally running RAG-app that keeps your whole basecode in the context. I did use it for a while and it's pretty good, however current setup works better with my coding habits.
Overall, this setup saves you some subscription fees and works just as good as original thing!