Phi-2, an open-source model by Microsoft promises to be equal or even beat Llama-2 70B. All of that while being super small in size (just 2.7B parameters). The secret sauce being the low quantity high quality 'text book' synthetic data.
What interests me is that the model is relatively easy to run locally on almost any consumer hardware. You can get one and talk to it through LM Studio - I have downloaded and used 'TheBloke/phi-2-GGUF/phi-2.Q4_K_M.gguf' version of the model.
With Apple Metal acceleration enabled I managed to get an impressive 46 tokens per second inference speed on MacBook M1 Pro! That speed is perceived as faster than free ChatGPT. With Metal disabled the performance is ~20tokens/second. Memory consumption is ~2GB in both cases.
Speaking of quality... First of all the model does impress with the ability to generate coherent text and reply to generic questions, it can do basic math:
Though when asked to do some coding it doesn't take long to see that it is inferior to GPT-3.5. One of funny cases is when the model generated smth looking like TypeScript when asked to produce Dart code:
* Note, there're no Promise
s in Dart, those ones are called Future
s :)
Yet it is known that the model's training data has a lot of Python code...
And it is under 3B params! Not 175B GPT-3.5 has! A remarkable result, making foundation model very accessible - bringing LLMs to smaller devices, at no cost and at high speed!
UPD: 8-bit version of the model takes ~3GB of RAM and gives 22 tokens per second with Metal acceleration turned on (vs 46 with 4-bit model)
UPD2: I gave the model a try on a 5 year old MacBook Pro 13 with Intel Core i5-8257U CPU @ 1.40GHz and 16GB of RAM.
- 4-bit (Q4_K_M), CPU only - ~5 tokens/second
- 4-bit (Q4_K_M), Integrated graphics acceleration enabled (OpenCL) - ~8 tokens/second
- 8-bit, CPU only - ~5 tokens/second
- 8-bit, Integrated graphics acceleration enabled (OpenCL) - ~5 tokens/second