Paddler is an open-source load balancer and reverse proxy designed to optimize servers running llama.cpp.
Typical strategies like round robin or least connections are not effective for llama.cpp servers, which need slots for continuous batching and concurrent requests.
Paddler overcomes this by maintaining a stateful load balancer that is aware of each server's available slots, ensuring efficient request distribution. Additionally, Paddler uses agents to monitor the health of individual llama.cpp instances, providing feedback to the load balancer for optimal performance. Paddler also supports the dynamic addition or removal of llama.cpp servers, enabling integration with autoscaling tools.
Feature Highlight
Aggregated Health Status
Paddler overrides /health
endpoint of llama.cpp
and reports the total number of available and processing slots.
Buffered Requests (Scaling from Zero Hosts)
Load balancer's buffered requests allow your infrastructure to scale from zero hosts by providing an additional metric (requests waiting to be handled).
It also gives your infrastructure some additional time to add additional hosts. For example, if your autoscaler is setting up an additional server, putting an incoming request on hold for 60 seconds might give it a chance to be handled even though there might be no available llama.cpp instances at the moment of issuing it.
Scaling from zero hosts is especially suitable for low-traffic projects because it allows you to cut costs on your infrastructure—you won't be paying your cloud provider anything if you are not using your service at the moment.
State Dashboard
Although Paddler integrates with the StatsD protocol, you can preview the cluster's state using a built-in dashboard.
Final Thoughts
The project is gaining some traction. Let me know if you also use it in prod, and I will highlight your project in the repo. Thank you all for giving me feedback on it so far. I always appreciate it. :)