In this post, we are going to have a look at
- What is connection churn
- Common causes for connection churn in Go
- Benefits of utilizing persistent connections
- Methods for identifying connection churn and finally
- Demo that will help us better understand how to fix it connection churn in Go.
What is connection churn
While searching for a definition of what connection churn is I came across with
A system is said to have a high connection churn when its rate of newly opened connections is consistently high and its rate of closed connections is consistently high
or in other words when a service frequently opens and closes network connections instead of efficiently utilizing persistent connections. Persistent connections enable multiple HTTP requests and responses to be sent and received over a single TCP connection.
Common causes for connection churn in Go
Not Reading and Closing HTTP body
One of the most common causes and easy to miss, is not reading Body to completion and not closing after done with it.
It may be counterintuitive to read the body especially when you don't care about the contents of it.
There is a linter that can at least catch the bodyclose.
Using the default HTTP client
Go HTTP defaults are not great for heavy use
Internally Go HTTP client is using an LRU cache for persistent connection pooling. Instead of closing a socket connection after an HTTP request, if the connection is marked as reusable, it will be put back into the pool and later if you try to make another request before the idle connection timeout (90 seconds by default) it will re-use the existing connection from the pull rather than creating a new one. This will keep the number of total socket connections low as long as the pool doesn't fill up. If the pool is full the client will create a new connection and use it.
What's the size of the connection pool?
Looking at the DeafaultTransport in HTTP Client MaxIdleConns
sets the size of the pool to 100 connections, BUT there is a caveat. DefaultMaxIdleConnsPerHost
is set to 2, what that means is even if the pool has a size of 100 connections, there is a cap of two open sockets per host. We will see this later in action with an example.
Benefits of utilizing persistent connections
From Wikipedia for HTTP persistent connection#Advantages.
- Reduced latency in subsequent requests (no handshaking and no slow start).
An example of a production service, after fixing an issue that allowed reusing connections.
- Reduced CPU usage and round-trips because of fewer new connections and TLS handshakes.
TLS handshakes can be CPU-intensive by nature, so reducing the newly established connections and inherently spend less time in TLS handshakes reduces your service's CPU usage.
- Enables HTTP pipelining of requests and responses.
- Reduced network congestion (fewer TCP connections).
- Errors can be reported without the penalty of closing the TCP connection.
Methods for identifying connection churn
Let's see some methods that can help us identify that a service is having a high connection churn.
High CPU usage on TLS
A good indicator that your service might be suffering from connection churn is when a high percentage of CPU is spent on TLS handshake.
The below CPU profile is an example of this phenomenon. ~38% of CPU time is spent on establishing TLS connections,a size comparable to the time spent running our service code.
Monitoring
Monitoring platforms like Datadog provide tools for Network Monitoring.
An example below is a view from Network Performance Monitoring where a service is having high number of connections established and closed per second. This clearly shows a service suffering from connection churn.
HTTP Trace
Another method is using httptrace which provides mechanisms to trace events within HTTP client requests.
An example usage where we print connection info, which includes info on whether the connection was Reused
or not.
Netstat
Finally, we can use netstat which is a useful tool for inspecting connection statuses. In the command below we are running netstat every second and counting the number of all TCP connections at port 8080
that are in status time_wait
.
The high number of time_wait
statuses indicates the creation of lots of short-lived TCP connections.
From man netstat
TIME_WAIT
The socket is waiting after close to handle packets still in the network.
Demo
It's time for the demo with 2 examples in Go that can cause connection churn.
The setup is simple, we are going to run a server that will accept requests and a client sending requests from a specified number of goroutines (you can find the code below).
To observe the state of our connections, in two different shells, we are going to run netstat
watch -n 1 'netstat -n -p tcp | grep -i 8080 | grep -i time_wait | wc -l'
and
watch -n 1 'netstat -n -p tcp | grep -i 8080'
to see the list of connections in more detail.
Example-1
In the first example, we run the code as is, without reading or closing the body and with the default Go HTTP client configuration.
After around ~16k requests we are getting errors
Get "http://localhost:8080/message": dial tcp [::1]:8080: connect: connection refused
we can see the shell running netstat also having a high number of sockets at time_wait
state.
Every 1.0s: netstat -n -p tcp | grep -i 8080
tcp4 0 0 127.0.0.1.49478 127.0.0.1.8080 TIME_WAIT
tcp4 0 0 127.0.0.1.49477 127.0.0.1.8080 TIME_WAIT
tcp4 0 0 127.0.0.1.49481 127.0.0.1.8080 TIME_WAIT
tcp4 0 0 127.0.0.1.49482 127.0.0.1.8080 TIME_WAIT
tcp4 0 0 127.0.0.1.49486 127.0.0.1.8080 TIME_WAIT
tcp4 0 0 127.0.0.1.49485 127.0.0.1.8080 TIME_WAIT
tcp4 0 0 127.0.0.1.49489 127.0.0.1.8080 TIME_WAIT
tcp4 0 0 127.0.0.1.49490 127.0.0.1.8080 TIME_WAIT
...
What's happening is, because we are not reading and closing any of the requests, eventually we run out of available ephemeral ports and we are not able to establish new connections.
We can fix this by closing and reading the request. After rerunning the code with the fix we can see on the netstat list with TPC connections only two connections as ESTABLISHED
and the sockets with status time_wait
remaining 0.
All our requests are handled by those two persistent connections successfully.
Every 1.0s: netstat -n -p tcp | grep -i 8080
tcp4 0 0 127.0.0.1.8080 127.0.0.1.61273 ESTABLISHED
tcp4 0 0 127.0.0.1.8080 127.0.0.1.61272 ESTABLISHED
tcp4 0 0 127.0.0.1.61273 127.0.0.1.8080 ESTABLISHED
tcp4 0 0 127.0.0.1.61272 127.0.0.1.8080 ESTABLISHED
Example-2
For the second example, we are increasing the number of goroutines from 2 to 3, with this change essentially doing 3 concurrent requests.
Running the code now with the new update, we are seeing a steady increase of sockets in time_wait
status and two established connections as before.
Every 1.0s: netstat -n -p tcp | grep -i 8080
tcp4 0 0 127.0.0.1.8080 127.0.0.1.49213 ESTABLISHED
tcp4 0 102 127.0.0.1.49213 127.0.0.1.8080 ESTABLISHED
tcp4 0 128 127.0.0.1.8080 127.0.0.1.49198 ESTABLISHED
tcp4 0 102 127.0.0.1.49198 127.0.0.1.8080 ESTABLISHED
tcp4 0 0 127.0.0.1.8080 127.0.0.1.61787 TIME_WAIT
tcp4 0 0 127.0.0.1.8080 127.0.0.1.61751 TIME_WAIT
tcp4 0 0 127.0.0.1.61787 127.0.0.1.8080
...
We are successfully reusing two of the connections but still creating new connections to close the short after.
The question of why is this happening can be answered by DefaultMaxIdleConnsPerHost
. By increasing the concurrency to 3 requests, on every iteration we are able to reuse two of the connections (DefaultMaxIdleConnsPerHost
is 2) but the 3rd connection can not be put into the idle pool due to http: putIdleConn: too many idle connections for host
.
So the fix for this example would be adjusting MaxIdleConnsPerHost
to something > 2. Rerunning the code after the fix we can see now 3 established connections and zero sockets at time_wait
state.
Every 1.0s: netstat -n -p tcp | grep -i 8080
tcp4 0 128 127.0.0.1.8080 127.0.0.1.52440 ESTABLISHED
tcp4 0 0 127.0.0.1.8080 127.0.0.1.52439 ESTABLISHED
tcp4 0 0 127.0.0.1.8080 127.0.0.1.52438 ESTABLISHED
tcp4 0 102 127.0.0.1.52440 127.0.0.1.8080 ESTABLISHED
tcp4 0 102 127.0.0.1.52439 127.0.0.1.8080 ESTABLISHED
tcp4 0 102 127.0.0.1.52438 127.0.0.1.8080 ESTABLISHED
Appendix
In MaxConnsPerHost and MaxIdleConnsPerHost in GO HTTP package, is "Host" a domain(i.e., yahoo.com) or an IP address? The actual meaning of "Host" will affect my settings for the connection pool.
The graphs are from Datadog.
The demo was inspired by Tuning the Go HTTP Client Settings for Load Testing.
The gopher art is from MariaLetta.