HTTP Connection churn in GO

Georgios Kampitakis - Oct 21 '23 - - Dev Community

In this post, we are going to have a look at

What is connection churn

While searching for a definition of what connection churn is I came across with

A system is said to have a high connection churn when its rate of newly opened connections is consistently high and its rate of closed connections is consistently high

or in other words when a service frequently opens and closes network connections instead of efficiently utilizing persistent connections. Persistent connections enable multiple HTTP requests and responses to be sent and received over a single TCP connection.

Common causes for connection churn in Go

Not Reading and Closing HTTP body

One of the most common causes and easy to miss, is not reading Body to completion and not closing after done with it.

read and close body in go

It may be counterintuitive to read the body especially when you don't care about the contents of it.

There is a linter that can at least catch the bodyclose.

Using the default HTTP client

Go HTTP defaults are not great for heavy use

Internally Go HTTP client is using an LRU cache for persistent connection pooling. Instead of closing a socket connection after an HTTP request, if the connection is marked as reusable, it will be put back into the pool and later if you try to make another request before the idle connection timeout (90 seconds by default) it will re-use the existing connection from the pull rather than creating a new one. This will keep the number of total socket connections low as long as the pool doesn't fill up. If the pool is full the client will create a new connection and use it.

What's the size of the connection pool?

Default http client configuration

Looking at the DeafaultTransport in HTTP Client MaxIdleConns sets the size of the pool to 100 connections, BUT there is a caveat. DefaultMaxIdleConnsPerHost is set to 2, what that means is even if the pool has a size of 100 connections, there is a cap of two open sockets per host. We will see this later in action with an example.

Benefits of utilizing persistent connections

From Wikipedia for HTTP persistent connection#Advantages.

  • Reduced latency in subsequent requests (no handshaking and no slow start).

An example of a production service, after fixing an issue that allowed reusing connections.


A drop in latency from ~1 sec to 400 ms

  • Reduced CPU usage and round-trips because of fewer new connections and TLS handshakes.


same production service as before.

TLS handshakes can be CPU-intensive by nature, so reducing the newly established connections and inherently spend less time in TLS handshakes reduces your service's CPU usage.

  • Enables HTTP pipelining of requests and responses.
  • Reduced network congestion (fewer TCP connections).
  • Errors can be reported without the penalty of closing the TCP connection.

Methods for identifying connection churn

Let's see some methods that can help us identify that a service is having a high connection churn.

High CPU usage on TLS

A good indicator that your service might be suffering from connection churn is when a high percentage of CPU is spent on TLS handshake.

The below CPU profile is an example of this phenomenon. ~38% of CPU time is spent on establishing TLS connections,a size comparable to the time spent running our service code.


same production service as before.

Monitoring

Monitoring platforms like Datadog provide tools for Network Monitoring.

An example below is a view from Network Performance Monitoring where a service is having high number of connections established and closed per second. This clearly shows a service suffering from connection churn.


same production service as before.

HTTP Trace

Another method is using httptrace which provides mechanisms to trace events within HTTP client requests.

An example usage where we print connection info, which includes info on whether the connection was Reused or not.

HTTP trace example usage

Netstat

Finally, we can use netstat which is a useful tool for inspecting connection statuses. In the command below we are running netstat every second and counting the number of all TCP connections at port 8080 that are in status time_wait.

The high number of time_wait statuses indicates the creation of lots of short-lived TCP connections.

From man netstat

TIME_WAIT
The socket is waiting after close to handle packets still in the network.

Netstat cli example usage

Demo

It's time for the demo with 2 examples in Go that can cause connection churn.

The setup is simple, we are going to run a server that will accept requests and a client sending requests from a specified number of goroutines (you can find the code below).

To observe the state of our connections, in two different shells, we are going to run netstat

watch -n 1 'netstat -n -p tcp | grep -i 8080 | grep -i time_wait | wc -l'
Enter fullscreen mode Exit fullscreen mode

and

watch -n 1 'netstat -n -p tcp | grep -i 8080'
Enter fullscreen mode Exit fullscreen mode

to see the list of connections in more detail.

Example-1

In the first example, we run the code as is, without reading or closing the body and with the default Go HTTP client configuration.

After around ~16k requests we are getting errors

Get "http://localhost:8080/message": dial tcp [::1]:8080: connect: connection refused
Enter fullscreen mode Exit fullscreen mode

we can see the shell running netstat also having a high number of sockets at time_wait state.

Every 1.0s: netstat -n -p tcp | grep -i 8080

tcp4       0      0  127.0.0.1.49478        127.0.0.1.8080         TIME_WAIT
tcp4       0      0  127.0.0.1.49477        127.0.0.1.8080         TIME_WAIT
tcp4       0      0  127.0.0.1.49481        127.0.0.1.8080         TIME_WAIT
tcp4       0      0  127.0.0.1.49482        127.0.0.1.8080         TIME_WAIT
tcp4       0      0  127.0.0.1.49486        127.0.0.1.8080         TIME_WAIT
tcp4       0      0  127.0.0.1.49485        127.0.0.1.8080         TIME_WAIT
tcp4       0      0  127.0.0.1.49489        127.0.0.1.8080         TIME_WAIT
tcp4       0      0  127.0.0.1.49490        127.0.0.1.8080         TIME_WAIT
...
Enter fullscreen mode Exit fullscreen mode

What's happening is, because we are not reading and closing any of the requests, eventually we run out of available ephemeral ports and we are not able to establish new connections.

We can fix this by closing and reading the request. After rerunning the code with the fix we can see on the netstat list with TPC connections only two connections as ESTABLISHED and the sockets with status time_wait remaining 0.

All our requests are handled by those two persistent connections successfully.

Every 1.0s: netstat -n -p tcp | grep -i 8080

tcp4       0      0  127.0.0.1.8080         127.0.0.1.61273        ESTABLISHED
tcp4       0      0  127.0.0.1.8080         127.0.0.1.61272        ESTABLISHED
tcp4       0      0  127.0.0.1.61273        127.0.0.1.8080         ESTABLISHED
tcp4       0      0  127.0.0.1.61272        127.0.0.1.8080         ESTABLISHED
Enter fullscreen mode Exit fullscreen mode

Example-2

For the second example, we are increasing the number of goroutines from 2 to 3, with this change essentially doing 3 concurrent requests.

Running the code now with the new update, we are seeing a steady increase of sockets in time_wait status and two established connections as before.

Every 1.0s: netstat -n -p tcp | grep -i 8080

tcp4       0      0  127.0.0.1.8080         127.0.0.1.49213        ESTABLISHED
tcp4       0    102  127.0.0.1.49213        127.0.0.1.8080         ESTABLISHED
tcp4       0    128  127.0.0.1.8080         127.0.0.1.49198        ESTABLISHED
tcp4       0    102  127.0.0.1.49198        127.0.0.1.8080         ESTABLISHED
tcp4       0      0  127.0.0.1.8080         127.0.0.1.61787        TIME_WAIT
tcp4       0      0  127.0.0.1.8080         127.0.0.1.61751        TIME_WAIT
tcp4       0      0  127.0.0.1.61787        127.0.0.1.8080         
...
Enter fullscreen mode Exit fullscreen mode

We are successfully reusing two of the connections but still creating new connections to close the short after.

The question of why is this happening can be answered by DefaultMaxIdleConnsPerHost. By increasing the concurrency to 3 requests, on every iteration we are able to reuse two of the connections (DefaultMaxIdleConnsPerHost is 2) but the 3rd connection can not be put into the idle pool due to http: putIdleConn: too many idle connections for host.

So the fix for this example would be adjusting MaxIdleConnsPerHost to something > 2. Rerunning the code after the fix we can see now 3 established connections and zero sockets at time_wait state.

Every 1.0s: netstat -n -p tcp | grep -i 8080

tcp4       0    128  127.0.0.1.8080         127.0.0.1.52440        ESTABLISHED
tcp4       0      0  127.0.0.1.8080         127.0.0.1.52439        ESTABLISHED
tcp4       0      0  127.0.0.1.8080         127.0.0.1.52438        ESTABLISHED
tcp4       0    102  127.0.0.1.52440        127.0.0.1.8080         ESTABLISHED
tcp4       0    102  127.0.0.1.52439        127.0.0.1.8080         ESTABLISHED
tcp4       0    102  127.0.0.1.52438        127.0.0.1.8080         ESTABLISHED
Enter fullscreen mode Exit fullscreen mode

Appendix

In MaxConnsPerHost and MaxIdleConnsPerHost in GO HTTP package, is "Host" a domain(i.e., yahoo.com) or an IP address? The actual meaning of "Host" will affect my settings for the connection pool.


The graphs are from Datadog.

The demo was inspired by Tuning the Go HTTP Client Settings for Load Testing.

The gopher art is from MariaLetta.

. . . . . . . . . . . . . . .