Hello everyone, in this article I will explain how did we troubleshoot a network error between our microservices.
Content
🚀 Service Mesh in Trendyol
❌ Error Case
📝 Determining Troubleshooting Paths
🚨 Reproducing the Error
🕵️ Sniffing the Pod Network
🔍 Inspecting Code
✅ Problem and Solution
📚 Conclusion
Service Mesh in Trendyol 🚀
In Trendyol, our microservice applications run in Kubernetes along with service mesh Istio. We have thousands of clusters, almost ~200K pods running on it.
We empower microservices with the help of service mesh features like service authorization, load balancing, weighted request routing, traffic mirroring, interrupting incoming and outgoing requests, manipulating request flows etc.
Beyond that, we are extending our service mesh functionality by leveraging Envoy’s extensibility features as we described in Envoy Wasm Filters article.
If you wonder more about Trendyol infrastructure, you can take a look our infra metrics publicly on here.
Error Case ❌
Couple of days ago, we (Platform — Developer Productivity Engineering team) got a support request from one of our domain teams for a microservice that they developed. They described issue as a connectivity problem between a two services. The strange point is the application was facing with issue while sending a request to existing resource (/resources/1) and working correctly on non existing resource (/resources/133).
To reproduce the error we demanded the example requests and application urls. From now on, I explain the situation and the steps we followed.
Determining Troubleshooting Paths 📝
While troubleshooting an error, visualizing the process with big picture helps a lot to reveal overlooked points. So we started to draw the request flow.
What we know about process is:
- We have a reverse proxy in-front of Kubernetes
- We have Istio Envoy Proxy in pods
- We have two service that run in Kubernetes and access each-others
- Gateway service handles the request and makes a call to an API
Let’s see the big picture 👇
When a user make a request to a URL, it is handled by Tengine and then Tengine forwards request to the cluster, request first handled via Istio Ingress Gateway inside clusters. Inside the pod, request handled by istio-proxy (Envoy) container and then forwarded to the application container.
Now we can ready to dig in.
Reproducing the Error 🚨
According to those informations, we started to eliminate components one by one. First we sent requests by bypassing load balancer (Tengine). Because we now that load-balancer can access to Kubernetes because it works on some requests to the same URL.
So we are now removed the Tengine from big picture 👇
We directly sent requests to the Kubernetes cluster’s ip address and Istio Ingress Gateway port. To be sure about Istio configuration, we also sent direct requests to NodePort port of our service, this bypasses Istio Ingress Gateway from our request flow.
But either way, we faced with the same following result 👇
The error says “upstream connect error or disconnect/reset before headers. reset reason: connection termination”.
We wanted to sure that our upstream API works correctly, so we removed the gateway from the picture and we directly sent requests to the API 👇
We got our responses successfully for both existing and non existing resources while sending direct requests to the API.
After bunch of request combinations to the both gateway and API, we began considering the likelihood of an error occurring within the communication of the gateway pod.
Sniffing the Pod Network 🕵️
We could not find any reason to this error just by outlooking by sending requests. We decided to sniff the gateway pod’s network and see whats going on.
Thanks to the wonderful kubectl plugin named “ksniff” we easily sniff our network and track it on Wireshark.
To sniff our pod network let’s run the kubectl command below:
kubectl sniff pod-name -f 'tcp[tcpflags] & (tcp-syn|tcp-fin|tcp-rst|tcp-push)!=0'
When we start sniffing it opens up a Wireshark with bunch of package traces like in the following screenshot👇
As you can see, it contains a lot of unnecessary informations, we need to track a specific request flow.
We started to send couple of different requests to the application to see whats happening on the network so we can filter out easily by typing a filter like http.request.full_uri contains “/eligibility”
and this shows us filtered requests👇
And now we can select a request and follow it’s HTTP or TCP stream by right clicking on it 👇
Now when we follow this process on our invalid request, we can easily see that our TCP stream does not contain any error 👇
When we do the same thing on the failed responses, we faced with the following screen👇
As you can see there is a big difference between two TCP stream. Failed stream contains [RST] flag. This means that somehow the application causes to close connection with istio-proxy (Envoy).
We can actually see that our gateway application communicates with the Eligibility API.
So we can downgrade the error scope to the only gateway application pod. There must be something between gateway container and istio-proxy (Envoy) container.
Inspecting Code 🔍
Finally we ended up with the idea of there must be a difference between response flows of this application.
There is one thing to take a look: code difference between the error and success responses.
When we took a look to source code of gateway application, we notices that application has a global error handler which returns a fresh new response object created by handler.
BUT…
In the controller requests are sent by Feign Client to the API application and the response is directly return to the caller ⚠️
We found the reason of the error is because of this behavior of the controller. It corrupting the response header which expected by istio-proxy.
Problem and Solution ✅
Here is the problematic code that we saw in the controller 👇
@GetMapping("/some/path")
public ResponseEntity<Response> getResponse(
@RequestParam(name = "id", required = false) Integer id)
{
return externalApiClient.get(id); //directly returning the ResponseEntity<Response> from Feign client
}
We made a tiny change to fix this issue, just by wrapping the response body of the API call request with a newly created Response object👇
@GetMapping("/some/path")
public ResponseEntity<Response> getResponse(
@RequestParam(name = "id", required = false) Integer id)
{
return ResponseEntity.ok(externalApiClient.get(id).getBody()); //creating a fresh new ResponseEntity<Response> from Feign client response body
}
It fixed all of our issues.
Just because we got this issue in Java it does not mean that you may not encounter it in another language. This problem is language agnostic. You may encounter it in Go, C# or any other language as well.
Remember the main reason of this issue is directly returning response headers too of external requests. You need to create a custom response 👍
Conclusion 📚
We made couple of inferences after resolving this problem.
- Do not directly proxy responses of your external requests to the caller, wrap it with custom response
- Code implementation may vary over different flows, do not make assumptions
- Always try to make your application observable
We learned a lot while investigating this issue. Thanks a lot to my team mates Onur Yilmaz and Gokhan Karadas.
I hope you find this article entertaining and informative. Thanks for your attention, happy coding ⌨️
Ready to take your career to the next level? Join our dynamic team and make a difference at Trendyol.