Knowledge Hub^TM

Debugging Latency – Redux

Date Posted: 10.11.2022
Read time: 11 min read
Written by: Exa Networks

Not so long ago, we posted an article that detailed an issue impacting SurfProtect service stability. We described our investigation into the cause of the problem and celebrated its resolution.

This week, it came back.

While our previous investigation required significant effort, co-ordination between our teams, and time, we’re happy to say that the lessons learned from our previous difficulties meant that we were able to diagnose the issue and apply a fix within hours. The problem again proved to be service-affecting, however, and this follow-up describes once again what happened, providing more insight into how we run our services and hopefully explaining why we don’t expect another repeat.

The issue impacted two servers almost simultaneously, just after 08:00 in the morning on Thursday 2022-11-10. We’ve had weeks of monitoring to confirm that cpu usage was back to normal levels so we were confident in the findings from our previous investigation but we’re now unsure whether we fixed a similar issue or simply relieved pressure on the current one. We will likely continue this series of articles with a discussion of the work required to confirm our current belief.

Traffic was migrated away from the affected servers to an available machine and after some time we started to see the problem starting to occur on that server too. Fortunately, a potential fix was available before it was necessary to migrate traffic away and we were able to observe an immediate change in behaviour with the fix applied.

Restarting the investigation

Just like in the previous incidents, the first we knew that anything was wrong was when our monitoring system started alerting us to high service latency on two servers.

This time, our team immediately started up our data collection script on the affected machines then looked up performance metrics while we waited for the more detailed traces to populate.

Running top showed just over 50% cpu usage, which matched the overview of the proxy process, showing it maxing out each of the cores it was assigned to.

The graphical data shows a rapid but not quite immediate ramp up to those levels:

It might seem strange that we’re apparently hitting full capacity at only ~50% cpu usage but that’s plenty to handle the current traffic volume we’re allocating to each machine, and we’re avoiding running the proxy process across NUMA domains for now. Just like before, nearly all of our cpu time was spent in system calls and was concentrated onto the cores running the proxy.

We’ve included a visualisation of where the proxy was spending its time again, but the values are so close to the previous post that we may as well have just duplicated the image:

This time, though, let’s look at the actual impact of that high load. This image shows the latency that our health checker observed while monitoring one proxy service.

See how inconsistent those latencies are? That’s a measure of how irregular system call latencies were, which fits with the idea that we’re blocking while waiting for a (very) contended resource.

Even at the extreme lower end of the graph, though, those latencies look horrendous when compared to the values we expect to see.

The spikes here on metrics gathered under normal load measure increased latency on the services we’re accessing through the proxy rather than on the proxy itself, and it’s clear to see that the earlier samples were severely impacted by the issue. At up to 1 second to handle a web request, users are going to find the service unbearably slow.

Measuring contention

So far we’ve limited our discussion to directly measuring the impact of the problem to the overall service, but what about individual components? Do some parts of the proxy fare better than others under contention and can we use that knowledge to make the service more resilient to the problem?

One of the metrics we track shows the average time taken to successfully complete TLS handshakes. The amount of work done by both sides is dependent on a few factors so it’s normal for us to have quite a wide spread of latencies under normal load. Clearly, though, the increase in time taken by this subsystem will contribute to the overall drop in performance.

Other than highlighting that we need to add more bins to that particular histogram, this data shows that we’re starting to add a great deal of latency before we even receive a web request.

On the other end the proxy’s duties, we find that latency starts piling up further as response times from our backend decision logic appear to increase significantly.

We use two types of request messages for interacting with the backend service: options requests and decision requests. Decision requests determine the exact behaviour of the service in response to the web request made by a user and their latency exhibits a somewhat long tail as we sometimes need to consult services that sit elsewhere on the network. Options requests, on the other hand, control proxy behaviour during initial interception of web requests, and are usually expected to return a response within well under a millisecond.

We didn’t expect that it would be useful to measure more that 500x the expected response time of an options request but it’s safe to use the metrics for decision requests as a guide and assume that both sets of communication are adding on average 100-200ms or higher to our total time.

That sounds really bad but the deciders themselves paint a different picture. This is higher resolution breakdown of the time taken to process the same requests we measured above, this time from the point of view of one of the local decider processes

The deciders are assigned to completely different cores than the proxy instance but it’s clear that performance is still impacted by what’s happening on the rest of the machine. While the proxy measures hundreds of milliseconds of latency, however, here we see that requests are now most likely to be processed in 100us to 2ms, and that no request takes more that 10ms.

While there’s definitely evidence of degraded performance across the machine, we have once again found that the issue impacts us most greatly in communication between services.

Jumping to the root cause

During our earlier investigation, the team initially focused on debugging our own services and eventually we started building tools to investigate the possibility of an issue within the Linux kernel. With those tools now having been rolled into our data collection process, all the data we needed was ready by the time we’d performed our initial analysis and determined that the symptoms exactly matched the issues we’d only just fixed.

We knew exactly what the contention looked like before and went straight to our flame graph to see if we could see any sign of similar behaviour:

Unbelievably, the flame graph was almost indistinguishable from the version we published last time. With that in mind, we recalled from our previous analysis that we’d initially worried about potentially measuring behaviour of the tracer, based on seeing contended calls to same locking mechanism that appeared heavily in calls to apparmor_socket_sendmsg. We put those instances down to backpressure from the actual issue since they went away after disabling AppArmour but here they were again! This gave credibility to the idea that there were perhaps multiple sources of contention, and removing AppArmor had merely reduced demand on our contented resource enough that the issue was no longer visible. Turning to the stack traces we’d just gathered, we quickly recognised that the calls we’d erroneously attributed to our use of ftrace appear in fact to have been caused by actively registered kernel probes. The trampoline handler shown below is responsible for calling a user-specified return handler that we can spending a lot of time in a spin lock:

				
					   7)               |        trampoline_handler() {
   7)   0.201 us    |          kprobe_busy_begin();
   7)               |          kretprobe_hash_lock() {
   7)   0.201 us    |            _raw_spin_lock_irqsave();
   7)   0.561 us    |          }
   7)   0.200 us    |          percpu_array_map_lookup_elem();
   7)   0.200 us    |          percpu_array_map_lookup_elem();
   7)   0.201 us    |          bpf_get_current_pid_tgid();
   7)               |          __htab_map_lookup_elem() {
   7)   0.211 us    |            lookup_nulls_elem_raw();
   7)   0.761 us    |          }
   7)               |          htab_map_update_elem() {
   7)               |            _raw_spin_lock_irqsave() {
   7) # 1059.700 us |              native_queued_spin_lock_slowpath();
   7) # 1060.732 us |            }
   7)   0.421 us    |            lookup_elem_raw();
   7)               |            alloc_htab_elem() {
   7)               |              __pcpu_freelist_pop() {
   7)   0.201 us    |                _raw_spin_lock();
   7)   0.190 us    |                _raw_spin_lock();
   7)   0.201 us    |                _raw_spin_lock();
   7)   0.191 us    |                _raw_spin_lock();
   7)   0.200 us    |                _raw_spin_lock();
   7)   0.190 us    |                _raw_spin_lock();
   7)   0.190 us    |                _raw_spin_lock();
   7)   0.191 us    |                _raw_spin_lock();
   7)   0.190 us    |                _raw_spin_lock();
   7)   0.190 us    |                _raw_spin_lock();
   7)   0.190 us    |                _raw_spin_lock();
   7)   0.190 us    |                _raw_spin_lock();
...

Now we’re getting somewhere. If we have a service on the servers that’s registering a handler for events that we’re generating frequently then it’s quite plausible it could be inadvertently causing the problem.

Hunting for Probes

Attempting to identify the functions that were being probed, we discovered an intesting set of tracepoints enabled on all of our proxy servers:

				
					# cat /sys/kernel/debug/tracing/set_event
irq_vectors:thermal_apic_exit
irq_vectors:thermal_apic_entry
irq_vectors:deferred_error_apic_exit
irq_vectors:deferred_error_apic_entry
irq_vectors:threshold_apic_exit
irq_vectors:threshold_apic_entry
irq_vectors:call_function_single_exit
irq_vectors:call_function_single_entry
irq_vectors:call_function_exit
irq_vectors:call_function_entry
irq_vectors:reschedule_exit
irq_vectors:reschedule_entry
irq_vectors:irq_work_exit
irq_vectors:irq_work_entry
irq_vectors:x86_platform_ipi_exit
irq_vectors:x86_platform_ipi_entry
irq_vectors:error_apic_exit
irq_vectors:error_apic_entry
irq_vectors:spurious_apic_exit
irq_vectors:spurious_apic_entry
irq_vectors:local_timer_exit
irq_vectors:local_timer_entry
irq:softirq_exit
irq:softirq_entry
irq:irq_handler_exit
irq:irq_handler_entry
sched:sched_process_exec
sched:sched_process_fork
sched:sched_process_exit
oom:mark_victim

This is far from an exhaustive list of the tracepoints exported by the kernel but many of the events listed occur very often.

Assuming from patterns observed in our stack trace that there were also dynamically created probes in play, we quickly came up with a plan to disable them on the machine exhibiting signs of the issue to see if we could alter its behaviour. Once again, the interface for achieving this could barely have been simpler:

				
					echo 0 > /sys/kernel/debug/kprobes/enabled
echo > /sys/kernel/debug/tracing/set_event

As the issue had only just started to appear on the affected machine, there wasn’t yet a great impact on cpu so the immediate response to this action was almost underwhelming. The data gathering script was run and we waited a minute until we had flame graph showing no sign of the problem.

We could almost have believed that the problem had simply dwindled away on its own but for one small detail: all references to kretprobe_ functions have now disappeared from the output.

Finding the root problem

Some time sat reasoning about what was going on served to highlight the likely cause of our kernel probes, as our engineers realised that tracepoints we’d identified were related to metrics that we gather on every machine with the excellent netdata tool. The default configuration makes use of EBPF programs to collect data and a little digging found evidence of probes to other functions that appeared to be particular hotspots in the traces we generated:

netdata_ebpf_targets_t socket_targets[] = { {.name = "inet_csk_accept", .mode = EBPF_LOAD_TRAMPOLINE}, {.name = "tcp_retransmit_skb", .mode = EBPF_LOAD_TRAMPOLINE}, {.name = "tcp_cleanup_rbuf", .mode = EBPF_LOAD_TRAMPOLINE}, {.name = "tcp_close", .mode = EBPF_LOAD_TRAMPOLINE}, {.name = "udp_recvmsg", .mode = EBPF_LOAD_TRAMPOLINE}, {.name = "tcp_sendmsg", .mode = EBPF_LOAD_TRAMPOLINE}, {.name = "udp_sendmsg", .mode = EBPF_LOAD_TRAMPOLINE}, {.name = "tcp_v4_connect", .mode = EBPF_LOAD_TRAMPOLINE}, {.name = "tcp_v6_connect", .mode = EBPF_LOAD_TRAMPOLINE}, {.name = NULL, .mode = EBPF_LOAD_TRAMPOLINE}};

We’ve since verified on a test machine that we’re able to disable the use of EBPF in netdata and there’s no further sign of probes being hit.

No work has yet been carried out to profile the handlers we were using but our expectation is that at least one must be aquiring a lock that’s shared among the 48 cores we allocate to the proxy process, and that’s causing contention issues when faced with large volumes of network events.

We plan to report the issue and hope that we’ll be able to return to recording the missing metrics as soon as possible.

Going forward

We now have a convincing explanation for the issue and a solution that’s been tested to work in the wild, but we’ve been here before. How do we guarantee both that we’ve really fixed the issue and that there are no similar issues waiting to take over?

Our engineers have constructed a plan to continuously gather profiling data from each proxy server and generate alert events whenever we encounter the cpu spending large amounts of time acquiring locks, or instances of any kretprobe_ function. The initial plan was to build a service to do this ourselves but we’re currently evaluating the Grafana Phlare project to see if it can help us to achieve that goal.

While this solution doesn’t guarantee against the problem reoccurring, it’ll give us far earlier warning than our current monitoring is able to achieve and provide us with easy access to the data we need to diagnose what’s happening.

netdata response

Update (2022-11-15): upon reporting this information to the netdata team via discord, we learned that they were already working on a solution for the next stable release (we were using netdata v1.36.1).

The latest netdata tests showed that they had improvements after they merged:

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Customer Support

Knowledge Hub^TM

Debugging Latency – Redux

Not so long ago, we posted an article that detailed an issue impacting SurfProtect service stability. We described our investigation into the cause of the problem and celebrated its resolution.

This week, it came back.

Restarting the investigation

Measuring contention

Jumping to the root cause

Hunting for Probes

Finding the root problem

Going forward

netdata response

Related Knowledge Hub™ Articles

Ensuring Robust Security Through Regular Firewall Updates

Google’s .zip & .mov TLDs – Blocking phishing URLs

Kerberos from RC4 to AES

Debugging Latency on 40 SurfProtect servers

Draytek Vulnerability

ISPA Testing

The Exa Foundation

Contact us

Sales

Sales

Office hours

Technical Support

Contact us

Office hours

Customer Support

Knowledge HubTM

Debugging Latency – Redux

Not so long ago, we posted an article that detailed an issue impacting SurfProtect service stability. We described our investigation into the cause of the problem and celebrated its resolution.

This week, it came back.

Restarting the investigation

Measuring contention

Jumping to the root cause

Hunting for Probes

Finding the root problem

Going forward

netdata response

Related Knowledge Hub™ Articles

Ensuring Robust Security Through Regular Firewall Updates

Google’s .zip & .mov TLDs – Blocking phishing URLs

Kerberos from RC4 to AES

Debugging Latency on 40 SurfProtect servers

Draytek Vulnerability

ISPA Testing

The Exa Foundation

Contact us

Sales

Sales

Office hours

Technical Support

Contact us

Office hours

Knowledge Hub^TM