Speak to an expert : Live Chat exa online chat

Knowledge HubTMeducation

Debugging Latency on 40 SurfProtect servers

For the past couple of weeks, our active monitoring detected several occasions where the latency of traffic going through our SurfProtect service increased significantly enough to be worthy of investigation.

At the outset of the investigation, when our monitoring tools first identified the problem, the effect on our customers was not pronounced enough to be noticeable. But our team discovered that the issue was persistent and that the problem never seemed to be accompanied or preceded by any change in traffic patterns or other services that might explain them.

Unfortunately, the problem became service-affecting before we were able to resolve it. This article explains what happened, why and how we were able to restore services to their normal service level.

Investigating the reason for the slowdown

Investigating our code

SurfProtect is a complex application provided by over 40 servers, 16 of which are proxies taking customers’ HTTP and HTTPS requests, accepting or blocking them.

Our service monitoring includes exporting information about the response time of different components. Looking at this data, we could see that we were making filtering decisions promptly and that the cause of the latency was within our proxy code, which was surprising as this code had been stable and well-behaved for a few years.

Our proxies are written in Go, a language designed by Google to develop high-concurrency applications such as our proxy. This language comes with excellent real-time debugging tools, which we have come to master.

The team used them to seek to understand the reason for the slowdown. Unfortunately, following our initial research, we could only establish that the SurfProtect proxy process was taking longer and longer as time went on, waiting for the OS to respond to network reads and writes.

golang syscall usage

The image on the left is an excerpt from a visualisation we use to understand where the application spends most of its time.

This particular data comes from one of the incidents where latency on a machine rose by about 100ms. Average samples were showing CPU usage of around 40% performing the same tasks, compared to our expectation of 25% during standard operation.

89.58% of our available CPU resources were spent waiting for the OS to respond to read and write operations to the network.

This meant that the remaining 75% of the work usually performed by the proxy, was being performed with the remaining 10.42% of resources available on the machine.

Our team started to look into what could have caused this behaviour change.

Investigating CPU usage

cpu system before
Looking at the CPU usage graph illustrates the issue. The CPU time allocated to the proxy service (in yellow) became negligible compared to the time spent within the Operating System (in blue). We also expect this class of machine to show far lower CPU usage at such relatively low bandwidth usage.

As the proxy processes are allowed to use (up to) all of the cores on one of the two processors present, it’s clear from the flat line at 60% overall usage that we had hit a resource contention which led to the latency issue we experienced.

The search for the root cause

Our team then realised that we needed to change our perspective and identified two independent courses of action to push us towards a resolution.

Right from the start of the incident, our tools functioned as we hoped, alerting us to high latency and abnormal resource usage, recording when they occurred, and highlighting the reason for the latency with the proxy’s code. Still, we had no visibility of what was happening inside the kernel. We could not explain why we were experiencing high CPU usage.

By and large, our monitoring focuses on the service we provide and allows us to inspect the behaviour of our software. We’re usually happy to rely on the Linux kernel to do its job after we’ve been through the process of validating during testing and through progressive acceptance in service that it can support the features we use.

Based on our deployment experience, we have a good idea of how a machine will respond to certain types of workload and the states of its various subsystems. We have a broad understanding of how Linux subsystems are implemented, but in this occurrence we were initially frustrated. None of the information generated seemed to suggest a reason for the issues. Nothing had changed since the end of the previous school year, which had seen much higher traffic volume per service. We deployed more machines during the summer, so the traffic and CPU load per machine should be lower, not higher. Nothing related to networking on the servers also looked out of place.

Facilitating the gathering of data

We needed to extend our monitoring to the Linux kernel during outages. We realised our team already had a documented way to gather this data during investigations, but we were missing a quick and easy mechanism to standardise the data collected. We decided to put together new tools to gather the information we thought would help our engineers.

But we had to be careful: some of the mechanisms required to gather this information were already running at high CPU levels, and our actions could impact normal operations. We could not leave this running permanently but had to wait for an incident to collect the data.

Investigating Linux

As the problem was not within our code, our first course of action was to ensure that we were not potentially affected by recently released and misbehaving drivers or firmware.

We then debated between rolling back an older version or updating it to a new version of the kernel just released a few days earlier. We decided to update first instead of downgrading: even if downgrading could resolve the issue, it would prevent us from diagnosing the problem. We were unsure we could replicate the issue in our test environment, and holding back the machines’ kernel version was not a long-term solution. However, it happened that we found the reason for this behaviour before we could roll back.

We also saw to determine if we could correlate at least the most severe incidents to any particular hardware vendor. During our investigation, it became clear that our newly encountered OS problems were happening across the different generations of machine and networking cards. Upon investigation, we could not trace the faulty behaviour to any specific software, firmware or hardware combination.

Diving deeper

While our investigations and new tool development took place, we also had to investigate the possibility of a non-hardware-related issue with the Linux kernel.

As pointed out above, right at the start of the investigation, we identified an issue with Linux system calls while reading and writing to the network. We decided that we needed to understand what exactly Linux was doing while handling our IO requests. If we could see pauses while writing to or reading from network cards, it would confirm a possible driver issue; otherwise, it should give us food for thought and allow us to move forward.

So the team turned to the ‘perf’ tool to investigate why system calls took so long to return.

Due to initial unfamiliarity with the ‘perf’ tool, we went straight to the interface provided by debugfs and used by ‘perf’ for this purpose. Once understood, acquiring the data we wanted could not have been simpler:

				
					echo "function_graph" >  /sys/kernel/debug/tracing/current_tracer
echo 1 >  /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace > trace

				
			

While our worries eventually proved unwarranted, the data initially made us wonder if we were monitoring contention within the tracer code. So, we limited the monitoring to only one core to limit the likelihood of going down a false track:

				
					echo "00000000,00000000,01000000" > /sys/kernel/debug/tracing/tracing_cpumask

				
			

This command provided us with a very detailed trace of precisely what was going on, along with high-resolution timing information showing how long everything was taking. It was perfect, or almost perfect, since enabling the tracer saw CPU usage jump, and we could not accurately collect data from a machine exhibiting the problem. Still, the data gathered was enough to identify a specific issue conclusively.

Our analysis focused on the kernel function do_syscall_64, used by the IO calls. The data indicated that most calls returned within a perfectly reasonable time but that some calls inexplicably took 10s to 100s of milliseconds, matching what we’d already observed with the output of ‘perf’.

Further analysis showed an interesting pattern that occurred in ‘slow’ calls:

				
					kprobe_ftrace_handler() {
 93)   0.140 us    |          get_kprobe();
 93)               |          pre_handler_kretprobe() {
 93)   0.150 us    |            _raw_spin_lock_irqsave();
 93)   0.150 us    |            _raw_spin_unlock_irqrestore();
 93)   0.150 us    |            arch_prepare_kretprobe();
 93)               |            _raw_spin_lock_irqsave() {
 93) # 1459.327 us |              native_queued_spin_lock_slowpath();
 93) # 1460.008 us |            }
 93)               |            _raw_spin_unlock_irqrestore() {```

				
			

and

				
					 79)               |        trampoline_handler() {
 79)   0.140 us    |          kprobe_busy_begin();
 79)               |          kretprobe_hash_lock() {
 79)   0.150 us    |            _raw_spin_lock_irqsave();
 79)   0.411 us    |          }
 79)   0.140 us    |          percpu_array_map_lookup_elem();
 79)   0.140 us    |          percpu_array_map_lookup_elem();
 79)   0.140 us    |          bpf_get_current_pid_tgid();
 79)               |          __htab_map_lookup_elem() {
 79)   0.150 us    |            lookup_nulls_elem_raw();
 79)   0.652 us    |          }
 79)               |          htab_map_update_elem() {
 79)               |            _raw_spin_lock_irqsave() {
 79) # 1433.058 us |              native_queued_spin_lock_slowpath();
 79) # 1434.020 us |            }

				
			

This information allowed us to establish where the CPU spent its time. Rather than just checking in periodically with the network hardware, the kernel attempted to acquire a lock rather than doing valuable work.

We were now looking in the right place, but most calls to do_syscall_64 were reasonably fast, so we used the data to generate a flame graph to better understand the scale of the problem. It provided rather convincing evidence to support our belief that there was an issue with our newer kernel:

kernel flame before

The server was performing minimum work; the CPU core was busy acquiring a spinlock for about 1/3 of the available time. As we had established that the proportion of ‘wasted’ time grew as the problem worsened, we could now explain this behaviour as increased contention as threads become more likely to vie for the shared resource.

We now knew that we were putting effort into the right place and went back to the analysis of the kernel trace. We were relieved when we tied the problem behaviour ultimately to calls to security_socket_sendmsg, which in turn calls apparmor_socket_sendmsg.

Resolution

Finding the change

AppArmor is a Mandatory Access Control (MAC) system that provides fine-grained access rights to resources within the kernel. Like sandboxing, it can limit what an application can and can not do on the server.

This Linux feature is handy when running untrusted programs and is used on our customer web servers. However, it does not achieve much on our proxies that only run our proprietary software, where we are in a position to use coarser mechanisms to provide privilege separation and security. It left us with the straightforward option of removing AppArmor from the affected machines as it consists of kernel features which are enabled from some userland programs.

The removal was first performed on one server at approx 12:50 on Thursday, 14th October. We removed the Linux AppArmor package from the machine and rebooted the server. We then migrated some customer traffic to the machine and noted no sign of increased latency. Another maintenance was therefore scheduled for the evening to roll this change out across all servers in the cluster.

While we have not yet conclusively linked the periods of severe latency to the persistent issues that were the focus of our investigation, CPU usage has remained stable since this maintenance, and we have detected no further anomalous latency. The below CPU-usage graph shows that the previously struggling server is now able to handle the same level of traffic with much-reduced CPU usage:

cpu system after

(NB: the periodic spikes in CPU usage are expected behaviour caused by memory garbage collection within our proxy process and do not contribute to latency.)

We also see in our flame graph that there’s no longer any sign of the CPU wasting cycles in a spin lock:

kernel flame after

Detection

The instant we confirmed that disabling AppArmor appeared to fix the issue, team members were on hand to tell us that the package provided the affected code had last been updated by our Linux vendor as part of their security updates on the 31st of August 2022.

It was initially confusing since there was no sign of the issue for weeks when the school year started. As the behaviour changes on servers after a reboot, and since the contention seems to only occur once the level of traffic saturates the CPU, it wasn’t clear from our monitoring that affected machines had recently been rebooted. We only reboot the machine to apply kernel when the update fixes security issues and not generic problems, as our proxies have to run for long periods with no downtime.

Looking Forward

We have monitoring in place to track the running version of our own software, but this doesn’t yet extend to tracking the running kernel version or the versions of any modules that may impact us.

We will update our operating procedure to track our active cluster members’ kernel version and machine uptime. From now on, we should be able to identify better when changes have delayed consequences more easily.

And following this work, our new tool will continue to evolve and help us gather more data from our cluster of SurfProtect servers.

Suggested Next Read

Related Knowledge Hub™ Articles

The Exa Foundation

Contact us

Other

Contact us

Is DarkLight connectivity best suited to you?

Dark fibre is perfect if you are looking for a potentially limitless, ultrafast connection with complete flexibility and control.

If you fully rely on the internet, a dark fibre connection could be the best option for you.

Is Leased Line connectivity best suited to you?

Leased Lines are best suited to you if you have high bandwidth requirements and need a reliable, uncontended service.

It is ideal for you if you regularly carry out large uploads and downloads, use cloud based services and a VoIP telephone system as well as video conferencing, for everyday communication.

Is GPON connectivity best suited to you?

GPON is a great choice for you if you need gigabit speeds but don’t need them to be symmetrical. It is becoming more widely available across the UK but may not be immediately available to you yet.

Is Rural Fibre connectivity best suited to you?

If you want to make the move to full fibre, but are based in a rural area, this option is for you.

Is FTTP connectivity best suited to you?

If you have a number of users who use cloud-based applications to upload and download data on a daily basis, but don’t transfer large amounts of data, FTTP might be your best option.

Is Gfast connectivity best suited to you?

If your line cannot support a minimum of 100Mbps, this connection is not for you. Gfast must meet the speed as a minimum. 

If your line meets this need, and you’re looking for an ultrafast, consistent and reliable connection without the hassle and upheaval of construction work – this could be a good fit.

It’s worth noting that Gfast is a stop gap to FTTP, and is not a technology that is likely to be around for a long time.

Is FTTC connectivity best suited to you?

If you need more bandwidth but don’t really need a guaranteed speed, FTTC could be for you. It is widely available throughout the UK, making it suitable as a main connection. As this connection provides higher speeds than ADSL, it is also a good option for a back up to a leased line.

As with ADSL, once the PSTN is turned off in 2025/26, FTTC will become virtually obsolete and at the very least you will require FTTP to remain connected.

Sales

Sales

Office hours

Monday: 8:30am – 5pm
Tuesday: 8:30am – 5pm
Wednesday: 8:30am – 5pm
Thursday: 8:30am – 5pm
Friday: 8:30am – 5pm
Saturday: Closed
Sunday: Closed

Finance

Contact us

Office hours

Monday: 8am – 4pm
Tuesday: 8am – 4pm
Wednesday: 8am – 4pm
Thursday: 8am – 4pm
Friday: 8am – 4pm
Saturday: Closed
Sunday: Closed

Provisioning

Contact us

Office hours

Monday: 8am – 5pm
Tuesday: 8am – 5pm
Wednesday: 8am – 5pm
Thursday: 8am – 5pm
Friday: 8am – 5pm
Saturday: Closed
Sunday: Closed

Is DSL connectivity best suited to you?

DSL connections offer very limited bandwidth so it might be right for you if you typically use the internet for less data-intensive tasks. If you’re sending emails, browsing the web, downloading very small files and working with small amounts of data – you should be fine with DSL.

It is worth noting connections based on copper wire, like DSL, will be switched off in the UK by Openreach, with a phased approach due to begin at the end of 2025. If you don’t have a fibre connection at the moment, you’ll need to upgrade this as well as move to a VoIP telephone system.

Technical Support

Contact us

Office hours

Monday: 8am – 6pm
Tuesday: 8am – 6pm
Wednesday: 8am – 6pm
Thursday: 8am – 6pm
Friday: 8am – 6pm
Saturday: 10am – 4pm
Sunday: 10am – 4pm