{"id":16006,"date":"2022-11-10T10:00:00","date_gmt":"2022-11-10T10:00:00","guid":{"rendered":"https:\/\/exa.net.uk\/?p=16006"},"modified":"2024-05-15T17:03:09","modified_gmt":"2024-05-15T16:03:09","slug":"debugging-latency-redux","status":"publish","type":"post","link":"https:\/\/exa.net.uk\/knowledge-hub\/technical\/debugging-latency-redux\/","title":{"rendered":"Debugging Latency &#8211; Redux"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"16006\" data-post-id=\"16006\" data-obj-id=\"16006\" class=\"elementor elementor-16006 dce-elementor-post-16006\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-2e642dc0 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-dce-background-image-url=\"https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/blog-post-top.png\" data-id=\"2e642dc0\" data-element_type=\"section\" data-settings=\"{&quot;background_background&quot;:&quot;classic&quot;,&quot;jet_parallax_layout_list&quot;:[{&quot;_id&quot;:&quot;6d8e4b3&quot;,&quot;jet_parallax_layout_image&quot;:{&quot;url&quot;:&quot;&quot;,&quot;id&quot;:&quot;&quot;,&quot;size&quot;:&quot;&quot;},&quot;jet_parallax_layout_image_tablet&quot;:{&quot;url&quot;:&quot;&quot;,&quot;id&quot;:&quot;&quot;,&quot;size&quot;:&quot;&quot;},&quot;jet_parallax_layout_image_mobile&quot;:{&quot;url&quot;:&quot;&quot;,&quot;id&quot;:&quot;&quot;,&quot;size&quot;:&quot;&quot;},&quot;jet_parallax_layout_speed&quot;:{&quot;unit&quot;:&quot;%&quot;,&quot;size&quot;:50,&quot;sizes&quot;:[]},&quot;jet_parallax_layout_type&quot;:&quot;scroll&quot;,&quot;jet_parallax_layout_direction&quot;:null,&quot;jet_parallax_layout_fx_direction&quot;:null,&quot;jet_parallax_layout_z_index&quot;:&quot;&quot;,&quot;jet_parallax_layout_bg_x&quot;:50,&quot;jet_parallax_layout_bg_x_tablet&quot;:&quot;&quot;,&quot;jet_parallax_layout_bg_x_mobile&quot;:&quot;&quot;,&quot;jet_parallax_layout_bg_y&quot;:50,&quot;jet_parallax_layout_bg_y_tablet&quot;:&quot;&quot;,&quot;jet_parallax_layout_bg_y_mobile&quot;:&quot;&quot;,&quot;jet_parallax_layout_bg_size&quot;:&quot;auto&quot;,&quot;jet_parallax_layout_bg_size_tablet&quot;:&quot;&quot;,&quot;jet_parallax_layout_bg_size_mobile&quot;:&quot;&quot;,&quot;jet_parallax_layout_animation_prop&quot;:&quot;transform&quot;,&quot;jet_parallax_layout_on&quot;:[&quot;desktop&quot;,&quot;tablet&quot;]}]}\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-56af8b04\" data-id=\"56af8b04\" data-element_type=\"column\" data-settings=\"{&quot;background_background&quot;:&quot;classic&quot;}\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div data-dce-background-color=\"#FFFFFF\" class=\"elementor-element elementor-element-3a1d4750 master-btn-width elementor-mobile-align-center elementor-widget__width-auto elementor-widget-mobile__width-inherit elementor-absolute elementor-widget elementor-widget-button\" data-id=\"3a1d4750\" data-element_type=\"widget\" data-settings=\"{&quot;_position&quot;:&quot;absolute&quot;}\" data-widget_type=\"button.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<div class=\"elementor-button-wrapper\">\n\t\t\t<a class=\"elementor-button elementor-button-link elementor-size-sm\" href=\"https:\/\/exa.net.uk\/knowledge-hub\/\" target=\"_blank\" rel=\"noopener\">\n\t\t\t\t\t\t<span class=\"elementor-button-content-wrapper\">\n\t\t\t\t\t\t<span class=\"elementor-button-icon\">\n\t\t\t\t<i aria-hidden=\"true\" class=\"fas fa-caret-left\"><\/i>\t\t\t<\/span>\n\t\t\t\t\t\t\t\t\t<span class=\"elementor-button-text\">Knowledge Hub<\/span>\n\t\t\t\t\t<\/span>\n\t\t\t\t\t<\/a>\n\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7553618a elementor-widget__width-auto elementor-widget elementor-widget-heading\" data-id=\"7553618a\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Knowledge <span class=\"lightblue\">Hub<\/span><sup class=\"blog-tm\">TM<\/sup><span class=\"hub-image\"><img decoding=\"async\" src=\"https:\/\/exa.net.uk\/wp-content\/uploads\/2021\/02\/education.svg\" alt=\"\" title=\"\"><\/span><\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2b0479e9 elementor-widget__width-initial elementor-widget elementor-widget-theme-post-title elementor-page-title elementor-widget-heading\" data-id=\"2b0479e9\" data-element_type=\"widget\" data-widget_type=\"theme-post-title.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\">Debugging Latency &#8211; Redux<\/h1>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-103f2b65 elementor-widget__width-initial elementor-hidden-phone elementor-widget elementor-widget-spacer\" data-id=\"103f2b65\" data-element_type=\"widget\" data-widget_type=\"spacer.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<div class=\"elementor-spacer\">\n\t\t\t<div class=\"elementor-spacer-inner\"><\/div>\n\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-545bb243 elementor-mobile-align-center elementor-align-center elementor-widget__width-initial post-info elementor-widget elementor-widget-post-info\" data-id=\"545bb243\" data-element_type=\"widget\" data-widget_type=\"post-info.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<ul class=\"elementor-inline-items elementor-icon-list-items elementor-post-info\">\n\t\t\t\t\t\t\t\t<li class=\"elementor-icon-list-item elementor-repeater-item-ba9c996 elementor-inline-item\" itemprop=\"datePublished\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-text elementor-post-info__item elementor-post-info__item--type-date\">\n\t\t\t\t\t\t\t<span class=\"elementor-post-info__item-prefix\">Date Posted:<\/span>\n\t\t\t\t\t\t\t\t\t\t<time>10.11.2022<\/time>\t\t\t\t\t<\/span>\n\t\t\t\t\t\t\t\t<\/li>\n\t\t\t\t<li class=\"elementor-icon-list-item elementor-repeater-item-7e61590 elementor-inline-item\" itemprop=\"about\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-text elementor-post-info__item elementor-post-info__item--type-terms\">\n\t\t\t\t\t\t\t<span class=\"elementor-post-info__item-prefix\">Read time:<\/span>\n\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-post-info__terms-list\">\n\t\t\t\t<span class=\"elementor-post-info__terms-list-item\">11 min read<\/span>\t\t\t\t<\/span>\n\t\t\t\t\t<\/span>\n\t\t\t\t\t\t\t\t<\/li>\n\t\t\t\t<li class=\"elementor-icon-list-item elementor-repeater-item-a339c7b elementor-inline-item\" itemprop=\"author\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<span class=\"elementor-icon-list-text elementor-post-info__item elementor-post-info__item--type-author\">\n\t\t\t\t\t\t\t<span class=\"elementor-post-info__item-prefix\">Written by:<\/span>\n\t\t\t\t\t\t\t\t\t\tExa Networks\t\t\t\t\t<\/span>\n\t\t\t\t\t\t\t\t<\/li>\n\t\t\t\t<\/ul>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-33be5e0b elementor-widget__width-initial elementor-hidden-phone elementor-widget elementor-widget-spacer\" data-id=\"33be5e0b\" data-element_type=\"widget\" data-widget_type=\"spacer.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<div class=\"elementor-spacer\">\n\t\t\t<div class=\"elementor-spacer-inner\"><\/div>\n\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-8587caa elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"8587caa\" data-element_type=\"section\" id=\"main-article\" data-settings=\"{&quot;jet_parallax_layout_list&quot;:[{&quot;_id&quot;:&quot;5fc546a&quot;,&quot;jet_parallax_layout_image&quot;:{&quot;url&quot;:&quot;&quot;,&quot;id&quot;:&quot;&quot;,&quot;size&quot;:&quot;&quot;},&quot;jet_parallax_layout_image_tablet&quot;:{&quot;url&quot;:&quot;&quot;,&quot;id&quot;:&quot;&quot;,&quot;size&quot;:&quot;&quot;},&quot;jet_parallax_layout_image_mobile&quot;:{&quot;url&quot;:&quot;&quot;,&quot;id&quot;:&quot;&quot;,&quot;size&quot;:&quot;&quot;},&quot;jet_parallax_layout_speed&quot;:{&quot;unit&quot;:&quot;%&quot;,&quot;size&quot;:50,&quot;sizes&quot;:[]},&quot;jet_parallax_layout_type&quot;:&quot;scroll&quot;,&quot;jet_parallax_layout_direction&quot;:null,&quot;jet_parallax_layout_fx_direction&quot;:null,&quot;jet_parallax_layout_z_index&quot;:&quot;&quot;,&quot;jet_parallax_layout_bg_x&quot;:50,&quot;jet_parallax_layout_bg_x_tablet&quot;:&quot;&quot;,&quot;jet_parallax_layout_bg_x_mobile&quot;:&quot;&quot;,&quot;jet_parallax_layout_bg_y&quot;:50,&quot;jet_parallax_layout_bg_y_tablet&quot;:&quot;&quot;,&quot;jet_parallax_layout_bg_y_mobile&quot;:&quot;&quot;,&quot;jet_parallax_layout_bg_size&quot;:&quot;auto&quot;,&quot;jet_parallax_layout_bg_size_tablet&quot;:&quot;&quot;,&quot;jet_parallax_layout_bg_size_mobile&quot;:&quot;&quot;,&quot;jet_parallax_layout_animation_prop&quot;:&quot;transform&quot;,&quot;jet_parallax_layout_on&quot;:[&quot;desktop&quot;,&quot;tablet&quot;]}]}\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-50 elementor-top-column elementor-element elementor-element-3bf7379a\" data-id=\"3bf7379a\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-7caea28c custom-posts elementor-widget elementor-widget-text-editor\" data-id=\"7caea28c\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<h4>Not so long ago, we posted an <a href=\"https:\/\/exa.net.uk\/latency-debugging\/\" target=\"_blank\" rel=\"noopener\">article<\/a> that detailed an issue impacting SurfProtect service stability. We described our investigation into the cause of the problem and celebrated its resolution.<\/h4><h4>This week, it came back.<\/h4><p>While our previous investigation required significant effort, co-ordination between our teams, and time, we&#8217;re happy to say that the lessons learned from our previous difficulties meant that we were able to diagnose the issue and apply a fix within hours. The problem again proved to be service-affecting, however, and this follow-up describes once again what happened, providing more insight into how we run our services and hopefully explaining why we don&#8217;t expect another repeat.<\/p><p>The issue impacted two servers almost simultaneously, just after 08:00 in the morning on Thursday 2022-11-10. We&#8217;ve had weeks of monitoring to confirm that cpu usage was back to normal levels so we were confident in the findings from our previous investigation but we&#8217;re now unsure whether we fixed a similar issue or simply relieved pressure on the current one. We will likely continue this series of articles with a discussion of the work required to confirm our current belief.<\/p><p>Traffic was migrated away from the affected servers to an available machine and after some time we started to see the problem starting to occur on that server too. Fortunately, a potential fix was available before it was necessary to migrate traffic away and we were able to observe an immediate change in behaviour with the fix applied.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-39e59686 elementor-widget elementor-widget-heading\" data-id=\"39e59686\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Restarting the investigation<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3b550f92 custom-posts elementor-widget elementor-widget-text-editor\" data-id=\"3b550f92\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>Just like in the previous incidents, the first we knew that anything was wrong was when our monitoring system started alerting us to high service latency on two servers.<\/p><p>This time, our team immediately started up our data collection script on the affected machines then looked up performance metrics while we waited for the more detailed traces to populate.<\/p><p>Running\u00a0<code>top<\/code>\u00a0showed just over 50% cpu usage, which matched the overview of the proxy process, showing it maxing out each of the cores it was assigned to.<\/p><p>The graphical data shows a rapid but not quite immediate ramp up to those levels:<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-14e76ce dce_masking-none elementor-widget elementor-widget-image\" data-id=\"14e76ce\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"275\" src=\"https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-qbit-cpu-1024x275.png\" class=\"attachment-large size-large wp-image-16014\" alt=\"\" srcset=\"https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-qbit-cpu-1024x275.png 1024w, https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-qbit-cpu-300x80.png 300w, https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-qbit-cpu-768x206.png 768w, https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-qbit-cpu.png 1424w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" title=\"\">\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-inner-section elementor-element elementor-element-35f04388 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"35f04388\" data-element_type=\"section\" data-settings=\"{&quot;jet_parallax_layout_list&quot;:[]}\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-50 elementor-inner-column elementor-element elementor-element-41f1dcb4\" data-id=\"41f1dcb4\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-4471a754 custom-posts elementor-widget elementor-widget-text-editor\" data-id=\"4471a754\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>It might seem strange that we&#8217;re apparently hitting full capacity at only ~50% cpu usage but that&#8217;s plenty to handle the current traffic volume we&#8217;re allocating to each machine, and we&#8217;re avoiding running the proxy process across NUMA domains for now. Just like before, nearly all of our cpu time was spent in system calls and was concentrated onto the cores running the proxy.<\/p><p>We&#8217;ve included a visualisation of where the proxy was spending its time again, but the values are so close to the previous post that we may as well have just duplicated the image:<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-50 elementor-inner-column elementor-element elementor-element-0b86cd5\" data-id=\"0b86cd5\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-536dd9b dce_masking-none elementor-widget elementor-widget-image\" data-id=\"536dd9b\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"507\" height=\"601\" src=\"https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-golang-syscall-usage.png\" class=\"attachment-large size-large wp-image-16016\" alt=\"\" srcset=\"https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-golang-syscall-usage.png 507w, https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-golang-syscall-usage-253x300.png 253w\" sizes=\"(max-width: 507px) 100vw, 507px\" title=\"\">\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<div class=\"elementor-element elementor-element-68ae8fa7 custom-posts elementor-widget elementor-widget-text-editor\" data-id=\"68ae8fa7\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>This time, though, let&#8217;s look at the actual impact of that high load. This image shows the latency that our health checker observed while monitoring one proxy service.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-67fdf43 dce_masking-none elementor-widget elementor-widget-image\" data-id=\"67fdf43\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"944\" height=\"301\" src=\"https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-http-latency-wild.png\" class=\"attachment-large size-large wp-image-16018\" alt=\"debugging latency\" srcset=\"https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-http-latency-wild.png 944w, https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-http-latency-wild-300x96.png 300w, https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-http-latency-wild-768x245.png 768w\" sizes=\"(max-width: 944px) 100vw, 944px\" title=\"\">\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-be6b260 custom-posts elementor-widget elementor-widget-text-editor\" data-id=\"be6b260\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>See how inconsistent those latencies are? That&#8217;s a measure of how irregular system call latencies were, which fits with the idea that we&#8217;re blocking while waiting for a (very) contended resource.<\/p><p>Even at the extreme lower end of the graph, though, those latencies look horrendous when compared to the values we expect to see.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a43fa7b dce_masking-none elementor-widget elementor-widget-image\" data-id=\"a43fa7b\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"949\" height=\"302\" src=\"https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-http-latency-normal.png\" class=\"attachment-large size-large wp-image-16019\" alt=\"\" srcset=\"https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-http-latency-normal.png 949w, https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-http-latency-normal-300x95.png 300w, https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-http-latency-normal-768x244.png 768w\" sizes=\"(max-width: 949px) 100vw, 949px\" title=\"\">\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-02f7427 elementor-widget elementor-widget-text-editor\" data-id=\"02f7427\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>The spikes here on metrics gathered under normal load measure increased latency on the services we&#8217;re accessing through the proxy rather than on the proxy itself, and it&#8217;s clear to see that the earlier samples were severely impacted by the issue. At up to 1 second to handle a web request, users are going to find the service unbearably slow.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-790b73d7 custom-posts elementor-widget elementor-widget-text-editor\" data-id=\"790b73d7\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<h3 id=\"investigating-cpu-usage\">Measuring contention<\/h3>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0fa0e4a elementor-widget elementor-widget-text-editor\" data-id=\"0fa0e4a\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>So far we&#8217;ve limited our discussion to directly measuring the impact of the problem to the overall service, but what about individual components? Do some parts of the proxy fare better than others under contention and can we use that knowledge to make the service more resilient to the problem?<\/p><p>One of the metrics we track shows the average time taken to successfully complete TLS handshakes. The amount of work done by both sides is dependent on a few factors so it&#8217;s normal for us to have quite a wide spread of latencies under normal load. Clearly, though, the increase in time taken by this subsystem will contribute to the overall drop in performance.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6592b082 dce_masking-none elementor-widget elementor-widget-image\" data-id=\"6592b082\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"628\" height=\"262\" src=\"https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-proxy-tls-latency.png\" class=\"attachment-large size-large wp-image-16015\" alt=\"\" srcset=\"https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-proxy-tls-latency.png 628w, https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-proxy-tls-latency-300x125.png 300w\" sizes=\"(max-width: 628px) 100vw, 628px\" title=\"\">\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f65cd2d custom-posts elementor-widget elementor-widget-text-editor\" data-id=\"f65cd2d\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>Other than highlighting that we need to add more bins to that particular histogram, this data shows that we&#8217;re starting to add a great deal of latency before we even receive a web request.<\/p><p>On the other end the proxy&#8217;s duties, we find that latency starts piling up further as response times from our backend decision logic appear to increase significantly.<\/p><p>We use two types of request messages for interacting with the backend service: options requests and decision requests. Decision requests determine the exact behaviour of the service in response to the web request made by a user and their latency exhibits a somewhat long tail as we sometimes need to consult services that sit elsewhere on the network. Options requests, on the other hand, control proxy behaviour during initial interception of web requests, and are usually expected to return a response within well under a millisecond.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-87f54e5 dce_masking-none elementor-widget elementor-widget-image\" data-id=\"87f54e5\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"629\" height=\"263\" src=\"https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-proxy-decision-latency.png\" class=\"attachment-large size-large wp-image-16022\" alt=\"\" srcset=\"https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-proxy-decision-latency.png 629w, https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-proxy-decision-latency-300x125.png 300w\" sizes=\"(max-width: 629px) 100vw, 629px\" title=\"\">\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a9f775b dce_masking-none elementor-widget elementor-widget-image\" data-id=\"a9f775b\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"628\" height=\"264\" src=\"https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-proxy-options-latency.png\" class=\"attachment-large size-large wp-image-16027\" alt=\"\" srcset=\"https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-proxy-options-latency.png 628w, https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-proxy-options-latency-300x126.png 300w\" sizes=\"(max-width: 628px) 100vw, 628px\" title=\"\">\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9646eff elementor-widget elementor-widget-text-editor\" data-id=\"9646eff\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>We didn&#8217;t expect that it would be useful to measure more that 500x the expected response time of an options request but it&#8217;s safe to use the metrics for decision requests as a guide and assume that both sets of communication are adding on average 100-200ms or higher to our total time.<\/p><p>That sounds really bad but the deciders themselves paint a different picture. This is higher resolution breakdown of the time taken to process the same requests we measured above, this time from the point of view of one of the local decider processes<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1b26c5d dce_masking-none elementor-widget elementor-widget-image\" data-id=\"1b26c5d\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"630\" height=\"302\" src=\"https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-decider-options-latency.png\" class=\"attachment-large size-large wp-image-16023\" alt=\"\" srcset=\"https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-decider-options-latency.png 630w, https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-decider-options-latency-300x144.png 300w\" sizes=\"(max-width: 630px) 100vw, 630px\" title=\"\">\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-704cd93 elementor-widget elementor-widget-text-editor\" data-id=\"704cd93\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>The deciders are assigned to completely different cores than the proxy instance but it&#8217;s clear that performance is still impacted by what&#8217;s happening on the rest of the machine. While the proxy measures hundreds of milliseconds of latency, however, here we see that requests are now most likely to be processed in 100us to 2ms, and that no request takes more that 10ms.<\/p><p>While there&#8217;s definitely evidence of degraded performance across the machine, we have once again found that the issue impacts us most greatly in communication between services.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-13537475 elementor-widget elementor-widget-heading\" data-id=\"13537475\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Jumping to the root cause<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-31c4da47 custom-posts elementor-widget elementor-widget-text-editor\" data-id=\"31c4da47\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>During our earlier investigation, the team initially focused on debugging our own services and eventually we started building tools to investigate the possibility of an issue within the Linux kernel. With those tools now having been rolled into our data collection process, all the data we needed was ready by the time we&#8217;d performed our initial analysis and determined that the symptoms exactly matched the issues we&#8217;d only just fixed.<\/p><p>We knew exactly what the contention looked like before and went straight to our flame graph to see if we could see any sign of similar behaviour:<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8473cf3 dce_masking-none elementor-widget elementor-widget-image\" data-id=\"8473cf3\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"803\" src=\"https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-flame-bad-1024x803.png\" class=\"attachment-large size-large wp-image-16028\" alt=\"\" srcset=\"https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-flame-bad-1024x803.png 1024w, https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-flame-bad-300x235.png 300w, https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-flame-bad-768x602.png 768w, https:\/\/exa.net.uk\/wp-content\/uploads\/2022\/11\/sp-latency2-flame-bad.png 1202w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" title=\"\">\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fe5ac70 elementor-widget elementor-widget-text-editor\" data-id=\"fe5ac70\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\tUnbelievably, the flame graph was almost indistinguishable from the version we published last time.\n\nWith that in mind, we recalled from our previous analysis that we&#8217;d initially worried about potentially measuring behaviour of the tracer, based on seeing contended calls to same locking mechanism that appeared heavily in calls to\u00a0<code style=\"background: lightgrey;\">apparmor_socket_sendmsg<\/code>. We put those instances down to backpressure from the actual issue since they went away after disabling AppArmour but here they were again!\n\nThis gave credibility to the idea that there were perhaps multiple sources of contention, and removing AppArmor had merely reduced demand on our contented resource enough that the issue was no longer visible.\n\nTurning to the stack traces we&#8217;d just gathered, we quickly recognised that the calls we&#8217;d erroneously attributed to our use of ftrace appear in fact to have been caused by actively registered\u00a0<a href=\"https:\/\/www.kernel.org\/doc\/html\/latest\/trace\/kprobes.html\" style=\"color:#009fe3;\" target=\"_blank\" rel=\"noopener\">kernel probes<\/a>.\n\nThe trampoline handler shown below is responsible for calling a user-specified return handler that we can spending a lot of time in a spin lock:\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-58310d9 elementor-widget elementor-widget-code-highlight\" data-id=\"58310d9\" data-element_type=\"widget\" data-widget_type=\"code-highlight.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<div class=\"prismjs-default copy-to-clipboard \">\n\t\t\t<pre data-line=\"\" class=\"highlight-height language-javascript line-numbers\">\n\t\t\t\t<code readonly=\"true\" class=\"language-javascript\">\n\t\t\t\t\t<xmp>   7)               |        trampoline_handler() {\n   7)   0.201 us    |          kprobe_busy_begin();\n   7)               |          kretprobe_hash_lock() {\n   7)   0.201 us    |            _raw_spin_lock_irqsave();\n   7)   0.561 us    |          }\n   7)   0.200 us    |          percpu_array_map_lookup_elem();\n   7)   0.200 us    |          percpu_array_map_lookup_elem();\n   7)   0.201 us    |          bpf_get_current_pid_tgid();\n   7)               |          __htab_map_lookup_elem() {\n   7)   0.211 us    |            lookup_nulls_elem_raw();\n   7)   0.761 us    |          }\n   7)               |          htab_map_update_elem() {\n   7)               |            _raw_spin_lock_irqsave() {\n   7) # 1059.700 us |              native_queued_spin_lock_slowpath();\n   7) # 1060.732 us |            }\n   7)   0.421 us    |            lookup_elem_raw();\n   7)               |            alloc_htab_elem() {\n   7)               |              __pcpu_freelist_pop() {\n   7)   0.201 us    |                _raw_spin_lock();\n   7)   0.190 us    |                _raw_spin_lock();\n   7)   0.201 us    |                _raw_spin_lock();\n   7)   0.191 us    |                _raw_spin_lock();\n   7)   0.200 us    |                _raw_spin_lock();\n   7)   0.190 us    |                _raw_spin_lock();\n   7)   0.190 us    |                _raw_spin_lock();\n   7)   0.191 us    |                _raw_spin_lock();\n   7)   0.190 us    |                _raw_spin_lock();\n   7)   0.190 us    |                _raw_spin_lock();\n   7)   0.190 us    |                _raw_spin_lock();\n   7)   0.190 us    |                _raw_spin_lock();\n...\n<\/xmp>\n\t\t\t\t<\/code>\n\t\t\t<\/pre>\n\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-28d0959 elementor-widget elementor-widget-text-editor\" data-id=\"28d0959\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>Now we&#8217;re getting somewhere. If we have a service on the servers that&#8217;s registering a handler for events that we&#8217;re generating frequently then it&#8217;s quite plausible it could be inadvertently causing the problem.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2705284 elementor-widget elementor-widget-heading\" data-id=\"2705284\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Hunting for Probes<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5492f3d0 custom-posts elementor-widget elementor-widget-text-editor\" data-id=\"5492f3d0\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>Attempting to identify the functions that were being probed, we discovered an intesting set of tracepoints enabled on all of our proxy servers:<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fdfd27b elementor-widget elementor-widget-code-highlight\" data-id=\"fdfd27b\" data-element_type=\"widget\" data-widget_type=\"code-highlight.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<div class=\"prismjs-default copy-to-clipboard \">\n\t\t\t<pre data-line=\"\" class=\"highlight-height language-javascript line-numbers\">\n\t\t\t\t<code readonly=\"true\" class=\"language-javascript\">\n\t\t\t\t\t<xmp># cat \/sys\/kernel\/debug\/tracing\/set_event\nirq_vectors:thermal_apic_exit\nirq_vectors:thermal_apic_entry\nirq_vectors:deferred_error_apic_exit\nirq_vectors:deferred_error_apic_entry\nirq_vectors:threshold_apic_exit\nirq_vectors:threshold_apic_entry\nirq_vectors:call_function_single_exit\nirq_vectors:call_function_single_entry\nirq_vectors:call_function_exit\nirq_vectors:call_function_entry\nirq_vectors:reschedule_exit\nirq_vectors:reschedule_entry\nirq_vectors:irq_work_exit\nirq_vectors:irq_work_entry\nirq_vectors:x86_platform_ipi_exit\nirq_vectors:x86_platform_ipi_entry\nirq_vectors:error_apic_exit\nirq_vectors:error_apic_entry\nirq_vectors:spurious_apic_exit\nirq_vectors:spurious_apic_entry\nirq_vectors:local_timer_exit\nirq_vectors:local_timer_entry\nirq:softirq_exit\nirq:softirq_entry\nirq:irq_handler_exit\nirq:irq_handler_entry\nsched:sched_process_exec\nsched:sched_process_fork\nsched:sched_process_exit\noom:mark_victim\n<\/xmp>\n\t\t\t\t<\/code>\n\t\t\t<\/pre>\n\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-746be891 custom-posts elementor-widget elementor-widget-text-editor\" data-id=\"746be891\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>This is far from an exhaustive list of the tracepoints exported by the kernel but many of the events listed occur very often.<\/p><p>Assuming from patterns observed in our stack trace that there were also dynamically created probes in play, we quickly came up with a plan to disable them on the machine exhibiting signs of the issue to see if we could alter its behaviour. Once again, the interface for achieving this could barely have been simpler:<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-63bd1ed elementor-widget elementor-widget-code-highlight\" data-id=\"63bd1ed\" data-element_type=\"widget\" data-widget_type=\"code-highlight.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<div class=\"prismjs-default copy-to-clipboard \">\n\t\t\t<pre data-line=\"\" class=\"highlight-height language-javascript line-numbers\">\n\t\t\t\t<code readonly=\"true\" class=\"language-javascript\">\n\t\t\t\t\t<xmp>echo 0 > \/sys\/kernel\/debug\/kprobes\/enabled\necho > \/sys\/kernel\/debug\/tracing\/set_event<\/xmp>\n\t\t\t\t<\/code>\n\t\t\t<\/pre>\n\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-234845f1 elementor-widget elementor-widget-text-editor\" data-id=\"234845f1\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>As the issue had only just started to appear on the affected machine, there wasn&#8217;t yet a great impact on cpu so the immediate response to this action was almost underwhelming. The data gathering script was run and we waited a minute until we had flame graph showing no sign of the problem.<\/p>\nWe could almost have believed that the problem had simply dwindled away on its own but for one small detail: all references to\u00a0<code style=\"background-color: lightgrey;\">kretprobe_<\/code>\u00a0functions have now disappeared from the output.\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5209ba5 elementor-widget elementor-widget-heading\" data-id=\"5209ba5\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Finding the root problem<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-47c06d7b elementor-widget elementor-widget-text-editor\" data-id=\"47c06d7b\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>Some time sat reasoning about what was going on served to highlight the likely cause of our kernel probes, as our engineers realised that tracepoints we&#8217;d identified were related to metrics that we gather on every machine with the excellent\u00a0<a style=\"color: #009fe3;\" href=\"https:\/\/www.netdata.cloud\/\" target=\"_blank\" rel=\"noopener\">netdata<\/a>\u00a0tool. The default configuration makes use of EBPF programs to collect data and a little digging found evidence of probes to other functions that appeared to be particular hotspots in the traces we generated:<\/p><p><code style=\"background-color: lightgrey;\">netdata_ebpf_targets_t socket_targets[] = { {.name = \"inet_csk_accept\", .mode = EBPF_LOAD_TRAMPOLINE}, {.name = \"tcp_retransmit_skb\", .mode = EBPF_LOAD_TRAMPOLINE}, {.name = \"tcp_cleanup_rbuf\", .mode = EBPF_LOAD_TRAMPOLINE}, {.name = \"tcp_close\", .mode = EBPF_LOAD_TRAMPOLINE}, {.name = \"udp_recvmsg\", .mode = EBPF_LOAD_TRAMPOLINE}, {.name = \"tcp_sendmsg\", .mode = EBPF_LOAD_TRAMPOLINE}, {.name = \"udp_sendmsg\", .mode = EBPF_LOAD_TRAMPOLINE}, {.name = \"tcp_v4_connect\", .mode = EBPF_LOAD_TRAMPOLINE}, {.name = \"tcp_v6_connect\", .mode = EBPF_LOAD_TRAMPOLINE}, {.name = NULL, .mode = EBPF_LOAD_TRAMPOLINE}};<\/code><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3a5dd022 elementor-widget elementor-widget-text-editor\" data-id=\"3a5dd022\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>We&#8217;ve since verified on a test machine that we&#8217;re able to disable the use of EBPF in netdata and there&#8217;s no further sign of probes being hit.<\/p>\n<p>No work has yet been carried out to profile the handlers we were using but our expectation is that at least one must be aquiring a lock that&#8217;s shared among the 48 cores we allocate to the proxy process, and that&#8217;s causing contention issues when faced with large volumes of network events.<\/p>\n<p>We plan to report the issue and hope that we&#8217;ll be able to return to recording the missing metrics as soon as possible.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9c2224a elementor-widget elementor-widget-heading\" data-id=\"9c2224a\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Going forward<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6ae7ceb2 elementor-widget elementor-widget-text-editor\" data-id=\"6ae7ceb2\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>We now have a convincing explanation for the issue and a solution that&#8217;s been tested to work in the wild, but we&#8217;ve been here before. How do we guarantee both that we&#8217;ve really fixed the issue and that there are no similar issues waiting to take over?<\/p><p>Our engineers have constructed a plan to continuously gather profiling data from each proxy server and generate alert events whenever we encounter the cpu spending large amounts of time acquiring locks, or instances of any\u00a0<code style=\"background-color: lightgrey;\">kretprobe_<\/code>\u00a0function. The initial plan was to build a service to do this ourselves but we&#8217;re currently evaluating the\u00a0<a style=\"color: #009fe3;\" href=\"https:\/\/github.com\/grafana\/phlare\" target=\"_blank\" rel=\"noopener\">Grafana Phlare<\/a>\u00a0project to see if it can help us to achieve that goal.<\/p><p>While this solution doesn&#8217;t guarantee against the problem reoccurring, it&#8217;ll give us far earlier warning than our current monitoring is able to achieve and provide us with easy access to the data we need to diagnose what&#8217;s happening.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ded0578 elementor-widget elementor-widget-heading\" data-id=\"ded0578\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">netdata response<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5e1370cc elementor-widget elementor-widget-text-editor\" data-id=\"5e1370cc\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<strong>Update (2022-11-15):<\/strong> upon\u00a0<a style=\"color: #009fe3;\" href=\"https:\/\/discord.com\/channels\/847502280503590932\/1042075478221148270\" target=\"_blank\" rel=\"noopener\">reporting this information<\/a>\u00a0to the netdata team via discord, we learned that they were already working on a solution for the next stable release (we were using netdata v1.36.1).\n<div class=\"p-rich_text_section\">The latest netdata tests showed that they had improvements after they merged:<\/div>\n<ul class=\"p-rich_text_list p-rich_text_list__bullet\" data-stringify-type=\"unordered-list\" data-indent=\"0\" data-border=\"0\">\n \t<li data-stringify-indent=\"0\" data-stringify-border=\"0\"><a style=\"color: #009fe3;\" href=\"https:\/\/github.com\/netdata\/netdata\/pull\/13397\" target=\"_blank\" rel=\"noopener\">https:\/\/github.com\/netdata\/netdata\/pull\/13397<\/a><\/li>\n \t<li data-stringify-indent=\"0\" data-stringify-border=\"0\"><a style=\"color: #009fe3;\" href=\"https:\/\/github.com\/netdata\/netdata\/pull\/13530\" target=\"_blank\" rel=\"noopener\">https:\/\/github.com\/netdata\/netdata\/pull\/13530<\/a><\/li>\n \t<li data-stringify-indent=\"0\" data-stringify-border=\"0\"><a style=\"color: #009fe3;\" href=\"https:\/\/github.com\/netdata\/netdata\/pull\/13624\" target=\"_blank\" rel=\"noopener\">https:\/\/github.com\/netdata\/netdata\/pull\/13624<\/a><\/li>\n<\/ul>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-inner-section elementor-element elementor-element-306a1079 elementor-section-content-middle elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"306a1079\" data-element_type=\"section\" data-settings=\"{&quot;jet_parallax_layout_list&quot;:[{&quot;_id&quot;:&quot;62e95af&quot;,&quot;jet_parallax_layout_image&quot;:{&quot;url&quot;:&quot;&quot;,&quot;id&quot;:&quot;&quot;,&quot;size&quot;:&quot;&quot;},&quot;jet_parallax_layout_image_tablet&quot;:{&quot;url&quot;:&quot;&quot;,&quot;id&quot;:&quot;&quot;,&quot;size&quot;:&quot;&quot;},&quot;jet_parallax_layout_image_mobile&quot;:{&quot;url&quot;:&quot;&quot;,&quot;id&quot;:&quot;&quot;,&quot;size&quot;:&quot;&quot;},&quot;jet_parallax_layout_speed&quot;:{&quot;unit&quot;:&quot;%&quot;,&quot;size&quot;:50,&quot;sizes&quot;:[]},&quot;jet_parallax_layout_type&quot;:&quot;scroll&quot;,&quot;jet_parallax_layout_direction&quot;:null,&quot;jet_parallax_layout_fx_direction&quot;:null,&quot;jet_parallax_layout_z_index&quot;:&quot;&quot;,&quot;jet_parallax_layout_bg_x&quot;:50,&quot;jet_parallax_layout_bg_x_tablet&quot;:&quot;&quot;,&quot;jet_parallax_layout_bg_x_mobile&quot;:&quot;&quot;,&quot;jet_parallax_layout_bg_y&quot;:50,&quot;jet_parallax_layout_bg_y_tablet&quot;:&quot;&quot;,&quot;jet_parallax_layout_bg_y_mobile&quot;:&quot;&quot;,&quot;jet_parallax_layout_bg_size&quot;:&quot;auto&quot;,&quot;jet_parallax_layout_bg_size_tablet&quot;:&quot;&quot;,&quot;jet_parallax_layout_bg_size_mobile&quot;:&quot;&quot;,&quot;jet_parallax_layout_animation_prop&quot;:&quot;transform&quot;,&quot;jet_parallax_layout_on&quot;:[&quot;desktop&quot;,&quot;tablet&quot;]}]}\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-50 elementor-inner-column elementor-element elementor-element-2f96de0b\" data-id=\"2f96de0b\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-57c995e0 next-read-section elementor-widget elementor-widget-heading\" data-id=\"57c995e0\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<p class=\"elementor-heading-title elementor-size-default\"><span class=\"lightblue\">Suggested<\/span> Next Read <span class=\"lightblue\"><i class=\"fas fa-caret-right\"><\/i><\/span><\/p>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-50 elementor-inner-column elementor-element elementor-element-4ae5de96\" data-id=\"4ae5de96\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-d8dcf6f custom-slider blog-slide custom-slider-green elementor-widget__width-initial elementor-widget elementor-widget-ucaddon_uc_card_post_carousel_blog\" data-id=\"d8dcf6f\" data-element_type=\"widget\" data-widget_type=\"ucaddon_uc_card_post_carousel_blog.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\n<!-- start Card Post Carousel - Blog -->\n\t\t<link id='font-awesome-css' href='https:\/\/exa.net.uk\/wp-content\/plugins\/unlimited-elements-for-elementor\/assets_libraries\/font-awesome6\/fontawesome-all.min.css' type='text\/css' rel='stylesheet' >\n\t\t<link id='font-awesome-4-shim-css' href='https:\/\/exa.net.uk\/wp-content\/plugins\/unlimited-elements-for-elementor\/assets_libraries\/font-awesome6\/fontawesome-v4-shims.min.css' type='text\/css' rel='stylesheet' >\n\t\t<link id='owl-carousel-css' href='https:\/\/exa.net.uk\/wp-content\/plugins\/unlimited-elements-for-elementor\/assets_libraries\/owl-carousel\/assets\/owl.carousel.css' type='text\/css' rel='stylesheet' >\n\n<style>\/* widget: Card Post Carousel - Blog *\/\n\n#uc_uc_card_post_carousel_blog_elementor_d8dcf6f *{\n  box-sizing:border-box;\n}\n#uc_uc_card_post_carousel_blog_elementor_d8dcf6f{\n  position:relative;\n}\n#uc_uc_card_post_carousel_blog_elementor_d8dcf6f .uc_image_carousel_content{\n\ttext-align:left;\n    display: flex;\n     flex-flow: column nowrap;\n}\n#uc_uc_card_post_carousel_blog_elementor_d8dcf6f .ue_post_carousel_item\n{\n  overflow:hidden;\n  \n}\n\n#uc_uc_card_post_carousel_blog_elementor_d8dcf6f .ue_post_btn_holder\n{\n  margin-top:auto;\n}\n\n#uc_uc_card_post_carousel_blog_elementor_d8dcf6f .uc_more_btn{\n\n  display:inline-block;\n  text-align:center;\n  text-decoration:none;\n} \n\n.uc_overlay_image_carousel .uc_more_btn{\n  text-decoration:none;\n  display:inline-block;\n}\n\n.uc_overlay_image_carousel .uc_post_title{\n  font-size:21px;\n  text-decoration:none;\n}\n\n#uc_uc_card_post_carousel_blog_elementor_d8dcf6f .owl-nav .owl-prev{\n    position:absolute;\n    left:-40px;\n    display:inline-block;\n    text-align:center;\n}\n#uc_uc_card_post_carousel_blog_elementor_d8dcf6f .owl-nav .owl-next{\n  position:absolute;\n    right:-40px;\n  display:inline-block;\n  text-align:center;\n}\n\n\n#uc_uc_card_post_carousel_blog_elementor_d8dcf6f .owl-dots {\noverflow:hidden;\ndisplay:false !important;\ntext-align:center;\n}\n\n#uc_uc_card_post_carousel_blog_elementor_d8dcf6f .owl-dot {\nborder-radius:50%;\ndisplay:inline-block;\n}\n\n<\/style>\n\n<div class=\"uc_overlay_image_carousel\" id=\"uc_uc_card_post_carousel_blog_elementor_d8dcf6f\" style=\"direction:ltr;\">\n   <div class=\"uc_carousel owl-carousel owl-theme\">\n   \t\t<div class=\"uc_image_carousel_container_holder ue_post_carousel_item\">\n  <div class=\"uc_image_carousel_placeholder\">\n    <a href=\"https:\/\/exa.net.uk\/knowledge-hub\/technical\/ensuring-robust-security-through-regular-firewall-updates\/\">\n\t\t\n      <div class=\"custom-slider-bg\" style=\"background-image:url(https:\/\/exa.net.uk\/wp-content\/uploads\/2021\/03\/fortinet-logo.svg); background-size:cover; background-position:center; height:115px;\">\n\t\t<div class=\"color-bg\">\n\t\t\t<div class=\"content-info\">\n\t\t\t\t<a href=\"https:\/\/exa.net.uk\/knowledge-hub\/technical\/ensuring-robust-security-through-regular-firewall-updates\/\" style=\"text-decoration:none;\"><div class=\"uc_post_title\" style=\"text-decoration:none;\">Ensuring Robust Security Through Regular Firewall Updates<\/div><\/a>\t\t\t<\/div>\n\t\t\n\t\t<\/div>\n      <\/div>\n\t<\/a>\n  <\/div>\n\n  <div class=\"uc_image_carousel_content\" >\n    <a class=\"post-link\" href=\"https:\/\/exa.net.uk\/knowledge-hub\/technical\/ensuring-robust-security-through-regular-firewall-updates\/\">Read Next<\/a>  <\/div>\n<\/div>\n\n   <\/div>\t\n<\/div>\n<!-- end Card Post Carousel - Blog -->\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<div class=\"elementor-element elementor-element-56c79c98 elementor-widget__width-initial elementor-hidden-phone elementor-widget elementor-widget-spacer\" data-id=\"56c79c98\" data-element_type=\"widget\" data-widget_type=\"spacer.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<div class=\"elementor-spacer\">\n\t\t\t<div class=\"elementor-spacer-inner\"><\/div>\n\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t<div class=\"has_eae_slider jet-sticky-column elementor-column elementor-col-50 elementor-top-column elementor-element elementor-element-52f48463\" data-jet-settings=\"{&quot;id&quot;:&quot;52f48463&quot;,&quot;sticky&quot;:true,&quot;topSpacing&quot;:155,&quot;bottomSpacing&quot;:50,&quot;stickyOn&quot;:[&quot;desktop&quot;,&quot;tablet&quot;]}\" data-id=\"52f48463\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5b42788c elementor-widget elementor-widget-heading\" data-id=\"5b42788c\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><span class=\"lightblue\">Related<\/span> Knowledge <span class=\"lightblue\">Hub<\/span>&trade; Articles<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-613dd224 elementor-grid-1 elementor-grid-tablet-1 elementor-posts--thumbnail-top elementor-grid-mobile-1 elementor-widget elementor-widget-posts\" data-id=\"613dd224\" data-element_type=\"widget\" data-settings=\"{&quot;custom_columns&quot;:&quot;1&quot;,&quot;custom_row_gap&quot;:{&quot;unit&quot;:&quot;px&quot;,&quot;size&quot;:25,&quot;sizes&quot;:[]},&quot;custom_columns_tablet&quot;:&quot;1&quot;,&quot;custom_columns_mobile&quot;:&quot;1&quot;,&quot;custom_row_gap_tablet&quot;:{&quot;unit&quot;:&quot;px&quot;,&quot;size&quot;:&quot;&quot;,&quot;sizes&quot;:[]},&quot;custom_row_gap_mobile&quot;:{&quot;unit&quot;:&quot;px&quot;,&quot;size&quot;:&quot;&quot;,&quot;sizes&quot;:[]}}\" data-widget_type=\"posts.custom\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t      <div class=\"ecs-posts elementor-posts-container elementor-posts   elementor-grid elementor-posts--skin-custom\" data-settings=\"{&quot;current_page&quot;:1,&quot;max_num_pages&quot;:0,&quot;load_method&quot;:&quot;&quot;,&quot;widget_id&quot;:&quot;613dd224&quot;,&quot;post_id&quot;:16006,&quot;theme_id&quot;:16006,&quot;change_url&quot;:false,&quot;reinit_js&quot;:false}\">\n      <div class=\"elementor-posts-nothing-found\"><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6bf5c137 next-read-section elementor-widget elementor-widget-heading\" data-id=\"6bf5c137\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<p class=\"elementor-heading-title elementor-size-default\"><a href=\"https:\/\/exa.net.uk\/knowledge-hub\/\" target=\"_blank\" rel=\"noopener\">Knowledge <span class=\"lightblue\">Hub<\/span>&trade; <span class=\"lightblue\">Home<\/span> <i class=\"fas fa-chevron-right\"><\/i><\/a><\/p>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Knowledge Hub Knowledge HubTM Not so long ago, we posted an article that detailed an issue impacting SurfProtect service stability. We described our investigation into the cause of the problem and celebrated its resolution. This week, it came back. While our previous investigation required significant effort, co-ordination between our teams, and time, we&#8217;re happy to &#8230; <a title=\"Debugging Latency &#8211; Redux\" class=\"read-more\" href=\"https:\/\/exa.net.uk\/knowledge-hub\/technical\/debugging-latency-redux\/\" aria-label=\"Read more about Debugging Latency &#8211; Redux\">Read more<\/a><\/p>\n","protected":false},"author":4,"featured_media":22942,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"elementor_header_footer","format":"standard","meta":{"rank_math_lock_modified_date":false,"footnotes":""},"categories":[73],"tags":[29],"class_list":["post-16006","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technical","tag-11-min-read"],"_links":{"self":[{"href":"https:\/\/exa.net.uk\/wp-json\/wp\/v2\/posts\/16006","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/exa.net.uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/exa.net.uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/exa.net.uk\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/exa.net.uk\/wp-json\/wp\/v2\/comments?post=16006"}],"version-history":[{"count":1,"href":"https:\/\/exa.net.uk\/wp-json\/wp\/v2\/posts\/16006\/revisions"}],"predecessor-version":[{"id":22943,"href":"https:\/\/exa.net.uk\/wp-json\/wp\/v2\/posts\/16006\/revisions\/22943"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/exa.net.uk\/wp-json\/wp\/v2\/media\/22942"}],"wp:attachment":[{"href":"https:\/\/exa.net.uk\/wp-json\/wp\/v2\/media?parent=16006"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/exa.net.uk\/wp-json\/wp\/v2\/categories?post=16006"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/exa.net.uk\/wp-json\/wp\/v2\/tags?post=16006"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}