While benchmarking some tweaks to Hasura we noticed some strange behavior. We were measuring latencies for a simple GraphQL request being served by Hasura. Surprisingly Hasura was performing better when there was more load on the system than otherwise.
At 100 requests per second the processor load was low enough that the processor going to sleep was negatively impacting the latency measurements. However, when a browser was open on the same machine, latency measurements improved!
The explanation for this is quite intriguing and is documented in detail in this post by Brandon. This blogpost explains that post from a layman’s perspective.
Modern Intel processors have extremely sophisticated power management that modifies the clock frequency and powers up and down subsystems dynamically and constantly. This can happen several times a second.
p-states and c-states
Power management happens both when the processor is idle and otherwise. Idle states of a processor are called C-states and are denoted as C0, C1, C2… C0 is the operating state whereas C1, C2, C3.. are states where some or more of the subsystems are shutdown. Generally the higher the C-state
the lower the power consumption but also the longer it takes for the processor to move back to the operating state C0.
We can look at the idle states available on my laptop using the cpupower
command:
$ cpupower idle-info CPUidle driver: intel_idle CPUidle governor: menu analyzing CPU 0: Number of idle states: 9 Available idle states: POLL C1 C1E C3 C6 C7s C8 C9 C10 POLL: Flags/Description: CPUIDLE CORE POLL IDLE Latency: 0 Usage: 5845 Duration: 3482887 C1: Flags/Description: MWAIT 0x00 Latency: 2 Usage: 155904 Duration: 40263250 C1E: Flags/Description: MWAIT 0x01 Latency: 10 Usage: 505874 Duration: 163384145 C3: Flags/Description: MWAIT 0x10 Latency: 70 Usage: 405130 Duration: 115592556 C6: Flags/Description: MWAIT 0x20 Latency: 85 Usage: 2407019 Duration: 1503034108 C7s: Flags/Description: MWAIT 0x33 Latency: 124 Usage: 657 Duration: 674710 C8: Flags/Description: MWAIT 0x40 Latency: 200 Usage: 1694254 Duration: 2067037470 C9: Flags/Description: MWAIT 0x50 Latency: 480 Usage: 38 Duration: 65699 C10: Flags/Description: MWAIT 0x60 Latency: 890 Usage: 44032 Duration: 79069478
We can see that my processor has these idle states available: POLL
, C1
, C1E
, C3
, C6
, C7s
, C8
, C9
, C10
. We can also see a Latency
field that tells us the maximum time (in µs) the processor will take to go from that state to the operating C0
state. It is possible to enable/disable these states using the cpupower idle-set
command.
Similarly when a processor is not idle it will be in one of several P-states denoted by P0, P1, P2, P3. The higher the p-state, the lower the voltage and operating frequency of the processor. This means CPU bound programs will in general execute faster in a lower P-state
than a higher p-state. We cannot explicitly set the processor to a particular state. However we can set a CPUFreq governor using the cpupower frequency-set -g <governor>
command.
On my laptop I can see the available governors with:
$ cpupower frequency-info analyzing CPU 0: driver: intel_pstate CPUs which run at the same hardware frequency: 0 CPUs which need to have their frequency coordinated by software: 0 maximum transition latency: Cannot determine or is not supported. hardware limits: 400 MHz - 4.00 GHz available cpufreq governors: performance powersave current policy: frequency should be within 400 MHz and 4.00 GHz. The governor "powersave" may decide which speed to use within this range. current CPU frequency: Unable to call hardware current CPU frequency: 1.60 GHz (asserted by call to kernel) boost state support: Supported: yes Active: yes
We can see that I can set the governor to performance
or powersave
and the current governor is powersave
.
This article explains the various P-states and C-states that intel processors implement in greater detail.
We can also measure the time spent by the processor in various states using the turbostat
command:
$ sudo turbostat --quiet sleep 5 5.004098 sec Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI C1 C1E C3 C6 C7s C8 C9 C10 C1% C1E% C3% C6% C7s% C8% C9% C10% CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp GFX%rc6 GFXMHz Totl%C0 Any%C0 GFX%C0 CPUGFX% Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 Pkg%pc8 Pkg%pc9 Pk%pc10 PkgWatt CorWatt GFXWatt RAMWatt PKG_% RAM_% - - 6 0.35 1640 1991 2009 0 1 54 11 78 0 1555 0 523 0.00 0.01 0.00 0.18 0.00 27.94 0.00 71.50 0.95 0.00 0.30 98.39 54 55 99.96 300 2.76 1.83 0.00 0.00 20.55 0.12 0.09 0.15 75.31 0.00 0.00 0.76 0.09 0.00 0.32 0.00 0.00 0 0 12 0.77 1625 1991 730 0 0 2 5 48 0 603 0 90 0.00 0.00 0.01 0.98 0.00 47.68 0.00 50.57 1.25 0.00 0.90 97.07 54 55 99.96 300 2.76 1.83 0.00 0.00 20.56 0.12 0.09 0.15 75.31 0.00 0.00 0.76 0.09 0.00 0.32 0.00 0.00 0 4 4 0.25 1650 1991 154 0 0 0 0 4 0 125 0 70 0.00 0.00 0.00 0.07 0.00 22.48 0.00 77.17 1.77 1 1 2 0.15 1689 1991 95 0 0 8 0 3 0 66 0 32 0.00 0.01 0.00 0.05 0.00 9.76 0.00 90.01 0.86 0.00 0.08 98.91 54 1 5 6 0.35 1626 1991 222 0 0 0 0 4 0 129 0 115 0.00 0.00 0.00 0.06 0.00 18.13 0.00 81.44 0.67 2 2 5 0.31 1672 1991 187 0 0 1 0 6 0 163 0 53 0.00 0.00 0.00 0.10 0.00 39.79 0.00 59.78 1.02 0.01 0.13 98.53 54 2 6 8 0.50 1623 1991 299 0 0 1 6 5 0 246 0 61 0.00 0.00 0.01 0.08 0.00 46.01 0.00 53.39 0.84 3 3 4 0.27 1642 1991 180 0 0 5 0 5 0 139 0 43 0.00 0.01 0.00 0.07 0.00 29.59 0.00 70.05 0.58 0.00 0.10 99.06 54 3 7 3 0.19 1664 1992 142 0 1 37 0 3 0 84 0 59 0.00 0.05 0.00 0.05 0.00 10.11 0.00 89.58 0.66
In the above output we can see that the processor spends < 0.5% in C0 state (Busy%) because we are executing the sleep command and there is not much else happening on the machine. The bulk of the time is spent in C7 state (CPU%c7). This man page explains the various fields that turbostat will output.
How does all of this affect Hasura’s performance you ask? Let us look at power management states of the machine while Hasura executes a simple GraphQL query. The below measurements were made on my laptop running Intel i7–8550U (1.8GHz turbo boost to 4GHz).
Both Hasura and Postgres were running on the same machine and a simple GraphQL query fetching about 5 rows was used. I’ve used wrk2 — launched using turbostat
- to run a constant load of 100RPS.
Here’s the output from the wrk2 command:
Running 1m test @ http://localhost:8181/v1/graphql 1 threads and 10 connections Thread calibration: mean lat.: 2.489ms, rate sampling interval: 10ms Thread Stats Avg Stdev Max +/- Stdev Latency 2.49ms 521.70us 12.94ms 75.29% Req/Sec 105.63 102.66 363.00 37.27% Latency Distribution (HdrHistogram - Recorded Latency) 50.000% 2.52ms 75.000% 2.80ms 90.000% 3.03ms 99.000% 3.53ms 99.900% 5.90ms 99.990% 12.94ms 99.999% 12.94ms 100.000% 12.94ms
Latency measurement for Hasura serving a simple GraphQL request
Here’s the output from turbostat:
60.011521 sec Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI C1 C1E C3 C6 C7s C8 C9 C10 C1% C1E% C3% C6% C7s% C8% C9% C10% CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp GFX%rc6 GFXMHz Totl%C0 Any%C0 GFX%C0 CPUGFX% Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 Pkg%pc8 Pkg%pc9 Pk%pc10 PkgWatt CorWatt GFXWatt RAMWatt PKG_% RAM_% - - 50 3.00 1669 1992 70009 0 261 1971 1526 11017 34 71007 4 17698 0.01 0.04 0.06 1.62 0.02 49.45 0.00 45.85 5.20 0.07 2.63 89.10 39 40 100.00 300 24.38 20.60 0.00 0.00 25.92 1.42 0.42 0.11 47.07 0.00 0.00 1.15 0.46 0.00 0.43 0.00 0.00 0 0 50 3.08 1629 1992 8822 0 32 210 185 1225 8 9399 2 2005 0.00 0.03 0.05 1.39 0.03 51.97 0.00 43.49 5.23 0.07 2.56 89.07 39 40 100.00 300 24.38 20.60 0.00 0.00 25.92 1.42 0.42 0.11 47.07 0.00 0.00 1.15 0.46 0.00 0.43 0.00 0.00 0 4 50 3.07 1615 1992 9119 0 49 259 216 1523 5 9173 0 2120 0.00 0.04 0.06 1.81 0.03 48.68 0.00 46.35 5.24 1 1 60 3.29 1807 1992 8405 0 20 171 191 1201 0 8406 0 2056 0.01 0.02 0.06 1.35 0.00 49.63 0.00 45.67 4.86 0.07 2.42 89.36 39 1 5 45 2.75 1653 1992 8734 0 39 385 191 1384 8 8161 0 2498 0.00 0.07 0.05 1.64 0.02 44.75 0.00 50.75 5.40 2 2 52 3.15 1644 1992 9181 0 29 269 184 1436 3 9256 0 2008 0.00 0.07 0.06 1.70 0.02 50.97 0.00 44.07 5.19 0.07 2.49 89.10 39 2 6 53 2.99 1756 1992 7914 0 25 265 183 1138 2 8432 0 2248 0.01 0.03 0.05 1.34 0.01 49.80 0.00 45.80 5.35 3 3 45 2.80 1616 1992 8336 0 25 209 150 1401 6 8112 0 2663 0.00 0.05 0.05 1.66 0.02 44.79 0.00 50.68 5.20 0.07 3.06 88.88 39 3 7 46 2.85 1617 1992 9498 0 42 203 226 1709 2 10068 2 2100 0.01 0.03 0.06 2.12 0.00 55.00 0.02 39.95 5.14
We can see the processor was busy i.e in C-state 0 less than 3.5% of the time, while it spent about 90% of the time in the deep C-7 state, which is slow to wake and do useful work.
Essentially at 100 RPS Hasura takes so little of the processor time that the processor spends most of the time in deep sleep 🤓 That in turn increases the latency of requests because it takes time for the processor to wake up and serve the request.
“performance” p-state governor
How does using the performance governor improve the situation? Switching it to performance
mode with...
$ sudo cpupower frequency-set -g performance
…reduces latency as the CPU is more ready to ramp up and do work when it comes:
wrk2 output:
Running 1m test @ http://localhost:8181/v1/graphql 1 threads and 10 connections Thread calibration: mean lat.: 2.002ms, rate sampling interval: 10ms Thread Stats Avg Stdev Max +/- Stdev Latency 1.94ms 579.10us 9.26ms 68.44% Req/Sec 105.33 100.88 333.00 43.93% Latency Distribution (HdrHistogram - Recorded Latency) 50.000% 1.90ms 75.000% 2.34ms 90.000% 2.62ms 99.000% 3.21ms 99.900% 5.98ms 99.990% 9.27ms 99.999% 9.27ms 100.000% 9.27ms
Latency measurement with performance governor enabled
We see better latencies at pretty much every percentile.
turbostat output:
60.006307 sec Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI C1 C1E C3 C6 C7s C8 C9 C10 C1% C1E% C3% C6% C7s% C8% C9% C10% CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp GFX%rc6 GFXMHz Totl%C0 Any%C0 GFX%C0 CPUGFX% Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 Pkg%pc8 Pkg%pc9 Pk%pc10 PkgWatt CorWatt GFXWatt RAMWatt PKG_% RAM_% - - 67 1.72 3930 1992 67347 0 227 1648 1301 9135 32 72585 2 17162 0.01 0.04 0.08 1.62 0.01 50.84 0.00 45.66 8.01 0.10 2.23 87.95 49 49 100.01 300 13.73 11.58 0.00 0.00 19.86 1.41 0.35 0.04 52.64 0.00 0.00 2.49 1.69 0.00 0.64 0.00 0.00 0 0 62 1.58 3923 1992 9298 0 34 167 175 1277 3 9688 0 1967 0.02 0.05 0.08 1.92 0.01 54.27 0.00 42.06 7.76 0.07 2.17 88.42 49 49 100.01 300 13.73 11.58 0.00 0.00 19.86 1.41 0.35 0.04 52.64 0.00 0.00 2.49 1.69 0.00 0.64 0.00 0.00 0 4 64 1.64 3924 1992 7662 0 36 196 144 917 2 8251 0 2133 0.00 0.02 0.06 1.19 0.01 47.09 0.00 49.97 7.69 1 1 71 1.81 3932 1992 8996 0 26 232 196 1253 4 9553 1 2035 0.00 0.05 0.09 1.73 0.01 53.37 0.00 42.92 8.42 0.08 2.25 87.44 46 1 5 68 1.74 3933 1992 8754 0 35 224 136 1132 5 9662 0 2293 0.02 0.04 0.06 1.59 0.01 53.98 0.00 42.54 8.48 2 2 70 1.78 3937 1992 7611 0 20 200 149 1063 3 8247 0 2260 0.00 0.02 0.08 1.53 0.00 47.71 0.00 48.85 7.35 0.08 1.78 89.01 47 2 6 60 1.54 3918 1992 7920 0 25 210 134 780 2 8559 1 1993 0.00 0.04 0.06 1.05 0.01 51.94 0.00 45.33 7.58 3 3 74 1.88 3937 1992 8414 0 25 259 164 1262 3 9352 0 2064 0.00 0.10 0.07 1.78 0.02 49.07 0.00 47.05 8.32 0.16 2.72 86.91 48 3 7 69 1.76 3930 1992 8692 0 26 160 203 1451 10 9273 0 2417 0.00 0.03 0.17 2.16 0.01 49.29 0.00 46.56 8.45
Notice Bzy_MHz
is now close to the maximum advertised frequency of 4GHz (with turboboost).
Prohibit idle states
We can keep the processor from entering deep sleep states with:
$ sudo cpupower idle-set -D2
This will disable any idle state with a latency >= 2µs which on my laptop is all idle states.
Here’s the results from wrk2:
Running 1m test @ http://localhost:8181/v1/graphql 1 threads and 10 connections Thread calibration: mean lat.: 0.995ms, rate sampling interval: 10ms Thread Stats Avg Stdev Max +/- Stdev Latency 1.13ms 353.31us 8.95ms 79.16% Req/Sec 108.45 104.68 333.00 19.19% Latency Distribution (HdrHistogram - Recorded Latency) 50.000% 1.20ms 75.000% 1.32ms 90.000% 1.41ms 99.000% 1.79ms 99.900% 4.33ms 99.990% 8.96ms 99.999% 8.96ms 100.000% 8.96ms
Latency measurement with performance governor enabled and idle states disabled
Again we can see an improvement at almost every percentile.
Turbostat output:
60.005077 sec Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI C1 C1E C3 C6 C7s C8 C9 C10 C1% C1E% C3% C6% C7s% C8% C9% C10% CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp GFX%rc6 GFXMHz Totl%C0 Any%C0 GFX%C0 CPUGFX% Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 Pkg%pc8 Pkg%pc9 Pk%pc10 PkgWatt CorWatt GFXWatt RAMWatt PKG_% RAM_% - - 3685 99.76 3693 1992 68091 16 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 85 84 99.92 300 398.38 99.59 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 19.73 18.80 0.00 0.35 0.00 0.00 0 0 3685 99.76 3693 1992 9292 2 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 81 84 99.92 300 398.38 99.59 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 19.73 18.80 0.00 0.35 0.00 0.00 0 4 3685 99.76 3693 1992 7130 2 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.24 1 1 3685 99.76 3693 1992 9178 2 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 80 1 5 3685 99.76 3693 1992 9541 2 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.24 2 2 3685 99.76 3693 1992 7925 2 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 85 2 6 3685 99.76 3693 1992 8344 2 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.24 3 3 3685 99.76 3693 1992 9789 2 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 80 3 7 3685 99.76 3693 1992 6892 2 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.24
As expected the Busy% is now close to the 100%.
In this article we have seen that processor power management has significant effect on response times with Hasura. Using the performance governor and disabling idle states improves the average latency by almost 50%. A couple of caveats to be aware of though:
- With wrk2’s, measured latencies are [only] accurate to a +/- ~1 ms granularity, due to OS sleep time behavior.
- Disabling idle states/using the performance governor has the effect of also consuming more power. In particular disabling all idle states means that the CPU is running at peak frequency continuously. You might want to experiment with disabling some of the deeper C-states and leaving the shallower ones on.
We have also not tried these in production and these are not official recommendations. However, if you do try these techniques out, let us know of your experience on Discord or tweet to us at Hasura.
Sign up for our newsletter to know when we publish new articles.
Originally published at https://hasura.io on April 14, 2020.