Reliability of benchmarking scripts

poisson · December 2019

I was doing some benchmarking when I noticed some weird numbers being reported. I decided to investigate by running the script three times. The reported download speeds of the 100MB test file from Softlayer SG to my server are:

5.92 MiB/s
5.67 MiB/s
4.16 MiB/s

When I downloaded the same file using curl three consecutive times, the speeds are:

8.422 MB/s
7.890 MB/s
8.403 MB/s

Finally, I tried the same file three consecutive times using wget:

8.57 MB/s
9.19 MB/s
8.60 MB/s

There seems to be a clear consistent pattern that the script is under-reporting the bandwidth in the range of 40 - 80% relative to plain curl or wget. I have even more wild swings in other instances. This is making me curious as to what is causing it. Nench uses the curl command in its script, but it doesn't seem to be curl's fault because when I executive the curl command manually, the speeds are consistent with wget.

I think the download speeds of benchmark scripts need to be taken with quite a big grain of salt. I am not sure why, but manual downloads seem to be a much better reflection of downloading speed.

Now I have to think about how to tweak my benchmarking procedure.

uptime · December 2019

@poisson - (just curious) are you using the same flags for curl as it is invoked in nench.sh?

curl --max-time 10 -so /dev/null -w '%{speed_download}\n' -4 http://whatever.url

and I guess might also see if any effect from piping to that awk command (the guts of the Bps_to_MiBps() function)

awk '{ printf "%.2f MiB/s\n", $0 / 1024 / 1024 } END { if (NR == 0) { print "error" } }'

(Just as a methodical first pass, to rule out low-hanging fruit)

InceptionHosting · December 2019

I think any single bench mark on a shared resource environment that is used to make an absolute judgement about anything is in itself a more accurate judgement of the level of knowledge of the person using it is than the benchmark results are on the service.

A generic bench mark probably tests nothing that resembles real world sources or scenarios, you need to make your own based on your own use case to have anything even remotely accurate.

poisson · December 2019

@uptime said:
@poisson - (just curious) are you using the same flags for curl as it is invoked in nench.sh?
curl --max-time 10 -so /dev/null -w '%{speed_download}\n' -4 http://whatever.url
and I guess might also see if any effect from piping to that awk command (the guts of the Bps_to_MiBps() function)
awk '{ printf "%.2f MiB/s\n", $0 / 1024 / 1024 } END { if (NR == 0) { print "error" } }'
(Just as a methodical first pass, to rule out low-hanging fruit)

Nope I didn't use the same flags but they should not affect because there is no timeout (--max-time accounted for) and -s simply silences curl, but I did use -o because I had to output the file to disk. I cannot see how they can reasonably explain the difference.

I am not sure about the piping, but that looks like should not impact the download speeds.

poisson · December 2019

@AnthonySmith said:
I think any single bench mark on a shared resource environment that is used to make an absolute judgement about anything is in itself a more accurate judgement of the level of knowledge of the person using it is than the benchmark results are on the service.

A generic bench mark probably tests nothing that resembles real world sources or scenarios, you need to make your own based on your own use case to have anything even remotely accurate.

I actually have lots of data points, not just a single bench. This is why I became curious and started to dig. It seems like scripts tend to be under-reporting for reasons that I don't know why, even when everything else is pretty much controlled for (same file, same servers, time is within the same few minutes).

uptime · December 2019

@poisson - well ... I'm inclined to agree with regard to "no obvious reason it should run slower" ...

And yet - it does! So presumably there is a reason, and presumably that reason is not obvious (to me) when looking at the entire nench.sh script as a whole.

And so, I would simply try to decompose the system methodically (without thinking too much about how things "should" work, given that my delusional capacity for pure reason and perfect knowledge has failed me yet again - while perhaps some simple experimentation and observation would suffice to enlighten instead). Yea, verily I should endeavor to test each possible component and combination of components as I essentially put the script back together.

So - along those lines - the next questions might be "does running curl inside a bash function take longer?" ... "does running bash functions from inside a script in a file take longer?" ... And so forth.

The rest is left as an exercize for the i nterested reader. Q.E.D, etc, etc, etc.

cybertech · December 2019

My guess is the script could be doing some cpu/memory intensive bench before start the download so it could be affecting it.

InceptionHosting · December 2019

edit the benchmark script.

add before the network portion of the benchmark

echo 1 > /proc/sys/vm/drop_caches
echo 2 > /proc/sys/vm/drop_caches
echo 3 > /proc/sys/vm/drop_caches
sleep 10

See if that makes a difference, alternatively move the networking part of the benchmark to run first before all else.

If it does change things I can probably explain whats going on, if not then I am puzzled without having a look myself.

vimalware · December 2019

I like the new batch of benchmark scripts that use a handful of 10G Iperf3 servers. Is it YABs?

disk IO: fio with 4k/8k blocksizes if you are evaluating iops headroom.

Cpu: CPU steal tells you a lot about how the node is being managed by operator.

poisson · December 2019

@vimalware said:
I like the new batch of benchmark scripts that use a handful of 10G Iperf3 servers. Is it YABs?

disk IO: fio with 4k/8k blocksizes if you are evaluating iops headroom.

Cpu: CPU steal tells you a lot about how the node is being managed by operator.

YABS uses iperf but the problem is there are few Asian public iperf servers. The one in YABS often doesn't work.

Mason · December 2019

@poisson said:

@vimalware said:
I like the new batch of benchmark scripts that use a handful of 10G Iperf3 servers. Is it YABs?

disk IO: fio with 4k/8k blocksizes if you are evaluating iops headroom.

Cpu: CPU steal tells you a lot about how the node is being managed by operator.

YABS uses iperf but the problem is there are few Asian public iperf servers. The one in YABS often doesn't work.

Indeed. I tried to find all the public iperf3 servers out there, but Asia is pretty dark on that front. Mark from DirectAdmin did say he'd sponsor a few iperf3 POPs for the bench, so there's a possibility a couple more locations might be added. Problem will be finding a location in Asia that has good connectivity and a ton of bandwidth (as iperf servers will naturally push a ton of bw every month).

poisson · December 2019

@Mason said:

@poisson said:

@vimalware said:
I like the new batch of benchmark scripts that use a handful of 10G Iperf3 servers. Is it YABs?

disk IO: fio with 4k/8k blocksizes if you are evaluating iops headroom.

Cpu: CPU steal tells you a lot about how the node is being managed by operator.

YABS uses iperf but the problem is there are few Asian public iperf servers. The one in YABS often doesn't work.

Indeed. I tried to find all the public iperf3 servers out there, but Asia is pretty dark on that front. Mark from DirectAdmin did say he'd sponsor a few iperf3 POPs for the bench, so there's a possibility a couple more locations might be added. Problem will be finding a location in Asia that has good connectivity and a ton of bandwidth (as iperf servers will naturally push a ton of bw every month).

That's why for my own benchmarking purposes, I might set up a private iperf on my mikho SG box. Can't make it public.

seriesn · December 2019

My 2c, if you are downloading from any public server, chances of one getting a consistent number will be slim because odds are, other people/scripts etc already using them.

poisson · December 2019

@seriesn said:
My 2c, if you are downloading from any public server, chances of one getting a consistent number will be slim because odds are, other people/scripts etc already using them.

Yes and no. Many of the tests use big providers like Linode and Softlayer, which is ok. Results are fairly consistent. I am contemplating changing to 1GB download tests instead of the 100MB most scripts use (or to iperf) because often it seems to take time to reach maximum speed using curl, and the download often completes at lower speeds by 100MB. At least 500MB is necessary to average out fluctuations.

poisson · December 2019

Thanks everyone for the input, especially @uptime. I thought there was no curl timeout issue, but when I investigated more, it seems like timeout was indeed an issue for certain locations. I modified the timeout parameter and now the speeds are similar to a direct command-line curl or wget.

But, I noticed that it was only particular locations that were exhibiting problems, which were APAC locations. Then I decided to leave the timeout untouched and added another location to do a speed test and VOILA, I got my answer. The problem is that Softlayer's networks need a higher timeout.

Here's a sample output from Europe without modifying the original curl timeout. Note all the Softlayer locations and the corresponding alternative locations (marked by asterisks for your easy reference):

    Cachefly CDN:         48.85 MiB/s
    Leaseweb (NL):        8.16 MiB/s
    Softlayer DAL (US):   9.01 MiB/s *
    Vultr DAL (US):       14.48 MiB/s *
    Online.net (FR):      73.59 MiB/s
    Softlayer (UK):       35.93 MiB/s *
    DigitalOcean (UK):    68.97 MiB/s *
    OVH BHS (CA):         21.17 MiB/s
    Softlayer (AU):       0.96 MiB/s **
    Vultr (AU):           5.09 MiB/s *
    Linode (JP):          6.29 MiB/s
    Vultr (JP):           7.21 MiB/s
    Softlayer (SG):       1.05 MiB/s **
    Vultr (SG):           6.59 MiB/s *
    Softlayer (HK):       0.91 MiB/s **
    Leaseweb (HK):        6.65 MiB/s *

Here's a sample output from Europe increasing the curl timeout from 10s to 60s (note all the Softlayer locations again):

    Cachefly CDN:         48.83 MiB/s
    Leaseweb (NL):        80.14 MiB/s
    Softlayer DAL (US):   8.75 MiB/s *
    Vultr DAL (US):       15.02 MiB/s *
    Online.net (FR):      73.08 MiB/s
    Softlayer (UK):       41.31 MiB/s *
    DigitalOcean (UK):    71.17 MiB/s * 
    OVH BHS (CA):         21.15 MiB/s
    Softlayer (AU):       4.23 MiB/s **
    Vultr (AU):           6.15 MiB/s *
    Linode (JP):          7.67 MiB/s
    Vultr (JP):           7.98 MiB/s
    Softlayer (SG):       4.63 MiB/s **
    Vultr (SG):           7.97 MiB/s *
    Softlayer (HK):       4.99 MiB/s **
    Leaseweb (HK):        7.19 MiB/s *

Again, Softlayer generally performed worse than alternative locations, but note the double asterisks for APAC locations with 60s timeout compared to the previous test with only 10s timeout.

Still somewhat skeptical, I decided to run the same thing on a VPS located in America this time. The nomenclature in terms of asterisks is the same as above. First up is the original curl timeout of 10s

    Cachefly CDN:         96.25 MiB/s
    Leaseweb (NL):        15.12 MiB/s
    Softlayer DAL (US):   29.56 MiB/s *
    Vultr DAL (US):       61.09 MiB/s *
    Online.net (FR):      17.66 MiB/s
    Softlayer (UK):       9.69 MiB/s *
    DigitalOcean (UK):    18.36 MiB/s *
    OVH BHS (CA):         22.28 MiB/s
    Softlayer (AU):       4.65 MiB/s **
    Vultr (AU):           12.86 MiB/s *
    Linode (JP):          1.28 MiB/s
    Vultr (JP):           19.17 MiB/s
    Softlayer (SG):       5.43 MiB/s **
    Vultr (SG):           11.81 MiB/s *
    Softlayer (HK):       8.29 MiB/s **
    Leaseweb (HK):        14.66 MiB/s *

Now, the results when the curl timeout increased to 60s:

    Cachefly CDN:         97.56 MiB/s
    Leaseweb (NL):        15.09 MiB/s
    Softlayer DAL (US):   28.63 MiB/s *
    Vultr DAL (US):       60.06 MiB/s *
    Online.net (FR):      17.62 MiB/s
    Softlayer (UK):       5.85 MiB/s *
    DigitalOcean (UK):    17.68 MiB/s *
    OVH BHS (CA):         22.57 MiB/s
    Softlayer (AU):       4.95 MiB/s **
    Vultr (AU):           11.70 MiB/s *
    Linode (JP):          17.62 MiB/s 
    Vultr (JP):           21.10 MiB/s
    Softlayer (SG):       7.26 MiB/s **
    Vultr (SG):           11.53 MiB/s *
    Softlayer (HK):       8.55 MiB/s **
    Leaseweb (HK):        12.98 MiB/s *

The difference from timeout is not that obvious for the US location. However, I think one thing is clear; probably Softlayer locations should not be used for benchmarking because they are generally slower for unknown reasons. This seems to be unique to Softlayer's networks as the other networks I used do not exhibit such symptoms.

It's taken quite a while to figure this shit out but I think I am pretty convinced that the problem is with Softlayer.

tgl · December 2019

its good you fixed it

dahartigan · December 2019

I've observed that most downloads can tend to start off slower then ramp up to the potential speed, but it really seems to depend on the network or the time of day. I've never really been able to make much sense of it without actually researching the phenomenon.

It'd be nice if the community could collaborate on a benchmark script that we could all agree to use, to keep all the results as fair as possible (or at least consistent)

Most of the time, I put network problems down to "You're in Australia, Dave, your connectivity is fucked sometimes, that's just how shit is in the land down under, ya silly cunt."

uptime · December 2019

@poisson - awesome analysis! and thank you for actually doing the needful to figure this one out.

one question/suggestion for future reference - do you ever iperf, bro?

poisson · December 2019

@uptime said:
@poisson - awesome analysis! and thank you for actually doing the needful to figure this one out.

one question/suggestion for future reference - do you ever iperf, bro?

I prefer iperf but problem is there are few public hosts in APAC. I may run my own APAC iperfs using my bundle from @mikho (never done it before but should not be too difficult I guess) but no way I can make those public servers.

poisson · December 2019

@dahartigan said:
I've observed that most downloads can tend to start off slower then ramp up to the potential speed, but it really seems to depend on the network or the time of day. I've never really been able to make much sense of it without actually researching the phenomenon.

It'd be nice if the community could collaborate on a benchmark script that we could all agree to use, to keep all the results as fair as possible (or at least consistent)

Most of the time, I put network problems down to "You're in Australia, Dave, your connectivity is fucked sometimes, that's just how shit is in the land down under, ya silly cunt."

This is why 500MB to 1GB files are usually better because the effect is greatly diluted after a while. If I have time I will modify an existing script to use 1GB files instead (with a warning on high bandwidth use)

saibal · December 2019

dahartigan said: I've observed that most downloads can tend to start off slower then ramp up to the potential speed, but it really seems to depend on the network or the time of day. I've never really been able to make much sense of it without actually researching the phenomenon.

That's normal TCP behavior. In a perfect world all data streams would eventually ramp up to the full advertised capacity of their pipes albeit slowly. However, a lot of factors (network load, latency, laws of physics) contribute to the actual observed speed. Cerf, Kahn and their team tried to think of all these possibilities in the 1970s and the best they could come up with was TCP.

poisson · December 2019

@saibal said:

dahartigan said: I've observed that most downloads can tend to start off slower then ramp up to the potential speed, but it really seems to depend on the network or the time of day. I've never really been able to make much sense of it without actually researching the phenomenon.

That's normal TCP behavior. In a perfect world all data streams would eventually ramp up to the full advertised capacity of their pipes albeit slowly. However, a lot of factors (network load, latency, laws of physics) contribute to the actual observed speed. Cerf, Kahn and their team tried to think of all these possibilities in the 1970s and the best they could come up with was TCP.

Yea, iperf is a much better test because we can ignore the first few seconds in the bandwidth test to get a better sense of actual throughput at time of testing.

cybertech · December 2019

how about BBR to improve testing

WSS · December 2019

@cybertech said:
how about BBR to improve testing

Stop reading random Chinese blogs.

cybertech · December 2019

@WSS said:

@cybertech said:
how about BBR to improve testing

Stop reading random Chinese blogs.

Reliability of benchmarking scripts

Comments