Andromeda 2.1 reduces GCP’s intra-zone latency by 40%

boulos · on Nov 7, 2017

BTW, since the blog post doesn't make it explicit: we're down to ~40 microseconds round trip between VMs in the same zone. No placement groups or infiniband required :).

Disclosure: I work on Google Cloud.

dajonker · on Nov 7, 2017

How would I test this? If I just create two (preemptible) n1-standard-1 VMs with the latest debian image and I use "ping" with the internal IP address then I get on average 0.138 ms

I do get impressive bandwidth (1.95 Gbits/sec) using iperf.

boulos · on Nov 7, 2017

ping is sadly not a great test program. You should do a TCP_RR run with netperf instead. For best results, try one of our newer regions/zones (with newer fabric and NICs) like us-east4-b or europe-west2-c. An n1-std-1 is also a single hyperthread, so you may want a larger instance as well.

dajonker · on Nov 8, 2017

Thanks, got 34 µs in europe-west2-c with n1-standard-4 machines (also have about 8 Gbits/sec bandwidth between machines!).

jsolson · on Nov 9, 2017

acdha · on Nov 7, 2017

Thanks — I was disappointed not to see any absolute values in the post

FBISurveillance · on Nov 7, 2017

Is this available on GKE? Any action required (e.g. node pool flags) on our side to get this working?

boulos · on Nov 7, 2017

Yep! GKE just runs atop GCE. No work required on your side.

That reminds me though: you have to be communicating over the internal IP addresses, not the external ones (I hope the Services stuff does the right thing automatically). Firewall rules and such are different for internal versus external IPs (people often want a simple "deny all external traffic"), so that's sadly still a performance cliff.

jsolson · on Nov 7, 2017

Yes, this is enabled for all VMs running on Compute Engine (which includes GKE VMs), however the in-guest iptables &c. bits add non-trivial overhead (I don't have numbers handy, apologies).

FBISurveillance · on Nov 7, 2017

Thanks, good to know.

Regarding in-guest iptables -- there's not much we can do on GKE/Kubernetes about that. I serve about 1.2M requests per second on my GKE cluster through GLB and see this overhead very well.

boulos · on Nov 7, 2017

The new Alias IP stuff should help with that (though I imagine there will still be some iptables shenanigans left, so I'm not sure how much it will help)

FBISurveillance · on Nov 7, 2017

AFAIK for GLB to work with GKE (via GLBC) I would still need NodePort service that routes exclusively via iptables?

FWIW I also look forward to IPVS in k8s 1.9 that should improve this slightly.

FBISurveillance · on Nov 7, 2017

Why don't AWS people actively contribute in HN discussions about their achievements?

There're 8 comments on this thread and 3 of them are from GCP people giving less marketing-y insights. Thanks @jsolson and @boulos (it's always interesting to read your comments).

boulos · on Nov 7, 2017

Glad you appreciate it! Jon and I both just like explaining and clarifying. In this case he did actual work, and along with Jake and others deserve the credit (I rarely contribute directly to Compute Engine these days, but I still pretend like I can explain what Jon and others do).

I don't begrudge the folks that prefer to work silently. I'm not adding the names of any people who worked on this (and there are many!) but perhaps don't want to be publicly visible. I assume that's why there are less AWS folks for their launches here, and it's a purely personal decision. You also might be biased since Jon and I are particularly loud and slack off a lot at work :).

Fwiw, please call us out if you think we're straying into Sales/Marketing. That's not the intent, and part of why I make sure to put the disclosure on my posts. Clearly, Cloud is a business, but I'm (still) an engineer. My goal is that we should build an excellent product, and hopefully that convinces you or others to use it. If it's not excellent, keep complaining until we improve!

Disclosure: I work on Google Cloud.

jacobn · on Nov 7, 2017

How does this compare/relate to the AWS "enhanced networking"?

jsolson · on Nov 7, 2017

They're different approaches to solving the same problem (improved throughput with lower latency and jitter). The major thing they have in common is that they both dedicate hardware to the problem.

With respect to AWS, in the historical "enhanced networking" case Amazon dedicated hardware by offering SR-IOV capable NICs. SR-IOV is a well understood and effective technique for approaching bare metal performance for virtualized environments, but it tends to lock you into a particular vendor, if not specific model, of hardware. I gather ENA does something a bit different, but I don't know the details.

In Google's case, we dedicate hardware to the Andromeda switch in the form of processor cores (the "SDN" block in the linked post). This allows us to be flexible in terms of NIC hardware while presenting a uniform virtual device to guests, in addition to simplifying universal rollout of new networking features to all zones/instance types.

Both approaches have tradeoffs, although I think even with ENA AWS hits ~70µs typical round-trip-times while GCE gets down to ~40µs. Amazon's largest VMs in some families do advertise higher bandwidth than GCE does currently.

(I was the tech lead for the hypervisor side of this launch — Jake, the post's author, leads the fast-path team for the Andromeda software switch)

_msw_ · on Nov 18, 2017

Hmmm...

  [ec2-user@ip-10-0-1-56 ~]$ sudo ping -f 10.0.1.111
  PING 10.0.1.111 (10.0.1.111) 56(84) bytes of data.
  .^C
  --- 10.0.1.111 ping statistics ---
  115480 packets transmitted, 115479 received, 0% packet loss, time 5385ms
  rtt min/avg/max/mdev = 0.037/0.039/0.226/0.008 ms, ipg/ewma 0.046/0.040 ms

_msw_ · on Nov 18, 2017

Different c5.18xlarge instances with netperf TCP_RR, no significant tuning:

  [ec2-user@ip-10-0-2-191 ~]$ netperf -v 2 -H 10.0.2.52 -t TCP_RR -l 30
  MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.2.52 () port 0 AF_INET : first burst 0
  Local /Remote
  Socket Size   Request  Resp.   Elapsed  Trans.
  Send   Recv   Size     Size    Time     Rate         
  bytes  Bytes  bytes    bytes   secs.    per sec   

  20480  87380  1        1       30.00    21178.69   
  20480  87380 
  Alignment      Offset         RoundTrip  Trans    Throughput
  Local  Remote  Local  Remote  Latency    Rate     10^6bits/s
  Send   Recv    Send   Recv    usec/Tran  per sec  Outbound   Inbound
      8      0       0      0   47.217   21178.689 0.169     0.169

jsolson · on Nov 18, 2017

c5 wasn't available to me when I made that comment, or at least c5 numbers weren't — we have them now, although we're observing ~10 µs worse than your one-off in ours.

It's certainly a nice improvement over what we see on the c4s. Is that using a placement group to ensure proximity (I believe our tests do, but I'd have to double check)? Our benchmarking philosophy is generally to aim for "default" numbers for GCP and "best" numbers for others -- keeps us honest about our "fresh out of the box" behavior.

Also, if we should be seeing better on earlier instance types, I'd love to know what we're potentially doing wrong.

wmf · on Nov 7, 2017

AFAIK AWS is faster (although the difference may not be noticeable outside microbenchmarks) and they are fully offloading networking (and EBS) onto the NIC.

jsolson · on Nov 7, 2017

Replied to a sibling, but I believe our latency is currently coming in under theirs. Their largest/newest VMs advertise higher peak bandwidth than we do. The latency difference would certainly be most directly visible in microbenchmarks, although HPC applications and those relying on in-memory databases are also likely to see practical benefit.

fivesigma · on Nov 7, 2017

What prevents the hardware offloading from being used on public interfaces? Even if microsecond-level jitter reduction on public networks is negligible this should reduce CPU load in high PPS deployments, right?

jsolson · on Nov 7, 2017

edit: It's late here and I think I misread this originally :)

The differential between public IPs and internal IPs is tied into the path packets take after leaving the host. The path out of the guest is identical for both, but using VM public IPs (rather than internal) can result in passing through additional hops versus being routed straight to the target VM. Common firewall configurations can also impact perf here.

Original comment:

With respect to guest CPU, the approach used by Andromeda 2.1 eliminates VM exits both on transmit and for interrupt delivery (where supported by Intel). In that regard it's essentially identical to PCIe passthrough. There are customers running DPDK to further reduce variance (and eliminate the cost of interrupt handling entirely).

The choice to not pass through host hardware comes down to a few factors, but high on the list are supporting live migration and NIC vendor flexibility.

(I worked on this effort; see other comments for specifics)

fulafel · on Nov 7, 2017

The Andromeda description link explains that there are special OS drivers for it. Anyone know what part of the magic is guest side?

boulos · on Nov 7, 2017

Which bit are you referring to? I think the blog posts make this a little too abstract. It's virtio-net and as mentioned elsewhere, we like and contribute to the latest kernels to do things like better multiqueue support.

For example, here's the script we install in our guest images and encourage people to run to make sure interrupts are paired to queues: https://github.com/GoogleCloudPlatform/compute-image-package...

Disclosure: I work on Google Cloud but don't know much about networking.

fulafel · on Nov 7, 2017

This bit, "Some of the most valuable enhancements enable VMs built on supporting Linux kernels to exploit offload/multi-queue capabilities"

I was wondering what is offloaded. I guess virtio-net is a good keyword, thanks.

jsolson · on Nov 7, 2017

At the time Andromeda was originally introduced it took a fairly recent Linux kernel to get support for multi-queue networking and offloads with virtio-net. Today anything even moderately recent has support baked in -- specifically Linux 3.8 and above include multi-queue support (as well as the offloads we support).

In terms of specific offloads, the big ones are TCP segmentation offload (TSO) and TCP large receive offload (LRO). These substantially reduce the compute burden on the guest. Less impactful (although still important) are checksum calculation and verification offload.

(I was the tech lead for the hypervisor side of this launch — Jake, the post's author, leads the fast-path team for the Andromeda software switch)

boulos · on Nov 7, 2017

Oh! I think the acronyms you're looking for are GRO, LRO and whatnot.

polskibus · on Nov 7, 2017

Is this improvement also available in standard Linux? If not, can it be ported to benefit all Linux VMs?

jsolson · on Nov 7, 2017

Yes, any kernel >= 3.8 includes the relevant offload features. The improvements here primarily come from reduced overhead getting packets out of the VM (not mentioned in the post is that Andromeda 2.1 also eliminates VM exits when packets are sent and, where supported by Intel, when interrupts are delivered).

edit: Realized you might have meant Linux VMs running outside of GCE -- the improvements here are fairly GCE-specific, although as wmf points out, vhost is a similar technology in the open source world. Performance specifics down at this level (tens of microseconds and below) tend to be hardware dependent.

wmf · on Nov 7, 2017

It sounds like this feature is akin to vhost which has been available in Linux for a few years. Using vhost-net and OVS or vhost-user and VPP you could build something similar to Andromeda.

jsolson · on Nov 7, 2017

It's similar, although distinct. By building on a common foundation of Google networking dataplane bits, Jake's team (and peer teams) get easier integration with Google's other networking infrastructure for features like DoS protection, encryption, etc. The core bits underlying Andromeda 2.1 are related to those used for Espresso (https://www.blog.google/topics/google-cloud/making-google-cl... — HN discussion: https://news.ycombinator.com/item?id=14037830).

YuriGrinshteyn · on Nov 7, 2017

Congratulations, y'all!