Exploring the PCIe Bus Routes
As is evident from the Cirrascale web page, we do a lot business with customers making use of GPU Compute (or other computational accelerators, such as the Intel® Xeon Phi™ Coprocessor). While a few of our customers have workloads that are embarrassingly parallel, most applications require some sort of intercommunication between GPUs, so understanding the data flow between GPUs is important for getting optimal performance out of the system. Even for the embarrassingly parallel jobs, getting data to the GPUs to keep them busy requires understanding the inter-system flow of data between local and remote systems.
Many of the people I talk to about GPU Compute are familiar with the typical 8 GPU systems on the market today, such as the Tyan FT77AB7059 or Supermicro 4027GR, which make use of 4 PCIe switches, and all PCIe lanes of a dual-socket Intel® Xeon® processor E5-2600 series based server.
GPUs are divided between the two CPUs, and each pair of GPUs is behind a PCIe switch. This configuration is popular with system vendors because it’s relatively easy to build: keeping PCIe integrity between 3 devices (the PCIe root in the CPU, and the two GPUs on any particular PCIe switch) is relatively easy given the physically short trace lengths that are attainable with this configuration. Unfortunately for users, there are a number of downsides to this configuration.
As illustrated by the green arrows in the picture above, communication between two GPUs on the same switch is straightforward: data from a GPU (GPU 2 in the picture above) goes to the switch, then right back down to the adjacent GPU (GPU 3 in our illustration). Full PCIe x16 bandwidth, and minimal latency for that transaction. Going between two non-adjacent GPUs on the same CPU, illustrated by the purple arrows, induces a bit of extra latency due to the two switches and root complex that packets must traverse, but overall it’s “not that bad”. Using the NVIDIA CUDA 6.0 “p2pBandwidthLatency” sample as a trivial benchmark, there is roughly a 7% bandwidth (9.8 GB/s vs 10.6 GB/s) and 2% latency (7.9μs vs 7.7μs) penalty for the extra complexity in the data path1. On the software front, this is all transparent, with all 4 GPUs being able to peer with each other (to use CUDA parlance), making the writing of multi-GPU software pretty straightforward.
This typical 8 GPU system configuration, where some of the GPUs are on different sockets than the others, also means that there is yet another data-path for getting between GPUs, that of getting from one half of the GPUs (e.g., GPU 0-3) to the other (e.g., GPU 4-7). Here is where things get really ugly. Looking again at the illustration above, consider the case where GPU 0 (on CPU 0) wants to share data with GPU 4 (on CPU 1). The traffic leaving GPU 0 starts like any other transaction, heading first to the local PCIe Switch, and then on to the PCIe Root (CPU 0). But where does it go next? The logical place would be to make use of the high-bandwidth, low-latency QPI bus between the two CPUs, then to the PCIe Switch, and finally to GPU 4. Unfortunately, that’s a really bad idea, and the reason for the universal “No” symbol around QPI in the illustration above. In current generations (IvyBridge) of Intel Xeon processor E5-2600 series CPUs, the generally really useful QPI link introduces significant latency when forwarding PCIe packets between processors due to the way the packets are buffered and forwarded. How big is “significant”? Going back to our NVIDIA® CUDA® 6.0 “p2pBandwidthLatency” test, the 10.6 GB/s bandwidth observed between two GPUs on the same switch drops to a dismal 4.0 GB/s (just 38% of it’s former self!), while latency skyrockets from 7.7μs to an astronomical 32.1μs (a 316% change in the wrong direction)2. If high bandwidth, low latency inter-GPU communication is what you’re after, the QPI link should not be your preferred mode of transportation. There’s also the fact that since half of the GPUs in the system now reside on a different PCIe root (“CPU 1″ vs “CPU 0″), they can’t see the address space of all GPUs, and therefore extra care needs to be taken for writing multi-GPU enabled software. New APIs made available in CUDA 6.0 have gone a long way toward removing much of this complexity, allowing for buffers to be managed by CUDA, but just because CUDA can now take care of moving data between different groups of GPUs doesn’t make the performance impact any less dramatic.
Since QPI is out as a method to get PCIe packets from one set of GPUs to another, a potential solution is to use a low-latency interconnect such as Infiniband. While I’m not personally a fan of Infiniband, it does have it’s place; and that place is when you need really low latencies. Infiniband gives per-packet latencies in the single-digit microseconds3, which ends up matching nicely with what is needed to extend inter-GPU communication. The bonus of using a transport like Infiniband is that it makes no difference if the “other” GPUs are on the socket next door, or on another system elsewhere in the network. Given suitable GPU Direct RDMA capable drivers, this seems like a pretty decent solution, unless you are actually developing applications. In that case, there’s some extra overhead from the users perspective…okay, a lot of extra overhead, as now any application that wants to use a non-trivial number of GPUs needs to start getting involved in the job scheduling business, and have knowledge that some traffic needs to take a different transport (Infiniband) than others (PCIe). Going back to the typical 8 GPU system shown above, there’s also the fact that the Infiniband connectivity to GPUs is only PCIe x8. More than sufficient for a single FDR port (56Gbps), but significantly less than what GPUs on the same PCIe root complex can achieve when talking amongst themselves.
Unfortunately, this is where many people I talk to have stopped. They have the notion that “Yes, you can put 8 GPUs in a server,” accompanied by a huge caveat of “but it won’t perform like you’d want”. Fortunately, “typical” isn’t what we do at Cirrascale, so I get to talk about changing expectations of how a GPU heavy system behaves.
The Cirrascale GB5470 is a flexible platform allowing us to support a number of different PCIe topologies. The most logical evolution from the typical 8 GPU system is what is shown in the diagram above. This configuration improves on two of the problem areas discussed above. First, instead of using 2 GPUs behind a switch, 4 GPUs are now placed on the same switch (using the Cirrascale PCIe Gen3 Riser), removing the additional PCIe latency induced by the prior configuration in some scenarios. The green arrows showing communication between GPU 2 and 3 could apply to any pair of GPUs on that same CPU socket. This allows more freedom for users, as the traffic pattern between GPUs can be handled at run-time not system-build time. Applications can decide to pair up any GPU 0 through 3 with any other GPU 0 through 3, as the bandwidth and latency between any two is identical. Similarly, if an application requires communicating with more than one GPU, there are 3 others to choose from with no penalty. The second significant improvement is for communicating between sets of GPUs on different sockets, shown by the blue arrows. The Infiniband (or whatever the interconnect of choice is…our GB5470 doesn’t really care what you use, and we’ve seen some crazy things!) now has a full PCIe Gen3 x16 connection, allowing for something like an Infiniband dual-port FDR card (such as the Mellanox MCB194A-FCAT) to be used for inter-GPU communication. If a GPU needs (almost) full PCIe Gen3 x16 bandwidth to another GPU, it is now possible.
For many applications, the configuration above is a significant boost in capability over what users are accustomed to. Some applications can make better use of a different PCIe topology in the GB5470, which solves a different set of problems.
Moving all 8 GPUs to the same PCIe root (“CPU 0″ in the diagram above) is the obvious big win of the configuration shown above. 8 GPUs on the same PCIe root (and hence able to be peers in CUDA terms) makes for a really fast and flexible system. Of course there are still some PCIe topology considerations to keep in mind, like scheduling transfers to GPUs on the “other” PCIe switch so as to not saturate the PCIe Gen3 x16 link to and from the CPU, but for users accustomed to trying to manage two pairs of GPUs in a typical 8 GPU system, the constraints of this configuration are trivial to deal with. The huge benefit here, of course, is that all 8 GPUs share the same address space, and data can be shared between them trivially and transparently using CUDA APIs. The blue arrows showing GPU communication no longer have to go over an Infiniband link, but remain entirely on the same PCIe root: Low latency and high bandwidth for 8 GPUs. A notable downside is that to get data on or off the system, the QPI bus must be used. Since communication between GPUs and the Infiniband card isn’t direct in many applications (there’s oftentimes a bounce-buffer used, if not some higher-level protocol handling with some pointer shuffling, since GPU Direct RDMA isn’t possible here), this doesn’t necessarily hit the QPI performance problems that afflict the configurations mentioned previously (and the Infiniband card is still on a full PCIe Gen3 x16 bus, just on the “other” CPU), but it is clearly not optimal. However for GPU compute jobs which can “fit” within 8 GPUs, this configuration offers unparalleled bandwidth and latency between GPUs. A large number of applications really excel in this configuration due to the relatively flat PCIe topology.
It’s probably obvious that there are a number of other possible configurations for the GB5470, such as replacing one of the GPUs in the previous configuration with the Infiniband card; sacrificing a GPU worth of compute capability in return for exceptionally high bandwidth and low latency inter-system communication. For others applications, where the inter-system communication can be supported with a PCIe Gen3 x8 connection (read: Usually a single-port Infiniband FDR card) and x86 CPU resources aren’t in high demand, single socket solutions make more sense.
A PCIe topology as shown above (another Cirrascale GB5400-series system, making use of Intel® Xeon® processor E5-1600 series CPUs) yields low latency inter-system communication, while preserving the benefits of 8 GPUs on the same PCIe root complex. There are obviously other tradeoffs involved with this solution, related to the single-socket nature of this product (higher CPU clockspeed, but lower overall core count and RAM volume than a dual-socket system), but that’s why I enjoy talking with people about our products.
Understanding what users are trying to achieve with their GPU intensive applications gives me reason and opportunity to explore what bottlenecks are holding them back. In turn, I can arm them with knowledge that lets us work together to think beyond the typical system, and instead toward what’s possible.
1. Benchmark run on CentOS 6.5 x64 (kernel 2.6.32-431.11.2, NVIDIA driver 331.44, CUDA 6.0RC) using NVIDIA K40 cards and the p2pBandwidthLatency CUDA sample. Measurements are the mean of 5 test runs with P2P enabled unidirectional traffic.↩
2. Benchmark run on the same system as previously described.↩
Intel, the Intel logo, Xeon, and Intel Inside are trademarks of Intel Corporation in the U.S. and other countries. NVIDIA, Tesla, Quadro, GRID, and GeForce are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries.All other names or marks are property of their respective owners.