Getting to Good Enough with Homebrew Tools
Being a small company, we’re constantly looking for ways to save money in capital expenditures (sure, people cost money too, but the culture we’ve created here at Cirrascale means people work hard and go above and beyond…so labor costs for internal projects are often absorbed pretty easily). For immediate, high-priority issues, we obviously do whatever it takes (including throwing money at the problem), but long-term our goal is to be lean – don’t have large or recurring expenses if they aren’t necessary.
In the past few weeks, I’ve been trying to recreate an issue one of our customers reported related to intermittent network problems when using 10GBaseT. After the typical first steps of verifying that the components, firmware, and associated networking gear were all at the latest versions, and ruling out software and user errors, it was clear that the symptoms the customer was reporting were genuine, but that the source was non-obvious. This set me down the path of trying to recreate the issue in our lab, which is difficult given the complexity and scale of the environment at the customer site – there was really no practical way to simulate their environment on a small scale and then narrow things down from there. In unfortunate situations like this, it means the testing starts out ad-hoc so that we can get maximum coverage at various test points just to see which elements the problem is sensitive to.
Since this is an issue that manifests itself as network related problems, of course the starting point for our test setup was to run streams of packets through the system and try and spot errors. The trusty iperf is a quick way to generate traffic, with ethtool(8) and ifconfig(8) providing the “are there errors?” side of things. After firing up traffic and abusing our systems in a variety of different ways, we were unable to generate errors here in our lab. Seriously…zero errors. Nada. Zilch. Yet the customer was reporting errors in (what I think is) a far less demanding environment than what we had tortured our systems with in our lab. Even using suspect parts from the customer equipment (“Maybe this board is bad…try that in the test setup!”), no errors became evident. Something else was needed to try and figure this problem out.
Thanks to our really helpful friends at Ixia, we were able to setup a number of blades in a BladeRack 2 Series Platform with continual 10Gbps traffic flowing through them, and monitor every frame for errors. Due to the way the Ixia product works (FPGA’s making raw frames and sending them to the PHY), after re-running the same types of tests that didn’t cause errors previously, we were now able to see errors. Yay! There was now a repeatable way to cause errors to happen, and the nature of the errors aligned well with the symptoms the customer was seeing. From there, it was a relatively quick path to understand the problem, and come up with solutions.
What does this have to do with trying to be wise with company money?
With the immediate customer problem taken care of, the obvious question became “How did we miss this in the design phase?” quickly followed by “How can we catch this (and similar things) next time?” The first question has some pretty interesting answers, but is a tale for another time (over a stiff drink…). The second question is at least partially answered by having a suitable test harness available so that we can verify the problem doesn’t crop up again in newly built and newly designed products.
While the Ixia product was fantastic for getting the problem solved quickly, it’s quite an investment to make given that it’s overkill for our usage; for this particular issue we don’t need the advanced capabilities of the product, such as replicating customer workloads, or injecting errors at various levels. What we need is a way to generate high-speed traffic, and monitor for errors. Astute readers may notice that that’s exactly what we tried initially using iperf, ethtool(8), and ifconfig(8) – only to not see any errors. It turns out that while sometimes errors can be so egregious that the NIC driver and OS have visibility to them, what happens more frequently (as a precursor to those errors) are errors between the MAC and PHY. Specifically, the MAC reports that the PHY sent some bogus data, so the MAC filters that erroneous data out…meaning OS and driver level tools (aka ethtool(8) and ifconfig(8)) never see the errors. With the Ixia product, this didn’t happen because the FPGA has, and reports, everything that is happening from the PHY: No errors get ignored. Of course, testing for a longer period (now that we know the conditions which can cause the problem) would undoubtedly show the customer symptoms as well, but a shorter test cycle means higher design and production throughput.
Nearly all of our 10GBaseT solutions involve the use of the Intel X540, because it is an extremely high performance and low cost part. Fortunately for us, Cirrascale and Intel also have a great working partnership, so a quick description of where we’re trying to see errors has Intel providing us with a tool that can show exactly that (in addition to generate lots of network traffic while using relatively little CPU resources!). This enables us to take these relatively low-cost NICs, combined with other COTS parts, and build a test harness that can look for errors exactly where we want.
In the end, we have a homebrew solution which required negligible capital outlay (we had most the components already laying around our lab), and is easily repeatable…we could make another one just like it (that’s what we do here afterall!) as many times as we need to to satisfy our testing demands. There are obviously downsides, like not having as user-friendly of an interface as the commercial tools and not having nearly as many options and knobs. For us, the solution is optimal though – it is tailored for our needs. It’s by no means perfect, but it’s good enough.