While I have been using MQTT so far for rtndf, I always had in mind using my own infrastructure. I have been developing the concepts on and off since about 2003 and there’s a direct line from the early versions (intended for clusters of robots to form ad-hoc meshes), through SyntroNet and SNC to the latest incarnation called Manifold. It has some nice features such as auto-discovery, optimized distributed multicast, easy resilience and a distributed directory system that makes node discovery really easy.
The Manifold is made up of nodes. The most important node is ManifoldNexus which forms the hyper-connected fabric of the Manifold. The plan is for rtndf apps to become Manifold nodes to take advantage of the capabilities of Manifold. Manifold has APIs for C++ and Python.
Even though it is very new, Manifold is working quite well. Using Python source and sink scripts, it’s possible get throughput of around 2G bytes per second for both end to end (E2E) and multicast traffic. This figure was obtained using 5000 400,000 byte packets per second on an I7 5820K machine. Between machines, rates are obviously limited by link speeds for large packets. Round-trip E2E latency is around 50uS for small packets which could probably be improved. Maximum E2E message rate is about 100,000 per second between two nodes.
Manifold does potentially lend itself to being used with poll mode Ethernet links and shared memory links. Poll mode shared memory links are especially effective as latency is minimized and data predominately bounces off the CPU’s caches, not to mention DPDK links for inter-machine connectivity. Plenty of work left to do…
Back in the dark ages, when CPUs only had one, two or maybe four cores, the idea of dedicating an entire core to a single thread was ridiculous. Then it became apparent that the only way to scale CPU performance was to integrate more cores onto a single CPU chip. Now there are monsters like the Xeon E5-2699 v4 with 22 physical cores – not to mention Knight’s Landing with 72! People started wondering – how to use all these cores in a meaningful way without getting bogged down in delays from cache coherency, locks and other synchronization issues.
Turns out the answer may well be to hard-allocate threads to cores – just one thread locked into each core. This means that almost all of an application can be free of kernel interaction. This is how DPDK gets its speed for example. It uses user space polling to minimize latency and maximize performance.
I have been running some tests using one thread per core with DPDK and lock-free shared memory links. So far, on my old i7-2700K dev machine (with another machine generating test data over a 40Gbps link), I have been seeing over 16Gbps of throughput through DPDK into the shared memory link using a single core without even trying to optimize the code. It’s kind of weird seeing certain cores holding at 100% continuously, even if they are doing nothing, but this is the new reality.
An interesting question comes out of this though: is this the end of Java/Scala and the JVM for high performance applications? Can the JVM support this model of operation fully? I don’t know. Fashions change of course, even when it comes to programming styles. In the old days, every CPU cycle mattered. More recently, people were happy to waste cycles to get things like automatic garbage collection. Maybe now the tide is turning back to every CPU cycle mattering again. And they are some really powerful cycles now!
Following on from the earlier post, it seemed like a good idea to get DPDK’s Pktgen program running to work with the l2fwd example that was previously built. More details after the jump…
I want to set up some point to point links using DPDK to give the advantages of kernel bypass and had lots of “fun” setting things up for this. Probably it’s because I am incapable of following instructions properly that I need to document the process properly so I don’t waste so much time next time. One major issue I discovered (at the time of writing) is that Ubuntu 16.04 isn’t yet supported – 15.10 is the highest version that seems to be supported by the Mellanox OFED. It might be possible to get it working with 16.04 but it seemed easier just to go back to 15.10. Instructions after the jump…