Back in the dark ages, when CPUs only had one, two or maybe four cores, the idea of dedicating an entire core to a single thread was ridiculous. Then it became apparent that the only way to scale CPU performance was to integrate more cores onto a single CPU chip. Now there are monsters like the Xeon E5-2699 v4 with 22 physical cores – not to mention Knight’s Landing with 72! People started wondering – how to use all these cores in a meaningful way without getting bogged down in delays from cache coherency, locks and other synchronization issues.
Turns out the answer may well be to hard-allocate threads to cores – just one thread locked into each core. This means that almost all of an application can be free of kernel interaction. This is how DPDK gets its speed for example. It uses user space polling to minimize latency and maximize performance.
I have been running some tests using one thread per core with DPDK and lock-free shared memory links. So far, on my old i7-2700K dev machine (with another machine generating test data over a 40Gbps link), I have been seeing over 16Gbps of throughput through DPDK into the shared memory link using a single core without even trying to optimize the code. It’s kind of weird seeing certain cores holding at 100% continuously, even if they are doing nothing, but this is the new reality.
An interesting question comes out of this though: is this the end of Java/Scala and the JVM for high performance applications? Can the JVM support this model of operation fully? I don’t know. Fashions change of course, even when it comes to programming styles. In the old days, every CPU cycle mattered. More recently, people were happy to waste cycles to get things like automatic garbage collection. Maybe now the tide is turning back to every CPU cycle mattering again. And they are some really powerful cycles now!