This one looks quite a bit nicer than my previous attempt at this design! The functionality is the same but now a lot of the heavier processing has been moved into a new infrastructure that’s been developed to integrate artificial intelligence and machine learning functions into data flows very efficiently. Now I am able to leverage Apache NiFi‘s extensive range of processors to interface to all kinds of things but also escape the JVM environment to get bare metal performance for the higher level functions including access to GPUs and things like that. In this design I am just using NiFi’s MQTT and Elasticsearch processors but it could just as easily fire processed data into HDFS, Kafka etc.
There are lots of guides out there describing how to set up simple Apache Kafka configurations but they generally stop short of describing how to use this with a three Apache ZooKeeper quorum so that ZooKeeper isn’t a single point of failure. The configuration of machines that I am working with are running these components:
- Server1 (static 192.168.10.11) – ZooKeeper
- Server2 (static 192.168.10.12) – ZooKeeper
- Server3 (static 192.168.10.13) – ZooKeeper, Kafka broker
- Desktop (static 192.168.10.14) – Kafka producer and Kafka consumer
This setup doesn’t use multiple Kafka brokers but that’s a relatively simple extension.
Apache NiFi is a great way of capturing and processing streams while Apache Kafka is a great way of storing stream data. There’s an excellent description here of how to configure NiFi to pass data to Kafka using MovieLens data as its source. Since I am not running HDFS I modified the example to just put the movies and tags data into Kafka and save the ratings data to a local file. Trying to stash the ratings data into Kafka doesn’t work – there is just too much of it too fast and buffers overflow. It’s pretty easy to use the Kafka console consumer to check that the data is being stored for the movies and tags topics and the local ratings.dat file will be generated also.