There are lots of guides out there describing how to set up simple Apache Kafka configurations but they generally stop short of describing how to use this with a three Apache ZooKeeper quorum so that ZooKeeper isn’t a single point of failure. The configuration of machines that I am working with are running these components:
- Server1 (static 192.168.10.11) – ZooKeeper
- Server2 (static 192.168.10.12) – ZooKeeper
- Server3 (static 192.168.10.13) – ZooKeeper, Kafka broker
- Desktop (static 192.168.10.14) – Kafka producer and Kafka consumer
This setup doesn’t use multiple Kafka brokers but that’s a relatively simple extension.
Apache NiFi is a great way of capturing and processing streams while Apache Kafka is a great way of storing stream data. There’s an excellent description here of how to configure NiFi to pass data to Kafka using MovieLens data as its source. Since I am not running HDFS I modified the example to just put the movies and tags data into Kafka and save the ratings data to a local file. Trying to stash the ratings data into Kafka doesn’t work – there is just too much of it too fast and buffers overflow. It’s pretty easy to use the Kafka console consumer to check that the data is being stored for the movies and tags topics and the local ratings.dat file will be generated also.