Kafka: Getting Started

Posted on Thursday, December 22, 2016



In this article I am going to be going over some quick explanation on how Kafka works and then go install it on an Ubuntu 16.04 Server and run a few basic commands to make sure it's working.





What does Kafka do?


First what is Kafka and why would I want it?

From Wikipedia

Apache Kafka is an open-source stream processing platform developed by the Apache Software Foundation written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a "massively scalable pub/sub message queue architected as a distributed transaction log,"[3] making it highly valuable for enterprise infrastructures to process streaming data.



OK, fantastic what does that mean? 

A good place to start educating yourself on Kafka is their own documentation http://kafka.apache.org/documentation/ [2]
But before you dive into that, let me go over a simple "How Kafka Works" example. 




Records, Topics and Partitions


In Kafka information is stored as a Record.  A Record contains three pieces of information.

1     1. Value: 
     The stored message (messages are typically small ~10KB)
2     2.  Key:
     An optional key can be associated with a record
3     3.  Timestamp:
As of version 0.10.0 Records now include a timestamp


Records are written sequentially to a Partition of a Topic.



This image shows the anatomy of a Topic that contains a single Partition.  As records come in they are appended to the end of the partition.  This Forms an immutable sequence of records.  If I add one more record to the above the new record would be appended to the end.




This shows that the newly added record is added to the end.  When a record is added to a partition it is given a sequential id number called the offset.  In this case the newly added record has an offset number of 6.



Lifecycle of a Record?


How long does a Record stay around?  That depends.

In the server.properties file there are a few settings that determine how long a record stays in a partition. 

Name
Description
Type
Default
log.retention.hours           
The number of hours to keep a log file before deleting it (in hours), tertiary to log.retention.ms property
int
168
log.retention.bytes
The maximum size of the log before deleting it
long
-1
(no size limit)

There are the basic properties that set the retention rules for a record.  The default settings is to remove a record if it is older than 168 hours (7 days).    You can also set a byte size limit if you do when the size of the partition exceeds that size records will be removed from the front of the partition to get the total under the size.





For example…



In this Topic with one partition 6 records have been written.  The first two records where written on day 1 and the rest on day 5.




Eight days later if we look at the partition we will see that the first two records have been removed.    They were removed based on the log.retention.hours in the server settings.  Everything 7 days or older is removed.



Reading from a Topic


When a consumer starts up it subscribes to a Topic.



Although a new consumer can read every message in a topic it is more typical to subscribe to a topic and wait for new records to be sent to it. 





When a new record is added to the Topic it sends the new record to all Consumers attached to that Topic/Partition.  In this example the Consumer a subscribed to this Topic after Record '4' was added.  No records were sent to the Consumer until the next record, Record '5', was added to the topic.

As long as this consumer is attached all records added to this Topic/Partition will be sent to it.





Multiple Consumers can be attached to the same Topic/Partition.

OK now that gives us some basics.     With that in mind I am going to install Kafka on Ubuntu 16.04 and do a few tests.





Installing Kafka on Ubuntu 16.04

 I have a basic Ubuntu 16.04 server installed.

Install Oracle java 1.8


You need Java installed on the machine and I prefer installing Oracles Java vs the OpenJDK.

Run the following command to install it.


  > sudo echo oracle-java8-installer \
shared/accepted-oracle-license-v1-1 select true | \
sudo /usr/bin/debconf-set-selections
  > sudo echo \
"deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main" | \
sudo tee /etc/apt/sources.list.d/webupd8team-java.list
  > sudo echo \
"deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main" | \
sudo tee -a /etc/apt/sources.list.d/webupd8team-java.list
  > sudo apt-key adv --keyserver \
hkp://keyserver.ubuntu.com:80 --recv-keys EEA14886
  > sudo apt-get update
  > sudo apt-get -y install oracle-java8-installer



Now check the java version


  > java -version










Install Zookeeper


Now we need to install Zookeeper.   I am not a zookeeper guy… yet but it's needed for a Kafka install.


  > sudo apt-get install zookeeperd



Test to make sure it's up


  > netstat -ant | grep :2181




This is the results you want.





Install Kafka


Here is Kafka's Download page https://kafka.apache.org/downloads.html [3]
This is where I found the URL to download.


  > wget \
http://apache.cs.utah.edu/kafka/0.10.1.0/kafka_2.11-0.10.1.0.tgz


Make a director for kafka and untar it.



  > sudo mkdir /opt/kafka
  > sudo tar -xvf kafka_2.11-0.10.1.0.tgz -C /opt/kafka





Try it out real quick to make sure it runs.


  > sudo /opt/kafka/kafka_2.11-0.10.1.0/bin/kafka-server-start.sh \
/opt/kafka/kafka_2.11-0.10.1.0/config/server.properties




Looks good.

Leave it running and use the Kafka-console tools to talk to it.
These tools are located at
/opt/kafka/kafka_2.11-0.10.1.0/bin/






I am going to set up some simple scripts to make it simpler to run these command.



  > sudo vi /bin/kafka-topics


And place the following in it


#!/bin/bash
exec "/opt/kafka/kafka_2.11-0.10.1.0/bin/kafka-topics.sh" "$@"


Make it executable


  > sudo chmod 755 /bin/kafka-topics




Let me do the same thing for kafka-console-consumer


  > sudo vi /bin/kafka-console-consumer


And place the following in it


#!/bin/bash
exec "/opt/kafka/kafka_2.11-0.10.1.0/bin/kafka-console-consumer.sh" "$@"


Make it executable


  > sudo chmod 755 /bin/kafka-console-consumer




Let me do the same thing for kafka-console-producer


  > sudo vi /bin/kafka-console-producer


And place the following in it


#!/bin/bash
exec "/opt/kafka/kafka_2.11-0.10.1.0/bin/kafka-console-producer.sh" "$@"


Make it executable


  > sudo chmod 755 /bin/kafka-console-producer






Creating a Topic


In Kafka you post messages to topics.   Currently you have no topic set up.  To prove this, run this command.


  > kafka-topics --zookeeper localhost:2181 --list


You should get nothing returned


 


Now create a topic


  > kafka-topics --create \
--zookeeper localhost:2181 \
--replication-factor 1 \
--partitions 1 \
--topic "topic-one"





For this simple example I will not go into multiple Partitions or Replication-factor.
And now list all topics again.


  > kafka-topics --zookeeper localhost:2181 --list


 

There it is…






Send Message to Topic


In another terminal start a producer


  > kafka-console-producer --broker-list \
localhost:9092 --topic topic-one


Then in one terminal start a consumer and listen to the topic


  > kafka-console-consumer --bootstrap-server \
localhost:9092 --topic topic-one


Now on the producer side type in some messages.  Each time you hit return it will send the line you typed.




Messages produced are consumed on the other side




While I am at it let me add another consumer


  > kafka-console-consumer --bootstrap-server \
localhost:9092 --topic topic-one





Also you can feed it an entire file.

Let me create a test file


  > vi /tmp/test.txt


And place the following in it


Line 01 This is line 1
Line 02 each line becomes a record
Line 03 That is how the console-producer works
Line 04 just to show  you


Now run this command to feed the test.txt file into the Kafka Topic


  > kafka-console-producer --broker-list \
localhost:9092 --topic topic-one < /tmp/test.txt




Each line of the file becomes a record.  That is the way the console-producer works.




You could just pipe the info


  > cat /tmp/test.txt | kafka-console-producer --broker-list \
localhost:9092 --topic topic-one


Or you can use this file to tail a file.


  > tail -f -n +1 /tmp/test.txt | kafka-console-producer \
--broker-list localhost:9092 --topic topic-one


Now just append to the /tmp/test.txt file and see its messages get sent.


  > echo "APPEND ME" >> /tmp/test.txt






There you go a very basic overview on very basic Kafka Topic with one partition.  

(More to come as I do more research)



References


[1]        Kafka Wikipedia page
Accessed 12/2016
[2]        Kafka  Download page
Accessed 12/2016
[3]        Kafka  Download page
Accessed 12/2016


No comments:

Post a Comment