November 28, 2022

Distributed Application Processing with Apache Kafka & HarperDB

Welcome to Community Posts
Click below to read the full article.
Arrow
Summary of What to Expect
Table of Contents

I recently learned about Apache Kafka and wanted to learn about it in my favorite manner: build a project using it! I’ve used other similar products in the past and explored the pros and cons so it would be great to explore Kafka as well. My plan is to emulate a storefront already using Apache Kafka that wanted to add a new order status/tracking system.

In this article, we’ll set up a Kafka cluster and HarperDB Instance using Docker and then upload a Custom Functions package that will allow us to persist topic messages, such as new orders and status changes. HarperDB will allow us to capture, store and query information sent to this topic and we can also upload a static frontend to HarperDB so we can serve the full solution from one container.

Sample architecture diagram

What is Apache Kafka?

Kafka is an open-source product that allows for distributed processing and ingestion of events in real time in an efficient and scalable manner. These events could be anything from errors, logs, orders, sensor readings, and much more. Kafka is very popular since it’s been built with speed in mind and it combines two messaging models: queueing and publish/subscribe.

We’ll be using the publish/subscribe model today, as we would typically have many consumers of a topic who all have their specific action to trigger with that event. This default approach is a bit different from the other message broker commonly used, RabbitMQ, which does not send a message to all consumers if it was consumed by at least one client.

Common Use Cases

The original use case for Kafka, as required by LinkedIn, was to track User activities such as clicks, likes, time spent on the page, etc. Some of the other common use cases include log aggregation, service messaging, eCommerce, and more. If you are only sending a small number of messages, Kafka may be overkill and more complex than is required for your setup.

What is HarperDB?

HarperDB I’ve described as a full-stack engineer’s best tool in our arsenal. It takes a lot of complexity out of deployments as our database, API, and frontend are all handled by HarperDB making our job much easier. It allows you to store data in a dynamic schema and query that data using either SQL or NoSQL operations. They have a hosted offering or you can run one locally as we will today.

I’ve gone over signing up for HarperDB in a previous article if you are interested. Another great thing about HarperDB is that they have a free tier which is enough for most small projects or to get started on a larger project. I’ve been using it for a few months now and it’s worked great for all of the projects I’ve thrown at it.

Setting up a Docker Cluster

Since Kafka requires both a Broker and a Zookeeper container, we have a total of three containers to manage including HarperDB which makes this a perfect candidate for Docker Compose. Docker Compose will allow us to define our project in a YAML file and Docker will handle spinning up and down all of the containers required. After installing Docker, and the Compose plugin, we can clone the repository like so:

$ git clone https://github.com/makvoid/guide-harperdb-kafka-ingestion
$ cd guide-harperdb-kafka-ingestion

Inside of docker-compose.yml, add a password for HarperDB’s HDB_ADMIN_PASSWORD variable and save the file. Afterward, you can launch the cluster by using the Compose plugin:

$ docker compose up

Cluster Configuration

It will take around a minute for all three containers to initialize. Afterward, we’ll want to create a new topic to submit our messages to:

$ docker exec broker kafka-topics --bootstrap-server broker:9092 \
  --create --topic ingestion

Finally, we can grab the HarperDB container’s IP address so we can start managing it through HarperDB Studio. To do this, you can run the inspect command with some filtering to lower the amount of noise returned:

# Return just the container's IP Address:
$ docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' harperdb
172.31.0.3
# Or, see all the container's juicy details:
$ docker inspect harperdb

With this information, let’s head over to HarperDB Studio and add this Instance to our Organization. Click ‘Register User-Installed Instance’, and name this Instance ‘kafka-node’, then enter the credentials from the docker-compose.yml file. For the Host, enter the IP address from the previous step with port 9925 with SSL enabled. Proceed with a free tier instance and accept the terms to finish adding the Instance.

Custom Functions Consumer

In order for HarperDB to consume the topic, the easiest thing is to set up a Consumer within a Custom Function. Since HarperDB and the Kafka Broker are hosted within the same network in our Docker stack, we can just pass the name of the container instead of having to worry about updating IP Addresses.

Taking a page from HarperDB’s sample ingestion project, I set up a simple event handler to automatically create the schema and table requested whenever a message is received by the Consumer. Alongside the consumer setup, I added a route that allows an Order ID to be passed and it will return the information about that order contained within the database.

Deploying the Project

To deploy a Custom Functions project, HarperDB gives us a few options. We can transfer a project from one Instance to another, via the API, or locally on the disk. Since you may not have another Instance already created, I’ve created a Node script that will transfer it to the Instance using the API.

Before deploying the project, we’ll need to update a few configuration values in the script scripts/deploy-custom-functions.js:

// Required configuration
const HDB_INSTANCE_IP = '172.31.0.3'
const HDB_INSTANCE_PORT = 9925
const HDB_USERNAME = 'clusteradm'
const HDB_PASSWORD = '...'
const HDB_PROJECT_NAME = 'orders'

Then, we can add the URL to the frontend’s environment files in frontend/src/environments:

export const environment = {
  apiUrl: 'https://172.31.0.3:9926/orders/',
  ...
}

After editing the files, we can run the script like so:

$ yarn # Install dependencies
$ node scripts/deploy-custom-functions.js
Found a total of 5499 files to add to the archive.
Deploying the project to HarperDB, please wait...
Deployment has finished - Successfully deployed project: orders

The script will automatically upload the Custom Functions project to the Instance using the project name you specified. This will also package the Angular sample frontend I prepared and upload it alongside the Custom Functions.

Sending Events

Normally, we would send the event using the Kafka client for our programming language of choice. However, for this simple test, we can use an interactive session for the broker to send a synthetic event like so:

$ docker exec --interactive --tty broker kafka-console-producer \ 
  --bootstrap-server broker:9092 --topic ingestion

Once it has connected, we can copy/paste in a sample event and disconnect afterward:

>{"schema":"acme","table":"orders","records":[{"id": "01cb7e40-d01b-457e-8cdd-976051de4c2b","name":"John Doe","email":"john.doe@example.com","number":"(555) 555 - 1234","total":"$12.34","orderDate":"Nov 20","deliveryDate":"Nov 22","rewardPoints":"900","tracking":{"carrier":"USPS","number":"9400 1234 5678 9999 0000 00"}}]}

Back in HarperDB Studio, navigate to the ‘Browse’ section and select the ‘acme’ schema and the ‘orders’ table. Within this table, we should see any sample records sent above in the interactive session. Since we are using the default retention settings, you’ll always want to pass an id with every record otherwise you may run into duplicated records.

As the Order gets updated, the system would ideally send new events with the fresh details attached so that we can update it. On the HarperDB side, we’re using the upsert logic to that we will update existing records with the latest information and not create a new record.

Testing

Finally, we can fully test the project by navigating to the static frontend hosted within HarperDB:

https://172.31.0.3:9926/orders/static

We can enter our sample order number (01cb7e40-d01b-457e-8cdd-976051de4c2b) and get the record’s information such as the order status, shipping carrier, and tracking number.

Alternatively, you can also do the same via the Custom Functions API and skip the frontend:

$ curl -sk https://172.31.0.3:9926/orders/order/8e10e6e3-8932-4df3-a812-7769a0e8b119 | jq
{
  "id": "01cb7e40-d01b-457e-8cdd-976051de4c2b",
  "name": "John Doe",
  "tracking": {
    "carrier": "USPS",
    "number": "9400 1234 5678 9999 0000 00"
  },
  "number": "(555) 555 - 1234",
  "__updatedtime__": 1669088937814,
  "rewardPoints": 900,
  "email": "john.doe@example.com",
  "total": "$12.34",
  "deliveryDate": "Nov 22",
  "__createdtime__": 1669088846686,
  "orderDate": "Nov 20"
}

Conclusion

As this was my first project with Kafka, it was super interesting to see the differences compared to other products. It was also a breeze to implement on HarperDB thanks to the Custom Functions being flexible. While this is a minor implementation, hopefully, this gives you some ideas on how you could take it further.

Some great extra functionality to add would be to set up data persistence for HarperDB or cluster it to other Instances to decrease the potential latency. We could also configure Kafka much more than we currently are, as we are just scratching the surface configuration-wise.

Thank you for reading and leave a comment if you have any feedback or questions about the article!

Shutting down the Cluster

After you have finished sending events, don’t forget to spin down the entire cluster:

$ docker compose down

Resources