July 25, 2023

Multi-Region Deployment of HarperDB via Replication: Part I

Welcome to Community Posts
Click below to read the full article.
Summary of What to Expect
Table of Contents

One of the interesting design choices with HarperDB is first-class support for data replication via its clustering feature (now known as replication). HarperDB allows users to create a mesh network using a bi-direction pub/sub model on a per-table basis. Data is replicated asynchronously across all the subscribers in an eventual consistency model. 

In this series, we will go over how to enable this feature and walk through creating our own multi-region deployment of HarperDB with a custom data replication scheme. In this first part, we will stand up HarperDB on multiple regions on AWS and demonstrate the replication feature. In part two, we will integrate with Cloudflare to handle load balancing of data requests. In part three, we will see how to set up geo load balancing and demonstrate it via custom functions.

AWS Setup

We will create two EC2 instances in two different regions for this demo. I’ll be using US-East-1 and US-West-1, but you can use any region. Launch an EC2 machine with Ubuntu 22.04 in the first region:

Do the same for the other machine. I used `m5.large` machines, but for this small demo, you can use free-tier eligible machines as well. 

Once the VM is ready, ssh into the machine and install Docker following the official Docker docs

Next, we will run HarperDB via Docker. Note that we are simply writing to root volume here for demo purposes, but you can attach an EBS volume with sufficient storage for persistence as well:


docker run -d \
-v $(pwd):/home/harperdb/hdb \
-e HDB_ADMIN_USERNAME=HDB_ADMIN \
-e HDB_ADMIN_PASSWORD=password \
-e CLUSTERING_ENABLED=true \
-e CLUSTERING_USER=cluster_user \
-e CLUSTERING_PASSWORD=password \
-e CLUSTERING_NODENAME=hdb1 \
-p 9925:9925 \
-p 9926:9926 \
-p 9932:9932 \
harperdb/harperdb 

This runs an instance of HarperDB with clustering enabled, giving this node the name `hdb1` and exposing ports 9925-9926 as well as 9932. 

Do the same for the other EC2 machine, but give it the nodename `hdb2` instead. 

Finally, we need to attach security groups to allow inbound traffic to our ports. Attach a new security group to allow TCP traffic to those ports:

We are opening up these ports to any IP for now, but in production, we would lock these down to our internal VPC. 

HarperDB Setup

Next, we will create our schema and tables. We’ll use our favorite examples of `dev` schema and `dog` tables. 

curl --location 'http://:9925' \
--header 'Authorization: Basic SERCX0FETUlOOnBhc3N3b3Jk' \
--header 'Content-Type: application/json' \
--data '{
"operation": "create_schema",
"schema": "dev"
}'

curl --location 'http://:9925' \
--header 'Authorization: Basic SERCX0FETUlOOnBhc3N3b3Jk' \
--header 'Content-Type: application/json' \
--data '{
"operation": "create_table",
"schema": "dev",
"table": "dog",
"hash_attribute": "id"
}'


Do this for both machines. 

Before connecting the databases via clustering, let’s seed each one with some sample data. 

For the one running in US-East-1, let’s insert a record with dog_name “Penny” and age “7”. 

curl --location 'http://54.234.98.106:9925' \
--header 'Authorization: Basic SERCX0FETUlOOnBhc3N3b3Jk' \
--header 'Content-Type: application/json' \
--data '{
"operation": "insert",
"schema": "dev",
"table": "dog",
"records": [
{
"dog_name": "Penny",
"age": 7
}
]
}'

Enabling Clustering

To enable clustering, let’s first connect these instances to HarperDB Studio. Add our self-managed DBs:

Once we connect to the instance, we should see our dog record on the first instance but not the other:

Now navigate to the `replication` panel and click on your other instance (in my case `harperdb-2`) to connect them:

For now, I’ll just give it the publish rules. Then for the other instance,  we need to also connect it. It’ll then automatically have subscribe rules turned on:

Testing out Replication

Now let’s add another record to `harperdb-1` instance with dog_name `Coco` and age `7`:

You can see that `harperdb-1` instance has both records listed. Now check on `harperdb-2` and you should also see the same:

Now let’s just insert a record to `harperdb-2` instance with no replication turned on (`Max`, 3):

We see our new record in `harperdb-2` but not in `harperdb-1` instance as expected:

Wrapping Up

In this tutorial, we saw how easy it was to configure replication on HarperDB using the clustering feature. We set up two HarperDB instances on two different regions on AWS and configured our replication scheme (i.e. `harperdb-1` → `harperdb-2`) via HarperDB Studio. 

In Part II of this article, we will configure Cloudflare to load balance requests based on latency or geo-location. So before shutting the machines down, go back to our replication tab and enable publish/subscribe rules for both instances.