We’ve all been working hard here at HarperDB to bring you the new and improved HarperDB 2.0. One of the main features that received a makeover was our solution for enhancing distributed computing, and more specifically, clustering. The new version has improvements across the board with performance, reliability and scalability. In this blog I’m going to introduce the concept of clustering and how it works with HarperDB.
What is clustering?
The term clustering has multiple meanings. In the context of this blog and HarperDB it refers to a group of nodes that are connected through hardware, networks and software, that behave as if they were a single system. A node is a device or data point that can send, receive and/or forward data. A personal computer is the most common node, some other examples are; modems, servers, gateways, cloud services and edge devices.
Why do we need it?
What’s better than one computer? A cluster of computers! Compared to a single computer, a cluster of computers can provide improved scalability, resource consolidation, centralized management, high availability, and failover.
How we do it?
The clustering paradigm HarperDB utilizes is the publish subscribe model. A publisher (i.e. any source of data) pushes messages out to subscribers (i.e. receivers of data) via data streams known as channels. All subscribers to a specific publisher channel are immediately notified when new messages have been published on that channel, and the message data is received together with the notification.
Publishers do not need to know anything about their subscribers, and subscribers only have to know the name of the channel they are subscribed to. Publishers simply define what data goes in which channel and transmit the channel data once.
Any publisher may also be a subscriber and data streams can be multiplexed, enabling the creation of interlinked systems that mesh together in an elegant, distributed, and internally-consistent manner.
When designing our clustering model we have maintained the HarperDB ethos “simplicity without sacrifice”. Our model is intuitive and easy to use, while continuing to maintain a low footprint and high performance.
A single instance/installation of HarperDB constitutes a node. A node of HarperDB can operate independently with clustering on or off. Each HarperDB node encapsulates the core HarperDB server as well as a cluster server which facilitates the publish subscribe model between HarperDB nodes.
To transport data between nodes we leverage WebSockets, which is a technology that allows real-time bidirectional communication. To secure the communications we enforce a Secure Socket Layer (SSL) across the cluster and to authenticate the connections we use JSON Web Tokens (JWT).
Subscriptions are defined when you add a node (we’ll cover that soon), they determine what data moves where. Subscriptions are exclusively at table level, but operate independently of referenced tables. Channel, publish and subscribe are all settings within a subscription which I will cover next.
The following definitions reference “transactions”. HarperDB (and this blog) recognizes transactions as: insert, update and delete. A single transaction consists of one if these, but can include one or more records to insert, update or delete.
A unidirectional data flow that will push insert, update and delete statements from one HarperDB node to another across a clustering connection.
A unidirectional data flow that listens for table transactions on another HarperDB node. When a transaction completes on the other node that transaction is then sent to the subscriber node where it will be executed upon receipt.
Publish and Subscribe
A bidirectional data flow that both pushes and listens for table transactions on another HarperDB node.
Channels are unique namespaces used exclusively for designating data paths between Nodes. Nodes publish or subscribe to specific channels in order to pass data. Channels utilize the following naming convention - schema:table. Channels represent a single table within a schema. On the diagrams above the channel is motor:rpm. A schema and table do not have to exist for a channel to be created between nodes, however they must exist for data to be propagated between nodes.
All schema, table and attribute metadata is automatically propagated throughout the cluster regardless of subscriptions. It is important to note that all metadata deletion operations like dropping schema, tables, and/or attributes will not propagate. This is to ensure these operations are only ran where necessary and to help prevent data loss.
Adding a node
In order for HarperDB clustering to work, each node must be aware of a least one other node. To add a node simply execute the add_node operation. The name and host value should be the name and host of the remote node you are connecting to; the port is the remote nodes clustering port. Cluster name and cluster port are both initially set during install, but can be updated via the settings.js file in the hdb config directory. An important point to note is that two nodes must have the same cluster user and cluster user password to establish a connection.
This operation is adding the remote node ‘node_1’ to the node this request is being executed on (for the sake of this example let's call it node_2). Node_2 will be linked to node_1 via the pump:pressure channel. Publish is set to ‘true’, which means that any data inserted into the table ‘pressure’ within the schema ‘pump’, will be propagated to node_1. Conversely, subscribe is set to false, which means data flow is unidirectional. Any changes to schema/table pump.pressure on node _1 will not be propagated to node_2.
HarperDB topology refers to the arrangement of nodes and connections within a cluster. HarperDB topologies are infinitely flexible and are defined node-to-node through the add node operation described above.
Below is a simple topology where data from equipment sensors is inserted on edge nodes and then published to a gateway node, the gateway node then forwards that data to an instance of HarperDB in the cloud. The gateway and the cloud node have both a publish and subscribe relationship, which means that they can both read and write to each other on the respective channels.
The clustering functionality is very powerful and could cause a network storm if not designed with caution. I strongly recommend creating a model for your topology before you start building anything, and follow the ‘keep it simple’ design principle.
Clustering with HarperDB offers a simple, lightweight and flexible solution to distributed edge computing and unifying solutions. Deploy HarperDB across a network of devices to gain real-time insights and the ability to provide intelligent edge analytics.
Build a data fabric architecture that spans across all your data points and create a unified, scalable system. A data fabric provides a simplified way of consolidating and managing your data across all platforms.
Another feature that HarperDB offers is observers. Observers are 3rd party applications that can listen to and receive real-time updates from a cluster channel. Stay tuned for a more in-depth discussion on observers in a future blog.