The HarperDB team built the first and only database written in Node.js, which implements SocketCluster for distributed computing in a unique way. Kyle Bernhardy, HarperDB CTO and Co-Founder, recently gave a talk on the inner-workings of SocketCluster, including a code review to highlight SocketCluster concepts within a database framework. I highly recommend checking out his talk at the link so you can see the full code review, but I'll also summarize the highlights from the talk here.
HarperDB is a net new database, essentially a structured object store with SQL capabilities. We have a lot of components in our architecture, and our WebSocket interface is a communication protocol that we implemented that is specifically used for different nodes of HarperDB to share data and schema metadata across nodes. Forward looking it will also be expanded into distributed operations like SQL & NoSQL, spreading out the querying capabilities. Rather than just distributing and deterministically sharing the data, it will also be able to execute queries across your cluster.
- Each node handles transactions & storage ACIDically, locally & independent of other nodes
- Each node can connect (or not) to any other node & send and/or receive transactions for any table
- Real time transmission of schema metadata & transactions in a deterministic manner based on customer defined topology
- All nodes can “catchup” from network / server outages, no “dead on the floor” transactions
Distributed computing can have super complex topologies, so we needed something lean and flexible to be able to handle this. Our assumption is that at any point in time a node can be offline, and to always allow catchup once the nodes are back online. We looked at options that were too heavyweight or didn’t have the option for nodes to be able to talk to that message broker - but we wanted to be able to accommodate our users and make it easy for them to use the technology instead of vice versa.
Some topology examples here - the left is a bit more simple with other nodes pushing to the man in the middle, this is a typical edge computing topology. You can also have chains, lines, circles, etc. We wanted to ensure that we accomodate any and all topology options. Using something like WebSockets where it's a duplex connection really helps us to overcome limitations, because if we’re required to have two-way communication but the server can’t push down to those nodes behind a firewall, then you’ve lost.
- Embedded Socket.io logic in our parent process
- Data duplication for every connected node
- Distributed logic tightly coupled in core logic
We tried embedding Socket.io in our parent process - using the cluster library to have parallel processes run “embarrassingly parallel” so we could scale out, but at the time we thought everyone would communicate up to the parent which would distribute data out across the cluster. We also had issues with the way we were storing data, and the distributed logic was tightly coupled in our core logic.
- Socket.io is hard to scale
- Need better transaction storage
- Need Pub/Sub
- Enable 3rd party observers to receive real time data stream and to publish to the stream
- Secure connections between nodes
We learned that Socket.io is hard to scale. To get Socket.io to scale we had to insert something like Redis or use other libraries to get done what we needed to achieve, which is a dependency nightmare. We were also doing direct pushes and emitting between nodes, but realized a pub/sub model on a per table basis made a lot more sense. We also wanted to enable 3rd party observers to receive real time data streams and publish to the streams (similar to Kafka), and we wanted better security.
What is SocketCluster
- Fast, lightweight, highly scalable realtime server engine
- Flexible framework
- Native JWT Authentication
- Built in connection/broker/channel/messaging handling
After several bake-offs we ultimately landed on SocketCluster. Our team wasn't too familiar at first, but after researching they enjoyed how lightweight, scalable, and flexible it is, as well as the ability to do those deterministic connections between nodes (where the admins of the system choose how that all works out). It also has built-in handling that you don’t need to worry about building yourself, it will do that for you. It also manages if you spawned up multiple instances of a SocketCluster server, there’s an underlying broker that handles making sure every subscriber receives the data they expect.
SocketCluster use cases: obvious one is Chat, an intriguing one is blockchain (they're actually funded by a blockchain company), as well as gaming, and us as a distributed database.
- Speed, performance & scalability
- Built in JWT authentication
- Broker/Connection/Channel/Message Management
- Messages are delivered in the order they were sent
- Fully Promise based
- Easily add custom logic
- Easily mutate/append message data
John Gros-Dubois who created and manages SocketCluster is always updating and refining this project. He has made huge leaps in the last year in that technology where going from an old callback approach he’s made everything promised-based, and on top of that he made all the listeners these async iterators that are event based that enable you to have all your messages delivered in the order it was sent - so you have transactional integrity.
How do we use SocketCluster
- Distributed Data Replication
- Every node is a message broker
- HarperDB uses a simple pub-sub model, so we replicate data by publishing data to different channels which different nodes subscribe to and are able to be distributed horizontally
- Maintain security between nodes
- In the future extend this to distribute all Core HaperDB operations
We use this as a distributed data replication framework. The Socket.io logic was tightly coupled into our core database logic, so we wanted to run this as a sidecar which was really easy with SocketCluster. That allows us to have every HarperDB node be its own message broker. SocketCluster has JWT authentication built-in providing credentialed security, and it also supports SSL between nodes so we can verify that no external connectors are coming in that shouldn't be part of the network.
This sample code will help you understand what we were trying to achieve and how we got there. Again I recommend checking out the code review portion of Kyle’s talk, but I’ll include a few highlights. This project demonstrates how to create a SocketCluster server with an integrated REST API, a SocketCluster client to connect to an instance of a SocketCluster server.
We have a classes directory where our primary logic lives, also included a Postman directory, etc. The meat of the project is creating a SocketCluster server: import library, attach to SocketCluster server, very basic to get up and running. It's interesting when we get to handling listeners and handling middleware. Here you can see the async iterator functionality:
Then we create a connection listener. We can listen for remote procedure calls and this is how we invoke authentication between server and client. The connection will establish and on connect we can invoke this login listener. All we need to do is in the SocketCluster client, listen, and invoke promises. On the server it’s listening for anyone trying to invoke that login. Basic validation. If we authenticate we can set an auth token and mark it as success and continue - since it's an iterator we have to tell it to continue so we don't get stuck.
One more thing inside the server is creating middleware - inbound, outbound, handshake and inbound raw. In this case, we have a middleware stream, each type has its own data assigned to the action. Authenticate, add custom if statement. When data is published across the cluster we’re calling a function to write that data to disk. We stop it from hitting the exchange to make sure data transacts on the server, and to stop subscribers from receiving double messages.
We have listeners, middleware, and we’re also creating a REST server. Pass in reference to the server, also a HTTP server we are reusing here, using the same port in the REST server as the Websocket server.
Now we run it, we are connected and authenticated.
So we can write to the database, and specify what channel we want to read against, and we’ll see on client it received that data. We can add another server, and now we can connect them and do full data replication and determine publish / subscribe. Since we have ties between the REST server and SocketCluster server we can also reference class functions in both which is handy. We’re also tracking outbound connections, iterating that subscriptions array that is defined in the body. If we’re publishing we need to do a bit of work and watch local exchange because the socket client needs to observe that channel and push that data to the other node.
So we have our connection, doing full data replication between node 1 and node 2. Making sure we have deterministic data sharing - deciding what data we want to go where. This use case is common for our clients especially in an edge computing scenario: say you have devices in a manufacturing plant collecting temperature data, you really only care when that data goes out of range, so command control wants to know what device is going out of range and what it looks like - then push that data to a separate table and push up to command control - but raw data is sitting only on the edge node and is tailing with time to live - that way we only share the data that really is important to customers.
We can add one more server to show one more fun thing here. We can create a procedure call between all nodes that are connected to the node that we’re on. So we can do a read all on the person channel, so everyone that's connected to node 1, we can send out a remote procedure call to every single node to look at each file that we have in the data directory. A use case for this could be that you’ve been offline for a bit, and you want to see what you’ve missed before you start transacting again.
The awesome thing here is I’m calling out to node 1 to get its own data, and nodes 2 and 3 are executing in parallel, which executes in 7 milliseconds (whereas just calling against node 3 is also another 7 milliseconds), so you can see the scale of parallelization of getting that data. You can also see that fragmenting your data across multiple nodes can help you use commodity hardware to increase performance and not have these giant monolithic servers. There’s a lot more you can do with SocketCluster but these are some of the main reasons we love utilizing it for distributed computing within our product.