Insights and Updates

Read the HarperDB teams’ thoughts on database technology topics such as SQL, NOSQL, Edge Computing, IoT, Industry 4.0, and more

Scalability and Stability: The HarperDB Way

Posted by Stephen Goldberg on October 10, 2018
Stephen Goldberg
Find me on:


 Recently, I was meeting with one of our partners and we were speaking in depth about HarperDB’s architecture and roadmap.  During the conversation, what struck me as really interesting, was his focus on HarperDB’s stability and scalability.  This was what excited him most about our product. 

I was really surprised by this.  Don’t get me wrong, I think those are two really important parts of our product, but in our messaging and marketing they often take a back seat to ease of use, performance, and the simplification of your data pipeline.  
That said, one of the most enjoyable things about talking to customers and partners is learning how they see and use HarperDB.  Kyle, Zach, and I may have created HarperDB (now with a lot of help from others), but what it has become is far beyond our original idea.   
I thought in this post I would explore the different elements that make HarperDB incredibly stable and scaleable. 

Node.js Database

While it might seem counterintuitive to folks who have not drunk the Node.js Kool-Aid, one of the primary reasons that HarperDB is so stable is due to the use of Node.js as a primary framework.  I will touch on this more in the sections below, but our ability to run HarperDB as a stateless database, utilizing a multi-threaded architecture, leveraging horizontal scale, and much more, come as benefits of adopting Node.js.  


Stateless Database

Because we chose Node.js as our framework, HarperDB is stateless.  What this means is that there is nothing really to crash.  Think about that for a second.  There are no background processes. There is no in-memory component.  HarperDB is really doing nothing unless you are interacting with it as it uses zero CPU and very little RAM at rest.   
Most databases typically have some form of background process that is running constantly.  This might be to update index files, or to better utilize in-memory caching, etc.  The problem is if these central processes crash, the entire database can lock up and become unresponsive.  With HarperDB, because of our data model and architecture, this is not required.  As a result, a single point of failure does not exist, and as a result, one bad operation will not crash your database.  We are so confident of this that we actually run a contest to see if developers can crash HarperDB, which no one has won yet.


Parallel Computing 

It is a common misconception that Node.js is single threaded.  While this is true out of the box, its pretty easy to implement Node.js in a multi-threaded architecture using the cluster library. 
HarperDB utilizes a multi-threaded, multi-core architecture which means that behind the scenes multiple instances of HarperDB are running independently.  This enables HarperDB to handle vast amounts of current clients as the interactions are handled separately by the different threads.  It also allows for our database to scale to fit your hardware.   
Our community edition is typically running 4 threads, depending on your hardware, whereas our Enterprise edition is running 16 threads.  HarperDB will spin up threads based on the number of cores it can access. 
It is true that a request can fail, but because we are using parallel computing, your next request will be routed to a different thread which can handle that transaction entirely independently.  Also, because Node.js is so lightweight, that failed thread can come back online in less than 1 MS.  

Horizontal Scale

Many databases, especially RDBMS, utilize a master-slave paradigm for clustering and replication.  These models can be deployed with horizontal scale, but that is pretty challenging and typically they have a single point of failure.  This often leads to vertical scale, to avoid failure points and to reduce configuration complexity, which is expensive.   
Furthermore, while capable of horizontal scale, most RDBMS systems are more performant when deployed in a vertical scale model as sharding makes horizontal scale challenging.  It is difficult to evenly distribute your data on RDBMS systems and the configuration can be very complex.  
HarperDB is designed for horizontal scale on commodity hardware leveraging a peer-to-peer clustering model.  Every node in the cluster is aware of every other node and can act as a “master” for a particular transaction.  This allows for true distributed computing on the edge or in the cloud.  If a single node fails, the other nodes can continue to handle transactions without any downtime and catch that node up when it returns to a ready state.   
Because hardware costs do not increase in a linear fashion, but rather exponentially increase, the peer-to-peer clustering model can allow organizations to scale HarperDB on cheaper hardware while achieving high scale, high availability architectures, that don't break the bank. 


Exploded Data Model

HarperDB’s data model is one of the primary reasons that HarperDB is able to achieve high scale and function as an HTAP database.  When building and designing HarperDB, we put an enormous amount of thought into ensuring that every transaction was as discreet and atomic as possible.  
One of the issues we had encountered previously was that at very high scale, data ingestion updates to the same row or object would often experience locking on throughput issues.  
You can read more about our exploded data model in Jacob’s blog  HarperDB’s Exploded Data Model
The basic idea behind our model is that every element of data, attribute or column, is stored atomically on disk.  This means that two clients attempting to make updates to the same row have non-locking transactions as each operation is entirely independent.  
By storing the data independently and in a file pattern that allows for quick access, we do not need index files.  This avoids the need for background processes, but it also avoids the need to update multiple sources of data for a single record.  This allows HarperDB to achieve very high scale as we are not duplicating work.   
In most RDBMS systems, in addition to your actual data, index files are created.  These are essentially pointers to your actual data.  What this means is that when you update a row, not only does the row need to be updated, but a background process needs to update that index file.  This is problematic as it means at least two updates occur.  If you have multiple indexes, this means multiple updates.  Additionally, if the databases crash prior to these updates occurring, you can have corruption.  
In most NoSQL or object stores, in order to index data, the entirety of the record is duplicated on disk.  What this means is that if you update a single object, the database needs to then update all the other objects.  This causes many updates, but also in the event of a crash, can cause corruption. 
In HarperDB, a single operation on disk is performed for each attribute in the operation.  Once these operations are complete, no other updates need to occur.  This reduces overhead and allows HarperDB to handle massive amounts of concurrent requests. 


In summary, a lot of things make HarperDB stable and scaleable.  These are just a few that I have touched on briefly.  As we continue to drive innovation within HarperDB, we will continue to add features and functionality around the scaleability of HarperDB.  Some of the things we are looking at in the near future are serverless architectures, new sharding paradigms, clustering groups, and much more.  In the beginning of the post, I mentioned that we have learned an enormous amount from the HarperDB community.  We would love to hear your thoughts on this post.  Do you agree with our design decisions?  Do you think we are crazy?  Do you have a suggestion?  Let us know at 

Topics: Node.js, Scalability