SUPPORT CALL US AT 720-514-9512 DOWNLOAD

Insights and Updates

Read the HarperDB teams’ thoughts on database technology topics such as SQL, NOSQL, Edge Computing, IoT, Industry 4.0, and more

Is HarperDB a Document Store?

Posted by Stephen Goldberg on May 01, 2018
Stephen Goldberg
Find me on:

No is the short answer.  Why does this matter?  Often when talking to the community I find that people are trying to understand where HarperDB fits in the database ecosystem.  Most people are primarily familiar with NoSQL databases, RDBMS, graph databases, and time-series databases.   A lot of people use the term Document Store and NoSQL database interchangeably.  In technology it is a pretty common practice to relate new and emerging technology to existing technology that is understood.  While HarperDB shares a lot of similarities with traditional NoSQL databases, it is not in fact a Document Store.  I thought it might be helpful to explain this further for folks trying to understand HarperDB at a deeper level.  

 
 

What is a Document Store?

 
According to DBEngines, “Document stores, also called document-oriented database systems, are characterized by their schema-free organization of data.” 
 
According to Wikipedia, “A document-oriented database, or document store, is a computer program designed for storing, retrieving and managing document-oriented information, also known as semi-structured data. “   
 
Essentially what these are both saying is that “document stores” are databases that store JSON or XML files, also know as “documents”.   
 
Lets next break down why this doesn’t apply to HarperDB.
 

Dynamic Schema

 
In both of the definitions above you will see a common theme that document stores are “schema-free” or have “semi-structured data”.    
 
Document Stores do not have the concept of schemas or tables.  They normally have “collections” which allow you to organize your data in a different fashion than a traditional RDBMS.  
 
What this means is that while you must define a hash and a range you can send in any data you want in your document.  What it also means is that while you can add secondary indexes you are limited on what you can search, and that you cannot inspect the schema of your entire database.  
 
This gives developers a lot of flexibility when they get started, but over time as their organizations mature, it becomes a major pain point as it is difficult for them to do data modeling and have organized enterprise architecture.
 
HarperDB is not “schema-free”.  HarperDB has a dynamic schema.  What this means is that if you perform a JSON insert like so: 
 
{
"operation":"insert",
"schema":"dev",
"table":"dog",
"records": [
  {
    "name":"Harper",
    "breed":"Mutt",
    "id":"1",
    "age":5
    
  }
]
 
}
 
 
or if you perform a SQL insert statement like so 
 
{
  "operation":"sql",
  "sql": "INSERT INTO dev.dog (name, breed, id, age) VALUES(‘Harper’, ‘Mutt’, 1, 5)"
}
 
Both of these insert statements will dynamically create a schema on the fly.  In a “Document Store” you cannot visualize your schema.  However, in HarperDB by calling “describe_all” you can then visualize your data like so: 
 
{
  "dev": {
    "dog": {
      "hash_attribute": "id",
      "id": "125ab5c9-f8ac-4197-a348-dd716b0f11ed",
      "name": "dog",
      "schema": "dev",
      "attributes": [
        {
          "attribute": "adorable"
        },
        {
          "attribute": "weight_lbs"
        },
        {
          "attribute": "owner_name"
        },
        {
          "attribute": "doc"
        },
        {
          "attribute": "id"
        },
        {
          "attribute": "dog_name"
        },
        {
          "attribute": "age"
        },
        {
          "attribute": "breed_id"
        }
      ]
    },
    "breed": {
      "hash_attribute": "id",
      "id": "4d560884-0f57-4756-a564-eea76f64d0af",
      "name": "breed",
      "schema": "dev",
      "attributes": [
        {
          "attribute": "section"
        },
        {
          "attribute": "country"
        },
        {
          "attribute": "name"
        },
        {
          "attribute": "image"
        },
        {
          "attribute": "id"
        }
      ]
    }
  }
}  
 
 
 
This also gives you the ability to search on any column without creating secondary indexes.  Every column/attribute is stored separately on write and becomes an individual index.  This means that HarperDB is a fully indexed database.  These indexes are what make “document stores” eventually consistent where as HarperDB is ACID.  These indexes also require data duplication, increased memory utilization, and administrative overhead.   
 
 
 

HarperDB Data Storage: Exploded Model 

 
The way this is accomplished in HarperDB is based on how we store the data.  In a document store, data is stored as full JSON documents as the name “document store” implies, the storage of documents.  
 
Inside of Document Stores, somewhere a JSON object like the one we inserted above is stored as whole.  This could be in one giant file, shared across multiple files, as individual files, but the net result is that the JSON document lives wholly on disk.  
 
This is why different indexes need to be created.  Those whole objects are then referenced based on the hash and the range.  When you want to search by something other than the hash and range the document needs to be duplicated, stored somewhere else on disk and then referenced by that new index value.  
 
With HarperDB as mentioned above, we created different locations on disk for each column/attribute.  
 
harperdb file storage
In the example above I have a schema called dev and a table called load_test.  That table currently has two attributes, “count” and “timestamp”.  You can see them in human readable form directly on my hard disk above.  Below each column is a value folder.  In this case we are looking in the folder for value 2.  We can see there are several files.  Each one of these files represents an attribute of a particular record. 
 
harperdb data
When I open that file we can see that all that it contains is the value for that attribute, in this case the value is 2.  We are not storing the entire JSON object, but rather each individual attribute value atomically on disk.  This allows for each operation to be atomic, and our file structure allows for each column to be an index without increasing data storage.  
 
As a result we are not storing JSON documents or XML documents, but rather their individual parts.  This alone makes HarperDB not a “document store”.  
 
 
 

Interfaces & Multi-Model

 
Multi-Model databases typically were first built as NoSQL databases or “document stores”.  These companies later realized that they needed to accommodate the need for complex analytics and NoSQL was not designed or built for complex analytics.  To accommodate this need, NoSQL databases adopted the concept of “multi-model”.  What this means is that essentially under the hood those products are running a document store for unstructured data and then transforming that data into a column row model either on disk or in memory.  This is expensive and slow.  You can read more about our thoughts on multi-model in my blog Multimodel Databases - A Mistake.   
HarperDB is not a multi-model database, but rather a single model with a dynamic schema.  This allows HarperDB to consume both unstructured and structured data in a single database model without the need for transformation on disk or in-memory.  This reduces cost, reduces overhead, and reduces complexity compared to multi-model databases.  
 
Because of HarperDB’s unique data model users can use HarperDB without ever interacting with JSON or unstructured data.  It can be used as an RBMS replacement, and we have users that are interacting with HarperDB via JDBC drivers in this fashion.  Alternatively users can use HarperDB purely as a NoSQL database through JSON interfaces or some hybrid approach of both.  This is most common, typically using NoSQL interfaces for ingestion and SQL interfaces for query/read capability.  This gives users choice and flexibility that would not be present in a document store.  
 

Then What is HarperDB? NewSQL? HTAP? 

 
Where does this leave HarperDB?  To be transparent we are somewhat figuring that out ourselves. How do we categorize ourselves in the market?  Most recently 451 Research in their article New dog, new tricks: HarperDB debuts hybrid SQL/NoSQL database, targets IoT workloads called HarperDB “NewSQL”.  This seems like a solid fit.  We also feel strongly that HarperDB fits in the Hybrid transactional/analytical processing (HTAP) category.  Both of these terms have a ways to go before grabbing mainstream adoption.  At the end of the day we see a strong fit for databases that can handle NoSQL and SQL use cases, that can provide analytical and transactional capability with low overhead and a lack of complexity.   Over time, the categorizations will change but our mission to provide technology building blocks that empower customers, partners, and employees to innovate quickly and easily while maintaining simplicity and fostering confidence will stay the same.
 
 
Download HarperDB
 
 
 

Topics: Data Value Chain, Dynamic Schema


Comments: