SUPPORT CALL US AT 720-514-9512 DOWNLOAD

Insights and Updates

Read the HarperDB teams’ thoughts on database technology topics such as SQL, NOSQL, Edge Computing, IoT, Industry 4.0, and more

Building an Edge Database for IoT with Linux File System

Posted by Sam Johnson on August 28, 2019
 
Hello world!  My name is Sam and I am the newest member of the HarperDB Engineering Team. 
 
While I may be relatively new, I have been following the company since they launched last year (I’ve known Stephen since college) and jumped at the opportunity to join this awesome team.  I was and still am excited by the opportunity to help develop a database solution that is fully-indexed with no data duplication while also supporting full SQL and NoSQL within a single model.  In Node.js, no less!
 
As a new engineer, the first thing I had to do was figure out how we were able to accomplish all of this - that meant learning more about the exploded data model and how we use the Linux file system to enable it.  The goal of this blog post is to give you more insight into the guts of HarperDB, and to hopefully help you better understand how different and transformative our edge database solution really is.
 

HarperDB and the File System

“When HarperDB ingests a record it immediately splits that record up into individual attributes, storing the attributes and their values discreetly on disk. We use the required hash value to link the attributes together. This is what we mean when we say exploded.” 
From HarperDB’s Exploded Data Model post from June 20, 2018
 
When I started in my role, the above statement made complete sense to me but my mind immediately went to trying to understand how this exploded data was managed, updated, and extracted in a performant way once it was written to disk.  To figure that out, I had to learn more about how we use the Node FS module since the current release of HarperDB uses the Linux file system as its datastore.  In other words, the file system is where our software implements a structured directory and regular file architecture for managing our unique data model.  I should also note that the file system implementation is what allows us to be stateless, which is important for IoT.
 
It seems the best way to dig deeper into how HarperDB works is to run through an example of a basic insert and some search operation examples.
 

Basic Insert

In this example, we are doing a very basic insert to our dog table which is currently managing an id attribute (i.e. the required hash value) and a name attribute.  
Diagram: ‘dogs’ table before insert
Diagram: ‘dogs’ table before insert
Basic insert JSON
{
  {
    "operation":"insert",
    "schema":"pets",
    "table":"dog",
    "records": [
      {
        "id":"2",
        "breed": "Lab",
        "name":"Zelda",
      }
    ]
  }
}
As you probably know by now, when our API receives an insert request, we break apart the record/s being inserted by their individual attribute values and insert them into our schema as a hash/value pair using the following steps:
 
  1. Check our cached system schema to see if the insert attributes exist in the table
    In this scenario, we have a new attribute (i.e. breed) being inserted into the table which means we add it to our schema by:

      1. Adding /breed directories under /pets/dogs AND /pets/dogs/__hdb_hash - i.e. /pets/dogs/breed/Lab AND /pets/dogs/__hdb_hash/breed

      2. Adding the new table attribute to our system schema 

  1. Create the hash/attribute value pair files within the appropriate directories
    This is done in two distinct steps:

      1. Insert the individual attribute values into their respective directories in the /pets/dogs/__hdb_hash directory - e.g. a file with name “2.hdb” and contents “Zelda” is created in the /pets/dogs/__hdb_hash/name directory

      2. Hard links to the files created above are added to the attribute name directories in the /pets/dogs directory - e.g. a hard link to the file created above is added to the /pets/dogs/name/Zelda directory
Note: Because “Zelda” is a new attribute value for name, we also create a new directory for the value before creating the hard link
 
  1. Create a journal entry for the operation in the hash attribute value directory
    A file with name based on the time of the operation - “<timestamp>.hdb” - and contents equal to the data that was a part of the transaction - { "id":2, "breed": "Lab", "name":"Zelda" } - is added to the /pets/dogs/id/2 directory
 
Diagram: ‘dogs’ table after insert
 

Basic Searches

Understanding the data structure that is created in the file system when a record is inserted is critical to understanding how HarperDB is able to search and return that data after it’s been exploded and written to disk. This is because the two places we link to the hash/attribute value file represent how we manage a primary (i.e. hash) and secondary (i.e. value) index for that attribute value.  Let’s look at an example of how each of these indexing strategies work.

Search By Hash
{
  "operation":"search_by_hash",
  "schema": "pets",
  "table": "dogs",
  "hash_values":[2],
  "get_attributes": ["*"]
}
HarperDB searches a table by hash value - the primary key for data in a table - and returns attribute data using the following steps:
 
  1. Check the system schema to confirm that requested attributes exist on the table OR to collect the list of table attributes when all are requested via the wildcard (i.e. *)
    e.g. In the example above, the system schema would return “breed”, “id”, and “name” as the “dogs” table attributes to collect values for
 
  1. Retrieve attribute values for each get_attributes for the requested hash_values using fs.ReadFile()

    • Because we know the schema, table, attributes, and hash values from the request, we can build out file paths for each hash/attribute value files in the pets/dogs/__hdb_hash directory and use those paths to read the contents (i.e. attribute value for the specific hash) of each file
    • e.g. to retrieve the name attribute value for id - i.e. the hash value - equal to "2", the process creates the file path [hdb_install_path]/schema/pets/dogs/__hdb_hash/name/2.hdb and then performs a fs.readFile() operation which returns “Zelda” as its contents
Diagram: Search By Hash in HarperDB
 
  1. Consolidate and return the final search results for the hash and attribute values requested
    Using the attribute values retrieved from the hash directory, the final step in this process is to consolidate the search result objects by hash
 
To put it another way, when we know the hash values we need to look up in a table, we (1) validate and/or collect the table attributes to look up in our cached system schema, (2) using the necessary data (i.e. schema, table, attribute names, and hashes), iterate through the table’s hash folder to collect the hash/value pairs for each attribute, and then (3) consolidate the data into result objects by hash value and return it in the response.  
 
This operation is extremely performant and something we are constantly looking to make improvements on every day!

Search By Value
{
  "operation":"search_by_hash",
  "schema": "pets",
  "table": "dogs",
  "hash_values":[2],
  "get_attributes": ["*"]
}
 
HarperDB searches a table by attribute value and returns matching record with the other requested attribute data using the following steps:
 
  1. Same as the first example, check the system schema to confirm that requested attributes exist on the table OR to collect the list of table attributes when all are requested via the wildcard (i.e. “*”)
    e.g. In the example above, the system schema would return “breed”, “id”, and “name” as the “dogs” table attributes to collect values for

  2. Check the table attribute value directory for a child directory with name equal to the search_value and, if it exists, collect matching hash values using fs.readDir()

    • Because we know the schema, table, and attribute name and value to search for, we can build out a path for the attribute value directory - i.e.  pets/dogs/name/Zelda - where the hash value files are stored.
    • e.g. in this example, we call fs.readDir() on the path and get an array of file names in the directory - in this case, it would return ["2.hdb"] giving us the hash value after stripping off the hdb file extension
Diagram: Search By Value in HarperDB
 
  1. Retrieve values for the other requested attributes for the matching hash values returned in the last step using fs.readFile()

    • Similar to the first example, now that we have the matching hash values, we can use those hashes to return attribute values using their primary key in the same way we did in the previous example
    • e.g. to retrieve the breed attribute value for id - i.e. the hash value - equal to “2”, the process creates the file path pets/dogs/__hdb_hash/breed/2.hdb and then performs a fs.readFile() operation which returns “Lab” as its contents.  Since we already have the id attribute value (i.e. the hash), we don’t need to do anything else for this request!
  1. Consolidate and return the final search results for the matching hash values and requested attribute values
    As with the previous example, the final step in this process is to consolidate the search result objects by has
 
To put it another way, when we know the attribute name and value to search for in a table, we (1) quickly validate and/or collect the other table attributes to look up in our cached system schema, (2) using the necessary data (i.e. schema, table, attributes, and search attribute name and value, collect the names of the files within the attribute value directory (i.e. the matching hash values), then, with the hash values now identified, (3) iterate through the table’s hash folder to collect the remaining hash/value pairs for the other attributes requested, and, finally, (4) consolidate the data into results objects by hash value and return it in the response.  
 
While it does require an additional step to collect the matching hash values for the search value, this operation is also extremely performant.

Wrapping Up

There is obviously a lot more going on behind the scenes to make HarperDB an effective and performant enterprise-class database solution for IoT and edge projects. That said, understanding how data is stored and how that structure enables us to search for any attribute as a primary or secondary index was critical for me and my understanding of our solution and what makes us different. I hope it helped you connect a few more dots as well!
 
The mission statement of HarperDB is “Simplicity without Sacrifice.”  What I enjoy most about working as an engineer at HarperDB is that, while our technical implementation is complicated, the overall design concept behind our patented data model and how it is enabled in the file system is actually really straight forward.  It just took me a few months to figure that out!
 
Right now, my team spends a lot of our time working to figure out the most performant ways to search and extract data using Node's non-blocking I/O.  I’ve been amazed at how quickly these operations can run through big data sets and I’m excited to see continuous improvements moving forward. 
 
Speaking of performance improvements, my team is currently in the middle of a big refactor to enable HarperDB to work with Levyx’s Helium™ embeddable data engine and other datastore providers.  I’ll be sharing more about that in my next post.
 
 
 

Topics: Node.js, HTAP, Multimodel, Single Model, Distributed Systems, Data Storage, Developer


Comments: