Select Star Logo
September 24, 2019

The Future of Big Data is Small Data

Generic Placeholder for Profile Picture
September 24, 2019
Jaxon Repp
Field CTO at HarperDB

Table of Contents

The Looming Crisis of Big Data databases at Thing Scale.

As a partially reformed software engineer, I’m acutely aware of the balance we all attempt to maintain between what we can achieve today and what we believe we’re going to be able to achieve tomorrow.

In theory, our increasingly complex attempts to understand the universe of data around us should keep pace with Moore’s Law- the axiom that states that computing power doubles every 18 months- as our insights are always going to be limited by the power we have to extract them from our observations. In practice, we’ve so completely bought into the idea that Big Data is the cure to all that ails us that we’ve started adding sensors to absolutely everything… and the architectures that served us so well for the first 50 years of industrial computing are failing to keep up. 

The Promise- And Terror- of Big Data.

Storing and analyzing every sensor reading, account transaction, user interaction, or threshold event feels like a good idea. Without perfect information, it stands to reason one will be unable to make a perfect decision. Machine learning models are- much like us- only as good as the universe of information we use to construct them. Data from the previous hour does not feel like the future. Realtime data from 10,000 sensors firing every 10ms feels very much like the future…but also very much like one is going to saturate one’s network and personally fund Blue Origin’s next 5 launches. 

The promise of Big Data was supposed to be that future. But a lot of Industry 4.0 initiatives seem to stall, and usually for one of two reasons: we either 1) try to measure everything and become overwhelmed with data we aren’t prepared to process, let alone draw insights from, or 2) determined to avoid number 1, we spend the entire R&D budget guessing which pieces of data are most important.  

The first approach is what I call Rumsfeldian: in a world of unknown unknowns, let’s carpet-bomb the cloud with every one and zero we can capture. Historically speaking, carpet-bombing (of any type) tends to age exceptionally poorly, so I’d venture a guess that it won’t turn out to be a great data management strategy, either. 

The second approach is even worse, for obvious reasons. To paraphrase the movie Animal House, “Overanalyzed, overwhelmed, and over budget is no way to go through life, son.” 

Distributed Computing to the Rescue

The logical compromise has increasingly been to deploy smaller compute nodes closer to the sensors, store their data in an edge database, and forward the important bits up to the cloud. There, we expect that insights are extracted, the promise of Big Data is finally realized, we all move to the beach, spending our days enjoying the highly refined and optimized existence we’d been planning all along. 

Except they aren’t, it isn’t, and we don’t get to.  

Moving all the data to the cloud is too expensive, but shipping only the “important” bits strips away the context. In other words, the data we cast aside in search of efficiency is the very same data that enables us to identify the correlations and relationships that govern our system in the first place. The subtlety and nuance our brains leverage to make decisions are based entirely on context, and the process itself is exceptionally tiered. Each signal we take in is evaluated multiple times en route to a conscious decision, but thousands of subconscious evaluations and decisions are executed by specialized subsystems long before we’re even aware that we’re about to make a choice at all. We rely on that context, even if we aren’t consciously aware of it. 

Optimizing a business process using data that’s been stripped of its context is akin to handing a toddler a good apple and a bad apple and then demanding they improve the efficiency of Aldi’s global produce supply chain.  

Even if the cost of moving an interconnected, graph-like, distributed system’s data to the cloud was reduced to zero, the resources required to generate comprehensive machine learning models from that data scale at a multiplicatively exponential rate as the system expands. Trying to do so all in one place will always remain too computationally- and, in turn, financially- expensive. The limits of pre-quantum computing infrastructure simply aren’t capable of calculating all the logical tiers across even moderately complex systems- we must push these calculations out to the edge.

This Is My Datum. There Are Many Like It, But This One Is Mine.

If building and applying machine learning models based on the entirety of data in a system is too expensive, then perhaps we should manage this “machine knowledge” in the same way we manage human knowledge.  

We don’t rely on a single source of truth when it comes to the whole of human intelligence. Instead, we distribute responsibility amongst those of us who interact with specific subsets of it, trusting subject matter experts to validate, maintain, augment, and summarize it for the rest of us. It seems reasonable to suggest that the global data fabric layer be architected with a similar focus on bottom-up autonomy. 

Generate and apply models at the edge, where context can play a key role. Ship the metrics associated with their application upstream for analysis (a fraction of the data you’d otherwise send across your network), tweak their inputs and weights, and ship those configuration variables back down to the edge where they are used to rebuild newer, better, more optimized models. Doing so delivers better results, with less data transfer, computing power, and contextual noise than models generated higher up the data chain. 

To extend the “human knowledge” vs “machine knowledge” analogy to corporate or industrial knowledge: even when management teams have a firm grasp on the depth and breadth of a company’s levers of success, the creation of front-line subject matter experts- and the empowerment of those resources to drive change based on that expertise- routinely delivers outsize returns on investment- in both financial and human capital.  

Fostering subject matter expertise amongst compute nodes in a distributed system at Thing Scale is no different. Front-line edge nodes analyze and act on the ever-increasing volume of data that would very quickly overwhelm cloud-based solutions, and the role of those of us seeking to extract opportunity from that data becomes that of parent to an army of toddlers: Each innately capable of learning, provided a complete picture of its immediate surroundings, and dependent on us to help them understand the importance of and proper reaction to each of the signals it encounters. Over time, reaching more and more complex conclusions based on a wider variety of information, and eventually requiring little or no assistance from us, at all.  

In short, the future of Big Data is Small, Toddler-Sized Data. 

While you're here, learn about HarperDB, a breakthrough development platform with a database, applications, and streaming engine in one unified solution.

Check out HarperDB