A Brief History of Open Source Big Data Distributions

This blog post is based on a lecture at Berlin Buzzwords by Lars Francke and Sönke Liebau on June 15th, 2021. You can find the full version of the lecture on YouTube.

If large amounts of data are to be stored, processed and visualized with open source tools, you need more than just a single tool. It often takes a dozen or even more different open source projects to build a complete data storage and processing system. Ideally, these should already be integrated and bundled in advance in order to form a coherent platform – similar to the Linux distributions for the operating systems. Stackable provides such a platform and it is worth taking a look at the journey the open source data world has taken to understand the rationale behind the creation of Stackable.

Openness and diversity

The first open source big data distributions based on Apache Hadoop appeared around 2008 and each combined their own selection of available open source projects. The primary goal of each was to simplify the deployment and management of a scalable data platform, typically using an often proprietary management framework.

In the beginning there were five or more companies that offered so-called big data distributions. Over time, most of them disappeared from the market, merged, or focused on investing in the remaining competitors. Today there are many stories related to these events. But once the dust of these “distribution wars” had settled, only Cloudera remained as a provider of an on-premise open source big data distribution after the merger with the main competitor Hortonworks.

Unsurprisingly, as a result of market dominance by a single company, prices rose. And very strongly, from an estimated US$2,000 per node per year from the days of fierce competition to a staggering US$10,000. With the end of the free Cloudera version of Cloudera Manager and the establishment of a paywall that only allowed paying customers to download the formerly free software, the market experienced another turning point: The market for free open source data platforms suddenly and hard brought to a halt. Instead, it has become a pay-to-play market where a dominant player, currently valued at $5.3 billion, occupies that niche.

Enter Stackable

Stackable was founded in 2020 after its founders previously built the successful big data consulting firm Open Core. The changes that moved the market in 2019 were disruptive, customers also felt these waves and began to ask Open Core for advice. For paying Cloudera and Hortonworks customers, license costs skyrocketed and those using the free version were now locked out of further updates. They were left with two options: either continue to stay on an unsupported and unmaintained platform, or pay a hefty subscription fee. So upgrading to the latest version of Cloudera’s on-premises offering felt more like a platform migration, so the ability to switch to a different solution suddenly became a viable alternative.

This disruptive market change is now the opportunity for Stackable. Despite the hype, many customers cannot or do not want to move to the cloud and are instead investing in their on-premises platform. They appreciate the predictable costs associated with operating on-premises and the sovereign control over where their data resides. The move to the cloud also brings with it a significant need for additional expertise for customers unaccustomed to working with cloud providers. On-premises customers can leverage their existing teams and their skills to focus on their business needs rather than learning a new platform. And don’t forget the significant investment in employee knowledge of Hadoop from an operational, developer, and end-user perspective. Replatforming will always be a challenge, especially given the lack of an adequate competitor in this space.

The cloud is here and will not go away again. So, in the age of cloud-native services, is there still room for building on-premises data platforms? At Stackable, we believe in this and are building a new big data distribution based entirely on open source software. Cloud providers offer similar services but with very little penetration of the on-premises market. Hybrid platforms that not only bridge the cloud and on-premise markets, but leverage the best of both worlds, will become commonplace. Stackable has set out to offer the open source alternative for such modern data platforms.