Building a New Big Data Distribution Based on Kubernetes - With a Twist!

This blog post is based on the presentation to Berlin Buzzwords by Lars Francke and Sönke Leibau on 2021-06-15. You can watch the full version of the talk on YouTube.

A brief history of open source big data distributions

If you want to store, process and visualise lots of data with Open Source tools you have to use more than one tool; often you’ll need a dozen or more disparate Open Source projects to form a data storage and processing system. Ideally you’d want these to be packaged together to form a coherent platform akin to the way Linux distributions work. Stackable provides such a platform and it’s useful to understand the journey the open source data world has taken that led to our formation.

In around 2008 the first Open Source Big Data distributions based around Apache Hadoop began to appear and each packaged their own selection of the available Open Source projects. Choices were made on which projects to include but their main goal was to make it easier to deploy and manage a scalable data platform, typically done using an often proprietary management framework.

An unsurprising consequence of a single company dominating an area of the market was that prices rose. A lot. They skyrocketed from an estimated $2,000 per node per year when competition was fierce to a staggering $10,000. It also saw the end of Cloudera’s free as in beer version of Cloudera Manager and the erection of a paywall, preventing all but paying customers from downloading the software. And so the free, open-source data at scale platform market was brought to a sudden and grinding halt leaving it a pay to play market with one dominant player who is currently valued at $5.3 billion occupying this niche.

Stackable

Stackable began life in 2020, the founders having previously built the successful Big Data consultancy Open Core. The changes that 2019 brought to the market were disruptive and the ripples were felt by Open Core’s customers who began asking for advice. Cloudera and Hortonworks paying customers saw the license cost rise sharply and those using the free version were now locked out from further updates and faced either having to remain on a now unsupported and unmaintained platform or to start paying a hefty subscription fee. Upgrading to the latest version of Cloudera’s on-premise offering itself felt more like a platform migration, putting the opportunity to move to something else as a viable option.

This market disruption presents an opportunity for Stackable. Customers are invested in their on-premise platforms and despite the hype many are unable or unwilling to move to cloud. They like the predictable costs that come with running on premise and the ability to control where their data is. Moving to cloud also entails a significant skills uplift for customers unaccustomed to working with cloud providers. On premise customers can use their existing teams and their skills, focusing on their business needs and not learning another new platform. There is a significant investment in skills in Hadoop from an operational, developer and end user perspective. Replatforming is always going to be a challenge especially given the absence of any viable competitor in the same space.

Cloud is here and it’s not going anywhere and in the age of cloud native services is there still room for building on-premise data platforms? At Stackable we believe there is and we’re building a new Big Data distribution based entirely on open source software. Cloud providers offer similar services but with little to no penetration into the on-premise market.

Building a New Big Data Distribution Based on Kubernetes – With a Twist!

A brief history of open source big data distributions

Stackable