The Stackable Docathon: building a data pipeline

Last month we ran our first ever Documentation-Hackathon – or “Docathon” – at Stackable. The result is a guide showing how to build a simple data pipeline which can be found here:

https://docs.stackable.tech/home/tutorials/end-to-end_data_pipeline_example.html

As anyone who has been involved in software engineering for any length of time will readily confirm, it is almost impossible to keep your documentation *consistently* up-to-date. There are partial remedies and strategies, of course, such as: including documentation in a Scrum-Team’s “Definition of Done” (which often degenerates into “Documentation-less but Done”), a documentation “spring-clean” before a release, keeping documentation as close to the code as possible (e.g. javadocs or similar; database metadata), even writing tests into your documentation, as Rust allows. But the fact remains that it is usually seen as a bit of a chore.

In a recent team retrospective we decided to run an internal Docathon to fill in some of the gaps before our upcoming release. I volunteered to set things up: as someone who has never been part of a Hackathon – let alone a Docathon – my initial overriding concerns were: how does this work, how does this work *remotely*….and most importantly, how do I make this *non-boring*?

This post will give a summary of what we did, how we did it, and what we learned along the way.

What we wanted to achieve

Our motivation was not only to provide content for our users but also to kick the tires of our own product in order to identify things we may have missed in the course of regular software development. Were there issues hidden to an integration tester but which would be all too obvious to a new user? Would our operators play well together, not just technically, but also intuitively? How good a grasp did we have on the capabilities of the data products that are managed by our operators?

With this in mind, and following an initial brain-storming session, we decided to document a user-journey to include the following points:

Cover the usage of several of our operators in combination, rather than simply fill in the gaps in API call descriptions and the like.
Provide a detailed “HOW-TO” guide: from the setting up of an initial Kubernetes environment, to the processing and visualizing of a medium-sized dataset.
Use a publicly-available dataset for the pipeline – specifically part of the New York City taxi data from 2020.
Provide a series of detailed, repeatable steps that result in a dashboard displaying the result of an aggregation query.

The final pipeline looks like this:

S3/Nifi: ingest data from a public external S3 bucket into Nifi
Nifi/Kafka: perform some light processing in Nifi and write out the result to a Kafka topic
Kafka/Druid: define an ingestion job in Druid that reads this data from the same Kafka topic as the previous step and creates a Druid data source
Druid/Superset: reference this data source in Superset and use it to create a chart on a dashboard

The pipeline involves 5 operators in all: Nifi-Kafka-Druid-Superset operators in combination for the data processing, plus Zookeeper (used for backing configuration storage by Kafka and Nifi). Further details are given in the guide linked to in this article.

What we did

As a first step, we specified some ground rules.

What to document?

We agreed to err on the side of having too much documentation i.e. even documentation of things we hadn’t yet implemented! This was because we wanted to describe the user journey in its entirety, even if it meant being a little too “optimistic”. We quickly adopted a pragmatic approach to this, though: since we wanted our final artifact to be something that our users could take away and try out, rather than just read through as theory, we opened issues for any missing or incomplete items that we could implement in a day or two, as and when we encountered them. This meant that we were not held up by things that had been overlooked, as long as they did not constitute new features.

How to document?

Following a brief discussion of all the wonderful formatting tools out there in the internet world, we decided to prioritize content over format: rather get a lot of material down on (virtual) paper and fine-tune it later, than spend any significant effort on formatting it in-flight. A single shared online-doc, with screenshots pasted in, was sufficient for our needs.

Duration, Groups and break-outs

Again, we wanted to keep things lightweight and pragmatic: we are a team distributed over three countries in two time-zones, so we spread the exercise out over two days with a 0900-1500 window on each day. Each day began with a short (half-hour) kickoff/retrospective, we had a lunchtime round-up on day one, and we finished with a short wrap-up at the end of business on day two. At all other times we worked in 3 groups of about 3-4 people, with ad hoc Slack chats between teams to iron out overlaps, interfaces and other questions.

Follow up work

The formatting and consolidation of the contributions from each team took a little effort, but this was partly because we had lots of content to work with! We assigned this to a subset of the team, as it was more efficient to format and check the document as a whole rather than to try and achieve this by working on it in parallel. A consolidated document was ready a day or two later, and went online following reviews and small improvements about 7-10 days after that.

What we didn’t do

We intentionally didn’t include authentication and authorization everywhere – this was purely to keep things as simple as possible. The target data set was chosen to have a manageable size to keep wait times during validation tests to a minimum.

Lessons Learned

Here is a short summary of our “take-aways” and things that we will note for next time:

Settle on exact interface definitions – e.g. data set scope, topic name – before we start. It wasn’t clear until the consolidation stage that we had actually used two different months from the NYC taxi data that didn’t overlap, meaning that the final dashboard was initially empty!
Keep documentation of not-yet-implemented features to a manageable level (e.g. things that can be implemented in the course of the docathon, or shortly after its completion)
Recognize that we will probably overestimate what can be achieved in the first 4 hours, yet underestimate what can be achieved in 2 days.
Keep documentation guidelines and conventions to an absolute minimum content over presentation – as pulling everything together and formatting it is more efficient if one is working with plain text.
Carry out a release of all components before the docathon starts, so that we are documenting components that are in a stable state.
Appreciate that eating our own dogfood is an efficient and effective way of learning about our project. This exercise helped us understand the products themselves better, as well as our own operators, and has yielded a user-journey template that can be adapted for similar scenarios in the future. It also reveals aspects of interoperability that are different to those covered by integration tests.
Aim for manageable team size. Keeping the groups small – e.g. 3-5 people – enables us to stay nimble, pragmatic and engaged.

Conclusion

I am sure we will repeat this exercise at Stackable: yes, documentation can actually be *fun* (!), especially when approaching it from the perspective of the user and with the goal of creating a specific tutorial as the end product.