Stackable

Stackable

Apache NiFi 2: Key Updates for the Stackable Data Platform

Apache NiFi has always been a powerful tool for automating the flow of data between systems. With the release of NiFi 2, it has been completely revamped, delivering many new exciting features and improvements and making it an even more essential part of any data engineering toolkit. Coupled with first-class support from Stackable as part of our Stackable Data Platform, we’re making it easier than ever to get up and running with NiFi.

Let’s dive into some of the key updates in NiFi 2 and its specific extensions in the Stackable Apache NiFi image.

Framework and architecture updates

The extensive changes and new features in NiFi 2 which we present in this article were only possible thanks to a comprehensive update of the underlying frameworks to modern versions, including Java 21, Spring 6, Jetty 12, Servlet 6, Angular 18, and OpenAPI 3.

The relocation of the Apache NiFi API into a separate, independent library and its decoupling of the public API creates a stronger distinction between the modification of extensions and the implementation of fundamental changes.

Many more details on structural improvements in NiFi 2 are provided in a comprehensive blog post by David Handermann

NiFi goes Kubernetes

While NiFi can continue to be deployed as a bare-metal solution, NiFi 2 now allows for native k8s integration. Running NiFi on Kubernetes is nothing new for users of the Stackable Data Platform, but the native integration makes it possible to dispense with ZooKeeper as a quorum manager and rely on Kubernetes leases instead. This feature will be offered as an option within the Stackable Data Platform in our upcoming 25.7 release.

Stateless flows

As a replacement for the previous ExecuteStateless Processor, stateless NiFi is now fully integrated, enabling nodes to execute flows without maintaining state.

Developers can define the transactional boundary for processing data. If a failure occurs, the entire transaction can be rolled back. This is particularly helpful for Function-as-a-Service (FaaS), short-lived container jobs, and edge or event-driven systems. Stateless execution engines can scale elastically operating in parallel without central orchestration bottlenecks caused by Zookeeper dependencies.

Python support

The Python API for Processors makes Python a first class citizen for extensions. In its initial releases, it allows Python programmers to build custom Processors for NiFi and MiNiFi.

NiFi 2 fully supports CPython with access to a broad range of python packages and their functionalities. This opens up a wide range of new application possibilities, in particular the support of AI and ML workflows. Widely used libraries such as pandas for data analysis and scikit-learn for machine learning can be easily integrated.
In a separate repository, NiFi offers exemplary extensions for RAG pipelines such as processors for multiple vector databases, document chunking and OpenAI. 

Python processors work particularly well with the stateless mode of NiFi 2.0: They can be executed as a function without persistent state, making them ideal for on-demand processes, data enrichment, or inline ML inference.

Custom Python processors can already be used within the Stackable Data Platform today, but support by the Stackable Operator for Apache NiFi will be greatly simplified in the upcoming SDP release 25.7.

Flow versioning with Git

In NiFi 2 flows can use a Flow Registry client where the ‘registry’ is simply a git repository as an alternative to the traditional NiFi Registry. Native Git integration improves NiFi lifecycle processes for software deployment and is well aligned with the “Everything as Code” pattern that the Stackable Data Platform is implementing. A detailed documentation will be available on our docs page. However, direct editing of data flows within Git is not supported.

Processor enhancements and housekeeping

NiFi 2 offers many new and enhanced processors to support the requirements for data management on a modern tool stack. Highlights: Apache NiFi 2 includes new S3 processors for retrieving metadata and copying files across buckets, along with support for Kafka 3 for both consuming and publishing data.

On the other hand, the NiFi committers have taken the opportunity to thoroughly clean up the existing components and features as shown in this housekeeping list. For example, support for Kafka 2 has been removed, as well as processors and controller services for HBase and Hive 3, due to its EOL.

As a result, along with the PutIceberg processor, support for the prominent Iceberg table format, a key component for the implementation of data lakehouses, was also removed. We have decided to include a bundle that contains an Iceberg processor in the Stackable NiFi image again to ensure backwards compatibility and continuity.

Hello Dark mode my old friend

The user interface has been completely reimplemented in NiFi 2 on a far more modern software stack as mentioned above. It brings with it many improvements to details, but retains much of the proven user experience of NiFi 1.x. Data engineers will appreciate the new dark mode.

Last but not least: Security

The updates to the frameworks and the removal of legacy components alone ensure that many security gaps have been closed in NiFi 2. In addition, they also enable the use of new security-relevant features: PEM certificate support with ECDSA, Ed25519 and RSA encryption algorithms (see this blog post by David Handermann for details).

Historical insecure options have been removed. The Stackable Operator for Apache NiFi will ensure that only valid methods can be used when NiFi 2 is deployed. 

Single Sign-On to the UI via OIDC has been supported since the early days of NiFi 1.x. NiFi 2 adds support for Client Credential Flow with OIDC allowing access to NiFi using tokens obtained from the configured identity provider. Client Credentials Flow enables the authentication not only of personal users, but also of machines.

In addition to these innovations, Stackable is enhancing authorisation in NiFi within its Data Platform: We are aiming for fine-grained policy-based authorisation with Rego rules and the Open Policy Agent, consistent with other platform tools.

To achieve this, the Stackable NiFi product image will include the OPA Authoriser for Apache NiFi. This plugin was developed by David Gitter in partnership with Ordix and Stackable. Thank you very much for supporting us!

Summary

New functionalities, up-to-date frameworks, robust Python integration, native Kubernetes support and much more: the developers’ hard work on the next generation of Apache NiFi has really paid off! These advancements establish a strong base for ongoing innovation and future improvements. 

Apache NiFi is poised to be an extremely valuable tool for data and machine learning engineers and remains a key component of the Stackable Data Platform.

Comments are closed.