+

Data Feast Weekly

Get weekly insights on modern data delivered to your inbox, straight from our hand-picked curations!

Data Orchestration for Data Products - Hugo Lu
Data Orchestration for Data Products - Hugo Lu

Data Orchestration for Data Products - Hugo Lu

7 min
|
Open Source Tools, All-in-One Control Planes, & Impact on Data Product Velocity
Apr 11, 2024
Hugo Lu
,
 and
Vivek Dubey
,
  and

Originally published on

Modern Data 101 Newsletter

,

the following is a revised edition.




This piece is a community contribution from Hugo Lu, founder of Orchestra. He extensively talks about the future of data and how data engineers (and data teams in general) can overcome the most common but also the most challenging problems facing many industries today. We highly appreciate his contribution & readiness to share his knowledge with MD101.

We actively collaborate with data experts to bring the best resources to a 4000+ strong community of data practitioners. If you have something to say on Modern Data practices & innovations, feel free to reach out! Note: All submissions are vetted for quality & relevance.


Data teams everywhere are currently battling an uphill struggle. On the one hand, the demands for Data Products have never been greater. AI has increased demand (and CEO appetite) for data-driven features, products and capabilities - these are all facilitated by Data Pipelines and effective management of data and data collection.

On the other hand, Data teams’ budgets are being cut, and headcount is reduced. As a result, their ability to organize data is reduced, and as a result, there continue to be many disparate data sources (structured and unstructured data) that organisations cannot use.

One of the biggest catalysts to velocity and success is having a unified control plane that eliminates the pain of using manual data and consolidates data orchestration, data observability, lineage, metadata gathering, and alerting. In this article, we’ll examine the role the data orchestration process plays in accelerating Data and AI Product Velocity and what you can do to adopt it.

📝 Note from Editor: Learn more on unified control planes and experiences

- Control planes in the community-driven Data Developer Platform standard

- Data Industry’s convergence to Unified Experiences

- Fragmented tools to Unified Interfaces

Open Source Workflow Orchestration Tools vs. All-in-One Control Planes


Open Source Workflow Orchestration Tools like Airflow and Prefect offer powerful capabilities for managing and automating data workflows, but they differ in their approach and features compared to All-in-One Control Planes.

Airflow, an open-source tool developed by Airbnb, focuses on workflow scheduling, monitoring, and management. It uses Directed Acyclic Graphs (DAGs) to define workflows and supports a wide range of integrations with external systems. Airflow's strength lies in its flexibility and scalability, allowing users to create complex workflows and handle large volumes of data.

On the other hand, Prefect is another open-source workflow orchestration tool that emphasizes ease of use, reliability, and data versioning. Prefect offers a more intuitive interface for defining workflows, with features like automatic retry logic, task dependencies, and a central dashboard for monitoring and managing workflows.

When comparing Open Source Workflow Orchestration Tools like Airflow and Prefect with All-in-One Control Planes, such as platforms like Orchestra, Apache NiFi, DataOS, or proprietary solutions like AWS Step Functions, the major difference is that All-in-One Control Planes often provide a more streamlined user experience and have lower learning curves compared to open-source tools. They are able to offer standard interfaces in preferred forms like graphical presentation, drag-and-drop functionality, and pre-built templates for common workflows, making them accessible to users with less technical expertise.

However, the true power of the control plane is its ability to interface with any and all entities in the data ecosystem, which completely cuts down needless integration overheads. More interestingly, a central control plane has downright access to global metadata as well as governance protocols, enabling orchestration to have complete context around data as well as operate securely.

📝 Note from Editor:

Learn more about the
High Internal Quality of Unified Architectures that support a central or “all-in-one” control plane for orchetration, metadata management, and governance.

OSS Orchestration Systems Promote an Anti-pattern


Existing OSS Workflows promote a monolithic approach to data architecture. This means that there is a single repository and a single application that are capable of handling all the components of a “Modern Data Stack”.

Many people simply don’t have this architecture. Building Data Services is hard. Building Data Pipelines often requires services specialised at different parts of the Pipeline—indeed, it’s often necessitated by having different teams.

For example, the Data Engineering team may be responsible for managing Kafka or another event-based streaming application. The Analytics team may be responsible for a dbt repository. These need to be separate.

Therefore, leveraging a Modern Data Stack with an OSS Workflow Orchestration tool is somewhat of an anti-pattern and necessitates an additional (and expensive) observability platform. The phrase “Quis custodiet ipsos custodes?” or “Who will Guard the Guardians?” springs to mind.

The image illustrates the process to enable effective microservices  architecture
How to Enable an Effective Microservices Architecture | Image Courtesy: Author

How Unified Data Orchestration Impacts Data Product Velocity


Having a unified control plane for automating the Data Orchestration Process brings a number of advantages.

Remove boilerplate work

By leveraging existing infrastructure for all heavy computation and letting the Data Orchestration tool only coordinate existing processes, engineers do not need to learn a new skill (e.g., Airflow), deploy any infrastructure (e.g., Kubernetes), or grapple with tricky UIs.

Engineers and Data Teams focus on Data Ingestion, Data cleaning, Analytics and Machine Learning and can rely on Orchestration to pick up the rest

Stop worrying about cost and metadata

Using a non-unified Data or Workflow Orchestration tool means valuable metadata sits ununified within source systems and may even disappear due to data lifecycle policies.

Unified Orchestration and Observability mean this data is efficiently gathered in real-time and made available to Data Teams. This makes managing metadata across disparate data sources trivial.

Calculate Business Value

A unified Orchestration platform allows Data Engineers to quantify the business value of Data and AI Products. For example, fetching cost data per query and AWS Compute cost, alongside Dashboard and Machine Learning model usage statistics.

Having this business-critical, operational data at a Data Product level means Data Teams can not only prioritise more efficiently but also finally have a window to the business to showcase their value and their worth

A unified data orchestration and data observability platform can rapidly accelerate product velocity by removing all boilerplate work, gathering helpful metadata, and presenting it at a Data Product Level. In addition to saving engineers time surmounting technical challenges, there are numerous intangible benefits.

Proactively monitoring data quality, pre-empting data issues and informing business stakeholders (thereby gaining trust), and ruthlessly efficiently prioritising data and AI initiatives are all possible with a unified Data Orchestration platform or Data Control Plane. Control Planes like Orchestra are paving the way for businesses to take advantage of their data at an unprecedented rate.

Where Do Existing Data Orchestration Solutions Fall Short


Existing orchestration solutions often fall short in several key areas. Firstly, they often force users to write Python code in a large monorepo, making it difficult to scale and manage as the codebase grows. This can lead to increased complexity and maintenance challenges.

Secondly, there's a lack of visibility and end-to-end metadata tracking in many existing tools. This makes it hard for users to understand the flow of data and processes, leading to potential issues with debugging and monitoring.

Lastly, there's a notable absence of user-friendly interfaces that provide comprehensive insights into how data and AI products are performing. This lack of a UI hampers users' ability to easily monitor and optimize their workflows for better performance and efficiency.

📝 Note from Editor: Learn more here

Lack of Multi-Tenancy and Governance

In an environment with multiple teams using multiple different pieces of software, it is impossible to get a multi-tenanted view of the entire data operation. This makes debugging lineage graphs across teams difficult, challenges cross-collaboration, and makes data pipelines difficult to manage

Compliance with Security

In a monolithic repository, both individuals and clusters have access to more secrets than they should. This increases the risk of failure as multiple people have access to parts of the code base they don’t really need access to

Alerting and Ownership

Alerting and Data Asset ownership becomes challenging if there are multiple repositories used by different teams.

Streamlined Continuous Integration

An oft-overlooked factor, Continuous Integration for Data pipeline repositories is relatively straightforward for service-based microservices such as data ingestion scripts.

However, for Analytical repos leveraging dbt, something like dbt Slim CI is required or GIt-Control for Data a la Project Nessie. Ensuring the bare minimum CI runs in this environment is challenging enough, and it gets even more complicated if the code for data ingestion, streaming, transformation, and operational analytics all exists in the same place.

Summary


In essence, it is important to come down to the fundamentals - the product approach. What does it mean to become product-driven? Complete focus on user experience and upliftment. Reduction of effort to do XYZ job.

When we build data products, we don’t just cater to the customer as a “user”, but even the data engineer and analytics engineer are “users” of the data product approach. Their lives are constantly impacted by the way organisations choose to build data, which in turn always ends up impacting the end-users.

The product approach isn’t one giant strategy - it’s, in fact, an atomic strategy that comes alive at every little corner and every last turn of the data workflow. Orchestration is no exception in this narrative and, in fact, drives it to a large degree. In the context of data products, orchestration also adopts the user-first approach, enabling data and analytics engineers to breathe a little easy.

📝Author Connect


Find me on LinkedIn | Explore Orchestra

Note from Editor: Orchestra curates a blog authored by the Orchestra team & guest writers. You’ll also find some cool whitepapers for more in-depth info; feel free to dig in!
// Text truncation functionality const elements = document.querySelectorAll('[ms-code-truncate]'); elements.forEach((element) => { const charLimit = parseInt(element.getAttribute('ms-code-truncate')); // Helper function to recursively traverse the DOM and truncate text nodes const traverseNodes = (node, count) => { for (let child of node.childNodes) { if (child.nodeType === Node.TEXT_NODE) { if (count + child.textContent.length > charLimit) { child.textContent = child.textContent.slice(0, charLimit - count) + '...'; return count + child.textContent.length; } count += child.textContent.length; } else if (child.nodeType === Node.ELEMENT_NODE) { count = traverseNodes(child, count); } } return count; } // Create a clone to work on without modifying the original element const clone = element.cloneNode(true); traverseNodes(clone, 0); // Replace the original element with the truncated version element.parentNode.replaceChild(clone, element); }); });