+

Data Feast Weekly

Get weekly insights on modern data delivered to your inbox, straight from our hand-picked curations!

Module #
4

1

.

Introduction to Data Products

Get introduced to the concept of Data Products, their characteristics, and their importance in the data ecosystem. The chapter also discusses the significance of tying data initiatives to the overall business model.

What is a Data Product?

In Data Mesh speak, a Data Product is an architectural quantum, which is the “smallest unit of architecture that can be independently deployed with high functional cohesion and includes all the structural elements required for its function.”

The term “Data Product” is quite new, but the idea it defines has persisted since businesses started working with data, that is since a caveman exchanged, say, three bags of berries for one piece of steak! Data Products exhibit all the qualities that we have always desired in data:

Discoverable: through appropriate titles, descriptions, tags, filters, and categories.
Addressable: across the data ecosystem through unique data links.
Understandable: through rich semantic data - descriptions, tags, data profiles, usage metrics, and more.
Natively Accessible: through output ports with connectivity to the user’s specific domain. Users can direct connect and consume data by selecting the required port.
Trustworthy: through approval marks, quality scores, importance scores, ranks, ratings, filters, and more.
Interoperable: by giving data products and data within the data products to communicate with each other and external entities.
Independent: with complete self-service and no dependencies on external disruptive agents. Every dependency is contained within the realm of the data product.
Secure: by applying the necessary access and masking policies.


But qualities alone do not make up a data product. There are several definitions flying around, as is natural whenever a new term pops up. So instead of focusing on definitive definitions, it’s often more helpful to understand its meta-definition instead, which is characteristically more stable and unchanging. A meta definition is a description of the definition- context.

We can see the meta-definition of data products as a function of purpose and execution. While the data product attributes we saw above are the desired features of the data (purpose), a data product must also bring with itself a way to produce that outcome (execution).

‍In essence, you cannot talk about data products without establishing how it meets its purpose.


Practically, there are three cogs to this:

Data & Metadata: The Protagonists
Code
: The Writer
Infrastructure
: The Stage

The Maker has complete freedom to work with any data, code as necessary, and set up any infrastructure of choice. But to produce the play, the three must come together to tell the story of purpose and adventure in unison.

Why Data Products Now?

‍Code and Data have never been coupled so thoughtfully before. It was barely even a notion a few years back. Data passed different hands unchecked, & even though we admitted it was an asset, we failed to treat it like one.

The Realisation

Contrastingly, product assets such as features and capabilities have always been closely tied to code for ease of management, governance, and exactness. Once enterprises became more savvy in understanding and getting used to their own data, undoubtedly, the need to couple it with code resonated widely. This is one of the reasons why the Data Mesh and Data Products exploded in conference rooms and data strategy meetings.

The Details

A data product is a collection of several structural components - one of them being code. This component includes:

1. Logic, algorithms, and processing pipelines that drive the functionality of the business purpose.
2. Including but not limited to APIs for accessing semantics, schema, metrics, and metadata.
3. And implementing policies, compliance, lineage, provenance, and more.

The original article on martinfowler.com also describes it very succinctly, alongside the other components.

The Core Philosophy

What we like to stress consistently is the ability to manage the disparate limbs of a data lifecycle with “all code in one place”. The unified experience. Everything from reading sources, defining scanners, monitoring sources and pipelines, writing workflows, SLOs, etc. etc. - every function, interface, and resource for a purpose is consolidated under one repo and deployed within a single code file.

1. Containerisation of the entire process

2. Unified experience instead of fragmented experience

3. A Data Product DAG with end-to-end management & observability

The Implementation

Is there a way to achieve this realistically? Yes. In a thousand different ways.

But there are a few core capabilities necessary for bringing these disparate activities under one common umbrella, as outlined by the Data Developer Platform (DDP) Standard.

Deep Dive
Learn more about it here

The 3 Data Product Components


To become the independent architectural quantum, a Data Product needs to wrap three structural components:

Code

The code required for data management is far from simple or straightforward. It encompasses data pipelines (ingestion, transformation, and serving), APIs (access, metadata, and metrics), and enforcement (policies, compliances, lineage & provenance).

The reason why the data ecosystem is not as robust as the software ecosystem is that code for data management was never approached from a software development angle. In prevalent data stacks, all these code components are isolated and are unable to talk to each other.

Note: We do not just suffer from data silos, but there’s a much deeper problem. We also suffer tremendously from data code silos.

A Data Developer Platform approaches data as software and wraps code to enable object-oriented capabilities such as abstraction, encapsulation, modularity, inheritance, and polymorphism across all the components of the data stack. As an architectural quantum, the code becomes a part of the independent unit that is served as a Data Product.

Becoming data-first within weeks becomes possible through the high internal architectural quality of a DFS that constitutes composable primitive components talking to each other to build finer and more complex data solutions: Unification through Modularisation.

The spiel on code composability is, in fact, endless. More so if we start talking about how data contracts fit into the enforcement narrative, or how federated governance plays out, or how metadata is centrally managed and utilized. Perhaps, I’ll reserve it all for a dedicated piece on the code aspect of data products.

Data & Metadata

Data is undeniably powerless without metadata. The Data Product construct understands this and augments heterogeneous sources to tap into rich metadata. While data can be served in multiple formats, as is usually the need, metadata allows the data product to maintain universal semantics.

Note: Universal semantics allows the data product to become addressable, discoverable, interoperable, and natively accessible.

Having access to universal metadata, it is possible to identify associations between data assets, surface lineage, provenance, observability metrics, and key relations. It also enables the discovery of latent information, and with open APIs, users can augment and leverage this data programmatically.

With such rich information, data users can throw more light on the data and resurrect it from the limitations of ‘dark data’, which is rich, yet dormant information limited by subpar semantics.

A Data Developer Platform enables high-definition metadata through the dual-plane principle. The central control plane has complete visibility across the data ecosystem and is able to oversee metadata flow from heterogenous polyglot sources. The data planes, on the other hand, are isolated instances that are deployed for niche domains, specific data product initiatives, or even for individual use cases.

Having contracts as a key resource of DFS adds more colour. If you think about it fundamentally, a data contract is nothing but metadata + enforcement. Establishing a contract inherently means establishing declarative enforcement. Contracts can either pull metadata from the central plane or be explicitly defined, and once established, they act as a guarantee on the metadata specifications and make change management (one of the biggest challenges) practically seamless.

Infrastructure

The infra is the supporting foundation to enable the code for data products and to ensure that all the nine yards of governance, metadata, and orchestration are declaratively taken care of.

The Data Developer Platform has a unified architecture ideology laced with contractual handshakes. This unique combination allows the reversal of data accountability to shift it closer to the data source, as is the requirement for a true data product. You’ll soon see how a few scrolls away.

While typically, having dedicated data product owners becomes essential to produce and maintain data products, the infrastructure powering the DFS frees the organisation or data teams from the additional friction of onboarding, training, and maintaining data product developers or data product developer teams.

Note: Any data developer is a data product developer when armed with a Data Developer Platform.

Approach to Establishing Data Products

A typical data lifecycle assigns the highest weight to data analysis and processing: Understanding the Data. If you remember the ML hype when hoards of data professionals seeped into the ML stack, you'll remember how one stat continuously surfaced to ground the hype: Almost 80% of the time & effort is spent on understanding and processing the data, irrespective of fancy models and optimisation algorithms.

What’s the Product parallel?

When building products, the parallel of the analysis stage or the most weighted stage is the design aspect. This involves:

Strategising
Designing
Finalising business and operational metrics
Validating the impact and ties with business KPIs

Essentially, this answers the most important question: Is it worth spending time, effort, and the customer's money on the effort, or could we do better with another model?

Sidenote: Products cannot be developed in vast numbers, keeping in mind the resource angle. In a product approach, efforts and designs need high optimisation. So is true for data. You should ideally develop only a handful of core data products that form the backbone of your business. Just like you have a few core data models.

The business model is the pinnacle of business value proposition, and unfortunately, so far we have not been able to closely tie back our data-related efforts to it, primarily due to a bottom-up approach where data teams become responsible for the business model (data models) leading to heavy debt, maintenance, and low importance to iterations with business feedback.

With the advent of the data product construct, we now have the opportunity to tie the data experience and product strategies together. In essence, this demands heavy weight on the product design aspect. This essentially means a superset of the analysis segment. In addition, you collaboratively design the product wireframe, run validations, run tests with the defined metrics, and iterate on the design. All before diving into the build and apply stages.

Having a product and engineering background, this neat sync between the two cycles seems very harmonious in terms of the value proposition of an org's data. It helps us to build a case for the data initiatives, and helps us repeatedly ask and precisely answer, "Does this initiative directly impact the consumer's experience, or is this initiative even necessary?"

Focusing on the tiny shifts instead of big spectacular changes

Data Products could be a new phrase doing the rounds, but we’ve been trying to achieve the state of data products long before we knew about the term.  Data Products are nothing but data that is easily discoverable, usable, has high quality, and is well-governed. In short, trustworthy and useful data. If we look at prevalent systems, data teams are far from achieving a state of reliable data, which is a big red flag for businesses.

Data producers who generate data are not aware of how their data is used, and in fact, most of the time, they are not even aware that downstream applications are using their data. Data consumers cannot expect data to match their requirements if it’s coming from folks who are not even privy to the fact that someone is using that data. Changing this situation does not require rocket science, it simply requires adding a few enablers to already existing processes.

Pushing ownership of data closer to the source (to the data producer)

  • Make the producer aware of how their data is used in downstream use cases
  • Incentivise them with a view of the ROI impact of their data
  • Give them the right tools to define specifications on their data (aka contracts)

Providing the foundation for quality and policy enforcement

  • Specifications defined by producers should be machine-readable
  • And enforceable across multiple feedback loops tied with progressive processes to combat progressive quality derailment
  • Enable progressive quality checks before the data is made available for consumption


While the first stage is established through contracts, the second is enabled through data developer platforms with product- and software-like approaches to data.