Data science

Lean Six Sigma and Data-as-a-Product: A Winning Combination (Part 1 of 4)

October 2, 2024 - 4 minutes reading time
Article by Natan Van Der Knaap

In this new series of articles, you'll discover how data and process optimization are closely interconnected. In this first part, we dive into the concept of Data-as-a-Product, laying the foundation on which we’ll apply Lean Six Sigma in the upcoming articles. Lean Six Sigma is usually used to improve processes using data products. In this article, we focus on how this methodology can be applied to developing the data products themselves.

Much has been said and written about Lean Six Sigma. Increasingly, this is also true for data-driven work. However, a combination of these two worlds is still uncommon. This is surprising, considering that data and processes are fundamentally inseparable. As Van Gils (author of Data Management: A Gentle Introduction) puts it: "If processes are the value-creation engines of an organization, then data is the fuel."

The better you execute your core processes, the more direct value you create for your customers. To improve these core processes, you can use data. On the one hand, data helps make processes more efficient. On the other, it allows for better decision-making, increasing the effectiveness of a process. This is the essence of data-driven work.

Data as a Product

When viewed this way, data becomes a product for the processes. The concept of data as a product is not new. According to Zhamak Dehghani, the creator of the data mesh (a distributed architecture for data management), data as a product is defined as "an autonomous, optimized, standardized data unit containing at least one dataset, designed to meet user needs." When we talk about a data product, we’re referring to an optimized dataset for use, such as a clearly defined set that can be found in a catalog and is carefully prepared so customers can use it safely. In this sense, you can compare the raw data from a source system to crude oil and the data products to refined fuel for the value-creation engine Van Gils (see intro) mentions.

Currently, data potential often remains hidden within silos (domains within an organization that work in isolation, resulting in disconnected databases), rendering the data inaccessible and underutilized. By treating data as a product, you can keep data collection and decision-making localized in areas with the most knowledge about the data, while the outcome (the data product) can be used across domains or even organizations. This way, even within public organizations that have a silo structure, the potential of data can be unlocked for internal and external customers.

'Seeing data as an asset in its own right creates the need to manage it as if it were a product.'

This approach shifts the perception of data from a byproduct to an independent asset capable of creating value. By treating data as an asset in iets own right, you introduce the need to manage it as if it were a product. This means you can apply product management principles to data products. Product management is defined as "the strategic process of an organization to manage every step of the product lifecycle, taking into account both business and consumer needs."

Data Product Attributes

When applying product management to data products, you're essentially focusing on data management and governance at every stage of the product lifecycle, with the ultimate goal of meeting the needs of data users.

Dehghani outlines the following attributes for data products to ensure optimal use:

  • Findable: The data product must be easy to find by both humans and machines via a well-structured catalog or metadata register.
  • Addressable: The data product must be accessible to the user with clear instructions.
  • Reliable: The data product must be accurate and consistent, with mechanisms in place to ensure quality and integrity.
  • Self-describing: The data product should include clear metadata and documentation that explains its content and use, making it reusable.
  • Interoperable: The data product must use standardized formats and interfaces to work seamlessly with other systems and data.
  • Secure: The data product must be protected through measures like encryption or access control to prevent unauthorized access.

The ownership of data products and the responsibility to meet the above attributes should rest with the areas where they are created. That’s where the most domain knowledge exists to make the most informed decisions. This aligns well with a federated decision-making strategy. Formalizing this allows for better responses to user needs and maximizes the intrinsic value of the data product.

Using a Data Product

The use of a data product can vary greatly and, like any other product, depends on the needs of different customers. This makes the "optimized for use" aspect (see Dehghani's definition) particularly complex. For example, an internal user may use the dataset for a dashboard, while an external user may want the same dataset to benchmark organizations. Both users, and possibly many others, may have different requirements and expectations for the optimal use of the dataset.

To manage this complexity, you can compare your data product to a manufactured car. Do you produce a standard car that meets general requirements and expectations, or do you offer a customized solution? Both approaches have their pros and cons: a standard solution is more efficient, while customization requires more effort for each customer, justifying a higher price. This is a matter of cost calculation, which can then be translated into a price for external customers. For internal customers, the data product can be treated as a "service," with agreements made about what is realistic and achievable.

The Data Process of a Data Product

To see data as a product, it's crucial to make the process of creating the data product transparent. A process is defined as "a series of actions or steps taken to achieve a specific goal" (Tempelman & Schildmeijer, authors of Lean & Six Sigma in Practice). The data product process involves the various steps needed to achieve the ultimate goal: an optimized-for-use dataset. We previously established that this is achieved when a data product is findable, addressable, reliable, self-describing, interoperable, and secure. To realize this, you either follow an operational process or a development process. If the product already exists in the catalog (a central storage point with metadata about the datasets), similar to an existing product in a store, you're dealing with operational process steps such as generating, editing, and making datasets available. If it’s a new product, you go through different steps, such as defining and modeling datasets to meet new needs. This falls under product development, where you must carefully inquire about requirements and make agreements on product attributes like latency, security, privacy, and quality. These agreements can be recorded in a Service Level Agreement (SLA), a contract outlining the terms between provider and customer. The result is a new product in the catalog, which is then made available to the customer through the operational process.

ARTICLE

How do you optimize data quality for genuinely effective data-driven work?

Read more.

'It is important to understand the process steps of the data product to ensure quality and security.'

Process Optimization

It’s important to make the data product’s process steps transparent to ensure quality and security. This promotes transparency and helps establish accountability. It also ensures that the process is consistent and repeatable. By making these steps transparent, you can then optimize the process. This can be done in two ways: by improving the output (the data product) to better meet customer requirements (effectiveness), or by improving the process itself by eliminating waste (efficiency). For the attentive reader: ensuring data meets user requirements is the definition of data quality and is a prerequisite for effective data-driven work (see my previous article). Two popular approaches to process improvement are Lean and Six Sigma. Lean focuses on maximizing customer value by reducing waste and streamlining processes. Six Sigma focuses on reducing variability and defects through statistical analysis. Both methodologies, like product management, take customer value as a starting point.

In the next articles, we will explore the improvement possibilities of applying Six Sigma and Lean to the data product process to better meet user needs. It is crucial to increase the intrinsic value of data because the better the data product process, the higher the quality of the data product itself. And the higher that quality is, the more effectively it can be used in follow-up analyses, ultimately leading to more value creation in (primary) processes. After all, an engine needs fuel to keep moving forward, and the better the fuel, the more powerful the engine!

Related articles
High data quality is prerequisite for data-driven work
Data science Retail Finance Public Logistic
In this article, you will read why high data quality is important if you want to work data-driven.
How do you optimize data quality for genuinely effective data-driven work?
Data science Retail Finance Public Logistic
In this article, you will discover how to optimise data for data-driven working.
Sustainability from a data mindset
Data science Retail Logistic
In this article, read why sustainability benefits from a data mindset.