Data science

Lean Six Sigma and Data-as-a-Product: a winning combination (part 2 of 4)

November 4, 2024 - 5 minutes reading time
Article by Natan Van Der Knaap

In the previous article, you learned how to approach data as a product to create datasets that are optimized for specific use cases. This is an important prerequisite for data-driven operations and enables organizations to obtain reliable insights at acceptable costs that directly contribute to value creation. Understanding the process behind data products is essential to ensure quality and security. In this article, we focus on how the Six Sigma methodology can help monitor and optimize the operational process steps of a data product, such as generating, processing, or making datasets available.

Every step an organization takes to create a data product affects the quality of the end result. Therefore, it's essential to define what constitutes a suitable outcome for each step, ensuring the final product meets the expectations of its consumers. Six Sigma focuses on achieving consistent and predictable results in processes, aligned with customer needs (Tempelman & Schildmeijer, 2023). Deviations from these predictable results are called variations. Variations can occur in the end result of the process, the data product, but also in the individual steps of the operational process. Reducing these variations is crucial for increasing the effectiveness and efficiency of the process steps, and thus for the overall quality of the data product.

Application of Six Sigma in relation to data products: a case study

To clarify the application of Six Sigma in relation to data products, I will discuss a case study that shows both the customer's perspective and that of an internal department. This case illustrates how responsibilities are divided between the customer, who sets specific quality requirements, and the organization, which must ensure the data product meets these requirements, provided they are realistic, safe, and feasible.

From a customer perspective: An analyst at a consultancy organization is developing a predictive model for peaks in passport applications. This model uses various data products, such as data on passport expiration, appointment registration, and employee availability. The analyst has set specific quality requirements for these data products before they can be used in the model. This is essential to ensure that the insights from the predictive model are reliable and usable. Looking back at article 1 in this series, this is comparable to an engine that sets certain requirements for fuel quality.

‘Not all process steps have an equal impact on the final quality of the data product’

From an HR perspective: The HR department manager, responsible for the employee availability data product, has full knowledge of how the data is collected and who has rights to view it. When a request came in from an external customer to use this data product in a prediction model, it was decided after consultation that the original dataset could not be shared. However, an anonymized version could be provided as a basis for access to the final insights. The customer had specific quality requirements for the dataset. To meet these requirements, the HR department wanted to use Six Sigma to gain structured insight into the quality of their operational data processes and demonstrate that they can meet the set requirements. This enables them to deliver consistent and reliable data products.

The first step within Six Sigma is establishing the specific quality requirements and needs of the customer, also known as the Voice of the Customer (VoC). This input is then translated into measurable process requirements, or Critical to Quality (CTQ) characteristics. The CTQ characteristics form the measurable performance standards within the operational process that must be achieved to meet customer expectations (VoC). This way, a data product must be delivered that is optimally aligned with specific use cases, as discussed in article 1. By consistently monitoring and optimizing these CTQ characteristics, variation in the process can be reduced, which increases the reliability and quality of the data product.

Example VoC: "I need a weekly dataset with (anonymized) employee data, including their availability for the next two weeks and the number of hours they actually worked."

Translation into CTQ characteristics:

  1. Timeliness: A new dataset version must be available weekly on Friday.
  2. Completeness: 100% of the (anonymized) employees and their planned hours for the next two weeks must be in the dataset.
  3. Accuracy: At least 90% of the 'actually worked' column must be correct.
  4. Consistency: For the same employee, availability and hours worked must correspond.
  5. Validity: The number of hours worked must be displayed in whole numbers.

You can see the theory of DAMA DMBoK (Guidelines for Effective Data Management) and Six Sigma coming together here. The CTQ characteristics are measurable properties that you can easily link to the quality dimensions within data management. Think of dimensions like timeliness, completeness, accuracy, consistency, and validity, extensively described in DAMA DMBoK and summarized in my previous article about data quality. By actively monitoring CTQ characteristics, the HR department can not only meet customer expectations but also establish internal quality standards. This makes it possible to deliver data products that are more consistent, relevant, and valuable. This way, Six Sigma can help the organization identify and address potential defects and variations internally, before the product goes to the customer.

Monitoring customer requirements of the data product

Monitoring CTQ characteristics can be done using Statistical Process Control (SPC). SPC is a method to manage processes by measuring variations in input or output, or analyzing process steps (DAMA International, 2017). SPC is based on the assumption that if a process with consistent input is executed consistently, it will produce consistent results. If the input or execution changes, the results will change too.

For each defined CTQ characteristic, the next step is to determine a lower limit, upper limit, or both, in consultation with the customer or by the product owner themselves. By taking measurements on the CTQ characteristics, you gain insight into how well the product meets customer specifications. Automating these measurements can increase efficiency and consistency. For example, by automatically reading logs or using scripts to monitor properties such as completeness and accuracy. This saves time and helps detect deviations more quickly.

However, it is important to realize that measuring all CTQ characteristics can be complex and time-consuming. Therefore, it is wise to focus on the most critical characteristics, rather than trying to measure everything. This requires a risk assessment by the product owner: which data and process steps must always be of high quality and intensively monitored, and which are less critical for the customer if an error occurs?

Not all process steps have an equal impact on the final quality of the data product. Some steps, such as generating and processing data, have a greater influence on the end results and must be more strictly controlled. Other steps can be followed with less intensive SPC measures. Compare it to manufacturing a car: you want to test the engine very thoroughly, while the hat shelf is less important. A layered approach to monitoring helps to increase process efficiency without compromising the end quality of the data product.

Example of SPC in the data process

When the outcome of a measurement falls outside the defined specifications, we call it a defect. In that case, the product is not approved and must be adjusted or rejected. Variation in the process leads to variation in the data product and can result in poor end results. With SPC, you can see at a glance when the quality of a process step in your data product doesn't meet customer requirements. After all, the data product is the end result of the process. The number of defects in the process steps, such as generating or processing data, is an indicator of the process's stability and quality.

Below is an elaboration of SPC for CTQ 1 and CTQ 2 of the employee availability dataset. This elaboration focuses on measuring timeliness by looking at timestamps and completeness by measuring the number of null values.

‘The first step in Six Sigma is identifying the specific quality requirements and needs of the customer’

In addition to SPC, Process Mining can add an extra dimension to process analysis by not only measuring variations but also providing insight into how the process actually unfolds. Process Mining analyzes and visualizes processes based on digital log data, such as input, processing, and transmission logs. SPC and Process Mining thus complement each other: SPC helps to detect deviations in process steps early, while Process Mining provides insight into variations between different executions of the same process. This helps identify inefficiencies or errors and implement targeted improvements.

By monitoring the operational process in a data-driven way, defects can be detected early and specifically. Process Mining can help visualize deviations in process steps such as data processing, while SPC makes changes in timeliness or accuracy directly measurable. This not only provides insight into possible improvements but also leads to cost savings by reducing repair work at the end of the process. Moreover, you demonstrate that your data product meets all quality requirements, allowing customers to work data-driven with confidence.

In this article, you have read about the process and meeting customer quality requirements. In part three of this series, we will delve deeper into how the Lean methodology can help us not only minimize variations but also reduce waste, further optimizing the efficiency of data products.

Related articles
Lean Six Sigma and Data-as-a-Product: A Winning Combination (Part 1 of 4)
Digital transformation Data science Retail Finance Public Logistic
Data and process optimization are closely linked. In this article you will read how Lean Six Sigma can be ...
Lean Six Sigma en Data-as-a-Product: a winning combination (part 3 of 4)
Data science Retail Finance Public Logistic
In this article, learn about the Lean methodology and how to increase efficiency in your organization by ...