Data science

High data quality is prerequisite for data-driven work

May 10, 2024 - 5 minutes reading time
Article by Natan Van Der Knaap

High-quality data is essential to encourage data-driven decision-making. But high-quality data is not an end in itself. It is a means to organizational success. For example, reliable data reduces risks and costs, and increases the efficiency of data-driven work. In this article, we discuss the importance of good data and what good data means.

With accurate and reliable data, organizations can make more informed decisions at strategic, tactical and operational levels. It makes sense because data is actually a reflection of facts in the real world. High-quality data is therefore a good reflection of the real world. With accurate customer data, for example, you get a better representation of your customers' needs and preferences in the real world. This therefore has a positive effect on data-driven work.

On the other hand, poor-quality data carries risks. It can result in inaccurate decision-making and can damage an organization's reputation, leading to fines, loss of sales and customers, and negative media coverage (DAMA International, 2017). Not to mention the costs associated with restoring data quality!

In short, improving data quality is important because it means the data better meets the needs of data users. The result? Improved efficiency, greater customer satisfaction and better regulatory compliance, which can ultimately lead to higher business results.

What is high data quality?

Data is high quality when it accurately reflects the real world. However, maintaining this quality can be challenging and costly because the real world is constantly changing and sometimes difficult to capture. Therefore, we prefer to assess data quality based on how well the data meets users' expectations. In other words, data quality depends on the context and needs of the data user (DAMA International, 2017). The data user is a broad term and can range from an employee who wants to see his or her schedule, to a manager who wants to discover trends in revenue. So high quality data in one context may not necessarily be high quality in another. An average number of cars in the parking garage can paint a good picture of how many lunches the caterer has to make that day. However, you can't tell from the number of cars how many employees you have to pay salaries to at the end of the month.

So data quality is not intrinsically linked to the data, but to the relationship you have with the data. It is therefore important to make good agreements about the different needs of the different users regarding the minimum required quality. You also have to deal with external requirements on the data from laws and regulations. Balancing the various interests and determining the required data quality requires agreements on roles and responsibilities.

High-quality data therefore meets the needs of different users and is fundamental to the smooth operation and success of any organization. However, organizations are still struggling to work on data quality in a structured manner and as a result are unable to reap the full benefits of data-driven work.

PDCA cycle

A structured approach to working with data can be aligned with Deming's PDCA cycle. In essence, this involves starting with an initial audit of the current level of your data and the expected level for different users. This provides insights into which data requires the greatest improvements to meet needs or external demands. After root cause analysis, you proceed to improve and verify that the improvements have been effective. Finally, you make sure you don't fall back into old habits by securing these improvements. Following these steps is an ongoing process that helps get as much value out of your data as possible.

Critical data

One more important step precedes the start of the PDCA cycle - identifying your critical data. Most organizations have a lot of data, but not all data is equally important. A principle of data quality management is to focus improvement efforts on data that is most important to the organization and its customers. This is assessed based on the processes they use and the nature of the reports in which they appear. Another way to assess this is by looking at the risk the organization would face if something were to go wrong with the data.

An example of critical data is, for instance, your master data. Identifying critical data gives the program scope and focus and can have a direct, measurable impact on business needs (DAMA International, 2017). Critical data, like data quality, is relative and can vary by department, domain and organization. Therefore, determining critical data across various departments or domains can be tricky and again requires good agreements and collaboration. It is best to start with this bottom-up, because this is where the knowledge of which data is most important for the functioning of the organization lies. Should this fail, then a more top-down approach is needed.

Quality check of critical data

To measure data quality objectively, you can use data quality dimensions. These are measurable properties of data. There are lots of dimensions to be found. DAMA NL (2020), for example, has already described sixty of them. Here again, it is important that you look at the goals you want to achieve on your critical data, and then build dimensions from there.

In general, the following eight dimensions are the best known and most widely used:

  • Accuracy: checks the extent to which data correctly represents real entities. Here you look at the data values. An example is whether a person's last name or residential address is still correct.
  • Completeness: checks whether all required data is present. You can check this for data values, elements from a dataset, rows in a dataset, data files in a dataset, or the required metadata. An example of metadata is the definition of a name that is missing.
  • Consistency: ensures that data values are represented consistently within a dataset and between datasets. It also ensures that data values are consistently associated across datasets at different points in time. For example: are place names written in different ways?
  • Integrity: is about storing data accurately, consistently and reliably to ensure it is correct and complies with rules and standards. An example is a hacker lowering integrity by deleting or altering data.
  • Reasonability: checks that the data is logical and makes sense. This is about how well a data pattern meets expectations. For example, March monthly revenue in 2024 looks as expected.
  • Unique: means that an entity is recorded only once in the data based on its identification. This means that an employee, for example, does not appear twice under the same or a different number. This helps prevent duplicates in records and ensures accuracy and efficiency in processes.
  • Timeliness: verifies that data is available when needed. This deals with the difference between creation and availability. For example, an annual report that is not available until October is not timely.
  • Validity: checks that data values are consistent with a defined domain of values. For example, an email address without @.

Business rules

At the beginning of your data quality program, it is important to use these quality dimensions to establish business rules for your critical data. Business rules describe how the company must operate internally to be successful and meet the demands of the outside world. This is a useful tool for determining when data is of sufficient quality for organizational goals. It also helps in developing a strategy to which you ultimately want to grow.

In short, high data quality is essential for maximizing value from data-driven work. Therefore, identify your critical data and make sure it meets minimum user needs and external requirements.

Related articles
Structured work on data quality
Data science
Few organizations still work on data quality in a structured way. Natan van der Knaap, graduate student a ...
Data science should increase efficiency above all
Data science
That data science is a promising field of research is beyond dispute. But what exactly is its purpose? Th ...
Sustainability from a data mindset
Data science Retail Logistic
In this article, read why sustainability benefits from a data mindset.