Leading Insights Blog

Keep Datasets Clean Using Data Quality Assessments

By Danny Grant, Daniella Dorio, and Ryan Han

Data can be complex and difficult to keep organized. Popular business advisor Michael Hyatt said, “You can’t improve what you don’t measure.” While this principle is usually applied to operations, it should also be applied to any production datasets as well (known as “metadata management,” or MDM).(1) The Information Age has created a universe of data requiring rapid organizational change. The volume, velocity, and variety of the data has resulted in disjointed systems, siloed repositories, and a lack of communication around common challenges. As data degrades over time, the amount of organizational effort required to correct it grows exponentially.

These challenges remain, even as there have been concerted efforts in the Federal space to unify a department’s or agency’s digital ecosystem. They are especially vexing when these frustrations can be prevented. In many cases, difficulties can be overcome by preemptively understanding known flaws, exploring unknown risks, and identifying guardrails that can be put in place to protect progress.

What actions can be taken to identify degraded data or prevent degradation before it occurs? A Data Quality Assessment (DQA) is often the driving force behind a successful data project. A DQA is a full review of the metadata surrounding a dataset by following three common phases:

  1. Reconstructing the Existing State
  2. Assessing and Measuring Quality Dimensions
  3. Recommending Areas for Improvement.(2)

A DQA should leverage both quantitative and qualitative evaluation methods to ensure a comprehensive review.(3) When the assessment is performed properly, several preventative capabilities are possible which contribute to successful data projects:

Tag and Track Critical Data Issues

A good DQA will highlight specific flaws and assess their impact. Project analyses can reach erroneous conclusions if datasets are out of date or if columns and their context are misinterpreted. Tracking these concerns allows for escalation to appropriate decision-makers and facilitates the design of monitoring checks, ensuring those critical flaws do not threaten project success.

Identify Areas for Deeper Analysis

Too much unmanaged data can slow momentum on high-quality analytic output, putting a data project at high risk. For instance, identifying free text fields that are candidates for categorical transformation can allow for deeper, more detailed analysis. A DQA highlights the tradeoffs between proposed solution, allowing data teams to properly navigate the obstacle, reaching its destination on time and under budget.

Monitor Shifting Data Standards

Key to any structured data system are the standards set to hold the schema in place. Setting standards helps define acceptable data inputs and maintains data useability, but sometimes those standards can drift. Automating DQAs allows the data team to persistently monitor datasets in real-time, so that shifting standards are identified quickly, and remediations are made to stop the degradation as early as possible.

Find Useful Integration Points

Data becomes more valuable when aligned with complimentary datasets. Finding the right linking points at the start assists the data team in building the architecture correctly the first time and allows for modularity and extensibility as additional datasets are brought into the data ecosystem. Stabilizing the architecture allows for consistent documentation as well, where constant changes can make it difficult for users to find the most up-to-date information.

Continuously Screen for New or Returning Threats

Data must be reviewed rigorously to ensure it is usable, and performing this analysis frequently keeps the data ecosystem healthy. However, concerns are not solely limited to what we see in the data itself. Reviewing business rules or data fields for usefulness helps to curate datasets and removes outdated, inapplicable, or misunderstood data from the data catalog. This lowers the risk of misused data and flawed analysis from reaching decision makers.

A data ecosystem is more likely to stay trim and healthy and have reduced risk to degrading standards or poor analysis when the organization identifies and monitors threats and risks, as well as protects progress as it is made. Done properly, a DQA helps keep critical data from becoming unusable.

Connect with Us

This publication is for informational purposes only and does not constitute professional advice or services, or an endorsement of any kind.

Kearney is a Certified Public Accounting (CPA) firm focused on providing accounting and consulting services to the Federal Government. For more information about Kearney, please visit us at www.kearneyco.com or contact us at (703) 931-5600.

(1) Earley, S., Henderson, D., & Data Management Association. (2017). DAMA-DMBOK: Data management body of knowledge.
(2) Batini, Carlo & Cappiello, Cinzia & Francalanci, Chiara & Maurino, Andrea. (2009). Methodologies for Data Quality Assessment and Improvement. ACM Comput. Surv.. 41. 10.1145/1541880.1541883.
(3) Chen H, Hailey D, Wang N, Yu P. A review of data quality assessment methods for public health information systems. Int J Environ Res Public Health. 2014 May 14;11(5):5170-207. doi: 10.3390/ijerph110505170. PMID: 24830450; PMCID: PMC4053886.

To top