The state of data quality: Too much, too wild and too skewed
We live in the age of data. We are constantly producing it, analyzing it, figuring out how to store and protect it, and, hopefully, using it to refine business practices and better understand the markets and customer we work with. However, this is all easier said than done and one of the biggest concerns that businesses have around their data is the quality – a fact confirmed by 1,900 people surveyed at the end of last year on the state of data quality. Despite being aware of data quality issues, many are uncertain about how to best address those concerns.
Data quality issues
At the top of the concern chart is the sheer number of data sources that are available to businesses today. Over 60% of respondents indicated that too many data sources and inconsistent data was their top data quality worry. This was followed closely by disorganised data stores and lack of meta data (50%), and poor data quality controls at data entry (47%).
Despite this being a top concern, it will be hard for any organization to reduce the number of data sources it has – if anything, this is only likely to increase over time. The problem was first tackled when we were still maintaining data in spreadsheet, with data management practitioners coining the term “spreadmart hell” as they tried to maintain data governance over multiple spreadsheets maintained by individuals or groups spread across an organization.
This problem was, unfortunately, not solved with the adoption of self-service data analysis tools as they failed to include much needed features like metadata creation and management and data synchronization.
So, instead of looking at the number of data sources as a problem, we should look at it as a feature and be thankful that technology has greatly progressed match organizations’ needs. Front-end tools generate metadata and capture provenance and lineage, data cataloguing software then manages this – so technology has our back. We do however have to continue to push a cultural change around data, encouraging people throughout the organization to ensure data quality, governance and general data literacy.
Data governance 101
Some of the common data quality issues that the survey revealed point to larger, institutional problems. Disorganised data stores and lack of metadata is fundamentally a governance issue and with only 20% of respondents saying their organizations publish information on data provenance and lineage, we can make the conclusion that very few organizations are up to snuff with governance.
Like the sheer amounts of data being ingested by organizations, data governance isn’t necessarily an easy problem to solve and it is likely to only grow. The poor data quality controls at data entry is fundamentally where this problem originates, as any good data scientist knows, entry issues are persistent and widespread, if not stubborn. Adding to this, practitioners may have little or no control over providers of third-party data, so missing data will always be with us.
However, there is reason for hope – machine learning and artificial intelligence tools could provide a reprieve from the worries. And almost half (48%) are already using machine learning or AI tools to address data quality issues, automating some of the tasks involved in discovering, profiling and indexing data.
Data governance, like data quality, is fundamentally a socio-technical problem and as much as ML and AI can help, the right people and processes need to be in place to truly make it happen. Ultimately, people and processes are almost always implicated in both the creation and the perpetuation of data quality issues, so we need to start there.
People and data: Bridging the biases
People have long been noted as transferring their biases onto data when analyzing it, but only 20% of respondents cited this as a primary data quality issue. Despite this being a much-discussed problem, respondents see it as less of an issue than many other quality issues.
Based solely on these responses, we shouldn’t rule data biases out as an issue, what it should do is underscore the importance of acknowledging that data contains bias. We should assume, not rule out, the existence of unknown biases. This should lead us to developing formal diversity standards within data and create processes to detect, acknowledge and address those biases.
Missing data also plays a part in this, it isn’t just that we lack the data we believe we need, sometimes we don’t know or can’t imagine what data we need. This is why it is important to have a diverse group of people working with data to bring different ideas and insights to the process.
Creating high quality data practices
Most respondents indicated many data quality issues – they don’t seem to travel alone. At the same time, over 70% of respondents don’t have dedicated data quality teams. What is to be done?
Organizations should take formal steps to condition and improve their data. This will be an ongoing process and C-suite buy-in – although difficult to obtain – will be key to creating this long-term strategy. The C-suite, like many others in the organization, will need education on and understanding of the importance of this project and the business benefits it will enable.
C-suite buy-in is also vital because data conditioning is not easy or cheap. Committing to formal processes, implementing technology and creating a dedicated team takes time and money. A ROI-based approach should help to determine what data conditioning is a priority and what is not worth addressing.
AI can unearth quality issues hiding in plain sight and investing in AI can be a catalyst for data quality remediation. But while AI-enriched tools can help, they are not the complete answer. You need a team to foster the use of the tools and fully master them to garner the benefits.
Data quality can feel like an overwhelming problem. Start with the basics and encourage good data hygiene practices throughout your organization. Remember: technology and tools are great, but it all starts with people and culture.