Have you ever read a book on data quality and felt your life sucked away?
Data quality isn’t really a problem, you can find peace with your data. Sure, it has the interesting position of being the most hated and rote of tasks, as well as being the most essential step of any project or process. It permeates our lives, from the moment we wake up (Is what I read on the news true?), to work (Do I have the right documents?), to home (Is this kid lying to me about his homework?). Assessing data quality is a way to separate the essential from the non-essential. There’s a lot of discussion on how to identify data quality issues but very little on the enlightening solutions.
The Inherent Problem
Data quality management is any initiative that includes techniques for ensuring reliability in data used for decision making. It includes the business processes that ensure the integrity of an organization’s data during collection, aggregation, warehousing, and analysis.
Frequently, data managers and analysts alike will argue that their data simply isn’t organized and that a tool is needed to reveal the issues. But tools have always been present and the issue still exists. Additionally, most data quality tools just stop at the identification and offer little promise on how to remediate the issues they find. The tools don’t provide solutions.
A Shift from Tools to ‘Solutions’
Data Quality is not a solved science. Data quality tools require the human variable to capture relevancy and make decisions. And so, they are a flawed vehicle.
A data quality solution in Spotfire, then, needs to be both adaptive for the needs of the domain as well as overcoming the human variable. We’ve found that achieving this data quality nirvana is possible.
Step 1: Find the Issues
One of the first things people say once they visualize their data is, “Something is not right.” This is good! Seeing your data visualized for the first time will force you to reconcile what you know about your business and whether the data reflects that (or it can be a wakeup call).
Usually this means the identification of outliers. Outliers can enter your dataset in a number of ways; bad inputs, conflicting methods for resolving missing data, different units of measure, et cetera. Outliers can easily be highlighted in visual representations of your data as well as through outlier detection methods via TERR. The objective of discovering these data points would either be to highlight them for further investigation or to remove them.
Step 2: Remove the Barriers
When we approach clients about using a new technique in data science we usually hear, “I don’t think our data is good enough for that.” This doesn’t have to stop you. Machine Learning is chock full of methods to get around this very issue, namely imputation.
Imputation is a common workflow for a machine learning expert and can be for you too! With the TERR engine built into Spotfire, we can take packages built for R and fill in missing values for essential data sets. While this would not be the way that you would correct your existing data, this would be a method for taking a messy data set and plugging it into a machine learning method where complete data would be required. Imputation can be a great way to combat this as discarding this incomplete data would introduce bias into the results.
Step 3: Share your Solution
Spotfire Power Users may have their own techniques for each domain. This is important for perceiving data quality from a solution perspective. The more solutions that are available, the more likely that we can reuse these techniques. Take for instance, an app that identifies data quality issues in oil and gas data sets. A pre-built template for Spotfire allows data managers to review this data set for completeness and validity. Apps give users a variety and templates allow for a constant evolution of techniques and opportunities to be disseminated throughout the Spotfire community.
What we are building with Exchange.ai is a forum for communicating these ideas, allowing for techniques to propagate through industries. In this way, data managers can be aware of best practices within their organizations and seek guidance from those who have successfully mitigated these problems.
And there’s the journey! We hope you can come to peace with your data quality and that Spotfire can provide you with a solution for the ever present struggle of data quality management. If you want an example of this, head to the Ruths.ai blog, DataShopTalk, to see how to put this into motion.