What is data cleansing?
Data cleansing is the process of finding and removing errors, inconsistencies, duplications, and missing entries from data to increase data consistency and quality—also known as data scrubbing or cleaning.
While organizations can be proactive about data quality in the collection stage, it can still be noisy or dirty. This can be because of a range of problems:
- Duplications due to multiple unmatched data sources
- Data entry errors with misspellings and inconsistencies
- Incomplete data or missing fields
- Punctuation errors or non-compliant symbols
- Outdated data
Data cleansing takes these problems, and using a variety of methods, cleanses the data and ensures it matches the business rules.
Why is data cleaning important in the business environment?
Data comes from a wide variety of sources and in a huge range of formats. There could be data inputted by the team, automated collection of data from a website where fields are filled out by users, information from internet of things (IoT) devices, and multiple other sources. Unless a lot of thought goes into the data collection at the planning stage, it can quickly become a mess.
A data lake can quickly become a data swamp, filled with duplicates, errors, and problems. Imagine a user fills out a form online but incorrectly enters their email address. Quickly, they realize, and fill it out again with the correct email address. The system will record this as two separate people, despite having all the same details. While the micro-outcome of this means a random person will get an email they did not sign up for, on a macro scale, this becomes a problem. Imagine the size of storage required to house all these extra fields, or the work required to remove the flawed entry. Then, consider the poor predictions or diagnostics when running analytics. Poor data quality makes information completely unusable.
Data scientists can spend a majority of their time simply cleansing data. They are an expensive resource that should be utilized better.
The need for data cleaning increases as the number of data sources increases. Organizations can have data spread around data warehouses, applications, and even individual’s computers. Consolidating all the datasets, eliminating duplicates, and removing flawed data is vitally important.
However, this cannot be done manually, especially if the organization has huge amounts of data. So, how can data cleansing be performed thoroughly, in a cost effective, replicable, and timely way? Using employees is not efficient and can create more errors.
How to perform data cleansing
Ideally, the organization should use business rules to define precise constraints on the data in the entry phase. This means that there should be a number of rules such as:
- Validation fields such as asking the user to submit their email address twice and rejecting it if different
- Stringent rules around data entry such as consistent formatting of phone numbers
- Transforming data at the loading stage so it is consistent
- Set expectations, guidelines, and processes for data collection and storage
However, if an organization is setting up a new system, merging datasets, or even incorporating a new batch of data, it needs to apply strategies to clean dirty data. Data cleansing can be performed in a range of ways. The way that an organization chooses is dependent on their current data structures and the outcomes they want to achieve.
By far the most efficient way of data cleansing is using a range of artificial intelligence (AI) techniques that automatically identify and resolve problems.
Types of data cleansing
There are two main types of data cleansing: traditional and big data cleansing.
Traditional data cleansing
Traditional data cleansing methods are not suitable to handle a huge amount of data. Historically, when organizations were not producing thousands or millions of data records, these methods were suitable. There are two main strategies for cleansing small data sources.
The first method is an interactive system that integrates error detection and data transformation, using an interface that resembles a spreadsheet. The user can define custom domains and then specify algorithms to enforce these domain rules. This requires detailed and precise manual efforts.
There are other methods, but these focus mostly on duplicate elimination. This framework has a systematic approach of:
- Pre-processing: Data anomalies are detected and removed
- Processing: Rules are applied to the data, such as duplicate identification
- Validation: A human checks the processing results
While these strategies may have worked in the past, they are time consuming and do not guarantee good data quality in the same way modern data cleansing tools do.
Data cleansing for big data
Cleaning big data is the biggest challenge many industries face. It is already a gargantuan volume, and unless systems are put in place now, the problem is only going to continue to grow. There are a number of ways to potentially manage this problem, and to be effective and efficient, they must be fully automated, with no human inputs.
Specialized cleaning tools: These typically deal with a particular domain, mostly name and address data, or concentrate on duplicate elimination. A number of commercial tools focus on cleaning this kind of data. These tools extract data, break it down into the individual elements (such as phone number, address, and name), validate the address information and zip codes, and then match the data. Once the records are matching, they are merged and presented as one.
Extract Transform and Load (ETL) tools: A large number of organizational tools support the ETS process for data warehouses. This process extracts data from one source, transforms it into another form, and then loads it into the target dataset. The “transform” step is where the cleansing occurs. It removes inconsistencies, errors, and detects missing information. Depending on the software, there can be a huge number of cleansing tools within the transform step.
Within these forms, there are also different ways that errors can be detected.
Statistical method for error detection
Statistical methods involve identifying outlier records, using the mathematical principles of averages, means, standard deviations, and range. When records sit outside expected norms or do not conform to patterns identified in existing data, it is likely that these records are outliers. These records are usually removed to avoid confusion. While this approach may generate false positives, it is simple and fast, and can be used in conjunction with other methods.
Pattern-based
This involves identifying outlier fields that do not conform to already established patterns in the data. Techniques such as partitioning, clustering, and classifying are used to identify patterns that apply to the majority of records.
Association rules
Association rules (if-then statements) can show the likelihood of a relationship between records. If there is data that does not fit within these association rules, then they are considered outliers.
Benefits of data cleansing
Data cleaning ensures data quality, which is vital for any data-driven business. There are many benefits to data cleansing and quality data.
Data-driven decision-making
A higher level of data quality means that organizations can make far better decisions. Using quality data removes the risk of poor predictions or inconsistent decisions. When data is accurate, complete, and up to date, so are the decisions derived from it.
Better customer targeting
Marketing to the entire world is like throwing spaghetti on a wall and hoping something sticks. It is a waste of resources. Good data allows an organization to drill down, identify their ideal avatar, and target them. Seeing who the buyers of a product or service are, and then finding potential customers who have the same attributes is far more efficient.
However, if the amount of data is too small or the information is incorrect, this is simply not possible.
More effective marketing campaigns
Good quality data can also improve an organization’s content and marketing campaigns. The more that marketing departments know about their audience, the more they can target them with tailored content and advertisements.
Improved relationships with customers
Who is the organization’s dream customer? If the data exists in an accessible and consistent manner, then it is easy to gather information about the organization’s avatars. What are their interests, needs, and preferences? Knowing all of this helps strengthen relationships.
Having a single source of customer data is also a great way to manage them more effectively. No risk of duplicate emails sent out and no poor targeting. This keeps the customer happy.
Easier data implementation
High quality data is the holy grail of organizations. It is much easier to use, it results in fewer errors or problematic outcomes, and it is consistent and complete. Once data is all in the same form, and it has been cleansed, almost any process can be applied to it within seconds.
Competitive advantage
Having higher data quality than competitors or using that data in effective and creative ways creates huge advantages over the competition. Good data helps to predict events, explain previous ones, and target customers far more effectively.
Increased profitability
While the process to get high quality data can be expensive, the return on investment can be huge. Good data helps organizations target customers better and create more effective and cost-efficient marketing campaigns. The data can help with slowing customer churn and increasing market share. It shows an organization where money is going, where it is coming from, and where to utilize resources.
Challenges in data cleansing
There are several challenges intrinsic to data cleansing.
It is likely that much of the data collected by organizations is dirty in some way. To manually cleanse the data would be impossible, laborious, time consuming, and hugely prone to error. This is why using powerful and automated tools are important and have become the only cost effective and practical way to get a consistent, quality dataset.
But even within this, there are huge problems to resolve. With data fast becoming an organization’s biggest asset, there is considerable reason to put money and time into resolving common challenges.
No guarantees of accuracy
While artificial intelligence (AI) is smart, it is not infallible. It learns only from the rules and information provided to it by humans. It also cannot apply human logic or use basic heuristics. For instance, to remove duplicates, a system breaks the data down into parts. It sees that D. Duck and Donald D. live at the same address, and decides this person is Donald Duck, and merges the records.
However, it may see that H. McDuck and Dewey M live at the same address, and try to merge them, however these are two separate entities who live together, and should be treated the same.
Consider outliers in the data. A system may consider an entry to be an outlier when it is actually a feature. By removing this supposed outlier, the data is now missing an important piece of information.
Also, if data needs to be repaired, AI will make fixes as it sees necessary. However, there is no way to ensure these are correct, and this may simply add new errors in the act of repairing dirty data.
Distributed data
If an organization has not created a thorough system that ensures all data is in one place, or accessible from one place, then data cleansing cannot be carried out. Current data cleansing tools cannot scale out to connect with thousands of different machines or applications.
Data variety
An organization could have a huge variety of data:
- Videos
- Images
- Information derived from social media
- Excel documents
- Multiple applications in a variety of formats
How can a data cleansing tool cleanse all these different formats? Unless they are all made consistent by use of another tool, it is simply not possible.
There are solutions to these problems. A data integrity analysis can uncover many possible errors in a data set, but it cannot address more complex errors. Errors that involve relationships between fields can be very difficult to find. These types of errors require more in-depth analysis.
Knowing what data and trends are supposed to look like can allow dirty data to be uncovered. However, real world data can be very diverse, and it rarely conforms consistently to a standard Gaussian distribution. This problem can be offset by using a variety of methods to detect problems and try and resolve them.
Is data cleansing the biggest challenge to contemporary organizations?
There are a range of techniques that have been developed to address the problem of data cleansing. While many tools have been created to automate the process, it still is largely an interactive approach that requires human intervention. Good quality data is vitally important for organizations. It frees up valuable time of data scientists, provides more accurate insights and predictions, and reduces the risk of poor decision making. The problem is, how to cleanse data in a cost effective and timely way that results in consistent and accurate data—and, the answer is going to be different for every organization.