5 Sure-Fire Steps to Ensure Data Cleansing During ETL

Reading Time: 2 minutes

Inappropriate, incorrect, duplicate, and missing data are prime examples of dirty data.

Dirty data contributes to inaccurate and unreliable results. If dirty data is used as the primary source for decision making, unforeseen critical errors can occur, predictive models become undependable, and calculations are less precise.

Once dirty data is detected, it has to be corrected. But while that’s taking place, managerial decisions are delayed, processes require re-evaluation and the work that’s contributed to generating the dirty data has to be reworked.

All this leads to wasted employee time, incorrect strategic decisions, and a decrease in the organization’s return on investment.

In the International Journal of Engineering Research and Applications (IJERA), author Sweety Patel identifies multiple ways data becomes dirty. Examples include:

  • Data’s been entered erroneously or data entry personnel are poorly trained.
  • System limitations or system configuration rules are applied inaccurately.
  • Scheduled data updates are neglected.
  • Duplicate records are not removed.
  • Lack of validation rules or rules are applied inconsistently.
  • Source to target mapping definitions are inaccurate.

Additionally, the IJERA article notes that when populating a data warehouse, the extraction, transformation and loading cycle (ETL) is the most important process to ensure that dirty data becomes clean.

During an interview, Milan Thakkar, a senior business intelligence engineer at Mindspark Interactive Inc., says he agrees with that sentiment. He reasons that all data is inherently prone to errors and suggests that during ETL data should be:

  1. Subjected to general statistical analysis. Evaluate new data against historical data for outliers. Mean, median, mode, standard deviation, range and other statistical methods can be applied. Confidence intervals should also be part of this analysis.
  2. Evaluated against a clustering algorithm. A clustering algorithm will also identify outliers and is usually significantly more complete then the general statistical analysis. Clustering can be used to evaluate an entire data set against itself by considering the Euclidean distance between records.
  3. Validated. Data integrity tests should be applied and then the data should be vetted against business rules. Check the data type to ensure that the data is appropriate for the column.
  4. Standardized. Data transformation rules should be used to ensure that the data format is consistent and the business logic is dependable and based on user requirements.
  5. Tracked. A metadata repository should be established to track the entire process including the data transformation, the process of vetting, and every method that’s used to analyze the data. Calculation formulas, data transformation algorithms, and business logic reason should be readily available.

As with any computer process, an ETL process has to be “told what to do” or programmed correctly. To further protect your organization against dirty data, Drew Rockwell recommends:

  • Dedicating resources to maintaining data integrity.
  • Embedding your analytics.
  • Not forcing an overarching schema.
  • Providing visibility into the origin and history of the data.
  • Thinking beyond Excel.

In general, in order to truly be protected against dirty data you must first be proactive by building automated processes to cleanse data during ETL and then applying the steps suggested by Rockwell.

Next Steps:

  • Turn data into insight with Spotfire. Start your free trial today.
  • Subscribe to our blog to stay up to date on the latest insights and trends in big data and big data analytics.

Dennis Earl Hardy
Spotfire Blogging Team