Imperfect data – A historic perspective
Our world of computing in 1969 was very different from today. In 1969, Dr. E.F. (Ted) Codd published his first internal IBM paper, “Derivability, Redundancy and Consistency of Relations stored in Large Data Banks”, followed in 1970 by the ACM publication, “A Relational Model of Data for Large Data Banks” – the birth of relational databases as we know them today.
Organizations used to have complete control of their data. With just a few systems (usually to automate back office functions) there was no concept of customer self-service, or integrated supply chains, or third party data feeds, or just about anything we take for granted today.
Data was generated by professional data entry staff; they took pride in getting the data entry right, with very low error rates. Data was processed sequentially, tapes spinning round and lights flashing brightly; often you could tell what job was being processed by the noises in the computer room.
What’s changed?
What’s changed over 40 years? Today the typical organization runs hundreds, if not thousands, of systems spread across large data centers – many of these applications sharing data with external sources, their supply chain, external data feeds and, of course, we are constantly trying to get our customers to do as much as we can get them to do. When you add up 40+ years-worth of growth and change, we can see how organizations have come to have such volumes of “imperfect” data to deal with – data that is full of errors, inaccuracies and inconsistencies.
SQL has little ability to deal with imperfect data
In 1969, there was no concept of anything other than data that was perfect. This was a major contributor to the fact that as RDBMS and SQL were being defined, very little allowance was made to deal with errors in data. “Like” or “contains” clauses and “wildcard” characters enable data with known errors to be found and very little else. If SQL can’t find the data people and systems need, then it needs to be searched by hand, so you often find significant human effort being spent – often trawling through databases to find the right data. Some organizations have tried to deal with the problem by building monolithic dynamic SQL search systems, which they typically find are very resource consumptive. These systems take a lot of effort to design, build and maintain, and still end up not being able to find the data.
The route forward
If only we could leverage all that we now know about data and go back in time to build RDBMS and SQL with the built-in ability to deal with all sorts of data effectively and efficiently. More realistically, of course, we need a different way to find the data people and systems require without needing to know the multitude of ways data can be “imperfect.” We also need to bear in mind that people are very good at finding data using their built-in ability to see through errors and differences – the only problem is that they work at their own, much slower pace. Providing systems with the ability to work as accurately as humans, yet at the speed of systems, is long overdue.
Stay tuned for five things to consider when evaluating solutions to deal with imperfect data!