Data Quality Rules: An Introduction
The main tool of a data quality assessment professional is a data quality rule — a constraint that validates a data element or a relationship between several data elements that can be implemented in a computer program. Of course, I use the term data relationship here in the broadest sense — ranging from simple entity relationships found in data models to complex business rule dependencies.
When properly designed, data quality rules allow identification and precise classification of the majority of data problems in a fraction of time compared with manual validation. And while the task of designing a comprehensive set of data quality rules is daunting, with the number of rules often reaching hundreds or even thousands, it is absolutely crucial to discover as many data quality rules as possible.
Imagine that you are appointed to be a home-plate umpire in a major league baseball game. Obviously, you cannot perform this role without knowing the rules. The official rulebook of major league baseball contains 124 rules, many with numerous sub-rules. If you miss just a couple of rules, you may inadvertently influence the outcome of the game — the one play that you call erroneously could be decisive. If you do not know 10% of the rules, you can easily cause a riot. Complicated rules are no less important than easy ones, so learning all but 10% of the most complex rules still leaves you 10% short of the target.
Data quality rules play the same role in data quality assessment as the rules of baseball in refereeing a major league game: they determine the outcome! Unfortunately, identifying data quality rules is more difficult than the learning rules of baseball because there is no official rulebook that is the same for all databases. In every project, we have to discover the rules anew. Also, some rules are easy to find, while others require lots of digging; some rules are easy to understand and implement, while others necessitate writing rather complex programs. But, as with baseball, all rules are equally important. Omitting a few complex and obscure data quality rules can (and most of the time will!) jeopardize the entire effort.