Connecting Entity Resolution and Information Quality
The term Entity Resolution (ER) has only been in use for a few years, but the concept has been around since information systems have been in use. Sometimes called record de-duplication, record matching, record linking, merge-purge, or the co-reference problem, ER is the process of determining if two references to real-world objects are referring to the same or to different objects.
ER is an important tool for achieving Entity Identity Integrity, a fundamental data quality rule that should hold in any information system. In his book Data Quality Assessment, Arkady Maydanchik describes Entity Identity Integrity in the context of a database system as strict adherence to the following conditions
- Each row (entity reference) in a entity table corresponds to one, and only one, real-world entity; and
- Distinct rows in the table correspond to distinct real-world entities.
Entity Identity Integrity is also another way of stating the Fundamental Law of Entity Resolution – that two entity references should be linked (merged or integrated) if, and only if, they are equivalent (i.e. both refer to the same real-world entity).
A more complete discussion of the Fundamental Law of ER and other ER principles can be found in my book Entity Resolution and Information Quality (2011, Morgan Kaufmann Publishers).