This process uses advanced fuzzy logic matching to identify profiles that may be duplicates (potential merges) but were not automatically merged in the auto dedupe process due to fields not having a 100% similarity.
It recognizes similarity between records regardless of spelling mistakes, inconsistent formatting, abbreviations, typos and other issues. Once identified, duplicates can be merged using the very simple-to-use front end dedupe application which shows the duplicates side-by-side to facilitate a quick look by the user.
To ensure false-positive duplicates (records that appear to be duplicates but are actually distinct and should be kept separate) are not automatically merged, all duplicates below a configurable threshold (see default rules below) require user validation prior to merging. The Dedupe/Reject screen presents these records side-by-side using color coding to indicate which fields contain differing data and highlight what data would be lost if the records are merged.
Merged and rejected records
- Records with a level similarity of 90%-99% are flagged as “Highly Likely Dupes” and queued for client review.
- Records with a level similarity of 60%-89% are flagged as “Likely Dupes” and queued for client review.
The fields included to determine record level similarity are: First Name, Last Name, Email, Phone, Address1 and Zip Code.
Interpreting the Results
How are duplicates identified?
The fuzzy-matching engine compares data within certain fields (FirstName, LastName, Email, PhoneNumber, CellPhoneNumber, HomePhoneNumber, Address1, ZipCode) using a complex algorithm. The algorithm does not rely on dictionary matching or phonics rules – so all misspellings, abbreviations and foreign-language names (which are outside of any dictionary or phonics rule set) are examined with the same high degree of accuracy as any other data.
What are false-positive duplicates?
False-positive duplicates occur when the application finds a high score of similarity between two records which are not actually duplicates. This can happen for the following reasons:
- Similar names or name spellings between two records: The more alike the names and spellings of names, the more similar the records are to each other, so a higher ‘Likeness Percent’ will be calculated by eInsight.
- Different people working for the same company: Since the business name (or, in some cases, the phone number) might be the same across multiple customer records, they will be identified as having a high degree of similarity between them.
How long does it take to run the dedupe process?
A report in the front end indicates to users how long each merge process will take.
The number of records in the database plays a crucial role in how long the deduping process takes since every customer record in the database is compared with every other customer record in the PMS database (but not client-uploaded databases or web sources). This means the initial processing required by the fuzzy-matching engine increases exponentially as the dataset also increases in size.
For example: If we have 100,000 customer records – and add 1 more record, we need to do 100,000 more comparisons (that record is compared with every other record in the database). And if we add 5 more records, we need to do 500,000 more comparisons!
We have implemented dedicated SQL/SSIS servers (dedicated multi-processor and multi-gigabyte database servers) to do most of this work. We have selected this technology for its built-in, highly efficient query optimization technologies and because the SQL Server query engine makes full use of parallelism across multiple processors and processor cores while managing memory as efficiently as possible. So, despite the high number of calculations being executed in some cases, our technology performs as quickly and efficiently as possible.
The ETL (Extract, Transform and Load) Process
Multi-source Data Sources: In a data warehousing scenario, the customer dimension typically contains the latest and greatest information for each customer, no matter how many databases the customer's data are stored in. Where customer information is held in two mutually exclusive databases, the data CRM ETL process must merge sources of customer data together at the parent-company level. This means that initially two records will exist in the staging area (partitioned by a source identifier).
Each time customers are loaded from the transactional source systems, a Fuzzy Lookup is used in tandem with a regular Exact Match Lookup transformation to ensure that data is merged for customers who already exist within the CRM system, and that only new customer records are inserted into the single record customer object. Since we consider the data CRM conformed view of customers across corporate systems, this serves as the reference table for the Fuzzy Lookup operation.
Step 1: PMS Source Data CenRes into eInsight CRM (ETL Runs).
Step 2: Move Profile data into Customer data.
Step 3: Customer records are selected.
Step 4: Configurations are loaded.
Step 5: Fuzzy-matching engine process (First Name, Last Name, Email, Phone Number, Cell Phone Number, Home Phone Number, Address1, and Zip Code).
Step 6: Rules run against the data for duplicates and perform automatic merging, if enabled.
Step 7: Records are available in the front end for user approval or rejection.