Science News

Grab the low-hanging fruits with open-source

Master Data Management, or MDM, is commercial vendors’ buzzword for an entity resolution framework. I talked to several vendors, most offering SaaS and priced by the total number of records ingested from sources. That totals in the 6- to 7-digit $ range per year for larger enterprises.

The target audience for this article

Are you planning to implement MDM soon? Have you asked vendors for a quote? Or did your company already invest in an MDM SaaS? For sure, it is not a small investment.

What if you could reduce annual subscription costs significantly with a few days of engineering work? The idea in one sentence:

Grab the low-hanging fruits with open-source and let MDM do the hard work.

A move that can easily translate into a two-digit percent or 5- to 7-digit $ saving per year.

Why entity resolution matters

The typical business of decent size uses several data sources. For its operations (ERPs), customer relationship management (CRM), analytics (lakes, warehouses), and more (file systems, external sources).

Redundant records exist within practically every system. Some transactions are formally linked to record AB InBev, others to AB INBEV NV. We lose the complete picture of that single customer entity if duplicates stay undetected. Examples of customer duplicates borrowed from an article I wrote. Image by the author.

Records of the same real-world customer entity hide across the different sources; not all are linked by foreign keys or have all attributes in sync, with duplicates within each source. That’s a significant data quality issue. And not just for customer entities but also suppliers, products, people, and other entity types.

I know a company that grew through many mergers and acquisitions. The business integrated many new product lines and geographic regions over time. But IT integration fell behind quickly, operating 100+ ERPs, and teams continued working in silos. This translated into missed synergy opportunities. To name a few:

Missed cross-selling opportunities across product lines and regions.Sub-optimal utilization of teams working in the field because of the legacy boundaries by region and product line.Sub-optimal negotiation with suppliers because teams purchase the same products independently.An order backlog is out of control because of a need for more transparency between manufacturing/procurement and sales.Larger enterprises maintain many IT systems across functions, regions, and lines of business. Arrows represent implemented processes, the flow of data, or just manual paperwork. Can we trace a product from closed opportunity to manufacturing, distribution, installation/sale, and service? Image by the author.

But it does not need to stay this way forever. I have outlined a simplified architecture below. The MDM platform takes care of the entity resolution end-to-end process. The outcome is a set of cross-references for customers, products, people, and suppliers — a lookup table of join keys across all sources. We combine these with a consolidated view of the rest (orders, quotes, transactions, …) to overcome the abovementioned challenges.

We extract+load (EL) data from sources to our raw layer in the lake. Some data flows into our Master Data Management (MDM) platform, which handles entity resolution end-to-end. Transformations (T) complement the MDM work to a full-blown data mart. Analytics and Reverse ETL deliver actionable insights. Image by the author.

How entity resolution works end-to-end

The article End-to-End Entity Resolution for Big Data: A Survey by Christophides and co-authors gives an in-depth overview — a great writeup of entity resolution methodology. Don’t miss out on the many topics we will not cover here.

The next figure represents one of many ways to implement entity resolution.

Entity resolution can be an iterative process. We ingest and preprocess records, engineer similarity features, select (and fit) a classification model, and cluster matches. We can set rules (similarity thresholds) under which highly similar pairs are considered true matches automatically (red) and distribute a batch of likely but unsure cases across humans for review (green). Resolved examples help us to learn and refine. Image by the author.

On a high level, these are the steps to follow in a typical end-to-end process:

Preprocess/normalize preserving just the semantics.Build blocks of records limiting the number of comparisons.Engineer features to measure the similarity of attributes.Select (and fit) a model to predict pairwise matching likelihood.Transform pairwise matches to entity clusters.Review (a batch of) probable but unsure examples with humans.

Typically, you distribute just a batch of unsure cases to humans for review. And the outcome of this labeling can be used to refit your classification model or even let you overthink any of the first steps (preprocessing, blocking, feature engineering). A stronger model can detect even more interesting cases worth reviewing, making this process iterative.

Why not build your entity resolution framework in-house?

The typically high computational cost plus human involvement in the review process adds another dimension to this problem: budgeting. You don’t want your cloud bill nor labor costs to go through the roof.

So, a quick and cheap solution might be too expensive in operation. You decide to go for the sophisticated version. And there are more components you want to include than what we discussed in the previous section. E.g., using weak supervision to label pairs from heuristics shared by subject matter experts programmatically. Or active machine learning to prioritize samples for manual review based on estimated uncertainty.

Every component in isolation sounds like a manageable task. The big challenge lies in the diversity of skills required to build and manage everything: take care of infrastructure and security, build the backend, the classification model, and the frontend for reviewers.

You can also build some components your team is more confident with and let vendors do the heavy lifting of the rest. I talked to two offering a strong matching engine as a product — software you must install on self-managed infrastructure. And I talked to vendors offering SaaS for annotation to manage the review tasks.

It sounds like a lot of talking. But it is also an opportunity to learn fast. I also recommend experimenting first with open-source frameworks before talking to vendors. Some benefits from personal experience:

Avoid marketing bullshit calls because you already know what you want.Challenge vendors with edge cases which you identified while experimenting with open-source. Let them find a solution.Identify weak spots of each vendor — they all have!Negotiate more confidently. Tell them you know about their weak spots and that you are not a low-hanging fruit. Without a doubt, this will strongly affect the pricing of their products.

How you can reduce your MDM costs significantly

Most MDM vendors I talked to base their pricing on the total number of records ingested into their platform. But that’s not all. They will also try to sell you integration with external APIs, e.g., for address validation.

The figure below takes a closer look at the data preparation step. I highlight money-saving opportunities in green.

Every green box is a money-saving opportunity. Preprocessing (e.g., SQL) helps us subset to just relevant records. Open-source entity resolution takes care of the simple cases, reducing again the number of records fed into MDM. Finally, expensive third-party APIs are called only where not replaceable by cheap alternatives. Image by the author.

You must invest to grab each of the money-saving opportunities. Let’s start with the ones on the lower investment end:

Save money with simple preprocessing

Not all customer records in your source systems are equally important. Likely, many bring zero value to the business or don’t fit into your MDM business cases.

We extract+load relevant data to our lake. Some records won’t be beneficial for MDM. We can identify those with rules translated into SQL and executed on our data lake. The rules can change anytime. Image by the author.Zombie records are not linked to a single order, transaction, contract, open opportunity, or other operations-related entities. Therefore, you will likely not benefit from resolving to dead ends.How likely will you benefit from resolving your B2C customers? The MDM selling point is to deliver 360-degree views of a customer across regions, product lines, and else. If that’s rarely beneficial for B2C in your business, why then invest in resolving those entities?

The general idea is to collect business cases with significant value. Then, challenge the business with questions like “Do we need customer records without any revenue to address your needs?”. All answers combined will identify the subset worth ingesting into MDM.

Excluding records is not a permanent decision. Does a new business case justify the ingestion of previously excluded records? Submit the change to your code; the data will be included in the next MDM batch.

Save money with cheaper alternatives for 3rd party APIs

You will not unleash the full potential of MDM if you don’t integrate it with 3rd party APIs. Two prominent examples are:

Geocode and validate addresses.Enrich B2B customers with industry classifications, hierarchies (parents, subsidiaries), and other KPIs (annual revenue, headcount).

The typical MDM vendor will try to sell you an in-house solution or the market leader to play save — nobody gets fired for buying IBM. But is this the best value for the money you can get for your business?

Use a friendly service to geocode your addresses. Some services respond not just with search results but also a confidence score. Call a 2nd, more expensive service for scores below a threshold if needed. Image by the author.

Let’s take the geocoding service as an example. Google Maps and Mapbox are two prominent, market-leading examples. And many more vendors are offering closed-source proprietary solutions. On the other hand, vendors like Geoapify and Opencage rely on open-source and open data, particularly the OpenStreetMap ecosystem. These open alternatives offer prices far below their closed competition. But more importantly, they come with a friendly license, allowing you to store and share their data without limitations.

Do you say Google Maps is more accurate than OpenStreetMap on your data? No problem. You can use others as a fallback if the preferred service responds with low confidence.

Save money with open-source entity resolution

Many MDM vendors offer features you will barely find in popular open-source alternatives—proprietary phonetic algorithms, collective matching, entity-centric matching, and more. These will help you catch edge cases you likely would have missed otherwise.

Use open source to measure the similarity of pairs of entity records. Let sophisticated proprietary MDM solutions do the heavy lifting for you. Increase value for your money. Image by the author.

But what about the bulk of cases? From my experience, most detected duplicates are low-hanging fruits — a ratio of 80 to 20 if you ask me. We can quickly grab the 80% using a simple open-source entity resolution step. Pick a relatively conservative threshold on matching similarity and automatically resolve. Assuming that your data consists of 20% redundant records (estimated from personal experience), we can reduce the total sample size by 16% before ingesting it into MDM.

Architecturally, we can execute such a step as a script deployed on our data lake and executed after extracting and loading the source data. We can keep the orchestration overhead at the bare minimum. Likely, a one-off job will clean up the bulk, and execution once every while will do the rest.

We can store the output, the detected low-hanging pairs of duplicates, in a cross-reference table and use those in combination with the MDM’s results for the complete picture.

Prove the concept and negotiate with confidence

MDM is a costly long-term investment. A sensible way to justify it is through a few days/weeks of work — a proof of concept (POC) on the company’s internal data.

How many redundant customer/product/supplier records are in our most critical data sources? How big are the reference gaps across sources? How do these translate into inefficiencies? Stakeholders need to have some rough estimates before they invest in a costly solution.

You can run a POC on a critical subset of your data within days. Check one of the open-source entity resolution frameworks. But don’t just report the number of duplicates you can detect with high confidence. Investigate the likely but unsure cases with random sampling and manual efforts.

You detected many duplicates confidently (green) with open-source and a few lines of code. Where does your algorithm need to catch up? Get an idea by random stratified sampling and some good old manual investigative work. Image by the Author.

Where does your in-house solution need to catch up? Is it sensible to misspell? Unaware of synonyms or acronyms? Performing poorly in non-Latin languages? Challenge MDM vendors and see if they can catch those cases more confidently. If your favorite vendor is behind in any aspect, negotiate prices downward using your evidence — another way to reduce your MDM bill.


MDM platforms are expensive in absolute terms. Vendors justify their price tags by the value these platforms generate. I agree. Yet, I see potential in significantly increasing our return on investment.

But why not build the whole thing in-house? You can keep the complexity low and the architecture simple. E.g., a simple script with a conservative threshold will be better than nothing. The real question is, will you benefit from entity resolution beyond that? Some considerations:

Treat entity resolution as a business problem, not an IT problem. Collect business cases with a significant estimated value. Show the business what you can do and when with in-house vs. bought solutions.Do you have the expertise in your team to build an in-house solution? You don’t want to hire a team of engineers for entity resolution alone.Lastly, there is significant variation among MDM prices. If budget is a concern, avoid the market leaders. Many vendors compete in this field. Some will be at a surprisingly low end of the price spectrum, much lower than the salary of a team of in-house engineers.

How to Reduce Your Master Data Management Bill was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read More


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Loading Disqus Comments ...

No Trackbacks.