ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > Living Room

help with data matching

(1/3) > >>

looking for some helpful suggestions regarding a task I'm currently trying to complete at work

I have 2 sets of data that I'm trying to match.  The common fields are address details, however they are all free text and there are no common conventions in use.  To compound the issue we can also toss in random abbreviations (still no common conventions), a limited field length (some entries don't have spaces), random pieces of extraneous info, and poor data quality(HA!).  I've been bashing my head against this for a while now without success and have come to the conclusion that I'm going to have to tackle it line by line, though I'm still holding out hope that there might be a better way

the data is clearly potentially sensitive so I can't really supply much in the way of examples, but any idea's?

US only?

the data is AU (whats the 'only' part?)

there are different considerations when you have a localized area for the addresses vs. international and local, so I was clarifying if the information was international or not.

You could think of the process in terms of two stages
1. Assign a score to every given pair of items (one from each datafile), where the score represents the likelyhood that they are matching entries (refer to same user).
2. Now for each item in dataset A you can identify its most likely (or top few) potential matches, and let a human make the final judgement.

Doesn't seem like its ever going to be possible to do this without human intervention at some point, but you could probably do a pretty good job of identifying good candidates and keeping false positives to a small number.

Now as for scoring algorithm,  i'm thinking the best approach would probably be to treat each entry as a single string, and use an existing algorithm that finds longest substrings in common.   something like that.


[0] Message Index

[#] Next page

Go to full version