topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Thursday March 28, 2024, 2:05 pm
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: help with data matching  (Read 6705 times)

Target

  • Honorary Member
  • Joined in 2006
  • **
  • Posts: 1,832
    • View Profile
    • Donate to Member
help with data matching
« on: December 15, 2015, 04:49 PM »
looking for some helpful suggestions regarding a task I'm currently trying to complete at work

I have 2 sets of data that I'm trying to match.  The common fields are address details, however they are all free text and there are no common conventions in use.  To compound the issue we can also toss in random abbreviations (still no common conventions), a limited field length (some entries don't have spaces), random pieces of extraneous info, and poor data quality(HA!).  I've been bashing my head against this for a while now without success and have come to the conclusion that I'm going to have to tackle it line by line, though I'm still holding out hope that there might be a better way

the data is clearly potentially sensitive so I can't really supply much in the way of examples, but any idea's?

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,186
    • View Profile
    • Donate to Member
Re: help with data matching
« Reply #1 on: December 15, 2015, 06:27 PM »
US only?

Target

  • Honorary Member
  • Joined in 2006
  • **
  • Posts: 1,832
    • View Profile
    • Donate to Member
Re: help with data matching
« Reply #2 on: December 15, 2015, 06:33 PM »
the data is AU (whats the 'only' part?)
« Last Edit: December 15, 2015, 06:39 PM by Target »

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,186
    • View Profile
    • Donate to Member
Re: help with data matching
« Reply #3 on: December 15, 2015, 09:01 PM »
there are different considerations when you have a localized area for the addresses vs. international and local, so I was clarifying if the information was international or not.

mouser

  • First Author
  • Administrator
  • Joined in 2005
  • *****
  • Posts: 40,896
    • View Profile
    • Mouser's Software Zone on DonationCoder.com
    • Read more about this member.
    • Donate to Member
Re: help with data matching
« Reply #4 on: December 15, 2015, 09:32 PM »
You could think of the process in terms of two stages
1. Assign a score to every given pair of items (one from each datafile), where the score represents the likelyhood that they are matching entries (refer to same user).
2. Now for each item in dataset A you can identify its most likely (or top few) potential matches, and let a human make the final judgement.

Doesn't seem like its ever going to be possible to do this without human intervention at some point, but you could probably do a pretty good job of identifying good candidates and keeping false positives to a small number.

Now as for scoring algorithm,  i'm thinking the best approach would probably be to treat each entry as a single string, and use an existing algorithm that finds longest substrings in common.   something like that.

Target

  • Honorary Member
  • Joined in 2006
  • **
  • Posts: 1,832
    • View Profile
    • Donate to Member
Re: help with data matching
« Reply #5 on: December 15, 2015, 09:37 PM »
so basically I'm just going to have to slog my way through it - luckily it's only small (4.5K records)

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
Re: help with data matching
« Reply #6 on: December 15, 2015, 11:46 PM »
You could possibly make it a little quicker, depends on how localised the addresses are, by doing a RegEx match for postcode.  Put all addresses of the same postcode into separate smaller files.

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,186
    • View Profile
    • Donate to Member
Re: help with data matching
« Reply #7 on: December 16, 2015, 07:28 AM »
so basically I'm just going to have to slog my way through it - luckily it's only small (4.5K records)


RegEx could really help with this- other than the street address.  Also, remember it doesn't have to be either/or.  You can use a programmatic method to narrow it down to possible matches that you have to look at, and do that part manually.

And I think that's exactly what 4wd said above, but I'm going to post this anyway because maybe I'm wrong.  :-[

Target

  • Honorary Member
  • Joined in 2006
  • **
  • Posts: 1,832
    • View Profile
    • Donate to Member
Re: help with data matching
« Reply #8 on: December 16, 2015, 03:35 PM »
I can pull the postcodes easily, and likewise derive corresponding suburbs.  Trouble is their data doesn't necessarily match my data (and in many cases is just plain wrong)

which brings me back to the street address.  I'd be interested in how I might go about building some sort of regex tool, but given the variables I'm not sure how I would even go about it

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,186
    • View Profile
    • Donate to Member
Re: help with data matching
« Reply #9 on: December 16, 2015, 04:40 PM »
I can pull the postcodes easily, and likewise derive corresponding suburbs.  Trouble is their data doesn't necessarily match my data (and in many cases is just plain wrong)

which brings me back to the street address.  I'd be interested in how I might go about building some sort of regex tool, but given the variables I'm not sure how I would even go about it

Well, I don't know much about AU street addresses.  Which is one of the reasons I quieted down once you clarified.  Can you expound?

Until then, some general resources:

As you said this was for a company, there are several services if you have a budget for this that would probably cost less than paying you to reinvent the wheel.

A good example: https://smartystreets.com/features

I use RegExBuddy to help me build.  There's also a good library: http://www.regxlib.c...etectCookieSupport=1

A good start (but this is US Based)

Code: Text [Select]
  1. \d{1,5}\s\w.\s(\b\w*\b\s){1,2}\w*\.

(need to get mouser to add regex to the code highlighting)

This bit of RegEx allows 1-5 digits for the house number, a space, a character followed by a period (for N. or S.), 1-2 words for the street name, finished with an abbreviation (like st. or rd.).  This by no means solves your problem... it's just meant as a start.

You can also use text matching to brute force the text matches... and use the DICE coefficient to set a threshold for human intervention.  For more on that, visit: https://en.wikibooks...Dice%27s_coefficient.  There are implementation examples in several languages.

A better and more practical explanation: http://www.tsjensen....In+C+Extensions.aspx
« Last Edit: December 16, 2015, 04:52 PM by wraith808 »

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
Re: help with data matching
« Reply #10 on: December 16, 2015, 05:39 PM »
And I think that's exactly what 4wd said above, but I'm going to post this anyway because maybe I'm wrong.  :-[

Not wrong  :Thmbsup:

I think some pseudo-examples would help with us working out possible RegEx.

Target

  • Honorary Member
  • Joined in 2006
  • **
  • Posts: 1,832
    • View Profile
    • Donate to Member
Re: help with data matching
« Reply #11 on: December 16, 2015, 08:52 PM »
this should be reasonably safe (what could possibly go wrong?)

the attached lists a bunch of entries that I've already matched - it's by no means complete, but it's a start

note that it's actually in CSV format so you can just change the xtn

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
Re: help with data matching
« Reply #12 on: December 17, 2015, 12:55 AM »
See what you mean, the postcode is about the only thing that matches reasonably consistently and even that fails in two of the examples you've got.

Target

  • Honorary Member
  • Joined in 2006
  • **
  • Posts: 1,832
    • View Profile
    • Donate to Member
Re: help with data matching
« Reply #13 on: December 17, 2015, 03:46 PM »
gotta love the value we place on data integrity :Thmbsup:

Stoic Joker

  • Honorary Member
  • Joined in 2008
  • **
  • Posts: 6,646
    • View Profile
    • Donate to Member
Re: help with data matching
« Reply #14 on: December 18, 2015, 06:21 AM »
If it's something you'll need to do frequently, or just multiple times...automate.

If it's something you need to do once, it's probably safer and easier to just suffer through it.

When we switch business applications a few years ago, it was my lot in life to sort out the mess that was our data for export/import. We had a list of 5,000 clients that was in in much the same state as you describe. Some people used address field 1, some people used address field 2, some people abbreviated, some did not ... It was a complete GD train wreck.

But in retrospect, it wasn't that hard to sort out in Excel once I started playing with the sort order and using replace X in column only. Because if you sort by the column you're working on all of the errors of a given type tend to get grouped by the sorting process, so they're easier to address it one shot.

Just a thought.