ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > Living Room

help with data matching

<< < (2/3) > >>

Target:
so basically I'm just going to have to slog my way through it - luckily it's only small (4.5K records)

4wd:
You could possibly make it a little quicker, depends on how localised the addresses are, by doing a RegEx match for postcode.  Put all addresses of the same postcode into separate smaller files.

wraith808:
so basically I'm just going to have to slog my way through it - luckily it's only small (4.5K records)
-Target (December 15, 2015, 09:37 PM)
--- End quote ---


RegEx could really help with this- other than the street address.  Also, remember it doesn't have to be either/or.  You can use a programmatic method to narrow it down to possible matches that you have to look at, and do that part manually.

And I think that's exactly what 4wd said above, but I'm going to post this anyway because maybe I'm wrong.  :-[

Target:
I can pull the postcodes easily, and likewise derive corresponding suburbs.  Trouble is their data doesn't necessarily match my data (and in many cases is just plain wrong)

which brings me back to the street address.  I'd be interested in how I might go about building some sort of regex tool, but given the variables I'm not sure how I would even go about it

wraith808:
I can pull the postcodes easily, and likewise derive corresponding suburbs.  Trouble is their data doesn't necessarily match my data (and in many cases is just plain wrong)

which brings me back to the street address.  I'd be interested in how I might go about building some sort of regex tool, but given the variables I'm not sure how I would even go about it
-Target (December 16, 2015, 03:35 PM)
--- End quote ---

Well, I don't know much about AU street addresses.  Which is one of the reasons I quieted down once you clarified.  Can you expound?

Until then, some general resources:

As you said this was for a company, there are several services if you have a budget for this that would probably cost less than paying you to reinvent the wheel.

A good example: https://smartystreets.com/features

I use RegExBuddy to help me build.  There's also a good library: http://www.regxlib.com/DisplayPatterns.aspx?cattabindex=6&categoryId=7&AspxAutoDetectCookieSupport=1

A good start (but this is US Based)


--- Code: Text ---\d{1,5}\s\w.\s(\b\w*\b\s){1,2}\w*\.
(need to get mouser to add regex to the code highlighting)

This bit of RegEx allows 1-5 digits for the house number, a space, a character followed by a period (for N. or S.), 1-2 words for the street name, finished with an abbreviation (like st. or rd.).  This by no means solves your problem... it's just meant as a start.

You can also use text matching to brute force the text matches... and use the DICE coefficient to set a threshold for human intervention.  For more on that, visit: https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Dice%27s_coefficient.  There are implementation examples in several languages.

A better and more practical explanation: http://www.tsjensen.com/blog/post/2011/05/27/Four+Functions+For+Finding+Fuzzy+String+Matches+In+C+Extensions.aspx

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version