Dirty data is no use to you

Human error. Typos. Faulty scanners. Mistakes tend to creep in more often than not. The question is, what can you do about it? Spend weeks sorting through thousands of address records to identify errors, verify the correct information and clean up the data? Repeat the exercise every few months when fresh errors have messed up your data again?

It’s a real problem, because without good data you can’t understand and improve your logistics performance.

How messy data gets in the way of good decisions

Imagine this scenario. Your business ships to and from more than 10 countries and you need to see what’s happening at key destinations. You want to understand:

  • in- and out-flow, so you can manage staffing bottlenecks better

  • site-by-site spend efficiency, so you can understand which sites are being cost effective

  • staff efficiency, so you can see who is shipping what and with which 3PL, and whether staff are adhering to standard operating practices

You’re looking for the answer to some pretty important questions, such as – are there opportunities for consolidation? Which providers are best? And so on.

But now you come up against a few snags. One is that no courier’s tracking website shows you all the shipments moving into or out of a location. And if you’re using more than one service provider to pick up and deliver at a given location you’ll need to pull data down from multiple tracking websites. 

Added to this, to get the answers to your important questions,  you’ll need both historical invoice data and current tracking data. But when the quality and consistency of site address recording is so unreliable, it’s difficult or impossible to even take the first steps toward a meaningful analysis.

And any other question, analysis or decision you want to make in your logistics planning depends equally on the quality of your data. With poor data, you are in the dark.

Machine learning. Just the thing for dirty data.

Fortunately, there is now a solution to this problem. Advanced machine-learning can be applied to dirty data to dramatically improve the quality of address records, stripping out incorrect duplicates, and leaving you with one – correct – address per location.

And once you have that, you can get going with serious analysis of your operations, secure in the knowledge that your results are reliable and accurate.

Without going into too much detail (get some here if you really want it), this is how it works.

Traditional computing techniques will try to map addresses if the records are exactly the same, as we see above this won’t work. A solution to this problem might be to use a traditional data mining approach called Entity Resolution. Whilst this is a reasonable approach it drastically fails to deal with sparsely populated or messy data which, unfortunately, is exactly the case with this logistics data.

Luckily, however, logistics address data has two useful properties which mean we can use a much more powerful method: 1) for a given business, addresses or sites don’t change frequently and 2) address data is messy in a consistent way depending on where it came from.

Characteristic 1) lets us configure an entity resolution algorithm to learn which records are clean and uniquely define the specific sites of a given business we have a common pattern to match dirty data against, while 2) lets us put more or less weight on certain parts of the data we know are more or less indicative of the clean address. 1) is important because it’s a lot easier to map a dirty record to a known clean one than to pick out a clean record from a set of dirty ones when you don’t know what a clean one looks like and 2) is important because different depending on where the data came from it’s likely that it is messy in a similar way.

For example, depending how the client fills out AWBs DHL invoices can have address data that is typed in so the quality of their data is good – meaning all the data is equally important in picking out clean records but FedEx invoices will have address data scanned in from AWBs to determine the sender/company/street name so the city/country data is more important during the cleaning process.