Those days are gone. Now we have Point of Sale data (PoS), Retailer Depletion Reports, Google analytics data on our website, Facebook information, Twitter info, and that doesn't even mention our membership lists. Now, our customers are not just our neighbors, or people from across the town, but also across the world. Catalogs have been replaced by anonymous users on our website, email news letter clicks have replaced our flyers. And while, in effect, the data hasn't changed, there's just more of it and we don't have a lot of control of the form in which we receive it - as evidenced by this lovely artifact which shows what the original winery data looked like from one of the data sources.
What is an executive, a manager, or even a small-winery, winemaker supposed to do when the information they need to make a decision about their products and sales arrives to them in the condition above? How does one get from a data set which looks like the picture above to one which looks like the picture below here?
Figure C, like Figure A, is an example of what makes government data inaccessible to most small-to-medium business owners. This is an easier to read version of what "dirty data" looks like. In this case we can see the following Rules are broken so trying to integrate the data set into a useful format. Here are the rules which are broken, so here's the information which has to be fixed.
It turns out
there are industry standards for how data should be formatted so data can be joined, manipulated, and its quality evaluated. For example, if you're dealing with financial data and you're in Switzerland at a bank and someone hands you a piece of paper with the numbers:
written on it, you would have no idea of context or value. This is an example of "dirty data." You know there's information in there somewhere, but it is disconnected from meaning.
Add a few characters, commas, and limit information after the decimal and you've got a happy moment in the bank
47.5442342 , -122.3776679