Those days are gone. Now we have Point of Sale data (PoS), Retailer Depletion Reports, Google analytics data on our website, Facebook information, Twitter info, and that doesn't even mention our membership lists. Now, our customers are not just our neighbors, or people from across the town, but also across the world. Catalogs have been replaced by anonymous users on our website, email news letter clicks have replaced our flyers. And while, in effect, the data hasn't changed, there's just more of it and we don't have a lot of control of the form in which we receive it - as evidenced by this lovely artifact which shows what the original winery data looked like from one of the data sources.
The intention of this blog is to communicate with people who don't speak "Data". Unfortunately, I'm a data geek, and so I struggle to speak "human." I get a "deer in the headlights look" when I use phrases like "Big Data", or "Open Data" and my friends say, "huh?" How to communicate everything that's changed in the world of the internet, and the digitization of information is challenging because language develops around experience, and technical language is precise. What's really interesting is being in the midst - literally - of the development of a new global language because of technical change. However, that makes translating back to non-technical language even more challenging.
Volume, Velocity, Variety
This blog entry is not going to have much by me, but are typed quotes from this lecture Hadley Wickham gave at the Chicago Chapter of the ACM, March 7th, 2018. It opens with who Hadley is, but basically he's an important tool builder / developer which has driven a lot of the programmatic breakthroughs which make getting to insights faster and easier today.
One can't say he's a pioneer of the "Data Revolution", but he is one of the most prolific, current key contributors to the changes taking place in the data world. This blog is really just a list of quotes I've tried to capture from this video because he does a really good job explaining what "data science" is and how it's done. The opening 8 minutes of this hour long program does a good job of explaining what Data Science is and its process for insights.
Data comes in a large variety of "types," but let's keep this simple. You can have text like someone's name, dates, numbers, boolean. Those are the simple categories. There are more these days, but let's just keep this blog post simple. The big problem for those who are not immersed within the data world is that even within those categories there is variation - and a lot of it. Data really isn't that simple any longer.
For example, if you have a sales spreadsheet and it has someone's first name and last name in one field (cell), that really isn't considered "clean" data, even if you can read the first name and last name, such as John Smith. Tidy data is a structure which makes working with the data easy. For example, say you wanted to join your sales data with your newsletter list. You would have to have an exact pattern match of "John Smith" in