Data comes in a large variety of "types," but let's keep this simple. You can have text like someone's name, dates, numbers, boolean. Those are the simple categories. There are more these days, but let's just keep this blog post simple. The big problem for those who are not immersed within the data world is that even within those categories there is variation - and a lot of it. Data really isn't that simple any longer.
For example, if you have a sales spreadsheet and it has someone's first name and last name in one field (cell), that really isn't considered "clean" data, even if you can read the first name and last name, such as John Smith. Tidy data is a structure which makes working with the data easy. For example, say you wanted to join your sales data with your newsletter list. You would have to have an exact pattern match of "John Smith" in
the newsletter, or you would not be able to make a match. However, if you're matching just on "Smith" first and "John" second, there's a good chance you'll be able to find at least some "Smith" in the newsletter list. Getting that first name into a "First Name" field and the last name into the "Last Name field is a process called "data cleaning" (or tidying the data). Once the first name is in the First Name field and the last name is in the Last Name field, then the data is considered "tidied".
Why this is important is that having tidy data extends its utility. Going back to the desire to "pattern match" between the newsletter email list and the sales records, if you have your first names in the First Name field and the last names in the Last Name field, you have a higher probability of matching on that Last Name rather than the "whole string" of text. What screws up pattern matches is that different systems or different people do things differently. Someone might type "John <space>Smith", another system might deliver "Smith, John", yet a third person or system might have "John & Jane Smith." Manipulating the data is now a hassle. And don't even get me started on how timestamps, date stamps, and dates can make manipulating data a misery. Everyone has to have their own flavor of ice cream.
But there is a standard definition of "Tidy Data" and this is what we practice at Vizzy Solutions. Tidy Data follows 4 rules but before we get to them, let's talk about two other standard definitions and those are variable and observation (also called a record by many). Below are the definitions we will be using.
The first column above is a perfect example of what we term "dirty data." It combines a count, or record number, for the observation, as well as the first and last name of the observed individual. This would make joining records, or even sorting records in a logical manner nearly impossible. This brings us to the first rule of what is "Tidy Data"
In the example above, the wrong example makes it difficult for you to grab onto things like someone putting a City of Seattle for a White Center zip code. Or how about if someone misspells "Seattle" for "Seatle"? Then there's all the variations of out the Zip+4 could be filled in. If you want to manipulate your data, having one "cell" (or "field" or "column" - whatever you call it) for each discrete piece of information will speed up your work and pay you back with fewer duplications, and an easier time to link your records.
By the way, splitting out data from a single field into multiple fields is called "parsing" data. When you join it back together, it's called "concatenation" or "concatenating" data.
The Third Rule of Tidy Data
each observation forms a row
in the example above, Alice Walter's 5-digit zip code and 9-digit zip codes are on two rows. It "looks" like there are two observations of the same Alice Walker. An additional column needs to be added to the record, one for the 9-digit Zip Code. Joining records together can sometimes be called "record linkage".
The Fourth Rule of Tidy Data
each type of observational unit forms a table
What this simply means is that you don't try to keep all your data in one table (or spreadsheet, or even workbook). You have a table for address information, a table for sales information, and a table for whatever else you mean to collect.
This speeds up processing for analysis, and reduces the "weight" of trying to join disparate items which could create bad math. An example of this would be if you want to track an order number, and the order overall total so that you could do some analysis on discounting, sales tax distributions, etc., don't combine this information with the per product line item sales data. Make two tables with common variables between them so that they can be joined, if & when necessary.
And there you have it...
This is a basic summary of what has become standard in the "data science" world thanks to people like Hadley Wickham and then Jeff Leek's work with Coursera and the Data Science Track with John Hopkins.