From the course: Descriptive Healthcare Analytics in R

Uses of a data dictionary - R Tutorial

From the course: Descriptive Healthcare Analytics in R

Start my 1-month free trial

Uses of a data dictionary

- [Instructor] Welcome to the first section of Chapter Two. Chapter Two will cover designing your meta data. So what is meta data? Let's start by talking about one particular type of meta data, a data dictionary. Most analysts have heard of a data dictionary. But many don't know the technical definition of a data dictionary. So first, we'll go over that. In the next section, we will talk about how you actually make a data dictionary as you go through the project. By the end of this course, you will have built out your complete data dictionary for the project. So first, please don't ask me for a data dictionary form or template. Because this is more of a concept than an official format. The exact structure of each data dictionary ever made is slightly different, depending upon the project. But all should meet the main objectives of documenting clearly what each of the variables in the data set mean, and also the exact definitions of each level of each categorical variable included. Sometimes a data dictionary includes more information, such as variable types and widths, but it has to meet those main objectives to be a useful data dictionary. Some data sets don't have a data dictionary. And if that happens, it basically means you cannot analyze it unless you find someone who knows what all those variables mean. Then you can make your own data dictionary, which I admit I've done on occasion. But if they do have a data dictionary, sometimes they give it to you in Microsoft Word or PDF format. And I find those hard to use. I prefer to have mine in Microsoft Excel, so we will make ours, in this course, in Excel. I'm not the only one who likes a data dictionary in Excel. Check out this online data dictionary in Excel for the U.S. Military Data Repository, or MDR. This is a beautiful data dictionary, which is a joy to use when doing MDR analysis. Really nice meta data. So like MDR, I prefer my data dictionaries in Excel, and I use the tabs. I document our main analytic data set in the first tab, and then all the levels of the different categorical variables. I use the term picklists from informatics and data-basing, in the tabs across the rest of the workbook. In the main tab, I just refer to the picklists. So what we are doing is documenting our analytic data set, and this is an activity that falls under the larger category of data curation. I explain the word curation by saying, have you ever been in a museum, and sat down on a broken down old chair, only to be yelled at by the museum staff, that that was a historic chair on display? Right. You thought it was a crappy old chair, because you did not see the curation, or the little plaque that explained it was George Washington's chair or something. The point is, the information about the chair made the chair valuable. Not the chair itself. And that is also true of data. If you do not have sufficient information about the data, then the data are not useful. But once you read the documentation, or data curation, you can use the data. And the main piece of documentation you need among those curation files is the data dictionary. So we need to get going on that now. I want to point out some vocabulary words here that I'm going to use as we make our data dictionary. Luckily, all we are documenting is an analytic data set. So we only have two kinds of variables. Native variables, or variables coming directly from the BRFSS data set. And calculated variables, or ones we make in R. When we read in the BRFSS data set, you'll see there are tons of variables. Our data dictionary will guide us to only keep the native variables we need for the analysis. But those native variables are typically not in the exact format we need to analyze them. Therefore, we usually need to make many calculated variables using those native variables. And the data dictionary is good for planning those. Sometimes you make a few calculated variables that you do not end up using, but if they are in the data set, it is good to document them. As you go through the project, you will first document your native variables. But then, as you make calculated variables, you will continue to update your data dictionary. How I do my analyses is I have two monitors. One with the statistical software on it. And one with the data dictionary on it. So I can read my map, or the data dictionary, and update it on one screen as I go through the analysis on the other screen. Perhaps you have heard of a data dictionary, but now you know exactly what one is. This movie introduced you to what goes in a data dictionary, different formats of data dictionaries, and what native and calculated variables are. In the next section, we will start making our own.

Contents