Explore best practices for code naming conventions and arrangement in developing and analyzing the analytic dataset.
- Now we will discuss code arrangement. This video explains how you split up your code and name it in a certain way, so you can be organized in how you develop code to make the analytic data set and then later make the code to analyze the data set. If you are wondering, what is she talking about? I just make one big long code file. Then please understand that many people would be exasperated with you for doing that. That is called spaghetti code and its really hard to troubleshoot. So this video will cover the opposite of spaghetti code, modular code, and how to arrange it so you don't get confused while developing your analytic file and later analyzing it.
Modular code files are often very short because they just do one thing. Sometimes that thing is big but if it is small, then there is just a little code in the file. But it plays an important role and has to be run in a certain order. That is why spaghetti code develops. Programmers want to make sure that certain pieces of code are run before others. In modular code that's handled with naming conventions that will make the modular code sort in the order that it runs. So it follows that transformation code or code to make the analytic data set should come first before any analysis code because that code will use the transformed variables.
Since this course is purely conceptual, I don't have any actual code to show you. So I thought that I'd just give you a typical example of code arrangement on the slide. The first step in most data analysis projects is reading in data. So just about all my projects start with a file I call 100_Read in data. If I need to read in many data sets, then I have multiple of these files. One for each data set. I name them in order and put the name of the data set in the name of the file, such as read and encounter data.
Then my next step is usually to trim off the unneeded variables. Remember our data dictionary from the last course will guide us on that. So typically my next file is called 105_Keep vars. This will sort in order after the 100_Read in data code because it makes no sense to keep vars without reading in the data. And, as you can imagine, these code files can be very short. It might be just a few commands to keep vars.
That's why having a lot of comments is not so problematic in files like these. My next step is usually to then apply exclusions. Remember our data reduction spreadsheet in diagram? We can keep track of the numbers that are excluded at each step on those. But the point here is that I keep that code separate after the keep vars code. So I'll call mine 110_Apply exclusions. Notice how I am incrementing the name of the code files by 5, like 105 to 110.
This is so I can sneak in code between 105 and 110 if I find the need to later. I can name it starting with a number in between. It just gives you a little wiggle room. As you know, we designed a lot of variables in the last course in our data dictionary. After applying exclusions, we can add those variables so the next code will be 115_Add vars. To be honest, I usually have several code files for adding variables. I might have 115, 120, 125, and 130.
It depends on how many variables you are adding and how involved the coding is. There usually is more coding under the 100s, as I call it. More transformation code. One thing I like to do is write out the final analytic data set and call it a name like analytic. And then use that name of the data set throughout the rest of the analysis. That's usually the last code I name under the 100s. Then I start my analysis code in the 200s. I call the example on the slide Descriptive stats.
That's because that's often the first analysis you do with your analytic data set, a descriptive analysis. I usually develop these code files in the order of where the analysis will appear in the report I'm writing or the presentation I'm working on. That way I can easily find where I did the analysis for each table or figure. So the take home message is that even though modular code may be more work to organize it is much easier to troubleshoot and share code under a modular framework. So I'm encouraging you to say yes to modular and no to spaghetti code.
Its a healthier choice for you in more ways than one.
- Planning for the analysis
- Descriptive analysis
- Stepwise regression analysis
- Interpreting the final model
- Redefining and defending the final model