Explore how to combine four large CSV files or whether to read them independently. Learn the data in the CSV file is inconsistent, with erratic timestamps and missing records. To calculate the coefficient of the slope, learn you need to perform a linear regression on the data.
- [Instructor] The first thing I thought of when presented this problem is, "What the heck is a coefficient of a slope?" I had no idea, not being a math genius. After all, the computer does the math, right? I just need to know what alchemy lies behind a slope coefficient. Knowing that calculating the coefficient of a slope would happen later, I concentrated on things I could do right now in the code. I could look at the data and see what I'm up against. I could figure how to extract the relevant pieces. Eventually, once I have all that barometric pressure data, figure out what the slope of the coefficient thing actually is, and then I could do a crude chart which I figured would be the most fun part of the program, so I saved it for last.
The data, the four sample files are huge. I thought about sticking them all together or splitting them up into months but that wasn't part of the assignment, so I leave them as is. The good news is the files are plain text in the CSV format which is perfect for the C language. But upon examining the files, I noticed that the timestamps were terrible. Here's a screen cap of the first clutch of readings, rows in the CSV file.
The timestamps are not taken at regular intervals. Immediately I knew that I'd have to do some time arithmetic to grab data between a specific date range. I didn't look forward to that. Worse, the data has major gaps in it. I didn't discover this flaw until I actually started running my code. Here in 2015 May, you see a 10-day span where no data was recorded between the ninth and the nineteenth. So if you search for a date that has no data, the program must deal with empty results.
As I was coding various routines, I began to research the coefficient of the slope thing. The internet is a wonderful resource, but not everything it says is true. For example, one page said to just get the difference between the first and last data points and I'd be good. Uh-huh, after more study and a visit to Wikipedia and YouTube, I gleaned more knowledge about the coefficient of the slope and how it's related to another crazy thing called linear regression. To properly perform a linear regression, I needed a second data set, not just the barometric pressure readings that my code already collected, but also the date and timestamps.
Yes, even when the timestamps aren't regular, you need them to perform a proper linear regression. I could've obtained a statistical library for C which is always an option, but I didn't really want to learn another library. And before I resorted to that solution, I found a C coder online, Chris Webb, who runs the site code-in-c.com. Chris has a lesson from July 2017 that explores calculating a linear regression.
Chris's code inspired me to write a different version, one that doesn't calculate every part of a linear regression, but obtains only the coefficient of the slope. When I had the proper formula, I added it to my code to present the coefficient of the slope data. And I want to express my appreciation to Chris for posting this information online which saved me tons of time and struggle having to do it myself.