Join Barton Poulson for an in-depth discussion in this video Reproducible research, part of Data Science Foundations: Fundamentals.
- [Voiceover] The last thing we wanna discuss before we wrap up our course on an introduction to data science is reproducible research. I like to think of this as leaving a digital trail of your work. There's a few reasons you would wanna be able to do this. First is, it allows you to check your work and verify your conclusions. Second, for your client, for future researchers, and yourself, and anyone who's gonna come back to the project, it allows them to see how it happened. Third, it may be required by certain policies.
And fourth, by documenting your process, it ensures a form of intellectual honesty. Now there's a few things that you actually need to include in reproducible research. Number one is your sources. You want to include the raw data before it's processed at all. You want to include a list of the goals and the rationale for the project, and the resources, and that can include the software, the hardware, even the people who worked on it, the funding sources, whatever is in the background. Next, you want to talk about process.
You wanna give the actual code that you used to analyze your data. You want to explain your explorations at each step. You'll probably want to keep a lab journal or notebook. And then there's the output. You're gonna wanna provide the clean final data set that you worked with. You'll want to provide presentation-ready graphics and all the reports that are written, so people have the full package from step one to the conclusion. In terms of how much detail you need, think of it like a recipe. What's required? You need enough detail that a person can duplicate each step of the analysis and all the decisions and come up with the same final product.
You also need to give them enough that they can follow your rationale, your decision making process as you go through it. Now, records are a big part of this. You may want to keep a lab notebook with a narrative. I'm not suggesting a composition notebook. Usually, a digital form's gonna be more durable, but you want to keep the process documented. And for any code that you produce, it should be commented well enough that people can tell what's in there, what it does, and why you included it. And then to state something obvious, make sure that your files and folders are appropriately titled that people can tell what they are.
And finally, it's a very good idea to create the narrative and keep the notebook as you go so you don't have to try to recreate it all in your mind afterwards. There's two other principles you might wanna keep in mind as you do this. Number one is portability. Try to avoid proprietary formats unless necessary. Use universal formats when you can. That includes things like CSV files and text files, PDF documents, or Markdown documents. And prepare yourself for obsolescence. Plan for software to change, plan for hardware to change, be prepared for web links to die and sources to disappear, and the people who worked on the project to leave.
Provide enough detail that they can reconstitute the work even when those things arise. In conclusion, no project stands alone. Every data science project is in a context and relates to other goals and other projects. Make sure to provide enough resources and details on the process that people can recreate your project as needed. And finally, plan for change, prepare for the inevitable, and make sure you give a durable and useful product to your client.
- The demand for data science
- Roles and careers
- Ethical issues in data science
- Sourcing data
- Exploring data through graphs and statistics
- Programming with R, Python, and SQL
- Data science in math and statistics
- Data science and machine learning
- Communicating with data