Join Doug Rose for an in-depth discussion in this video Break down your work, part of Learning Data Science: Using Agile Methodology.
- By now you've seen the difference between the software development and the data science lifecycle. You've also seen how the data science lifecycle is best delivered in two week sprints. These sprints allow you to break down the work and quickly deliver valuable insights. When you're on a data science team, there are always large data sets that need scrubbing. There's also new data sources to explore. A lot of what you're doing is preparing your data. When you work in sprints, you're forcing a team to do the minimum amount of data preparation.
Doing the minimum amount of preparation might sound like a bad thing. Most people want to do higher quality work, plus there's a lot of emphasis in organizations on preparation. In actuality, when you do the minimum amount of prep, you're forcing the team to focus on insights, and not just on capability. You don't want your team spending weeks or even months just setting up the data. Instead, you want the team to almost immediately start exploring the data. You'll also have to look at it from the organization's perspective.
Most organizations aren't really interested in the data. They're interested in the knowledge and insights you get from the reports. It's not the data that's valuable, it's the insights. From an organization's perspective, managing data is part of the cost and not part of the benefit, so a quick insight is much more valuable to an organization than long stretches of data scrubbing. That means that they'll be pressured to extract value from the data as quickly as possible. It'll be difficult for data science teams to spend too much time prepping data and only then to deliver reports in the end.
In many ways, it's similar to how organizations started viewing software. In the beginning, most organizations saw software development as a mystery. They left most of the details to highly skilled developers. These developers would spend most of their time planning for big releases. Now most software developers are forced to deliver valuable software in much smaller chunks. They'll spend less time preparing, and more time delivering. That way the organization can get a look at the value before the team gets too far along.
This is where data science is today. At most organizations, it's still a bit of a mystery. The team still gets a lot of leeway in how the want to do their work. It won't take long for managers to start asking tough questions. The team won't always have the luxury of spending much time preparing large data sets. Instead they'll have to focus on the minimum viable data prep. I once worked for an organization that was focused on automating the process of scrubbing a very large data set.
They wanted to plug it into an even larger set that they already had housed on their cluster. For months, the team was solely focused on this task. They downloaded open source software tools, and purchased some commercial products to help them prep the data. After several months, they'd created several ways to automate the process of moving these large data sets into their cluster. After they moved it over, they had a meeting with the vice president of data services. They showed a PowerPoint presentation of how much data they had moved.
They went through several slides of how difficult the process was to scrub and import this new large data set. Near the end of the meeting, the vice president asked an interesting question. They simply asked, "What do we now know "that we didn't know before?" the question landed in the room with a thud. It was clear from the silence that no one had thought that way for months. Everyone in the room was completely focused on capability. They had forgotten the real value to the organization. If they had delivered in two week sprints, they would have been much more able to focus on the value.
Instead of building out the entire data set, they could work with smaller subsets of the data. That way they could immediately start creating reports. They would start quickly gaining insights and ask more interesting questions. It's almost like having a dinner party. You don't want to spend all of your time setting up the table. That leaves too little time to prepare a great meal for your guests. When you're exploring the data, you get a much better sense of the value. You're in danger of having routine work when you just focus on scrubbing and importing. If you focus on the minimum viable data prep, then your team will keep the work small and manageable, and then you'll add the maximum value to your organization.
This course shows how to structure your work within a two-week sprint. See how to work within a data science life cycle (DSLC)—a methodology for cycling through questions, research, and reporting every two weeks. Explore key practices to help your team break down the work so it fits within a two-week sprint. Learn how to use tools like question boards to encourage discussion and find essential questions. And most importantly, learn how to grow your team's shared knowledge and avoid common pitfalls.
- Defining data science success
- Determining project challenges and criteria for success
- Using a DSLC
- Iterating through DSLC sprints
- Creating a question board
- Breaking down your work
- Adding to organizational knowledge
- Avoiding pitfalls