From the course: DJ Patil: Ask Me Anything

What should be in a data scientist's toolbox?

From the course: DJ Patil: Ask Me Anything

What should be in a data scientist's toolbox?

(soft techno music) - [Interviewer] Alright DJ, tell us about the Data Scientist's Toolbox. And I don't mean just hardware, software, but also soft skills that every data scientist should have to do a good job. - Well, the first thing that you need in your toolbox is curiosity, deep, profound curiosity. An exploration of how to think about the data. The second thing in the toolbox is a team. As a data scientist, if you're working alone and isolated, it's incredibly tough. Not only is it lonely, but you can't get a different perspective on the data. And so, if you don't have those, all the other tools that you might have aren't going to do anything for you. So you got to start with that. Then once you have that is a question of how do you actually are able to get the data, move it around, access it, clean it and then process it to start looking at something. So what do you need there? Well, it depends on the type of problem. Some data comes in at a high frequency and so then you need a technology like Kafka or something else to look it, maybe Spark or one of these other type of streaming processors. But other data sets may come in on large, periodic intervals like annual basis or maybe decadal basis, and so like the census. So it depends on the problem type. Then you need to be able to clean it, and still, one of the areas that needs massive investment still because we're still in the early days. There are technologies from companies like Trifacta or the Data Wrangler Project and these other type things where we're seeing really great innovation, but it's still not sufficient. Collaboration, got to be able to collaborate with that data, and there's different platforms for that, but it's still also tough. In code, you use GitHub or some other type of similar technology and that allows an unbelievable ability to actually collaborate. We don't have that still yet in data science. It's getting better. There's Jupyter Notebooks and other type things, but it's still early. And then there's a question of the presentation layer. And presentation layer is, do you showcase this in a visualization suite like one of the classic technologies that people are using these days? Could be MATLAB, it could be some open source technology, it could be a Tableau, it could be, there's all these things out there. But it also depends on your environment. Some environments you can use open source, some you can't, some you can use cloud, some you can't. I think those things are going to become easier. Some of the stuff that I'm most bullish on are the open source toolkits 'cause they're just so good right now and many of the companies that support them offer customized versions of that to allow them to be even more effective. But the number one thing that I would tell everyone about that is that, don't think that you have to stay with one tool. You can have many tools to approach a problem and there may be different attempts to try something in one way or another. (soft techno music)

Contents