Join Barton Poulson for an in-depth discussion in this video Python, part of Data Science Foundations: Fundamentals.
- [Voiceover] When people ask me what tools they need to learn to do data science, I generally tell them two things, the statistical programming language R, and the general purpose programming language, Python. Now, Python is an indispensable, central, critical tool to data science. Let's take a look again at the KDNuggets Software Poll at Python's Place. Once more this is a survey of data-mining professionals at the tools they use most often. Python is fourth on this list, and it's also the only general purpose programming language that's included on this list.
Also, it's growing rapidly. Python's been around for a while, but Python for working with data, is new. And it's expanding very quickly. There's a few reasons for this. First off Python is general purpose, it means potentially you can do anything in Python and a lot of people do. You can create applications, you can access data sources, you can control a number of things. Also, Python is built in. It's already included on a Mac or Linux computer.
And it's easy to add on a Windows PC. Like R, Python has a great community behind it. And it has thousands of packages, it actually has tens of thousands of packages, but only a portion of those apply to data science in particular. Now, there is a decision you have to make with Python and that is about versions. Because there are two versions, 2.x and 3.x that are both in wide circulation. The problem is, is that version three, 3.x of Python is not fully backwards compatible.
Meaning the code that you write for three doesn't always run on two and vice verse. And, also, many of the packages used in data science still rely on version two. Now I'm sure that will change over time. But for right now it means that many people doing data science actually use the older version, version two, although things are being brought up to date on version three. Let's take a quick look at interfaces. When you're working with Python, you have a choice of Python's own internet development environment, they call it IDLE.
You can also run it in the terminal of the command line, or truthfully, any IDE of your choice. You can run it through an interface called Jupyter, which is what I have here on the left. Originally called IPython, Jupyter's the name of the umbrella project that works with over 50 different languages. And for me right now, Jupyter is my preferred Python interface, especially for demonstration and sharing. Now, if you want to get Jupyter, installation can be a little tricky, unless you get sort of a pre-assembled bundle.
There are two that are really good. There's a company called Continuum that releases a version called Anaconda, and it's a huge download. It includes Jupyter and hundreds of the important packages and all their dependencies that you would need for data science. And Continuum Anaconda is free. Enthought, another company makes a version called Canopy. They have a free version of it, but they also have the commercial product. Either one will work and either one will get you started on Python and Jupyter and make your life a lot easier.
Now, again, like R, the important thing to remember is it's always command line. You're going to be typing lines of code no matter what interface you use. So, let's take a look at some of those lines of code. The commands are a text interface, and this is a command I used right here to produce one of the figures on another page. And the advantage is that Python is familiar to millions and millions of coders. People know how to use Python. And Python is a very clean, simple, easy to work with language.
And there are a lot of really simple adaptations for working with data. And let's take a look at some of the output. Now, this is what the output looks like in Jupyter/IPython, my preferred way of working. It's text output, which we have in some of they grey cells here. And we have inlne graphics, you don't see any here, but you did on the other page. And it's really easy to organize and present all of the output if you're using Jupyter.
That's one of its big advantages. So, our conclusions here. First, Python is a very popular language and it's familiar to millions and millions of people. It has the strength of being a general purpose programming language, which means it can do a lot more than just data. And like R, Python gets its power from the thousands of packages for data that are freely available from an online community.
- The demand for data science
- Roles and careers
- Ethical issues in data science
- Sourcing data
- Exploring data through graphs and statistics
- Programming with R, Python, and SQL
- Data science in math and statistics
- Data science and machine learning
- Communicating with data