Join Barton Poulson for an in-depth discussion in this video Existing data, part of Data Science Foundations: Fundamentals.
- [Voiceover] When we're looking at data sources, the easiest first place to start is existing data sets. Conceptually, the easiest way to do this is with in-house data, also in terms of existing data, there's open data and there's third-party data. I'll talk about the pros and cons of each of these briefly. In-house data can be fast and easy because it's already there. It might be in the proper format and, it you're lucky, it's documented properly. On the other hand, in-house data may not be subject to the kind of quality control that data that's intended for other people might have.
Also, there might be restrictions or policies on the data use even within a particular organization. So, let's look at some of the pros and cons of in-house data. The first pro is potentially it's very quick, easy, and should be free because you're within the same organization. If you're lucky, there's standardized formatting that the organization has instituted, rules or policies about this is the way we do it. Also, if you're very lucky, the original team that gathered the data and processed it should still be available.
Makes your life very easy when you want to answer questions and get insights. Also, identifiers might be available, and what that means is you might actually know the identity of each respondent or case in it. There are, however, some cons. First off is the data you need simply may not exist. It may be the organization never gathered that data. Also the documentation may be inadequate. Often when organizations are gathering their own data they get it together, they use it for what they need to, and then they kind of store it off to the side but it's hard to tell what's what.
And also, the quality of the data may be uncertain. You may not know about response biases or errors in coding and that makes it a little harder to work with it reliably. An alternative to in-house data is open data. This prepared data that's freely available. Usually, it's government data but there's also some open corporate data. And there's a move towards more and more open scientific data. Now the pros here are that you can get enormous data sets.
You could get terabytes, possibly even petabytes, that are worth millions of dollars. You can get a very wide range of topics. You can get them covering various historical times and trends, and the data from open data is often well formatted and well documented. The cons of open data are that you potentially have biased samples, in that the samples may be restricted to a particular geographic area or maybe restricted only to people that have internet access, and those can create problems with interpretation.
Also, the meaning of the variables may not be clear. You don't have the opportunity to talk with the people who gathered the data and it's sometimes hard to understand why or how they defined a particular variable the way they did. Also, in some situations open data may stipulate that you need to share your analyses. That open data is there for open analytics. That might be an issue and something you need to look at carefully before you dive in. Finally, openness and data can conflict with the needs of privacy and confidentiality in data.
Finally, there's third-party data, or what we call data as a service, DaaS. These people are also called data brokers and you can get a huge amount of data on a large range of topics. Also, many of these data brokers will also process the data to make certain kinds of inferences for you which can save you a lot of time. So the pros of third-party data is that they can save you a lot of time and effort. They often give you individual-level data. And you can get summaries and inferences.
On the other hand, third-party data can be very, very expensive. Also, it may still require validation. You still need to double check that it means what you think it means. And probably most significantly, third-party data is often very distasteful to many people. So, our conclusion. Exercise care in interpreting data because there's always the possibility of implicate biases or simply because you weren't there when it got gathered, it may not mean exactly what you think it means.
That being said, you can get enormous amounts of information and you can save yourself a lot of time in your data science projects by using existing data sets.
- The demand for data science
- Roles and careers
- Ethical issues in data science
- Sourcing data
- Exploring data through graphs and statistics
- Programming with R, Python, and SQL
- Data science in math and statistics
- Data science and machine learning
- Communicating with data