From the course: Learning Data Analytics: 1 Foundations

Understanding source data

From the course: Learning Data Analytics: 1 Foundations

Understanding source data

- [Instructor] It is so important to understand source data, and it's also important to understand the source of the data you're working with. Where did it come from and how did you receive it? Systems that people use at work are often powered by databases or capture data and place it into databases. In a perfect world, we would always be working with the source of the data. Sometimes working with the source is a challenge due to data governance, or even how it's structured in a system. Source data is where data is first initiated, and the hope is that is the most accurate compared with maybe say second-hand data. In today's environment, with needing data from multiple sources, you may find you're working with data that's stored in a data warehouse or some other data system that ensures you have the datasets you might need. This may come from another person who has access to the original source data, or you may be able to export it out of a system. But in the end, it's all part of the source data. I think one of the top skills of a data analyst or data worker of almost any type is the ability to work with different sources of data and connect them for further analysis, reporting, or visualization. It's important to always keep note of where your source data comes from, and if it's coming directly from the source of the original source, or if it's a pass-through to you, meaning someone sent it to you from the source. It's also important to know that the same data can live in different sources. Let me give you an example. We have a company that we write reports for and they have a human resources system and a separate payroll system. More specifically, when an employee hires onto the organization, the very first place you find them is captured in the HR system. This is where the employee ID is created, and this employee ID is used in other systems like payroll and benefits and even technology systems. The HR system keeps up with the necessary HR information, but it isn't specific enough for payroll. It just supplies information to the payroll system. This makes sure that employees get their paychecks. Because the paycheck data is in a dedicated system, you will not find paychecks in HR. You'll find employee information from HR in the payroll system, but the primary source for paycheck data is the payroll system. Let me take it a step further. In our example, human resources is one department, and they're responsible for all the data in HR, and providing necessary information to accounting, which is responsible in par for the payroll data. If you work in neither of those departments and you've been tasked with reporting on hiring information and performance bonuses, then you need information from both source systems. HR data and payroll data are highly protected data structures, and for what you need, they may not give you direct access to both systems. They will likely provide you just the data you need in the form of CSV. When you report on the data, your source is the CSV that is provided from both departments, and the sources of those CSV files are from the two source systems. Knowing your sources will help you explain how you got your data and where it is originating from. This is important for multiple reasons. You may detect an error and you need to know who to tell. Sometimes systems get upgraded and changed and impacts your data. If something is off in your data, you need a path to follow to determine the issue. It's always important to keep note of where your source data comes from, and if it comes directly from the source or if it's a pass-through to you.

Contents