From the course: LinkedIn Learning Highlights: Data Science and Analytics

Exploring data engineering

From the course: LinkedIn Learning Highlights: Data Science and Analytics

Exploring data engineering

(upbeat music) - Get the data, move it around, access it, clean it, and then process it to start looking at something. So what do you need there? Well it depends on the type of problem. Some data comes in at a high frequency, and so then you need a technology like Kafka or something else to look at it. Maybe Spark, or one of these other type of streaming processors. But other data sets may come in on large periodic intervals like annual basis or maybe decadal basis, like the census. So, it depends on the problem type. - Kafka has become practically the default for streaming analytics, especially for high-tech companies or companies dealing with large volumes of data. In the Kafka world, we can think of our data providers as producers, the things that write data to our cluster. They're sending data in and on the other side, where we have the use cases, we have our consumers, the things that are actually using the data. Now one interesting note about how Kafka works is that these consumers can also then rewrite data to another part of Kafka, so the consumers can become producers as well. - Data science is the process of making data useful, and it's not something that you can do with just one skill set or another, you need a whole host of skill sets to actually put data to work. And data engineering is one of the most essential skills that you need to really get value from your vast amounts of data. - Where does data engineering really fit in? Well, if we take another look here at our ideal world view, we have all of our components, right? We have our data coming into the hub, and we have all the people using the hub. So what parts are the data engineering team responsible for? Essentially, everything in the hub and all of the orange lines, the inputs and outputs to the hub. - The algorithms for mining text vary in their emphasis on meaning. Some place a lot of emphasis and try to model it with great care, others ignore it completely. Interestingly, the simple methods, the plain old bag of words simply indicates whether a word occurs or not can be sufficient for certain tasks. And the more complex methods are reserved for natural language processing where the computer, for instance, is trying to understand what you're saying, infer your meaning, and answer your questions from it. Either way, you want to choose an algorithm that fits your goals and your task, and helps you get the insight you need for your particular data science project. - The business is expanding to provide mobile couponing for customers. What are the goals for this use case? It is to architect a mobile data processing framework that will push coupons to customers based on their buying preferences and their current location. It has to work in real time within a couple of seconds round trip delay. It is location specific so coupons are filtered by location. It uses past history of the customer to find preferred services. It of course, needs to be massively scalable to handle thousands of active users.

Contents