From the course: Deploying Scalable Machine Learning for Data Science

Building and running ML models for data scientists

From the course: Deploying Scalable Machine Learning for Data Science

Start my 1-month free trial

Building and running ML models for data scientists

- [Narrator] Data scientists don't have to just build models, we have to be able to run them in production environments. Let's take a look at what that involves. Machine learning models are built for fairly specific purposes like predicting the optimal price for a product, detecting fraud, and for categorizing documents. The first thing a data scientist needs to do when starting a new project, is to understand the specific business requirements. We have to ask, what is the problem we're trying to solve? The answer to this question should be as precise as possible. Rather vague descriptions like, reduce the number of customers that leave for a competitor, or show an analyst interesting research documents, are too imprecise. They don't describe a problem that can be modeled. These kinds of statements are more like general descriptions of what we want but they don't necessarily map to a modeling problem. When we build models, we often build either regression models, or classification models. Regression models make numeric value predictions, like the estimated value of a stock at some point in the future, the best price to charge for a hotel room during peak tourist season, or the number of customers we can attract with sales pricing. Linear and logistic regression are commonly used algorithms for these kind of problems. Classification models make choices among categorical values such as legitimate credit card versus fraudulent charges, or identifying an object in an image, or identifying customers most likely to respond to a particular kind of advertisement. There are many algorithms for classifying structured data, including support vector machines known as SVMs, random forests and extreme gradient boosting, or XGboost. Deep learning networks are widely used to classify objects in unstructured data, such as images, videos and audio. The model building process begins with collecting data. This can take a significant amount of time. Data scientists have to consider what the data describes, and how it relates to the business question the model is addressing. Modelers also have to review data for quality control. Data sets often have mistakes, missing data, and inconsistent coding conventions. These problems need to be addressed before we can start the more interesting work of building a model. It helps to understand the distribution of data in a data set. That is, understanding descriptive statistics like the minimum, maximum values of a variable, as well as their mean and standard deviation. Here are some examples of bell shaped curves. These show the shape of how many times a variable takes on a particular value. A variable that shows this kind of bell shaped distribution are called, normally distributed. That's important because some statistics work well with normally distributed data, but not other kinds of distributions. Data scientists should know the overall distribution of variables that may be used in their models. This process is called, exploratory data analysis. The goal at this stage is not to create models, or make predictions, but to understand what kind of data you are working with. If you're interested in techniques for exploring data, I suggest you search the course catalog for SQL and exploratory data analysis. Once you have your data prepared, and you understand the overall properties of your data set, then you can start experimenting with building models. This is a highly iterative process. Many data scientists and machine learning developers work with Python and/or R. Jupyter Notebooks are popular because they make it easy to develop machine learning code iteratively. The Notebooks are also easy to share and include support for documenting models. R can be used with Jupyter Notebooks as well. R-Studio however, is a well established and popular tool for building models in R, or for performing statistical analysis with R. R-Studio makes it easy to work with a large number of R packages. It also provides tools for specialized analysis and it makes it easy to build interactive web applications for visualizing data with R. Working with Jupyter Notebooks and R-Studio is a standard practice for developing machine learning models. During development it is important to be able to manipulate and reformat data, split data into training and test sets, and experiment with different algorithms. These are also well suited for collaborating and sharing your models with other data scientists and machine learning developers. When we move our models into production, we need to work with additional tools and techniques to ensure our models will function within the complex ecosystem of software that typically is found in today's businesses and organizations.

Contents