From the course: Data Visualizations with Plotly

Statistical charts with Plotly - Plotly Tutorial

From the course: Data Visualizations with Plotly

Start my 1-month free trial

Statistical charts with Plotly

- To really get a sense of the relationships underlying your data, I recommend you visualize your data early and often. Plotly has a host of statistical charts which are especially useful to understand how your data is distributed. We'll cover a few of the most useful in this lesson. After importing Plotly Express, as PX we'll generate a simple histogram from the tips dataset. Tips is a record of dining at a restaurant, the total bill amount and tip size. When creating a histogram, you need only to specify the variable you want to see the distribution of and Plotly we'll create the buckets and record counts. Looks great. We see total bills tended to be within the 10 to $20 range, with the few as high as $50. Now, if we add color to our histogram it will be segmented by that variable. Let's color by whether the meal was lunch or dinner. Looks good. Note that this histogram is stacking the observations between lunch and dinner. One other useful aspect of the histogram is that you can actually take the bins on your X axis and calculate aggregate functions and other variables in your data. So for instance, if we wanted to see the average tip amount within each bucket, we'll want to add tip to the Y axis and make use of hist func or histogram function. Here, we'll use average. So let's see how this looks. Very cool. So as expected, the average tip size seems to grow as the bill size grows. Our next visual is especially handy for understanding how different segments of your data are distributed. Especially when the sample sizes are dissimilar. This is called a distplot. To use the distplot, we need to import another Plotly package called Figure Factory. This package is basically an add on in Plotly to allow for more niche and complicated figures that aren't currently supported by Plotly.JS or Plotly Express. To create this distplot, we'll create separate series for tips during lunch versus dinner. Next, we define our labels as dinner and lunch, then we package our data into a list called hist data. The figure factory function create distplot then generates the following. What you see here is a histogram of sorts showing the distribution of tip amount for lunch and dinner. The important difference is that the sum of all bars for each lunch and dinner sums up to one. Meaning differences in the number of observations are irrelevant here. This visualization is telling us that lunch had a higher proportion of low tip amounts than dinner did. The curve is a trend fitting the data. The rug, or bottom part of the visual, shows the occurrence of each observation un-bucketed for added detail. Each of these three components can be removed from the visual fairly easily. Let's remove the curve. Now we can also modify the bin size for added granularity. Next, to get a quick visual showing the relationship between all of your variables, I recommend the scatter matrix. This will create scatterplots for every combination of the dimensions you specify. Let's import the iris dataset. We'll use PX.scattermatrix and include all the measurement data from the iris data set as dimensions. Lastly, we'll color by species. Wow, look at that. The output is pretty cool. You can readily see some interesting relationships between measurements and you can also get a sense of how species are different. Note that the diagonal isn't that useful for us. So we can update our trace to hide the diagonal. Great. The last statistics chart we'll cover is the correlation matrix. Correlation matrices are very foundational to developing machine learning and statistics models, as they give a very preliminary view into how each variable in your dataset correlates to the others we'll use the Pandas.core function to generate a numerical correlation matrix for the measurements in our iris data. Here, you can see varying correlations exist within our iris data. For example, we see a strong positive relationship between sepal length and petal length. Whereas sepal length is weakly negatively correlated with sepal width. To visualize this correlation matrix, we use imshow from plotly.express. Imshow will interpret each value in our correlation matrix as a pixel in a heat map. The output is a snazzy heat map representation of our correlation matrix. Now is a good time to introduce the built-in color scales in Plotly, as they are highly customizable. Follow this link to see the sequences available. Clearly there's a lot of them. Let's alter our color scale to see how this works. Great. There's loads of different options to choose from. With that, we've just gone through some of the most influential visualizations that Plotly has to offer. When presenting statistical data to non-technical audiences, I recommend you forego code and numerical data and stick to these impactful plots to tell your story.

Contents