Join Ron Davis for an in-depth discussion in this video The Data Preparation tab, part of Learning Excel Data-Mining.
- Alright let's start examining the Data Mining User Interface. But before we do that, let's do a few housekeeping chores. I'm going to show you, FILE, where we turn on Data Mining. So it's under Options, and under the Options if you go to Add-Ins, and once again this is a COM Add-in, so I'm going to select that, click the GO button, and here they are, SQL Server, Data Mining Client Add-in. So these two here need to be enabled. I have some other Add-ins also, but these are the ones that are the focus of this course.
So that's turned on, we say OK, and that gives us this Data Mining tab. Now the second housekeeping thing we need to do, we need to switch over to SQL Server Management Studio. In SQL Server Management Studio, we need to change a property on the server. So I'm going to go in to Analysis Services. That's this little cube, right here, where as a standard database, has this cylinder look to it. But if we just right-click here, and we go into Properties, we're going into Properties of the server itself, not an individual database.
And we go over to General. In General, when you first open yours, this Allow Session Mining Model is going to be set, to "false", because that's the default. You're going to want to come back through, and set it to "true". You're also going to, at least I suggest, set the AdHoc, for the purpose of, Open Rowset Queries to "true" also and just say OK. What that will allow is for, Excel to come in, create a session into the Data Mining algorithms and use the power of Analysis Services where the Analysis Services algorithms actually run.
So we'll say OK, and we're good to go. As a matter of fact, I'm going to go ahead and close SQL Server Management Studio. And we're back over into Excel. So the first thing we need to take a look at, is this Data Preparation grouping right in here. To do that, we're going to grab some Sample Data, Cell A1, because that's where I want to do the Insert. So I'll go Sample Data, and I get my little splash page, and then I just go into Next. Now I can get down here and say I want to bring in data from an External source, and then I write a Query.
Frankly this is not as friendly as you might be used to, so I tend not to use this, instead I'm going to just go ahead and grab data through standard Excel methods, and then do it that way. So we're going to cancel out of that. So now we're just going to go over to the HOME tab, and how about if I go to the DATA tab, and I want to bring in my data From Other Sources. Click here, from SQL, and we're going to tie into an instance of SQL Server, and I happen to have that running locally so if I just enter the "." that will connect into local host.
And I can use Windows Authentication, just say Next. Pulling down the databases that I have, the only one I'm going to use is this AdventureWorksDW, which is the sample you download from Microsoft. And I'm going to bring in TargetMail. This is a view and it was really cleaned ahead of time, and they use it for Target Mailing. So we'll go Next, and down here, you're with standard stuff, you would put a description in here so others would understand what you're trying to do. We're not going to do that because you're watching this, so I'll call Finish.
I do want to dump into a Table, right here, so we'll say OK, and it's brought all of the information in. So now that I have data in, I'm going to click the, under the Data Preparations, Sample Data and ignore the splash page, and now it's going to come down here and it's going to select the entire data, and that's fine. Just go over here Next, Random Sampling or an Oversample, and we're going to go with Random Sampling, and Next. Now down here it specifies a percentage. We only have 12,000 rows, so it's really not that big of a thing.
But if you had many, many rows, maybe 3, 4, 500,000, then you want to take a look at that. Just to show you, I'm just going to change this down here to 25 and we'll go Next. It'll create two worksheets, one called Selected Data, and one Unselected Data, and that's fine for me. And I'll call Finish and let it do it's work. So here's my Selected Data, right in here, and if I look at that, comes down and you'll see I have about 29,000 rows down here. If I go over to Unselected Data, and I select here in the top, control + down arrow, I have about 29,000 rows and, okay I was reading the customer identifier here instead of off the rows.
So I have 13,000 rows here, on the Unselected and over on the Selected, I have about 4600, so that makes sense. So now we have that Sample Data. Why would you want to do a Sample Data where you had Selected Data and Unselected Data? Well, if you were coming through and applying an algorithm to it, again you might want to use a subset of what you had, so rather than using everything when you applied an algorithm, you would just apply it to your Selected Data. Another reason for splitting this on out, is you could apply an algorithm, see how it works, and then go over to your Unselected Data and apply it and see if it works the same.
So it's a way of verifying it. Now if we go in and look under the Data Preparation and say we want to Explore the data, I get my splash page, Next down in here, and now I can come through and select the column. Now the CustomerKey is really very boring, so let's come through and select something like TotalChildren, and we'll go Next. And this gives us a distribution. So it tells us by buckets, which is just a form of grouping in here and we can vary that number, so we can see how this works.
And by the way, if you have an existing data set and you see the way that's distributed and then you want to see if this new data set you're bringing in is distributed in the same manner, then this is a handy little tool. Let's look at another one real quick. I can Explore the data again, and this time let's just take a look at, if I can find something that's clever here, YearlyIncome, that's always interesting. Of course we, really probably don't have that. But here we have the YearlyIncome, we can come through and take a look at it, you know that's only going to work, of course, if they told us what their yearly income was.
And in this contrived data set, they did. So cancel out of that, and the last thing we can look at is, Clean the data, and we'll look for Outliers, things that are abnormal. And if we come through there, CustomerKey is not going to help us at all, but if we take MiddleName, let's just take MiddleName and go Next. And it comes through again and the minimum is one and we can change that on out. So the people that have one middle name is far greater than those that don't. Which kind of makes sense in there.
Or if we come through and look at, let's go back again to YearlyIncome, and you see we're all over the chart 10,000-17,000 etcetera. And you can go through and you can remove Outliers. However sometimes, the information you're really trying to look at is, in the Outliers. So it depends on what you're trying to do. So that's your Data Preparation tab, Of how you can sample the data, you can bring the data in, you can sample the data, and then you can go through and look at it and you can also come in and clean the data.
- Recognize the factors involved in building an environment.
- Define models, induction, and prediction.
- Determine key concepts in Excel data-mining.
- Analyze key influencers.
- Break down how to utilize the detect categories tool.
- Recognize where and when to use the Shopping Basket Analysis.