Data mining is the process of discovering previously unknown, actionable and profitable information from large consolidated databases and using it to support tactical and strategic business decisions.
The statistical techniques of data mining are familiar. They include linear and logistic regression, multivariate analysis, principal components analysis, decision trees and neural networks. Traditional approaches to statistical inference fail with large databases, however, because with thousands or millions of cases and hundreds or thousands of variables there will be a high level of redundancy among the variables, there will be spurious relationships, and even the weakest relationships will be highly significant by any statistical test. The objective is to build a model with significant predictive power. It is not enough just to find which relationships are statistically significant.
Consider a campaign offering a product or service for sale, directed at a given customer base. Typically, about 1% of the customer base will be "responders," customers who will purchase the product or service if it is offered to them. A mailing to 100,000 randomly-chosen customers will therefore generate about 1000 sales. Data mining techniques enable customer relationship marketing, by identifying which customers are most likely to respond to the campaign. If the response can be raised from 1% to, say, 1.5% of the customers contacted (the "lift value"), then 1000 sales could by achieved with only 66,666 mailings, reducing the cost of mailing by one-third.
This example was provided by Gary Saarenvirta, formerly of The Loyalty Group, now with IBM Canada.
Each case is one account. The account numbers have been removed.
The objective variable is a response variable indicating whether or not a consumer responded to a direct mail campaign for a specific product. "True" or "response" is 1, "False" or "non-response" is 0.
The data were extracted from a much larger set with a response rate of about 1%. All 1079 responders were used, together with 1079 randomly-chosen non-responders, for a total of 2158 cases.
There are 200 explanatory variables in the file: v137, v141 and v200 are indicators for gender "male," "female," or "unknown," respectively, and v1-v24, v138-v140 and v142-v144 are recency, frequency, monetary type data for the specific accounts; v25-v136 are census variables, and v145-v199 are demographic "taxfiler" variables. Most of the variables have been normalized.
A table with some variable descriptions is attached. Some of the product-specific variables have been blinded. "p##" means product, "rcy" means recency (no of months since most recent transaction), "trans" means number of transactions, "spend" means dollars spending. For example: p01rcy means product 1 recency. Note that zero recency means that the account was active for that product in the most recent month. "Never active" would be indicated by the largest possible value for recency, as determined by the first month in which the business collected data.
The census and taxfiler variables are summary statistics for the enumeration area in which the account holder's address is located. They generally give total or average numbers of individuals or families or dollars in the categories indicated. A table of taxfiler variable descriptions is attached. You may be able to guess the census variables from their names, but tables with longer descriptions of the census variables are attached: Group "a" and Group "b" are listed separately. You are welcome to contact us if you aren't sure of any.
You can get the data as an Excel 97/98 Workbook gary.xls (5.9 Mb), as an Excel 97/98 Workbook compressed into a ZIP archive gary_xls.zip (2.4 Mb), or as text files in a ZIP archive gary.zip (1.3 Mb).
There are two text files in gary.zip. The data are in a fixed-width ASCII file Sasvar.txt and the data file description is in imtrans.txt. If you choose to work with the text files you MUST use the column positions in imtrans.txt to import the data into SAS or Splus because some columns are contiguous. Be careful with the line endings; for example, if you unzip the text files in UNIX there will be a line feed character at the end of each line that will have to be included that when computing the record length.
We expect that you will be using Splus or SAS for the analysis, however, not all of the methods suggested here are readily available in Splus. If you have a SAS licence, the Enterprise Miner module will conveniently automate many of the analyses and you may be able to get an evaluation copy inexpensively from SAS. IBM's Intelligent Miner is also recommended but it is less likely to be available to you.
For all the analyses below, you should create a training set and a validation set. As the data were stratified to 50/50 you should create an unstratified validation set with the original proportion of 1% "True" for the objective variable. You would, of course, get better validation sets if you had the complete sample of around 100,000 accounts, 99% of them non-responders, but the file is too large for us to distribute conveniently. Validation sets constructed from the 50/50 stratified sample should be adequate for the purposes of this exercise.
Your results should be plotted on a gains chart, either tabular or graphical. A gains chart is a plot of the % of the responders reached (ordinate) against the % of the customer base contacted (abscissa). If the campaign is directed at randomly-chosen individuals the plot will be a straight line with unit slope through the origin. If the campaign preferentially targets responders, the gains curve will lie above the diagonal except, of course, at 0% and 100% where it necessarily touches.
The performance of a predictive model is measured by looking at the % responders at 10%, 20% or 30% of customers mailed. A good model will get 1.5 to 3.5 times as many as random over this range so, for example, mailing to 10% of the customer base will reach 15% to 35% of the responders. Less than this means the data are not very predictive, more than this likely means that you have overfitted or there is a strong bias in the data.
Some things you could try with these data include:
All model results should be analyzed for gains chart performance with the following measures:
You can get copies of some of Gary Saarenvirta's work online at www.db2mag.com. He has also written The Intelligent Miner for Data Applications Guide, found at www.redbooks.ibm.com.
An Internet search on "data mining" will find a number of commercial products similar to Intelligent Miner and Enterprise Miner.