Demystifying Data Mining
Thanks to our guest contributor Michael Lieberman.
A little knowledge can be very powerful in the hands of marketers. Data mining is a powerful set of methodologies that when successfully applied will increase business revenue, cuts costs, or result in other actions to improve the bottom line. Data mining exercises are deployed to make decisions about marketing strategies, new product promotion, and to compare and contrast among competitors.
Data mining knowhow is not new; however, the buzzword “data mining” is relatively new. Companies have long used statistics and database systems to extract valuable information from data sets. Today, innovations in artificial intelligence, machine learning, computer processing power, storage, data warehousing, and statistical software are dramatically increasing analysis accuracy while driving down costs.
The intent of data mining is to identify correlations or patterns among multiple fields in large relational databases. In simple terms, data mining involves seeking patterns in large scale data sets. Therefore, data mining attempts to find something new in the data from all parts of business, from production to management. Data mining software provides the analytical tools to allow users to analyze data from different angles, categorize the data, and summarize the relationships identified for business and marketing decisions.
Typical Data Mining Methodologies
Some of the more frequently employed data mining methodologies include classification, regression and clustering.
Classification is a much used data mining application where the variable of interest—the variable we would like to predict—is categorical in nature. Categorical data is used to distinguish between groups such as gender or age group. It is different from continuous data such as weight or price. Examples include:
• Credit scoring, determining whether a person is a good or bad credit risk
• Medicine, determining if a given patient is at a high or low risk for diabetes
• Insurance, predicting an individual's risk category to commit fraud
Classification data mining techniques often take on descriptive and predictive aspects. For example, we might want to find new categories of behavior that are strongly related to the main variable of interest, which is brands purchased most frequently. In another example, we might ask what indicators, such as driving record, suggest red flags for insurance fraud.
The goals of classification data mining projects are the developments of predictive models that companies can utilize on a real-time basis to implement or adjust marketing and business decisions.
Regression analysis is one of the most frequently used statistical methods in marketing research. It is a tool used to understand the effect of one variable on another variable and about understanding relative strengths of effects. In a regression data mining project, the variable of interest is continuous. Examples include:
• Determining the effect on sales if prices were increased by 5%
• Understanding if sales are affected more by price or advertising
• Determining which promotion outperforms other promotions
For example, we might be interested in finding the amount of money donors are willing to give to a foundation. In the context of this example, a university foundation might want to determine the variables which most motivate non-alumni donors to give to the university’s foundation. These individuals did not attend the university so why would they be interested in donating? We can test a number of different regression techniques and perhaps we find after running a multivariate linear regression model based on survey responses that three variables pop out that might explain donor behavior: visits to the university hospital, desire to honor a current colleague, or some business connection.
The goals of a regression problem are similar to that of a classification project. We would like to find the best predictors related to the variable of interest, and we would like to develop a predictive model to find, say, the lifetime value of a donor.
Clustering is our third data mining application and has quite a different goal than our first two. Here, a variable of interest does not exist. Instead, we attempt to sort the data into clusters. Here are some typical data mining cluster project examples:
• Clustering individuals for a marketing campaign
• Clustering symptoms in medical research to find relationships
• Finding clusters of products purchased based on customer survey responses
Perhaps the best example of a clustering data mining project is the ubiquitous Netflix model, which clusters customers into movie categories and makes recommendations based on their movie watching history.
The goals of a clustering data mining project are also descriptive, in that we are looking for the variables around which clusters are assigned. We might also want to compare the clusters across variables of interest. Often, the most important part of cluster analysis is to assign new cases to clusters and to measure the strengths of cluster membership.
The CRISP-DM Model
The exponential explosion of both computer power and data availability over the past two decades has facilitated a standardized method of data mining called CRISP-DM (Cross Industry Standard Process for Data Mining). This is a data mining process model that describes commonly used approaches that expert data miners use to tackle problems. The CRISP-DM model includes six major phases that are not necessarily linear because there is some movement back and forth between some of the phases.
The project objectives and requirements are outlined from a business perspective and then converted into a data mining problem definition. Using the problem definition, a preliminary plan to achieve the business objectives is developed.
Before working with any data set, the data mining expert must become familiar with the data by identifying data quality problems, detecting preliminary insights into the data, and exploring the possibility of interesting data subsets in which useful information may be hidden.
A final dataset is developed that will be used during modeling. Raw data may be manipulated multiple times to achieve the final data set. Techniques in this phase include tabling, recording, and attribute selection as well as transformation and cleaning of data.
Modeling techniques are dictated by the nature of the data. Typically, several modeling techniques will be employed. As some techniques have specific data format requirements on the form of data, returning to the data preparation phase is often required.
Before deploying a model it is critical to more thoroughly evaluate the model and review the steps executed to construct the model to be certain it addresses the business objectives.
The end of a data mining project is not creation of the model. The knowledge gained must be organized and presented in a way that management can understand. Deployment can be as simple as generating a report or as complex as implementing a repeatable data mining process.
Data mining and resulting knowledge management are powerful tools in the marketer’s arsenal. The thought of manipulating large data sets can be daunting to some; however, for those with vision, exploiting data and turning it into usable and actionable knowledge to inform and drive marketing strategy can have significant impact on business profitability.
Editor's note: Michael Lieberman is founder and president of Multivariate Solutions, a New York-based data science and research strategy firm. He can be reached at 646-257-3794 or at email@example.com.