Read the case study in your supplementary text titled "

Homework answers / question archive / Read the case study in your supplementary text titled "A Typical Decision Tree" on pages 238-244, and answer the following questions (over the course of the week, NOT all in one post): · How would you summarize the business scenario being used here? What is the target variable, its type, and permissible values? · In Figure 7-1 on p

Read the case study in your supplementary text titled "A Typical Decision Tree" on pages 238-244, and answer the following questions (over the course of the week, NOT all in one post): · How would you summarize the business scenario being used here? What is the target variable, its type, and permissible values? · In Figure 7-1 on p

Business

Share With

Read the case study in your supplementary text titled "A Typical Decision Tree" on pages 238-244, and answer the following questions (over the course of the week, NOT all in one post):

· How would you summarize the business scenario being used here? What is the target variable, its type, and permissible values?

· In Figure 7-1 on p. 239, there are three boxes with a letter I, A, or V connected to the box via an arrow. What is distinctive about each of these three boxes, and why do you think they are highlighted in the diagram?

· On page 243, the authors state: "After six months, 89.3% of subscribers are still active, 4.39% have left involuntarily, and 6.32% have left voluntarily." What were the corresponding distributions for the training set? Why do you think they were different? What are the implications for this difference?

· This organization is in a mature market, which means there are relatively few entities that do not already have a vendor supplying this product. The book says that in this type of market, organizations are concerned about churn. Why is that?

· What did the organization learn in your opinion about churn that it can use from this activity?

A Typical Decision Tree

The decision tree in Figure 7-1 was created from a model set describing post-paid phone subscribers; these are subscribers who talk first and pay later. The model set is set up for a predictive model. So, the input variables are recorded for all active customers on a given date, and the target is assigned, based on the customer status 100 days later. The model set is balanced, containing equal numbers of customers who are active 100 days later, who stopped involuntarily (by not paying) and who stopped voluntarily. These three possibilities are represented by the target variable, which takes on one of the three values, A, V, or I.

Figure 7-1: A decision tree.

The box at the top of the diagram is the root node, which contains all the training data used to grow the tree. In this node, all three classes are represented equally. The root node has two children, and a rule that specifies which records go to which child. The rule at the top of the tree is based on credit class: Credit class “C” goes to the left child and credit classes “A,” “B,” and “D” go to the right child. The point of the tree is to split these records into nodes dominated by a single class. The nodes that ultimately get used are at the ends of their branches, with no children. These are the leaves of the tree.

The path from the root node to a leaf describes a rule for the records in that leaf. In Figure 7-1 , nodes with distributions similar to the training data are lightly shaded; nodes with distributions quite different from the training data are darker. The arrows point to three of the darkest leaves. Each of these leaves has a clear majority class.

Decision trees assign scores to new records, simply by letting each record flow through the tree to arrive at its appropriate leaf. For instance, the tree in Figure 7-1 can be used to assign an A score, V score, and I score to any currently active subscriber. Each leaf has a rule, which is based on the path through the tree. The rules are used to assign subscribers in need of scoring to the appropriate leaf. The proportion of records in each class provides the scores.

Using the Tree to Learn About Churn

In mature markets, nearly all mobile service providers are concerned about churn, the industry term for subscribers switching from one provider to another. In markets where telephone penetration is already high, the easiest way to acquire new subscribers is to lure them away from a competitor. The decision tree in Figure 7-1 describes who is doing the churning and which of two variants of churn is more common in particular segments. Voluntary churn is when the customer decides to leave. Involuntary churn is when the company tells customers to leave, usually because they have not been paying their bills. To create the model set, subscribers active on a particular date were observed and various attributes of each captured in a customer signature.

The first split in the tree is on credit class. Subscribers with credit class C take one path whereas those with any other credit class take another. The credit class is “A,” “B,” “C,” or “D,” with “A” meaning excellent credit and “D” the lowest credit rating. The fact that this variable is chosen first means that credit class is the most important variable for splitting the data.

This split drastically changes the distribution of the target in each of the children. Sixty percent of subscribers with credit class “C” experience involuntary churn compared to only 10 percent for all other credit classes. Subsequent splits continue to concentrate the classes. Notice that different variables are used in different parts of the tree. However, any variable can be used anywhere in the tree, and a variable can be used more than once.

In the full tree, most leaves are dominated by a single class. Each node is annotated with the percent of subscribers in each of the three classes.

Look first at the leaf marked I. These subscribers are credit class “C” and have tenure of 264 days or less. Seventy-four percent of them were cancelled for non-payment. The rate of voluntary cancellations is quite low because most subscribers are on a one-year contract that includes a hefty cancellation fee. Rather than paying the fee, dissatisfied subscribers whose attitude toward debt repayment has earned them credit class “C” simply walk away.

Now consider the node marked V. These subscribers pay no deposit (the smallest deposit is $100) and have been around for at least 265 days. Although they were on contract at the time the inputs were recorded, they were known to be going off contract before the date when the target was recorded. The split on deposit>=50 is exactly equivalent to a split on credit class='D' because everyone with credit class D pays a deposit ranging from $100 to $600, whereas people with credit class “A,” “B,” or “C” pay no deposit.

Finally, look at the leaf marked A. Like those in the node marked V, they have no deposit and have been around for more than 265 days. But these subscribers are still on contract and not about to go off contract. Perhaps they signed two-year contracts to start with, or perhaps they were enticed to renew a contract after the first year. In any case, 80% are still active.

Judging by this tree, contracts do a good job of retaining subscribers who are careful with their credit scores, and large deposits do a good job of retaining customers who are not. Both of these groups wait until they can leave voluntarily and without punishment. The worst attrition is among subscribers with credit class “C.” They are not forced to pay a deposit, but unlike others who have no deposit, customers with credit class “C” are willing to walk away from a contract. Perhaps these customers should be asked to pay a deposit.

Using the Tree to Learn About Data and Select Variables

The decision tree in Figure 7-1 uses five variables from among the many available in the model set. The decision tree algorithm picked these five because, together, they do a good job of explaining voluntary and involuntary churn. The very first split uses credit class, because credit class does a better job of separating the target variable classes than any other available field. When faced with dozens or hundreds of unfamiliar variables, you can use a decision tree to direct your attention to a useful subset. In fact, decision trees are often used as a tool for selecting variables for use with another modeling technique. In general, decision trees do a reasonable job of selecting a small number of fairly independent variables, but because each splitting decision is made independently, it is possible for different nodes to choose correlated or even synonymous variables. An example is the inclusion of both credit class and deposit seen here.

Different choices of target variable create different decision trees containing different variables. For example, using the same data used for the tree in Figure 7-1 , but changing the target variable to the binary choice of active or not active (by combining V and I) changes the tree. The new tree no longer has credit class at the top. Instead, handset churn rate, a variable not even in the first tree, rises to the top. This variable is consistent with domain knowledge: Customers who are dissatisfied with their mobile phone (handset) are more likely to leave. One measure of dissatisfaction is the historical rate of churn for handsets. This rate can (and should) be recalculated often because handset preferences change with the speed of fashion. People who have handsets associated with high rates of attrition in the recent past are more likely to leave.

Picking Variables for a Household Penetration Model at the Boston Globe

During the data exploration phase of a directed data mining project, decision trees are a useful tool for choosing variables that are likely to be important for predicting particular targets. One of the authors' newspaper clients, the Boston Globe, was interested in estimating a town's expected home delivery circulation level based on various demographic and geographic characteristics. Armed with such estimates, it would be possible to spot towns with untapped potential where the actual circulation was lower than the expected circulation. The final model would be a regression equation based on a handful of variables. But which variables? The U.S. Census Bureau makes hundreds of variables available. Before building the regression model, we used decision trees to explore the possibilities.

Although the newspaper was ultimately interested in predicting the actual number of subscribing households in a given city or town, that number does not make a good target for a regression model because towns and cities vary so much in size. Wasting modeling power on discovering that there are more subscribers in large towns than in small ones is not useful. A better target is penetration—the proportion of households that subscribe to the paper. This number yields an estimate of the total number of subscribing households simply by multiplying it by the number of households in a town. Factoring out town size yields a target variable with values that range from 0 to somewhat less than 1.

The next step was to figure out which factors, from among the hundreds in the town signature, separate towns with high penetration (the “good” towns) from those with low penetration (the “bad” towns). Our approach was to build a decision tree with a binary good/bad target variable. This involved sorting the towns by home delivery penetration and labeling the top one-third “good” and the bottom one-third “bad.” Towns in the middle third—those that are not clearly good or bad—were left out of the training set.

TIP

When trying to model the difference between two groups, removing examples that are not clearly in one group or the other can be helpful.

The resulting tree used median home value as the first split. In a region with some of the most expensive housing in the country, towns where the median home value is less than $226,000 dollars are poor prospects for this paper (all census variables are from the 2000 Census). The next split was on one of a family of derived variables comparing the subscriber base in the town to the town population as a whole. Towns where the subscribers are similar to the general population are better, in terms of home delivery penetration, than towns where the subscribers are further from average. Other variables that were important for distinguishing good from bad towns included the average years of school completed, the percentage of the population in blue collar occupations, and the percentage of the population in high-status occupations.

Some variables picked by the decision tree were less suitable for the regression model. One example is distance from Boston. The problem is that at first, as one drives out into the suburbs, home penetration goes up with distance from Boston. After a while, however, distance from Boston becomes negatively correlated with penetration as people far from Boston do not care as much about what goes on there. A decision tree easily finds the right distance to split on, but a regression model expects the relationship between distance and penetration to be the same for all distances. Home price is a better predictor because its distribution resembles that of the target variable, increasing in the first few miles and then declining. The decision tree provides guidance about which variables to think about as well as which variables to use.

Using the Tree to Produce Rankings

Decision trees score new records by looking at the input variables in each new record and following the appropriate path to the leaf. For many applications, the ordering of the scores is more important than the actual scores themselves. That is, knowing that Customer A has higher or lower churn than Customer B is more important than having an actual estimate of the churn risk for each customer. Such applications include selecting a fixed number of customers for a specific marketing campaign, such as a retention campaign. If the campaign is being designed for 10,000 customers, the purpose of the model is to find the 10,000 customers most likely to churn; determining the actual churn rate is not important.

Using the Tree to Estimate Class Probabilities

For many purposes, rankings are not sufficient, and probabilities of class membership are needed. The class probabilities are obtained from the leaves. For example, the distribution of classes in the node labeled I in Figure 7-1 comes from applying the rule credit class='C' and tenure<264.5 to the balanced data at the root node. Saying that any record arriving at node I has probability 0.6 of churning involuntarily in the next 100 days might seem reasonable; however, the distribution of values in the original data is quite different from the distribution in the model set used to build the tree. After six months, 89.30 percent of subscribers are still active, 4.39 percent have left involuntarily, and 6.32 percent have left voluntarily.

Chapter 5 explains one way to convert scores to probability estimates. Another way to estimate the true probabilities is to apply the decision tree rules to the original, unbalanced preclassified data and observe the resulting distribution. For this particular dataset, selecting all subscribers with credit class='C' and tenure<264.5 yields a sample in which 84.14% are still active, 14.44% have left involuntarily, and 1.42% have left voluntarily. So the correct probability estimate for involuntary churn at this leaf is 14 percent rather than 60 percent. The percentage of involuntary churn in this leaf is well over three times the level in the subscriber population, but even here, “active” is still the most probable outcome by far.

Using the Tree to Classify Records

To use the tree as a classifier, all that is required is to estimate the class probabilities as described earlier and label each leaf with its most probable class. This is a use of decision trees that is often presented as primary in the academic literature. In the marketing world, the class probability estimates are usually more useful than the classification because classifiers quite commonly produce only one outcome. Classifying everyone as nonresponders is not helpful because the point of creating models is to differentiate among records.

A model that puts everyone in the same class is neither surprising nor uncommon in marketing applications where the behaviors of interest (response, fraud, attrition, and so on) tend to be rare. No matter how the segments for a marketing campaign are defined, the most likely outcome in any segment is no response. Fortunately, some segments are more likely to respond than others and that is enough to be useful. A charity does not send you an appeal for donations because they think you will respond; they reach out to you because they think the chance of your responding, while low, is high enough to justify the postage.

pur-new-sol