Computational Statistics (2008) 23:407–427 DOI 10

Math Dec 24, 2022

Computational Statistics (2008) 23:407–427 DOI 10.1007/s00180-007-0077-5 ORIGINAL PAPER Classification trees for ordinal variables Raffaella Piccarreta Accepted: 16 June 2007 / Published online: 25 July 2007 © Springer-Verlag 2007 Abstract We introduce new criteria to obtain classification trees for ordinal response variables. At this aim, Breiman et al. (Classification and regression trees. Wadsworth, Belmont, 1984), extended their twoing criterion to the ordinal case. Following CART procedure, we extend the well known Gini–Simpson criterion to the ordinal case. Referring to the exclusivity preference property (introduced by Taylor and Silverman in Stat Comput 3:147–161, 1993, for the nominal case), suitably modified for the ordinal case, a second criterion is introduced. The hereby proposed methods are compared with the ordered twoing criterion via simulations. Keywords CART · Gini–Simpson criterion · Gini index of heterogeneity · Ordered categorical variables · Twoing criterion 1 Introduction Suppose we are interested in predicting a categorical response variable, Y via treestructured classification. In order to predict Y , we therefore generate a tree on the basis of measurement on Q predictors, X 1 , X 2 , . . . , X Q , available for a training set of N cases. There are different ways to grow a tree. One of the most famous is the CART procedure, introduced by Breiman et al. (1984) (from now onwards B & A). CART consists of two phases: growing and pruning. In the growing phase, the training set is recursively partitioned into subsets, called nodes, through a sequence of binary splits. At each step, a node is split into two sub-nodes according to the values of one of the explanatory variables. A sequence of nested trees is thus obtained, having an R. Piccarreta (B) Istituto di Metodi Quantitativi, Università “L. Bocconi”, viale Isonzo, 25, 20136 Milan, Italy e-mail: raffaella.piccarreta@uni-bocconi.it 123 408 R. Piccarreta increasing number of terminal nodes, i.e., nodes which are no longer split. The growing phase ends when the maximal tree, TMax , is obtained, having the maximum number of terminal nodes. In the pruning phase, terminal nodes are merged to prevent over-fitting of the classification tree to the training set it is based upon. The tree selected after the pruning is the one characterized by the lowest risk (properly measured). In Sect. 2, we describe the Gini–Simpson and the twoing criterion, at the basis of CART methodology (B & A). The aim of the paper is twofold. Our first object is to introduce new criteria to grow trees in the case when the response is an ordered categorical variable (from now onwards, ordinal variable). The rationale for this is to enrich the class of ordinal splitting criteria: we think that the availability of many criteria and the inspection of different trees can also prove very useful from an exploratory point of view, allowing insights into the data to be analyzed. In Sect. 3, we describe the ordered twoing criterion, an extension of the twoing method proposed by B & A to deal with the ordinal case. Following this reasoning, we provide an extension of the Gini–Simpson criterion. Ordinal criteria are compared and their properties are discussed. We first introduce the distinction between “global” and “not global” criteria, and show that Gini–Simpson criterion is a global one, while the twoing criterion is not. We then compare the two criteria by referring to the “exclusivity preference property”, studied by Taylor and Silverman (1993) for the nominal case, and extended in this paper to the ordinal case. We provide some insights about this property, show that the ordinally extended Gini–Simpson criterion does not possess it (as its nominal counterpart), and introduce a new (global) criterion possessing it. Still following Taylor and Silverman (1993) we discuss the problem of the presence of the so-called anti-end-cut factor in all the considered ordinal criteria. In Sect. 3 the criteria are compared by referring to a real data set, while in Sect. 4 their performance is evaluated on the basis of simulations. As is evident, the second object of the paper is to discuss in details properties and characteristics either of the newly introduced criteria and of the “standard one” (the ordered twoing criterion). We think this is an important point, especially because ordinal criteria have so far received less attention both from the substantial and from the methodological side. As a final consideration, in Sect. 5 we evidence a further property of the ordinally extended Gini–Simpson criterion, permitting to fasten the growing phase. This property, which proves particularly relevant in the case when big data set are analyzed, is an extension of a similar property already illustrated for the nominal Gini–Simpson criterion by Mola and Siciliano (1997, 1998). Section 6 concludes and draws directions for future research. 2 The Gini–Simpson and the twoing criteria Let X be a vector of Q explanatory variables, X 1 , . . . , X Q . On the basis of X, Y is to be predicted, Y being a nominal categorical variable assuming G levels, y1 , . . . , yG . Suppose for a training set consisting of N cases, measurements are available on (X, Y ). 123 Classification trees for ordinal variables 409 A classification tree permits to predict the (unknown) level of Y for a new individual, characterized by a vector x (∗) . The first phase in building a tree is the growing phase. At the first step, all cases constitute a single node. At subsequent steps, each node splits into two subgroups (nodes), on the basis of the values of one of the explanatory variables, called the splitting variable. In CART, the best split of a node t is selected by referring to the concept of impurity, a measure of the heterogeneity of cases in t, defined as: I N (t) = G πt (g) · [1 − πt (g)] = 1 − g=1 G πt2 (g), (1) g=1 πt (g) being the proportion of cases in t characterized by the g-th level of Y . Suppose now t splits into the two subgroups t L and t R according to a split s, and let p L and p R be the proportions of individuals placed in t L and t R ( p L + p R = 1). Denote by πt (g|L) and πt (g|R) the proportion of cases characterized by the g-th level of Y in t L and t R respectively. The Gini–Simpson criterion to evaluate the split s is the decrease in impurity achieved when passing from t to t L and t R : C G N (s, t) = ?I N (s, t) = I N (t) − p L I N (t L ) − p R I N (t R ) = pL p R G [πt (g|L) − πt (g|R)]2 . (2) g=1 The best split of node t, s ∗ (t), is selected so as to maximize the decrease in impurity, s ∗ (t) = arg maxs C G N (s, t). B & A introduce the twoing criterion as an alternative to evaluate a split. Let C(t) = {y1 , . . . , yG } be the set of categories of Y in a node t. Consider two subsets of C(t) , say C1 (t) = {yg1 , yg2 , . . . , ygh } and C 1 (t) = C\C1 . A binary response variable can thus be defined: Y ∗ = 1 if Y ∈ C1 , 0 otherwise. The impurity within node t, depending of course on the choice of C1 (t), is measured by applying (1) to Y ∗ . We have that I N (t, C1 ) = 2πt (C1 )πt (C 1 ), where πt (C1 ) = s is, by (2): g:yg ∈C1 πt (g). Given C1 the decrease in impurity provided by split C T N (s, t|C1 ) = I N (t, C1 ) − p L I N (t L , C1 ) − p R I N (t R , C1 ) = 2 p L p R [πt (C1 |L) − πt (C1 |R)]2 . ∗ = arg max C Now let C1|s C1 T N (s, t|C1 ). Split s is evaluated through 2 ∗ ∗ C T N (s, t) = 2 p L p R πt (C1|s |L) − πt (C1|s |R) . (3) The best split is then s ∗ (t) = arg maxs C T N (s, t). 123 410 R. Piccarreta C G N and C T N both consist of two components. The term p L p R , called anti-end-cut factor, forces the criteria to encourage splits producing subsets of similar sizes. The second term synthesizes the difference between the two conditional distributions of Y in the sub-nodes induced by the split. Whatever the criterion selected to grow the tree, the growing phase ends when the maximal tree, TMax , is obtained. The growing phase is followed by the pruning phase, whose aim is to prevent over-fitting of the classification tree to the training set it is based upon. At each step of the pruning phase two terminal nodes are merged. A sequence of nested sub-trees having a decreasing number of nodes is thus determined, TMax T1 T2 · · · {t1 }, {t1 } indicating the root node of TMax . The “best” tree is then chosen, having the lowest risk [measured by the misclassification rate (mr ) or by the misclassification cost (mc)], evaluated by referring to a validation set or by using a v-fold cross-validation (see B & A, for details). Throughout the paper, the optimum size tree, selected after the pruning, is characterized by the minimum tenfold cross-validation risk. In the next section, we consider the problem of growing a maximal tree in the case when the response variable is ordinal. 3 The ordinal problem: old and new solutions If the response, Y , is ordinal, it can be of interest to grow a classification tree taking into account the ordering of the levels of Y . Before proceeding, it is worthwhile to evidence which kind of split has to be considered as a “good” one in an “ordinal sense”. For a nominal response, a good split should lead to “pure” sub-nodes, possibly characterized by a strong mode (hence the modal category should be characterized by a high frequency). When considering an ordinal response, the “purity” of a node is not necessarily the primary criterion to evaluate the goodness of a split. Consider for example a node t with the following composition: nt = {25, 25, 25, 25}. Let s1 and s2 be two splits of t, both with p L = p R = 0.5. Suppose that split s1 leads to sub-nodes with compositions nt L (s1 ) = {25, 0, 25, 0} and nt R (s1 ) = {0, 25, 0, 25}, while for split s2 it is nt L (s2 ) = {25, 20, 5, 0} and nt R (s2 ) = {0, 5, 20, 25}. Split s1 is purer than split s2 : it sends all cases of classes 1 and 3 into t L and all cases of classes 2 and 4 into t R , assuring thus a substantial simplification of node t. Nevertheless, in an ordinal sense split s1 does not appear preferable to s2 , since it leads to sub-nodes with cases characterized by non adjacent levels of Y . Hence, an ordinal criterion could select the less pure split s2 as the best. Since nominal and ordinal criteria may select different splits, the inspection of trees obtained by referring to ordinal criteria may be of substantial interest even from an exploratory point of view. In fact, it allows insight into data and might possibly evidence relations among the explanatory structure and the response variable which are not revealed by “nominal trees”. In this section, we illustrate some old and new criteria for growing the maximal tree taking into account the ordering of the levels of Y . 123 Classification trees for ordinal variables 411 The twoing criterion (3) can easily be extended to this case. While in the nominal case any subset of the categories of Y can be considered, in the ordinal case attention is limited to partitions such Cg = {y1 , y2 , . . . , yg } and C g = {yg+1 , yg+2 , . . . , yG }. as g It is clear that πt (Cg ) = j=1 πt ( j) = Ft (g), the cumulative distribution function (cdf) of Y evaluated in yg . Recalling (3), simple calculations yield that for given Cg and s it is: 2 C T O (s, t|Cg ) = p L p R πt (Cg |L) − πt (Cg |R) = p L p R [Ft (g|L) − Ft (g|R)]2 . (4) Heterogeneity between the two sub-nodes is now measured by referring to the dissimilarity between the two conditional cdf’s. It can be easily shown that, for a given split s, the class maximizing (4) is Cg∗ |s with g ∗ = arg max[Ft (g|L) − Ft (g|R)]2 . g Hence the ordered twoing criterion to evaluate a split is C T O (s, t) = p L p R max[Ft (g|L) − Ft (g|R)]2 . g (5) The dissimilarity between the two cdf’s is thus measured by their maximal distance. In this sense, the ordered twoing criterion is not based upon a “global” measure of dissimilarity between the two cdf’s. We will discuss this issue in more detail later. The twoing criterion is extended to the ordinal case by substituting the conditional distributions in (3) with the conditional cdf’s in (4). Using a similar substitution in (2), the Gini–Simpson criterion can also be extended to the ordinal case: C G O (s, t) = p L p R G−1 [Ft (g|L) − Ft (g|R)]2 . (6) g=1 It is simple to show that C G O represents the decrease in impurity when passing from t to the sub-nodes induced by split s. The impurity measure considered in this case is obtained by substituting Ft (g) for πt (g) in (1): I O (t) = G Ft (g) · [1 − Ft (g)]. (7) g=1 I O (t) is a measure of the heterogeneity of an ordinal variable originally proposed by Gini (1954). The main difference between C G O and C T O is the measure of dissimilarity between the two conditional—cumulative—distributions. In C G O the “global” distance between the two conditional cdf’s is taken into account, while in C T O only their maximal distance is considered. To appreciate the difference between the two criteria, consider 123 412 R. Piccarreta 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 1 2 3 4 5 1 6 Split s1 2 3 4 5 6 Split s2 Fig. 1 Conditional cdf’s for different splits the following example. Let t be a node characterized by nt = {10, 20, 20, 30, 20}. Let s1 and s2 be two splits of t, both with p L = p R = 0.5. The sub-nodes generated by s1 have compositions nt L (s1 ) = {5, 10, 10, 25, 0} and nt R (s1 ) = {5, 10, 10, 5, 20}, while for split s2 it is nt L (s2 ) = {8, 14, 10, 18, 0} and nt R (s2 ) = {2, 6, 10, 12, 20}. The conditional cdf’s corresponding to the splits are reported in Fig. 1. In this case it is C T O (s1 ) = C T O (s2 ), since the maximal distance between the cdf’s is equal to 0.4 for both splits. Hence, criterion C T O does not take into account the fact that the two cdf’s defined by s1 coincide for Y < 4. On the contrary, criterion C G O prefers split s2 to split s1 in that, as is evident from Fig. 1, in this case the distance between the two cdf’s is greater. In the next section, C G O and C T O are compared by referring to a property first considered by Taylor and Silverman (1993) (from now onwards T & S). 3.1 The exclusivity preference property As for criteria to build classification trees when the response is nominal, T & S introduced the concept of “exclusivity preference property”, and showed that Gini–Simpson criterion does not have this property. In this section, we introduce an extension of the exclusivity preference property to the ordinal case, we show that C G O does not possess this property, and we try to explain the meaning of “not possessing” this property. A splitting criterion has the (nominal) exclusivity preference property if: i) given p L and p R , it gives the largest value to any splits that are exclusive, i.e., such that: G g=1 πt (g|L) · πt (g|R) = 0; ii) regardless of p L and p R , it takes the smallest possible value if the two conditional distributions induced by a split are identical, i.e., when πt (g|L) = πt (g|R) for all g = 1, . . . , G. To extend the exclusivity preference property to the ordinal case, we introduce the concept of ordinal exclusive splits. A split will be said ordinal exclusive if a level of the response variable exists, g ∗ , such that ∗ g g=1 123 πt (g|L) · πt (g|R) = 0 and G g=g ∗ +1 πt (g|L) · πt (g|R) = 0. Classification trees for ordinal variables 413 Table 1 Monotone exclusivity preference property Split s1 : nt L (s1 ) = {10, 30, 39, 1, 0} nt R (s1 ) = {0, 0, 1, 19, 20} Split s2 : nt L (s2 ) = {0, 0, 40, 20, 20} nt R (s2 ) = {10, 30, 0, 0, 0} Split s3 : nt L (s3 ) = {10, 30, 40, 0, 0} nt R (s3 ) = {0, 0, 0, 20, 20} Hence a criterion has the “monotone exclusivity preference property” if, given p L and p R , it favors ordinal exclusive splits, assigning the largest value to them, and if it satisfies condition ii). This happens if and only if one of the conditional distributions is entirely above or entirely below the other (see for example splits s2 and s3 in Table 1 below). It can be easily shown that C G O always satisfies property ii), but (for given p L and p R ) it does not always favor ordinal exclusive splits (as shown by T & S for the Gini–Simpson criterion). To see this, consider splits s1 and s2 in Table 1. Even if s2 is ordinal exclusive, C G O selects s1 as the best split (since C G O (s1 , t) = 0.3205 > C G O (s2 , t) = 0.3056). It is instead simple to show that the ordered twoing criterion does have the monotone exclusivity preference property. Before proceeding, we want to deeper inspect the meaning of this property. In their work, T & S provided an example similar to the one in Table 1 for the nominal case, and argued “Split A is an exclusive split and its Gini–Simpson score is . . ., however the Gini–Simpson criterion for the non exclusive split B is the higher value . . . It is hard to imagine that split B would be preferred from any intuitive point of view” (T & S, p. 155). Actually, there is a reason for which this happens (either in the nominal and in the ordinal case). A criterion having the exclusivity preference property considers, for given p L and p R , all the exclusive splits as equivalent. On the contrary, C G O discriminates among ordinal exclusive splits. To verify this, consider split s3 in Table 1. It is C G O (s3 , t) = 0.3368 > C G O (s1 , t) > C G O (s2 , t). So, s1 is preferred to the ordinal exclusive split s2 because it is “more similar” to the ordinal exclusive split s3 , which is preferred to the alternative ordinal exclusive split s2 (a similar argument also holds for the Gini–Simpson criterion). In conclusion, the point we wish to make is that not having the monotone exclusivity preference property should not be necessarily considered a drawback (as in T & S) for a given criterion. Nevertheless, we can suppose that in applications we will rarely need to compare exclusive splits, and we can then be interested in using a criterion characterized by the exclusivity preference property. This is the reason why we now introduce another criterion having the monotone exclusivity preference property and being based on a “global” measure of the dissimilarity between the two conditional distributions induced by a split (we are in fact interested in evaluating the impact of both factors on the performances of the classifier: the kind of dissimilarity measure and the exclusivity preference property). To do this, we refer to an index introduced by Agresti (1981) to evaluate nominal–ordinal association. Consider a node t and a split s. s defines a nominal variable (which we will still indicate by s) assuming the two categories L and R. A possible measure of the association between s and Y is (Agresti 1981): 123 414 R. Piccarreta ? A (s, t) = G πt (g|L)πt ( j|R) − g=1 j>g = G G πt (g|R)πt ( j|L) g=1 j>g πt (g|L)[1 − Ft (g|R)] − g=1 G πt (g|R)[1 − Ft (g|L)]. g=1 Our idea is to use ? A (·, t) as a criterion for selecting the split. Notice that −1 ≤ ? A (s, t) ≤ 1, with |? A (s, t)| = 1 if and only if s induces an ordinal exclusive split. It is then clear that the maximum value is reached by all the ordinal exclusive splits, irrespective of p L and p R . As T & S point out, “under certain circumstances, we might be prepared to sacrifice exclusivity of the offspring if it can only be attained at the cost of creating one very small offspring node”. This is the reason why exclusivity should be preferred only conditionally on ( p L · p R ), reaching its maximum value when p L = p R = 0.5. So, the criterion we will refer to is: C A (s, t) = p L p R |? A (s, t)|, (8) containing the anti-end-cut factor, as the other criteria described above. 3.2 Adaptive anti-end-cut factors As pointed out above, all the considered ordinal criteria contain the so called anti-endcut factor (from now onwards aec f ), forcing a criterion to encourage splits producing sub-nodes of similar sizes. T & S illustrate that in some situations this characteristic may lead to unsatisfactory trees. To evidence the effects of the aec f consider splits s1 and s2 in Table 2. The (pure) split s1 , putting together adjacent classes of Y , should intuitively be preferred to s2 . Nevertheless, all of the ordinal criteria prefer s2 to s1 , due to the aec f . In fact, aec f 1 < aec f 2 , while the dissimilarity measures between conditional distributions all favor split s1 . Hence, the concentration upon equally-sized subsets leads the criteria to ignore the simplifying split s1 in favor of s2 . To emphasize the different impact of the aec f on C G O , C T O and C A , compare s1 and s3 in Table 2. s1 , appearing preferable to s3 , is selected as the best by C G O and C T O . On the contrary, C A selects s3 as the best, hence appearing more influenced by the aec f . These considerations show that in some cases it may be sensible to remove the tendency of splitting criteria to favor splits producing sub-nodes of similar sizes, still avoiding the opposite tendency to obtain very small off-springs. To attenuate the effects of the aec f without totally eliminating it, T & S suggest to adapt the aec f according to the complexity of a node. The simplest measure of the complexity of a node t is the class number index, m(t), the number of levels of Y in t. An alternative measure, taking into account not only the number of levels assumed by Y but also their relevance within t, is the so called reciprocal entropy index: 123 Classification trees for ordinal variables 415 Table 2 The effects of the aec factor s1 \Y y1 t L1 22 t R1 22 y2 y3 y4 22 C G O = 0.1716 · 1.692 21 26 31 78 C T O = 0.1716 · 1 21 26 31 100 C A = 0.1716 · 1 y3 y4 40 C G O = 0.2400 · 1.4719 s2 \Y y1 y2 t L2 22 18 3 26 31 60 C T O = 0.2400 · 0.9025 22 21 26 31 100 C A = 0.2400 · 0.9775 y3 y4 40 C G O = 0.1875 · 1.5228 t R2 s3 \Y y1 y2 t L3 22 3 18 26 31 60 C T O = 0.1875 · 0.7744 22 21 26 31 100 C A = 0.1875 · 0.9712 t R3 ? m ∗ (t) = ? G ?−1 πt2 (g)? = [1 − I N (t)]−1 . g=1 The complexity of a node depends now upon its impurity. It is simple to see that when Y is uniformly distributed m(t) = m ∗ (t). On the basis of the complexity of a node a threshold is obtained, p L O W ≤ 0.5, and the same aec f is associated with all the splits for which p L and p R are both greater than p L O W . Thus, the penalization for asymmetrical splits is applied only to splits leading to sub-nodes with a proportion of cases lower than p L O W . According to this criterion, an adaptive aec f for a given p L O W is defined as: AEC( p L ) = min[ p L (1 − p L ), p L O W (1 − p L O W )]. T & S introduce the basic adaptive aec f (ba-aec f ), based upon m(t), and the enhanced adaptive aec f (ea-aec f ), based upon m ∗ (t). The thresholds are respectively p L O W = [1/m(t)] p ∗L O W = min (1/2), 1/m ∗ (t) . 3.3 An application to real data To compare the considered criteria we now refer to a data set consisting of N = 1,002 observations on Italian investors, customers of an Italian firm. The response variable is an ordinal qualitative variable having six levels and measuring the “importance” of an investor for the considered firm (as stated by the firm itself on the basis of the amount of capital invested in years subsequent to the entrance, to portfolio’s diversifications and so on). The response has to be predicted on the basis of information about the first 123 416 R. Piccarreta Cross-validated risk. P=mc Cross-validated risk. P=mr 0,58 0,56 0,54 0,52 0,5 0,48 0,46 Ordinal Gini, aec=s 0,69 Agresti, aec=s Ordinal twoing, aec=s 0,67 Nominal Gini, aec=s 0,65 Ordinal Gini, aec=s Agresti, aec=s Ordinal twoing, aec=s Nominal Gini, aec=s 0,63 0,61 0,59 0,44 0,42 0,4 0,57 0,55 0 10 20 30 40 0 10 20 30 40 Fig. 2 Cross-validated risk for different criteria investment choices (the amount of invested capital, the first form of investment) and of social-demographic characteristics. The trees were grown using the different criteria, each combined with the standard and the (two) adaptive aec f . The obtained trees were then pruned and cross-validated (tenfold cross-validation was considered). The first measure of risk considered to evaluate and to prune a tree was the misclassification rate, mr . To take into account also the ordinality of Y , a cost was then assigned to each misclassification error. The cost of misclassifying a class j object as a class g object was set equal to cg| j = |g− j|: it depends only on the “distance” between j and g. Actually, we are not attaching costs reflecting “avversion” to particular errors (this is done to avoid the comparison to be affected by subjective choices), and the matrix of misclassification costs is symmetric. Of course, in applications different structures of costs may be considered. Thus, each tree was pruned by referring to mr and to mc (in the following, P = mr, mc will indicate the criterion used to measure the risk). In Fig. 2, we report the values of the cross-validated risk as depending on the number of terminal nodes for trees pruned by referring either to mr and to mc. Notice that ordinal criteria perform better than C G N , even in the case when the tree is mc-pruned. This is an interesting point, since one may be convinced that the ordinal problem may be taken into account only in the pruning phase (i.e., the tree is grown using a nominal criterion and then pruned using an ordinal criterion). In this example, instead, we observe that when the relationship between the ordinal response and the explanatory variables is monotone, a fully ordinal approach is preferable either w.r.t. the mr and w.r.t. the mc. Nevertheless in other applications, in particular in situations when the relationship between the predictors and the response variable is not so strongly monotone, we observed a good performance of the nominal Gini–Simpson criterion, at least w.r.t. mr . In this sense, taking into account either ordinal and nominal criteria does not only permit to build different trees—which is in itself relevant from an exploratory point of view—but it also permits to inspect the kind of relationship (monotone or not) between the response variable and the explanatory ones. Moreover, in many applications we observed that ordinal criteria often lead to trees with intermediate subnodes characterized by adjacent levels of the ordinal response. A better performance (as compared to the nominal criterion) was noticed either in terms of mc and in terms of ordinal–ordinal measures of association (Spearman coefficient, Kendall tau-b coefficient) between the response variable and its prediction. 123 Classification trees for ordinal variables 417 Criterion: CGO Criterion: CA 0.11 0.18 0.18 0.20 0.15 0.18 1002 0 0.002 0.16 0.29 0.24 0.31 568 0 0 (0) 0 (0) 0.01 (0.01) 0.30 (0.44) 0.69 (0.77) 197 0.11 0.18 0.18 0.20 0.15 0.18 1002 (0.00) (0.01) (0.51) (0.83) (0.90) (0.96) 0 0.003 (1) 0.24 (1) 0.44 (0.99) 0.21 (0.56) 0.11 (0.23) 371 0.004 0.52 0.27 0.12 0.06 0.03 223 0.25 0.42 0.20 0.08 0.04 0.02 434 (1) (0.99) (0.49) (0.17) (0.10) (0.04) (0.01) (0.64) (0.69) (0.79) (0.81) (0.75) 0.50 0.31 0.13 0.03 0.01 0.01 211 0 0 0.15 0.29 0.25 0.32 549 (0.99) (0.36) (0.31) (0.21) (0.19) (0.25) 0 0 0 (0) 0.05 (0.07) 0.33 (0.56) 0.62 (0.80) 227 0.24 0.40 0.21 0.09 0.04 0.02 453 (0) (0) (0.45) (0.80) (0.88) (0.96) 0 0 0.25 (1) 0.46 (0.93) 0.19 (0.44) 0.11 (0.20) 322 (0) (0) (0.29) (0.70) (0.85) (0.93) 0 0 0.14 (1) 0.36 (1) 0.32 (0.95) 0.18 (0.40) 384 0.50 0.31 0.13 0.03 0.01 0.01 211 (0.99) (0.36) (0.28) (0.18) (0.17) (0.25) 0.11 0.18 0.18 0.20 0.15 0.18 1002 0.11 0.18 0.18 0.20 0.15 0.18 1002 0 0 (0) 0 (0) 0 0.06 (0.05) 0.94 (0.60) 109 (0.01) (0.64) (0.72) (0.83) (0.83) (0.75) Criterion: CGN Criterion: CTO 0 0 0.11 0.28 0.26 0.35 493 0.004 0.48 0.29 0.14 0.06 0.02 242 (1) (1) (0.55) (0.20) (0.12) (0.04) 0 0.11 0.45 0.30 0.08 0.06 124 0.21 0.36 0.25 0.12 0.05 0.02 509 (1) (1) (0.71) (0.30) (0.15) (0.07) (0) (0.08) (0.45) (0.63) (0.43) (0.58) 0.28 0.44 0.18 0.06 0.03 0.01 385 0 0.002 0.16 0.29 0.24 0.31 568 (1) (0.92) (0.55) (0.37) (0.57) (0.42) 0 0 (0) 0 (0) 0 (0) 0.22 (0.26) 0.78 (0.72) 163 (0) (0.01) (0.51) (0.83) (0.90) (0.96) 0 0.002 (1.00) 0.22 (1) 0.40 (1) 0.25 (0.74) 0.12 (0.28) 405 Note: In each node t he distribution of the response variable is reported. For each child node, proportion of cases of the parent node in a given class which is "absorbed" by the child node. 0.25 0.42 0.20 0.08 0.04 0.02 434 0.004 0.52 0.27 0.12 0.06 0.03 223 (0.01) (0.64) (0.69) (0.79) (0.81) (0.75) (1) (0.99) (0.49) (0.17) (0.10) (0.04) 0.50 0.31 0.13 0.03 0.01 0.01 211 (0.99) (0.36) (0.31) (0.21) (0.19) (0.25) in parentheses we report the Fig. 3 First splits for all criteria (aec = s) Among the ordinal criteria, we observe that the best performance is attained by C A , and that C G O performs slightly better than C T O . Our aim is not to state conclusions about the performances of the criteria on the basis of these results. In fact, we are mainly interested in showing that different criteria lead to different trees, and that our criteria enrich the class of ordinal criteria. Actually, performance of a classifier is not the only relevant aspect when comparing trees. A very important issue concerns the comparison of the “structure” of the intermediate nodes characterizing the trees. To better emphasize the “qualitative” differences in the splits characterizing the trees, in Fig. 3 we report the first splits of the trees grown using C G O , C A , C T O , and C G N . 123 0,69 Ordinal Gini, aec=s Ordinal Gini, aec=ba Ordinal Gini, aec=ea 0,67 0,65 0,63 0,61 0,59 0,57 0,55 5 10 15 20 25 30 35 40 45 0,69 Ordinal twoing, aec=s Ordinal twoing, aec=ba Ordinal twoing, aec=ea 0,67 0,65 0,63 0,61 0,59 0,57 0,55 5 10 15 20 25 30 35 40 45 Cross-validated risk (P=mc) R. Piccarreta 0,69 Cross-validated risk (P=mc) Cross-validated risk (P=mc) Cross-validated risk (P=mc) 418 0,69 0,67 Agresti, aec=s Agresti, aec=ba Agresti, aec=ea 0,65 0,63 0,61 0,59 0,57 0,55 5 10 15 20 25 30 35 40 45 Nominal Gini, aec=s Nominal Gini, aec=ba Nominal Gini, aec=ea 0,67 0,65 0,63 0,61 0,59 0,57 0,55 5 10 15 20 25 30 35 40 45 Fig. 4 Cross-validated risk and aec (P = mc) For each node in the figure the conditional distribution of the response within the node itself is reported. To emphasize the degree of separation between two sub-nodes we also report, for a given class of the response, the proportions of cases sent to each sub-node (i.e., for class j we report p(t L | j) and p(t R | j) = 1 − p(t L | j)). We immediately observe the monotone exclusivity preference property “at work” for C A and C T O . Both criteria attempt at individuating pure nodes, characterized by the absence of low classes. A comparison with C G O evidence that this criterion “sacrifice” purity to obtain nodes “absorbing” a high proportion of high-level classes. Moreover, notice that C A is more influenced than other criteria by the presence of aec f , since we observe sub-nodes with similar size. As concerns C G N , we observe a first split identical to that of the C G O -tree. To appreciate the difference between the two criteria, compare the splits of the left-node in the first level. C G N leads to left sub-node characterized by a strong presence of class-6 cases; also class-5 cases are in the node, even if the highest proportion of class-5 cases is sent to the right sub-node. Notice that the proportion of classes 5 and 6 cases in the right sub-node is higher than its counterpart in the C G O -tree, appearing more able to absorb a higher proportion of higher-level cases. In Fig. 4, the cross-validated risk as depending on the number of terminal nodes is reported for trees (mc-pruned) obtained by combining the adaptive aec f with considered criteria. Notice that, at least in this application, the ordinal criteria do not suffer very much for the presence of the aec f : trees obtained by adapting the aec f are not better than the “standard” ones. Different conclusions can be drawn for the C G N : trees combined with adaptive aec f perform better. To understand why, we recall that in their work T & S pointed out that the use of adaptive aec f in conjunction with C G N leads to trees having similar performances in terms of mr . They moreover emphasized that when considering C G N -trees for an ordinal response, one of the main consequences 123 Classification trees for ordinal variables 0.00 0.00 0.00 0.00 0.24 0.76 167 419 Criterion: CGO CA Criterion: CTO Criterion: CGN 0.11 0.18 0.18 0.20 0.15 0.18 1002 0.11 0.18 0.18 0.20 0.15 0.18 1002 0.11 0.18 0.18 0.20 0.15 0.18 1002 (0.00) (0.00) (0.00) (0.00) (0.26) (0.69) 0.13 0.22 0.21 0.24 0.14 0.07 835 (1.00) (1.00) (1.00) (1.00) (0.74) (0.31) 0.00(0.00) 0.00(0.00) 0.00(0.00) 0.01(0.01) 0.30(0.39) 0.69(0.73) 197 0.13 0.23 0.22 0.24 0.12 0.06 805 (1.00) (1.00) (1.00) (0.99) (0.61) (0.27) 0.00 0.00 0.00 0.00 0.22 0.78 163 (0.00) (0.00) (0.00) (0.00) (0.24) (0.69) 0.13 0.22 0.21 0.24 0.14 0.07 839 (1.00) (1.00) (1.00) (1.00) (0.76) (0.31) Fig. 5 First splits for all criteria, aec = ba of adapting the aec f was that intermediate sub-nodes were constituted by adjacent levels of the response (so that a decrease in the mc should be observed). We can observe a stronger tendency (as compared with the “standard” tree) to separate cases characterized by the lowest levels of the response from the others. In this sense, the presence of the adaptive aec f helps in identifying asymmetrical sub-nodes, isolating classes also in the first splits. Similar results can be observed (at least when considering the first splits) when the aec f = ea is used. This difference between ordinal and nominal criteria is maybe due to the fact that our criteria still lead to sub-nodes possibly constituted by adjacent levels of the response, so that adapting the aec f has not so significant consequences in this context. Nevertheless, as T & S point out, adaptive aec f can prove useful in individuating small and well identified splits from the first phases of the segmentation process. To verify this, in Fig. 5 we report the first splits of the trees obtained with aec = ba. We tested the impact of adaptive aec f on ordinal trees generated for different data sets. The ordinal criterion appearing more influenced by the aec f is C A . As for C G O and C T O , we often observed similar trees and similar performances (both in terms of mr and mc) of the standard and of the adapted algorithms. We also considered measures of ordinal–ordinal association between the response and the predicted variable and conclusions were similar. As evidenced in this application, the criterion most influenced by the aec f is C G N . 4 Performance evaluation In this section the performance of the splitting criteria will be compared via simulations. As it was evidenced in the previous section, ordinal criteria perform better than the nominal one (eventually combined with a mc-pruning phase) in cases when the relationship between the response variable and the explanatory variables is ordinal. This is the reason why we here consider simulations where the relationship between the response and the explanatory variables is structured but not strongly ordinal. Thus, we are interested in evaluating and comparing the capability of ordinal criteria to ordinally-predict the response variable in a not too favorable simulation. 123 420 R. Piccarreta 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Fig. 6 Waveforms referred to in simulations At this aim, we adapted the waveform recognition model, described by B & A (p. 49), to our problem. This model is based upon the waveforms h 1 (t), h 2 (t), . . . graphed in Fig. 6. The i-th function is centered on a value, indicated by ci . The vertices of the i-th triangle will be denoted by li and u i : li = ci − 6, u i = ci + 6. Let G denote the number of levels of the response variable, Y . N = (50 · G) explanatory vectors were generated using prior probabilities (1/G, . . . , 1/G). Hence, there are approximately 50 cases for each class of Y . For each case, the vector of explanatory variables is obtained as a random combination of two waveforms, with noise added. The number of waveforms taken into account and the number of explanatory variables generated, n x , depend upon G. More precisely, if g ∗ is the number of waveforms necessary to obtain G classes, the number of explanatory variables generated is n x = u g∗ . To create an explanatory vector, x = [x1 , . . . , xn x ], a random number u is sampled from a uniform distribution, and n x random numbers 1 , . . . , n x are independently sampled from a standardized normal distribution (u is independent of m for each m = 1, . . . , n x ). Different explanatory vectors are then generated for each class: Class j xm = uh ( j+1) (m) + (1 − u)h ( j+1) +1 (m) + m , 2 Class ( j + 1) 2 xm = h ( j+1) (m) + (1 − u)h ( j+1) +2 (m) + m , 2 2 with j = 1, 3, 5, 7 and m = 1, . . . , n x . In what follows, a data set created as described above will be named a waveform-set. For each G, we generated K = 40 training waveform-set, Ltrain,G,k , k = 1, . . . , 40. For each Ltrain,G,k , the maximal tree was built using C G O , C A , C T O and C G N , each one combined with the standard aec f (aec = s), the basic adaptive aec f (aec = ba), and the enhanced adaptive aec f (aec = ea). Summing up, we have: G: K: cr : aec: number of levels of the response variables, G = 3, 4, 5, 6, 7; number of simulations, K = 40; criterion to obtain the maximal tree, cr = G O, A, T O, G N ; anti-end cut factor, aec = s, ba, ea. Hence, we have 2,400 maximal trees, TMax|G,k,cr,aec . Each maximal tree was pruned taking into account two measures of risk, namely mr and mc1 : thus, two se1 Again, we set c g| j = |g − j|. 123 Classification trees for ordinal variables 421 quences of nested subtrees, {T j } P were obtained, P = mr, mc. The tree characterized by the minimum risk (evaluated via tenfold cross-validation) is finally selected as the ∗ : thus we have 2,400 ×2 = 4,800 classifiers. The trees optimum-size tree, TG,k,cr,aec,P are evaluated on the basis of a test waveform-set, Ltest,G,k , of N = (50 · G) observations. Each case in Ltest,G,k a prediction is assigned.2 The true category of Y for ∗ it is possible to evaluate the Ltest,G,k -cases is known. Hence for each TG,k,cr,aec,P cr cr , respectively. Ltest –mr and the Ltest –mc, indicated by MG,k,aec,P and C G,k,aec,P As concerns the “absolute” performance of the algorithms, the following measures of errors are considered: Mcr G,k,aec,P = cr MG,k,aec,P L MG,k cr CG,k,aec,P = cr C G,k,aec,P L C G,k , L and C L being the mr and the mc of the root node of the test set (representing MG,k G,k thus the risk incurred when predicting Y ignoring information on the explanatory variables). To further emphasize the differences between algorithms with respect to mr and s,cr mc, we obtained the standardized measures of errors3 , Ms,cr G,k,aec,P and CG,k,aec,P . s,cr The syntheses of (standardized) errors for each (G, aec, P), Ms,cr G,aec,P and CG,aec,P , are obtained by average. M cr,C We also considered Min cr, G,aec,P and Min G,aec,P , the number of times each criterion is a strong minimizer of the error (in terms of M or C), reaching alone the minimum value of error. Besides these “marginal” measures of errors, also the number of times MC each criterion simultaneously minimizes (M and C), Min cr, G,aec,P was calculated. M cr,C cr,MC Similarly, Max cr, G,aec,P , Max G,aec,P , and Max G,aec,P will denote the number of times a criterion is a maximizer of the errors. For paired comparisons, the following quantities were taken into account cr1 cr2 ?M G,k,aec,P (cr1 , cr2 ) = MG,k,aec,P − MG,k,aec,P cr1 cr2 ?CG,k,aec,P (cr1 , cr2 ) = CG,k,aec,P − CG,k,aec,P . All the measures used to compare the criteria were also adapted to evaluate the impact of the different kind of aec f on the performance of the algorithms. 4.1 Comparison between splitting criteria The performance of the criteria can be analyzed (1) conditionally to the aec f (for a given aec f , we compare the relative performances of the criteria) and (2) conditionally to the criterion (for a given criterion, we analyze the impact of adapting the aec f ). 2 The prediction associated to each terminal node is the category of the response minimizing the risk in the node itself. The predictions only depend upon the observations in the training set. 3 Of course, the standardization was carried out conditionally upon G, aec, P. 123 422 R. Piccarreta aec=s, P=mr 0.40 6 mr 7 7 0.20 5 3 4 -0.60 5 7 6 6 -0.40 -0.2 4 4 3 4 5 3 3 0.20 0.40 0.60 -0.20 5 6 -0.40 GO A -0.60 7 TO GN -0.80 mc aec=s, P=ms 0.40 6 3 mr 0.20 7 4 7 4 5 3 6 -0.40 5 7 5 -0.20 4 3 3 4 6 0.20 0.40 -0.20 GO 6 A TO -0.40 7 5 GN mc Fig. 7 Scatter plots of Ms and C s (aec = s) Comparison conditional to aec f . We first consider the case when aec = s. In Fig. 7 the scatter plots of Ms and C s are reported (for P = mr and P = mc). Since the standardization was carried out conditionally to G, the measures of errors for different values of G are not comparable.4 Inspection of the scatter plots shows that we have a mr -oriented criterion, C G N , and a mc-oriented criterion, C A . • C G N is often characterized by the lowest mr (for a given G): ordinal criteria perform worse than C G N w.r.t. mr , even if a significant difference in performance is observed only between C A and C G N (the former criterion performs significantly worse than the latter). As for C G O and C T O , the difference with C G N is significant (worse performance) only for the highest values of G. Moreover, in many simulations the C G N -tree is characterized by the minimum mr , but it rarely minimizes mc, and often maximizes it. As for mc, we observed that ordinal criteria perform significantly better than C G N also for low values of G. These results are somehow expected: C G N favors splits leading to pure nodes (which are not 4 Actually, as G increases a worse performance of all the considered algorithms can be noticed both in terms of mr and in terms of mc. 123 Classification trees for ordinal variables 423 necessarily ordinally pure), and may thus perform better w.r.t. mr . When using ordinal criteria, we are prepared to accept a (possibly not too high) increase of mr , to obtain a tree with terminal nodes characterized by (possibly) adjacent levels of the ordinal response, and hence by a lower mc. These considerations also hold when the tree is pruned according to mc. Pruning nominal-trees taking into account the ordering of Y is not “enough” to improve the performance of C G N w.r.t. mc (as compared to ordinal criteria). • Criterion C A is often characterized by a low mc (and in many simulations it minimizes the mc). On the other hand, it often performs worse than the other algorithms w.r.t. mr (it seldom minimizes mr and often maximizes it). By analyzing the differences ?M and ?C between C A and the other criteria we noticed that C A has a significantly better performance w.r.t. mc but has also a worse performance w.r.t. mr . The two “oriented” criteria, C A and C G N , while having a better “marginal” performance (resp. w.r.t. mc and mr ), have a poor “joint” performance, showing low values of Min MC and high values of and Max MC . • As concerns C G O and C T O , they are more “balanced” w.r.t. the two kind of errors, showing a “medium” performance either w.r.t. mr and w.r.t. mc. Analyzing the number of times their trees reach the minimum and the maximum errors, we observe a better performance, as compared to C A and C G N ). Nevertheless, we have to point out that C G O leads to classifiers often showing a (slightly) better mc-performance: for some values of G the twoing ordinal criterion performs (significantly) worse than the other ordinal criteria. C T O maximizes C and (M, C) more often than C G O (C G O also has a slightly better “joint” performance). Considerations above also hold when comparing criteria conditionally to adaptive aec f , even if a worsening of the relative mc-performance of C G N is observed. Moreover, when aec = ba and (especially) when aec = ea the relative mr -performance of C G N is not as good (with an exception for G = 7) as in the case when aec = s. As a final consideration, we point out that the differences between the errors are seldom significant (especially between ordinal criteria). In our opinion this does not mean that the criteria are equivalent but, on the contrary, that they are “competitors”. Actually, we observed even high differences in the performance relative to single simulations, meaning that classifiers are different. This is also confirmed when analyzing the correlation coefficients between the errors characterizing the different classifiers: they are low and not significant. In this sense, we think that our criteria can be useful alternatives to C T O for growing classification trees in the case when the response is ordinal. As another conclusion, we think that ordinal criteria should be referred to when growing trees for ordinal responses. Comparison conditional to criteria. An analysis similar to that described above was conducted also to evaluate the impact of the aec f (standard and adaptive) on the performances of the splitting criteria. We summarize very briefly the main results of simulations. As for ordinal criteria, C A appears mostly influenced by the aec f . In particular, when P = mr the (combination with the) ea-aec f leads to classifiers showing a 123 424 R. Piccarreta better performance either w.r.t. mc and w.r.t. mr (except when G = 7). The s-aec f is instead characterized by the worse performance. This last consideration also holds when P = mc (even if the relative performance of the two adaptive aec f depends upon G). Concerning C G O and C T O we can not draw conclusion about the effects of the adaptation of the aec f . Moreover, in some cases a worsening of the performances was observed when adapting the aec f , either in terms of mr and in terms of mc. With regard to the nominal criterion, the best performance is observed when aec = s. The above mentioned results confirm the considerations of T & S. The adaptation of the aec f is not necessarily expected to lead to trees characterized by a better performance in terms of mr and mc but, rather, by a better “characterization” of the intermediate sub-nodes. 5 A property of the ordinal Gini–Simpson criterion At a first glance, it may seem criterion C G O to be more computationally expensive than its nominal counterpart. Actually, when growing a tree using the nominal criterion, it is necessary to obtain the conditional distributions of the response variable into the two sub-nodes induced by a given split. When considering the ordinal criterion, the cumulative conditional distributions have to be determined. It is immediate to notice that when the number of categories of the response variable is small, this further calculation is not excessively costly from a computational point of view. Nevertheless, the evaluation of the conditional distributions for all the possible splits of a given node is computationally intensive in itself. On the basis of this consideration, Mola and Siciliano (1997, 1998) introduce an algorithm to fasten the growing phase in the case when the response variable is nominal, and the Gini–Simpson criterion is used to evaluate the splits. We now extend the procedure to the ordinally extended Gini–Simpson criterion. We recall that the criterion to evaluate a split s of a node t is: C G O (s, t) = p L p R G−1 [Ft (g|L) − Ft (g|R)]2 . (9) g=1 Now consider the following measure of the impurity within node t: I O (t) = G Ft (g) · [1 − Ft (g)]. (10) g=1 As it was mentioned in Sect. 3, I O (t) is a measure of the heterogeneity of an ordinal variable originally proposed by Gini (1954). It can be easily shown that: C G O (s, t) = I O (t) − p L I O (t L ) − p R I O (t R ) 123 (11) Classification trees for ordinal variables 425 Now recall that a split s is defined on the basis of the values assumed by one out of the explanatory variables, say X I , the subscript indicating the number of categories of X . In particular, consider a split s defined on the basis of X I . s will divide a node t into two sub-nodes, t L constituted by cases having values of X I in a given set, say A, and t R constituted by cases having values of X I not in A. Suppose now that a node t is split, according to the levels of X I , into I sub-nodes. By extending (11) to this situation, we obtain: C G O (X I , t) = I O (t) − I pi I O (ti ), (12) i=1 ti and pi being, respectively, the node constituted by all case in t characterized by the i-th value of X I and the proportion of cases in t placed in ti . This quantity is the numerator of the index: τ O(t) (Y |X I ) = I O (t) − I i=1 I O (i|t) pt (i) I O (t) , (13) introduced by Piccarreta (2001) to measure the strength of the nominal–ordinal association between the (nominal) predictor X and the (ordinal) response variable Y (here this association is evaluated restricting attention only to cases in t). This consideration emphasizes that when evaluating a split as in (11) we are substantially evaluating the strength of the association between the response variable Y and the binary variable defined by grouping the categories of an explanatory variable. Hence, the ordinal Gini–Simpson criterion searches for the splitting variable (and the grouping of the categories of the splitting variable) predicting the response variable in the best possible way (according to the predictability index τ O ). Suppose now that two categories of X I , i 1 and i 2 , are grouped into a combined category i(12), and let X I −1 denote the resulting categorical variable. Of course, the number of sub-nodes defined on the basis of X I −1 will be (I − 1). It will be: C G O (X I −1 , t) = I O (t) + I pi I O (ti ) − pi(12) I O (ti(12) ), (14) i=i 1 ,i 2 and C G O (X I , t) − C G O (X I −1 , t) = pi(12) I O (ti(12) ) − pi1 I O (ti1 ) − pi2 I O (ti2 ). (15) Recalling (10) and observing that: pi(12) = pi1 + pi2 pi1 Fti1 (g) + pi2 Fti2 (g) Fti(12) (g) = , pi(12) 123 426 R. Piccarreta it can be easily shown that: C G O (X I , t) − C G O (X I −1 , t) = G 2 pi1 pi2 Fti1 (g) − Fti2 (g) ≥ 0. pi(12) (16) g=1 So, the main result here is that if two categories of X I are collapsed, the heterogeneity of the new partition is greater than the heterogeneity of the partition induced by the I categories of X I . This leads, by induction, to the following stating: if s X is a split defined on the basis of the explanatory variable X , it is: C G O (X, t) ≥ C G O (s X , t) (17) We now recall that to select the best split for a node t we have (1) to take into account all the explanatory variables, (2) for each explanatory variable we have to try all the possible splits defined on the basis of its categories. We have then to select the best split for each variable and then the best split among all these best splits. Property (17) of the ordinal Gini–Simpson criterion, can be used to find the best split of a node without trying out all possible splits. Consider in fact a node t and two explanatory variables, say X and Z , and suppose it is C G O (X, t) ≥ C G O (Z , t). Let s X∗ and s Z∗ be the best splits for node t which can be defined on the basis of X and Z , respectively. If C G O (s X∗ , t) ≥ C G O (Z , t), then it will be C G O (s X∗ , t) ≥ C G O (s Z∗ , t), so that it is not necessary to evaluate the splits which can be defined on the basis of Z . Hence the algorithm proceeds as follows. At the first step C G O (X q , t) is calculated for all the explanatory variables. Let (X (1) , X (2) , . . . , X (q) ) be the explanatory variables in an increasing order according to C G O (X, t). X (1) is the variable maximizing C G O (X, t)): all the possible splits defined on the basis of X (1) are evaluated, and the best is selected, s X∗ (1) . Now C G O (s X∗ (1) , t) is compared with the criterion evaluated for the second explanatory variable, C G O (X (2) , t). If C G O (s X∗ (1) , t) > C G O (X (2) , t) then the algorithm can be stopped, since s X∗ (1) is the best split. Otherwise, all the splits of X (2) are evaluated, and the best one is selected, s X∗ (2) . The best splits for X (1) and X (2) are compared, the best one is selected and the procedure goes on by comparing the best split with X (3) . The algorithm iterates until there are no more explanatory variables characterized by a value of the criterion lower than the value characterizing the best (current) split. Notice that in this way, attention can be limited only to the most promising explanatory variables, while explanatory variables with a low value of the criterion (in particular, lower than the value characterizing the best current split) can be disregarded. The computational gain in terms of the reduction of the number of splits inspected and of CPU time is illustrated in Mola and Siciliano (1997, 1998). 6 Conclusions and directions of future research The problem of building a classification tree in the nominal case has been given great attention in literature. The original proposal by B & A has been deeply analyzed and 123 Classification trees for ordinal variables 427 criticized and many methods to improve it have been introduced. The case when the response is ordinal has not received the same attention. In this paper, we illustrate why ordinal criteria should be used in the ordinal case. We analyze the twoing ordinal criterion, evidencing its characteristics and properties. We also introduce possible alternative ordinal criteria. The aim is to enrich the class of ordinal splitting criteria. Actually, we are convinced that the availability of more criteria and the inspection of different trees can also prove very useful from an exploratory point of view, allowing insights into the data to be analyzed. The criteria are first theoretically compared and analyzed, mainly referring to some drawbacks illustrated by T & S for the Gini–Simpson criterion, namely the exclusivity preference property and the possible—excessive—influence of the aec f on the treeclassifiers. As for this point, applications and simulations emphasize that in the ordinal case the problem is not so relevant as it is in the nominal case. With regard to the comparison (via simulations) between ordinal criteria, we can conclude that none of them appears preferable to the others. C A often performs better in terms of mc, but is characterized by a trade-off between mr and mc. As for the “global” performance, criteria C G O and C T O appear more promising. At least in simulations they often lead to trees with low mc without sacrificing too much in terms of mr , as compared to results obtained by referring to C G N . With respect to this consideration, it is interesting to point out that both C G O and C T O are based upon a measure of distance between the two cdf’s in the sub-nodes induced by a split. As a direction of further research, we think it would be sensible to enrich the class of ordinal criteria by considering different measures of distance between the two cdf’s (see e.g. Shih 1999, for this kind of analysis in the nominal case). From another point of view, C G O is based upon a particular measure of the heterogeneity of an ordinal variable. It would be interesting to consider criteria based on different measures of heterogeneity. References Agresti A (1990) Categorical data analysis. New York, Wiley Agresti A (1981) Measures of nominal–ordinal association. J Am Stat Assoc 76:524–529 Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont Gini C (1954) Variabilità e concentrazione. Veschi, Roma Mola F, Siciliano R (1997) A fast splitting procedure for classification trees. Stat Comput 7:209–216 Mola F, Siciliano R (1998) A general splitting criterion for classification trees. Metron 56:156–171 Piccarreta R (2001) A new measure of nominal ordinal association. J Appl Stat 28:107–120 Shih Y-S (1999) Families of splitting criteria for classification trees. Stat Comput 9:309–315 Taylor PC, Silverman BW (1993) Block diagrams and splitting criteria for classification trees. Stat Comput 3:147–161 123 Journal of Classi?cation (2020) 37:4–17 https://doi.org/10.1007/s00357-018-9302-x Ordinal Forests Roman Hornung1 Published online: 22 January 2019 © Classi?cation Society of North America 2019 Abstract The ordinal forest method is a random forest–based prediction method for ordinal response variables. Ordinal forests allow prediction using both low-dimensional and highdimensional covariate data and can additionally be used to rank covariates with respect to their importance for prediction. An extensive comparison study reveals that ordinal forests tend to outperform competitors in terms of prediction performance. Moreover, it is seen that the covariate importance measure currently used by ordinal forest discriminates influential covariates from noise covariates at least similarly well as the measures used by competitors. Several further important properties of the ordinal forest algorithm are studied in additional investigations. The rationale underlying ordinal forests of using optimized score values in place of the class values of the ordinal response variable is in principle applicable to any regression method beyond random forests for continuous outcome that is considered in the ordinal forest method. Keywords Prediction · Ordinal response variable · Covariate importance ranking · Random forest 1 Introduction In statistical applications, it is sometimes of interest to predict the values of an ordinal response variable. However, to date, there are relatively few prediction methods for ordinal response variables that make use of the ordinal nature of such response variables; in particular, few such methods exist that are applicable to high-dimensional covariate data. Ordinal Electronic supplementary material The online version of this article (https://doi.org/10.1007/s00357-018-9302-x) contains supplementary material, which is available to authorized users. Roman Hornung hornung@ibe.med.uni-muenchen.de 1 Institute for Medical Information Processing, Biometry and Epidemiology, University of Munich, Munich, 81377, Germany Journal of Classi?cation (2020) 37:4–17 5 response variables are often treated as nominal variables, applying prediction techniques for binary response variables to all paired combinations of classes of the ordinal response variable. In this paper, the ordinal forest (OF) method is introduced: an innovative prediction method for ordinal response variables applicable to both low-dimensional and highdimensional covariate data that makes use of the ordinal nature of the response variable. Analogously to the general class of regression models for ordinal data developed by McCullagh (1980) that includes the well-known ordered probit regression, the OF method is based on the notion of a latent continuous response variable underlying the observed ordinal response variable. Roughly spoken, with OFs, the ordinal response variable is treated as a continuous variable, where the differing extents of the individual classes of the ordinal response variable are implicitly taken into account (see below for details). OFs are closely related to conventional random forests (Breiman 2001) for continuous outcomes, termed regression forests in the following. In contrast to the latter, OFs make use of the out-of-bag (OOB) error estimates during the construction of the forest. A straightforward forest-based prediction method for ordinal response variables would consist of simply considering a regression forest using the class values 1, . . . , J of the response variable for the corresponding classes. However, this procedure is suboptimal as will be demonstrated in this paper. An important reason for the suboptimality of this approach is the fact that the extents of the classes of the ordinal response variable, from now on denoted as class widths, differ from class to class. In the latent variable model already mentioned above that is underlying OF, these class widths are the widths of J adjacent intervals in the range of the underlying continuous response variable; these J intervals correspond to the J classes of the ordinal response variable. The underlying refined continuous variable y ∗ determines the values of the ordinal variable y. The relationship between this continuous variable y ∗ and y is such that the higher the value of y ∗ is for an observation, the higher is the class of the ordinal response variable for that observation. More precisely, if y ∗ falls into the j th interval of J adjacent intervals, y will take the value j . As an illustration, school grades are usually determined by the total number of points the pupils score in all exercises composing the respective test. Here, the grade and the corresponding number of points take the role of the ordinal variable y and the underlying continuous variable y ∗ , respectively. If the continuous variable y ∗ is known, it can be used in regression techniques for continuous response variables. In the context of conditional inference trees, for situations in which y ∗ is known, Hothorn et al. (2006) suggest to use—as a continuous response variable—the midpoints of the intervals of y ∗ that correspond to the classes of y. The OF method is, however, designed for the common situation in which the underlying continuous variable was not measured or might not even be known. In OF, interval boundaries in y ∗ corresponding to the different classes of y are estimated or rather optimized by maximizing the OOB prediction performance of regression forests. Using score values that correspond to these optimized class intervals instead of using the class values 1, . . . , J leads to an improvement in prediction performance as will be seen in the analyses presented in this paper. Note, however, that apart from considering optimized score values for the class values, choosing arbitrary score values for the class values does not necessarily impact the prediction accuracy notably. This is also suggested by the results of a study performed by Janitza et al. (2016), who considered regression forests with conditional inference trees as base learners using the class values 1, . . . , J for the classes of the ordinal response variable. To study the robustness of their results, they considered the score 6 Journal of Classi?cation (2020) 37:4–17 values 1, 22 , . . . , J 2 in addition to the values 1, 2, . . . , J and found no differing prediction accuracies between these two choices. Note that conditional inference trees (Hothorn et al. 2006) are preferable over classical classification and regression trees (Breiman et al. 1984) in the presence of categorical covariates, because in this situation only the former allow unbiased variable selection for splitting. However, for high-dimensional data using conditional inference, trees in regression forests are computationally overly demanding due to the large quantity of permutation tests necessary to conduct in the case of this approach. Highdimensional data has, however, become a very common application field of random forests, in particular in the biomedical field. Therefore, in this paper, classical regression forests are considered, which use regression trees (Breiman et al. 1984). Although the prediction performance is increased by using score values that correspond to class intervals obtained via maximizing the OOB prediction performance instead of using the values 1, . . . , J , the widths of the estimated intervals are not useful for interpretation purposes: they carry no useful information on the actual class widths as will become evident in the analyses presented with this paper. The paper is structured as follows. In Section 2, the OF algorithm and how OF can be used for prediction and for ranking the importances of the covariates are described. Subsequently, using five real datasets and simulated data, in Section 3 the performance of OF with respect to prediction accuracy and quality of its variable importance measure is extensively compared to that of other (forest-based) approaches. Moreover, several important properties of the OF algorithm are investigated empirically using simulated data in this section. In the discussion (Section 4), the most important findings of the paper are summarized, further related points are discussed, and possibilities for future research are described. 2 Methods 2.1 Construction of OF Prediction Rules In Section 2.1.1, first, to give an initial overview, the algorithm used for constructing an OF prediction rule is described in a simplified form. Second, some aspects of specific steps of this algorithm are discussed. The algorithm is subsequently described in full detail in Section 2.1.2. 2.1.1 Sketch of the OF Algorithm The OF algorithm consists of the following two main steps: 1. Optimization of the score set: As described in the “Introduction,” ordinal forests are regression forests in which the class values are replaced by score values that are optimized with the aim of maximizing the (OOB) prediction performance. The first step in the optimization of the score set {s1 , . . . , sJ } is performed as follows: First, repeatedly and randomly generate a candidate score set {sb,1 , . . . , sb,J }; second, construct an OF as a regression forest using {sb,1 , . . . , sb,J } for the class values of the target variable; and lastly, measure the OOB prediction performance according to a specific measure, called the performance function. Journal of Classi?cation (2020) 37:4–17 2. 7 In the second step, the final score set is calculated as a summary of the score sets that featured the highest OOB prediction performance in the first step. Construction of the OF as a regression forest: Using {s1 , . . . , sJ } for the class values of target variable, construct an ordinal forest as a regression forest. In the first step of the algorithm, it is important that the collection of tried score sets is large and as heterogeneous as possible. This ensures that the best of the tried score sets are close to the optimal score set that is associated with optimal OOB prediction performance. The heterogeneity of the collection of score sets is ensured by a specific algorithm described in Section 2.1.2. As described above, the final score set is obtained as a summary of those score sets considered in the optimization that featured the highest OOB prediction performance. In principle, it would be possible to use the score set considered in the optimization which was associated with the highest OOB prediction performance as the final score set. However, the OOB prediction performance of the corresponding OF constructed in the optimization may be high merely by chance. However, when summarizing several score sets with the highest OOB prediction performance estimates, it is unlikely that these estimates are consistently high merely due to chance. The choice of the form of the performance function (Section 2.1.2) used in the optimization depends on the kind of performance the ordinal forest should feature. For example, in many situations, it is of interest to classify observations from each class with the same accuracy independent of class size. In other situations, the main goal is to correctly classify as many observations as possible. The latter goal implies prioritizing larger classes at the expense of a lower classification accuracy with respect to smaller classes. Sometimes, specific classes are of particular importance, which should then be prioritized by the OF algorithm. 2.1.2 Detailed Description of the OF Algorithm Assume a sample {(x 1 , y1 ), . . . , (x n , yn )}, where x i (i ∈ {1, . . . , n}) designates the vector of covariates of observation i and yi ∈ {1, . . . , J } correspondingly the class value of the ordinal response variable for that observation. Then, an OF prediction rule is constructed as follows: 1. As described in Section 2.1, the collection of score sets tried during the optimization should be as heterogeneous as possible. As seen below, each tried score set consists of the J (transformed) midpoints of J adjacent intervals that form a division of the interval [0, 1] into J adjacent intervals. The heterogeneity of the collection of tried score sets is reached by making the corresponding collection of divisions of [0, 1] into J adjacent intervals heterogeneous. The following algorithm is used for obtaining this heterogeneous collection of Bsets divisions {db,1 , . . . , db,J +1 } (b ∈ {1, . . . , Bsets }) of [0, 1]: (a) In this step, the rankings of the class widths for each of the Bsets sets are generated. If J ! < Bsets , this is performed as follows: i Generate all J ! possible permutations of 1, . . . , J and permute this set of permutations randomly. The result of the latter step is the rankings {rb,1 , . . . , rb,J }, b = 1, . . . , J !, for the first J ! sets. 8 Journal of Classi?cation (2020) 37:4–17 Copy each of the rankings {rb,1 , . . . , rb,J }, b = 1, . . . , J !, Bsets /J ! − 1 times to produce the rankings {rb,1 , . . . , rb,J }, b = J ! + 1, . . . , J !Bsets /J !, for the next J !(Bsets /J ! − 1) sets. iii The last rankings {rb,1 , . . . , rb,J }, b = J !Bsets /J ! + 1, . . . , Bsets , are produced through random sampling from the first J ! rankings {rb,1 , . . . , rb,J }, b = 1, . . . , J !. ii If J ! ≥ Bsets , the generation of the rankings of the class widths for each set is performed as follows: Permute 1, . . . , J randomly once to produce the ranking {r1,1 , . . . , r1,J } for the first set. ii For b = 2, . . . , Bsets : Draw Nperm (e.g., Nperm = 500) random per∗ , . . . , r ∗ }, l = 1, . . . , N mutations of 1, . . . , J denoted as {rl,1 perm . l,J ∗ ∗ }, l = 1, . . . , N Determine that permutation from {rl,1 , . . . , rl,J perm that features the greatest quadratic distance to {rb−1,1 , . . . , rb−1,J }, that is ∗ −r 2 argmax{r ∗ ,...,r ∗ } Jj=1 (rl,j b−1,j ) and use this permutation as the l,J l,1 ranking for the bth set (see Appendix A.8.1 in the Supplementary Online Materials for a discussion of the influence of the value of Nperm ). i (b) In this step, the sets {db,1 , . . . , db,J +1 } for all iterations b = 1, . . . , Bsets are generated. For b = 1, . . . , Bsets : Draw J − 1 instances of a U (0, 1) distributed random variable and sort ∗ , . . . , d∗ . the resulting values. The sorted values are designated as db,2 b,J ∗ := 0 and d ∗ Moreover, set db,1 := 1. b,J +1 ∗ , . . . , d∗ ii Re-order the intervals of the [0, 1] partition {db,1 b,J +1 } in such a way that the j th interval, j = 1, . . . , J , is that interval out of ∗ , d ∗ ], . . . , ]d ∗ , d ∗ ]db,1 b,J b,2 b,J +1 ] that features the rb,j th smallest width and use this re-ordered partition as the bth set {db,1 , . . . , db,J +1 }. i 2. For b = 1, . . . , Bsets (e.g., Bsets = 1000): (a) (b) (c) (d) (e) 3. Form a continuous response variable zb := zb,1 , . . . , zb,n by replacing each class value j , j = 1, . . . , J , in the ordinal response variable y := y1 , . . . , yn by the j th value in the score set s b := {sb,1 , . . . , sb,J }, where sb,j := −1 (cb,j ), cb,j := (db,j + db,j +1 )/2 (j ∈ {1, . . . , J }), and −1 denotes the quantile function of the standard normal distribution. Grow a regression forest fs b with Bntreeprior trees (e.g., Bntreeprior = 100) using zb as response variable. Obtain OOB predictions z?b,1 , . . . , z?b,n of zb,1 , . . . , zb,n . Obtain OOB predictions of y1 , . . . , yn as follows: y?b,i := j if z?b,i ∈ ]−1 (db,j ), −1 (db,j +1 )] (i ∈ {1, . . . , n}). Assign a performance score scb := g(y, y b ) to fs b , where y b := y?b,1 , . . . , y?b,n and g is the performance function (see further down for details). Be Sbest the set of indices of the Bbestsets (e.g., Bbestsets = 10) regression forests constructed in 1 that feature the largest scb values. Then, for each j ∈ {1, . . . , J + 1}, take the average of those db,j values for which b ∈ Sbest , resulting in a set of J + 1 values denoted as d1 , . . . , dJ +1 . Journal of Classi?cation (2020) 37:4–17 4. 5. 9 Form a new continuous response variable z := z1 , . . . , zn by replacing each class value j , j = 1, . . . , J , in the ordinal response variable y = y1 , . . . , yn by the j th value in the optimized score set {s1 , . . . , sJ }, where sj := −1 (cj ) and cj := (dj + dj +1 )/2 (j ∈ {1, . . . , J }). Grow a regression forest ffinal with Bntree trees (e.g., Bntree = 5000) using z as response variable. Three different variants of the performance function g (see below) are provided in the R package ordinalForest (see Section 4), where the user chooses the most suitable version depending on the particular kind of performance the OF should feature. The performance function g has the following general form: g(y, y ) := J wj Yind(y, y , j ), (1) j =1 where, wj = 1, Yind(y, y , j ) := sens(y, y , j ) + spec(y, y , j ) − 1, j #{yi = j ∧ y?i = j : i ∈ {1, . . . , n}} , and #{yi = j : i ∈ {1, . . . , n}} #{yi = j ∧ y?i = j : i ∈ {1, . . . , n}} , spec(y, y , j ) := #{yi = j : i ∈ {1, . . . , n}} sens(y, y , j ) := where ‘#’ denotes the cardinality and y := {y?1 , . . . , y?n } represents an estimate of y. As is apparent from the above definitions, Yind(y, y , j ) denotes the Youden’s index calculated with respect to class j . The higher the weight wj assigned to class j is chosen, the stronger the performance of the OF with respect to distinguishing observations in class j from observations not in class j will tend to be. In the following, three important special cases of g are presented that result from specific choices of w1 , . . . , wJ : • If observations from each class should be classified with the same accuracy, g should be specified as follows: y ) := gclequal (y, J 1 Yind(y, y, j ) J (2) j =1 • If as many observations as possible should be classified correctly, the weights of the classes should be proportional to their sizes, leading to the following choice for g: y ) := gclprop (y, J #{yi = j : i ∈ {1, . . . , n}} Yind(y, y, j ) n (3) j =1 • Note that the prioritization of larger classes resulting from the use of gclprop leads to a decreased classification performance for smaller classes. If it is merely relevant that observations in class j can be distinguished as reliably as possible from observations not in class j , g should be specified as follows: y ) := Yind(y, y , j ) (i.e., wj = 1) gclj (y, (4) 10 Journal of Classi?cation (2020) 37:4–17 2.2 Prediction Using OF A prediction of the value of the response variable of an independent observation i ∗ based on its covariate vector x i ∗ is obtained as follows: 1. For b = 1, . . . , Bntree : (a) Apply the bth tree in ffinal to observation i ∗ and obtain a prediction z?i ∗ ,b . (b) Obtain a class prediction from the bth tree: y?i ∗ ,b := j if z?i ∗ ,b ]−1 (dj ), −1 (dj +1 )]. 2. ∈ Obtain a final class prediction from y?i ∗ ,1 , . . . , y?i ∗ ,Bntree by plurality voting (i.e., use the class value that is predicted most often by the trees). 2.3 Variable Importance Measure of OF The variable importance measure (VIM) of OF for covariate j is given as: V Ij := 1 B ntree Bntree b=1 Err(yOOB,b,j , yOOB,b,j ) − Err(yOOB,b,j , yOOB,b,j ), (5) where, • • • • yOOB,b,j denotes the vector of class values of the OOB data of tree b from the OF ffinal , yOOB,b,j denotes the predictions of the class values of the OOB data from tree b from ffinal obtained after randomly permuting the values of covariate j in the OOB data of tree b, yOOB,b,j denotes the predictions of the class values of the OOB data of tree b from ffinal without permuting the values of covariate j , and Err({a1 , . . . , aM }, {b1 , . . . , bM }) := (1/M) M m=1 I (am = bm ), that is, the misclassification error is (currently) used as error function in this permutation variable importance measure. 3 Empirical Studies A real data analysis using five datasets and an extensive simulation study were performed in order to compare the performance of OF to that of competing (tree-based) methods for ordinal regression. Moreover, further specific properties of the OF algorithm were studied using the simulated data, for example, the appropriateness of the choices of the default values of the hyperparameters or the influence of the class distribution on the performance. For the sake of completeness, it is pointed out that the OF algorithm was developed prior to seeing the five datasets used in the real data analysis and setting up the simulation design. The comparison of OF with the alternative methods was performed with respect to the following two aspects: aspect 1, prediction performance; aspect 2, quality of variable importance ranking. The following methods were compared: OF, multi-class random forest (RF), and regression forests in which the class values 1, . . . , J of the ordinal response variable are used as score values. The latter are referred to as naive OFs in the following. In the real data analysis, ordered probit regression was included as a fourth method. This method was, however, not suitable in the case of the simulation study, because the simulation design Journal of Classi?cation (2020) 37:4–17 11 featured high numbers of covariates and ordered probit regression is only suitable in situations in which there is a small ratio between the number of covariates and the number of observations. In all analyses, the performance function gclequal was used in the OF algorithm. This choice was made because the classification performance should in most applications not depend on the class sizes in the data. Moreover, the following values for the hyperparameters of OF were used: Bsets = 1000, Bbestsets = 10, Bntreeprior = 100, Bntree = 5000 (see Section 2.1), and Nperm = 500. All R code written to perform and evaluate the analyses presented in this paper and in Appendix A in the Supplementary Online Materials as well as the datasets used in the real data analysis are made available in Appendix B in the Supplementary Online Materials. 3.1 Real Data Analysis 3.1.1 Data In this analysis, five real datasets that were also recently considered in Janitza et al. (2016) were used. This paper compared multi-class RF with naive OF, both, however, using conditional inference trees as base learners. Table 1 gives an overview of the five datasets. For details on the backgrounds of the datasets, see Janitza et al. (2016). 3.1.2 Study Design Contrary to the case of simulated data, the effect sizes of the covariates are not known for real data. Therefore, in real data analysis, the ordinal regression methods can only be compared with respect to their prediction performance, but not with respect to the quality of the variable importance ranking (obtainable only from the forest-based approaches). In order to avoid overoptimism which results from re-using the same data to assess prediction performance that was previously used already for learning the corresponding prediction rule, 10-fold stratified cross-validation was used. First, the values of the ordinal response variable of the left out fold were predicted for each iteration of the cross-validation. Second, after having obtained the predictions of the values of the ordinal response variable for all left out folds, that is, for all observations, the quality of these predictions was measured using a performance measure. This process was repeated 10 times to obtain more reliable results. Three performance measures were used to assess the quality of the predictions: weighted Kappa using quadratic weights, weighted Kappa using linear weights (Cohen 1968), and Cohen’s Kappa (Cohen 1960). The weighted Kappa is given by: J J j =1 wij poij − i=1 j =1 wij pcij J J 1 − i=1 j =1 wij pcij J J κw := i=1 , (6) where poij is the observed proportion of cases for which class i is true and class j is predicted, pcij is the proportion of cases for which class i is true and class j is predicted that is expected just by chance, and wij are so-called agreement weights. For the linearly weighted Kappa, the agreement weights are given by wij = 1 − |i − j |/(J − 1), for the quadratically weighted Kappa by wij = 1 − |i − j |2 /(J − 1)2 and for Cohen’s Kappa by wij = 1 if i = j and wij = 0 if i = j . 1914 5 5 9 6 nhanes supportstudy vlbw winequality 4893 218 798 412 mammography 3 classes size label Response variable 11 10 15 26 5 4 – 6 (n = 2198); 5 – 7 (n = 880); 6 – 8 (high quality) (n = 175) 1 – 3 (moderate quality) (n = 20); 2 – 4 (n = 163); 3 – 5 (n = 1457) Wine quality score: 8 – 8 (n = 36); 9 – 9 (optimal physical condition) (n = 12) 1 – 1 (life-threatening) (n = 33); 2 – 2 (n = 16); 3 – 3 (n = 19); 4 – 4 (n = 15); 5 – 5 (n = 25); 6 – 6 (n = 27); 7 – 7 (n = 35); Apgar score: 5 – patient died before 2 months after study entry (n = 320) 4 – patient intubated or in coma 2 months after study entry (n = 7) 3 – Sickness Impact Profile total score is at least 30 2 months after study entry (n = 57) patient’s surrogate was, the cutoff for disability was 5 or more activities (n = 104) 2 – patient was unable to do 4 or more activities of daily living 2 months after study entry; if the patient was not interviewed but the functional disability (n = 310) 1 – patient lived 2 months, and from an interview (taking place 2 months after study entry) there were no signs of moderate to severe Functional disability: 4 – fair (n = 346); 5 – poor (n = 83) 1 – excellent (n = 198); 2 – very good (n = 565); 3 – good (n = 722) Self-reported health status: 1 – never (n = 234); 2 – within a year (n = 104); 3 – over a year (n = 74) Last mammography visits: covariates with class sizes No. of Sample No. of Dataset Table 1 Overview of the datasets used in the real data analysis 12 Journal of Classi?cation (2020) 37:4–17 Journal of Classi?cation (2020) 37:4–17 13 The weighted Kappa is a metric well suited for measuring the quality of predictions of ordinal data. This is because it allows one to consider also the benefit of predictions that are close to the true class values on the ordinal scale instead of considering only predictions equal to the true values as valuable (Ben-David 2008). Quadratic and linear weights are the most commonly used weighting schemes for weighted Kappa in practice. Compared to the case of using linear weights, when using quadratic weights more benefit to predictions that are further away from the true class values is attributed. At the same time, when using quadratic weights, fewer benefit is attributed to predictions very close to or equal to the true class values than in the case of using linear weights. By contrast, in the case of Cohen’s Kappa, benefit is attributed only to predictions that are equal to the true class values. Thus, when using Cohen’s Kappa, predictions that are not equal to the true values are attributed no benefit, regardless of how close these predictions are to the true values. To summarize, when using weighted Kappa with quadratic weights, similar benefit is attributed to predictions that are equal, similar, or only roughly similar to the true class values, when using Cohen’s Kappa benefit is attributed only to predictions that are equal to the true class values and weighted Kappa with linear weights poses a compromise between the former two metrics. The following is an illustration of the behavior of the different metrics: a prediction rule that in many cases returns accurate predictions, but in other cases results in predictions that are far from the true value would be associated with a relatively high value of Cohen’s Kappa but a relatively low level of weighted Kappa. Note that Cohen’s Kappa and the weighted Kappa depend on the class sizes and the number of classes (Jakobsson and Westergren 2005). This does, however, not pose a problem in the analysis performed in this paper, because the different methods are compared with each other for a given dataset (or given simulation setting in the case of the simulation, see Section 3.2). Weighted Kappa and Cohen’s Kappa take values between 0 and 1, where higher values indicate a better performance in terms of the respective metric. 3.1.3 Results Figure 1 shows the values of the linearly weighted Kappa obtained for each of the five datasets. For the sake of clarity and because the linearly weighted Kappa poses a compromise between the quadratically weighted Kappa and Cohen’s Kappa, the values obtained for the latter two metrics are presented in Appendix A in the Supplementary Online Materials (Supplementary Figs. 1 and 2). Correspondingly, the descriptions of the results will first focus on the linearly weighted Kappa and subsequently important differences in the results obtained for the quadratically weighted Kappa and Cohen’s Kappa will be discussed briefly. For the sake of brevity, “linearly weighted Kappa” will occasionally be denoted as “Kappa” for short in cases where there is no chance of confusion. For two of the five datasets, OF performs notably better than naive OF in terms of the linearly weighted Kappa. For the remaining three datasets, the values are similar between these two methods. Nevertheless, the means over the 10 cross-validation iterations are higher for OF than those for naive OF in the cases of all five datasets. OF is also better than multiclass RF for those two datasets for which OF is clearly better than naive OF. For the other three datasets, the means over the cross-validation iterations are similar between OF and multi-class RF. For the dataset supportstudy, the means over the cross-validation iterations are almost virtually identical between OF and multi-class RF (OF, 0.40450; multi-class RF, 0.40454) and for the datasets vlbw and winequality, the means are slightly higher for OF. Ordered probit regression performs very similar to OF for two datasets, better than OF for Journal of Classi?cation (2020) 37:4–17 14 mammography nhanes supportstudy 0.35 0.3 0.42 0.1 0.33 0.41 0.32 0.40 0.31 0.39 winequality 0.6 0.275 m 0.4 0.225 0.3 or OF de r re ed p gr ro es b sio it n m ult i−c las sR F na ive OF i−c las sR F na ive OF 0.200 ult ult 0.5 0.250 m i−c las sR F na ive OF vlbw 0.300 or OF de r re ed p gr ro es b sio it n 0.30 or OF de r re ed p gr ro es b sio it n Linearly weighted Kappa 0.34 0.2 Fig. 1 Values of linearly weighted Kappa for each of the five datasets and each of the four methods considered; each boxplot shows the values obtained for the individual repetitions of the 10-fold stratified cross-validation one dataset, and much worse than OF and the other methods for the remaining two datasets. For the latter two datasets, there were convergence problems in the case of ordered probit regression. Moreover, one cross-validation iteration resulted in an error for one of these datasets (winequality), which is why for this dataset, in the case of ordered probit regression, the results of only nine instead of 10 repetitions of the 10-fold stratified cross-validation are available. The results are very similar for the quadratically weighted Kappa (Supplementary Fig. 1). While the Kappa values are generally higher for quadratic weights than those for linear weights, the improvement of OF over multi-class RF tends to be stronger for the quadratically weighted Kappa. This observation might be interpreted as follows: In contrast to multi-class RF, OF takes the ordinal nature of the response variable into account. Therefore, OF can be expected to deliver less often predictions that are far from the true class value on the ordinal scale than does multi-class RF. In the case of Cohen’s Kappa (Supplementary Fig. 2), for which, as mentioned above, benefit is attributed only to predictions that are equal to the true class values, OF performs less well in comparison to multi-class RF: For two datasets, OF performs slightly better than multi-class RF and for three datasets, OF performs slightly worse. Thus, while OF can be expected to deliver more often predictions that are close to the true class values than multiclass RF, OF might at the same time be less performant with respect to predicting the exact values of the true class values in comparison to multi-class RF. 3.2 Simulation Study The real data analysis had the aim of comparing OF to alternatives with respect to prediction performance. In terms of the linearly weighted Kappa, OF outperformed multi-class Journal of Classi?cation (2020) 37:4–17 15 RF for some datasets, where for other datasets, the two methods performed comparably well. Using simulated data, it was possible to study various further properties of the OF algorithm. The detailed aims and the design of the main simulation and of several additional simulation analyses are presented in the Supplementary Online Materials (Appendix A.2). Detailed results and discussions of the latter are also provided in the Supplementary Online Materials. In the following, the main findings of the simulation study are presented and important points related to these results are discussed. For details, see the corresponding sections of the Supplementary Online Materials. OF outperformed both naive OF and multi-class RF in the majority of settings (Appendix A.3). The improvement of OF over naive OF was stronger for settings in which there were large classes around the center of the class value range and the low and the high classes tended to be small (Appendix A.6). The variable importance measures of both OF and naive OF outperformed that of multiclass RF with respect to their abilities to discern influential covariates from non-influential noise covariates (Appendix A.4). However, the variable importance measures of OF and naive OF performed comparably well. The variable importance measure of OF currently uses the misclassification error as an error measure (see Section 2.3). Considering an error measure that is based on the respective performance function used in each case might lead to an improved variable importance measure. In extensive analyses (Appendix A.5), it was revealed that the estimated class widths cannot be used in any way for making insightful inference on the magnitudes of the actual class widths (relative to each other). All three variants of the performance functions (Section 2.1.2) were indeed associated with the specific kinds of prediction performance they were intended for in an additional simulation analysis presented in Appendix A.7. Nevertheless, frequently, the differences between the results when applying the different performance functions were not large. In particular, gclprop was only slightly superior to gclequal in terms of the metric for which it should perform best from a theoretical point of view (for details see Appendix A.7). However, gclequal was clearly superior to gclprop in terms of the metric for which this performance function should perform best. Given that gclequal did not perform considerably worse than gclprop in any of the settings studied and was at the same time superior to gclprop in many of the settings, it is reasonable to recommend using gclequal as default performance function in situations in which it is not clear which specific kind of performance the OF should feature. The OF algorithm features several hyperparameters. Appendix A.8.1 provides a detailed heuristic discussion on the influences of these hyperparameters on the prediction performance of OF. In Appendix A.8.2, a simulation analysis is presented to assess the appropriateness of the default hyperparameter values (used in the R package ordinalForest) and the robustness of the OF prediction performance with respect to the choices of the hyperparameter values. This analysis suggests that the chosen default values are indeed in a reasonable range and that OF is quite robust to the choices of the values of these hyperparameters, which is why it should not be necessary to optimize them in most cases. Similarly, in a work by Probst et al. (2018), it is seen that the random forest algorithm seems to be quite robust to the choice of the values of its hyperparameters: Using a large quantity of open-source datasets, Probst et al. (2018) show that for random forests there is quite little improvement by optimizing the values of the hyperparameters, in particular when compared to other machine learning approaches. Nevertheless, for ultra-high dimensional data, it might be necessary to choose a higher value for Bntreeprior than the considered default value 100. 16 Journal of Classi?cation (2020) 37:4–17 4 Discussion The simulation study revealed that OF performs particularly well in comparison to naive OF if the middle classes are larger than the low and high classes. This pattern of the distribution of the class sizes can be expected to be common in practice: the low and high classes are on the margins of the class value range, that is, the extreme ends of the ordinal scale, which is why they tend to be represented by less observations than the classes in the middle. The main concept of OF is to use optimized score values in place of the class values of the ordinal response variable in the vein of a latent variable model that is also underlying classical ordered probit regression. This concept is in principle applicable to any regression method for continuous outcome. The corresponding algorithm could be performed analogously to the OF algorithm, with the following differences: (1) In step 2 (b) (Section 2.1.2), the respective regression method would be repeatedly fitted to bootstrap samples using zb as response variable, and in step 2 (c) the OOB predictions of this bootstrapped prediction rule would be calculated; (2) In step 5 the regression method would be fitted to the data using z as response variable. Note that this algorithm is not confined to parameteric approaches to ordinal regression in contrast to the the model class defined by McCullagh (1980). However, in the presence of high-dimensional covariate data, this algorithm would be (currently) too computationally intensive in general. In this algorithm, the regression method has to be fitted very often and fitting a regression method to high-dimensional data is computationally intensive in most cases. However, regression forests can be constructed very fast also in the presence of high-dimensional data using the R package ranger (Wright and Ziegler 2017), which is why the computational expense of the OF algorithm is reasonable also for such data in general. Ultra-high dimensional data, nevertheless, could pose a problem. Another problematic setting is datasets with very large numbers of observations, where applying the OF algorithm using its default hyperparameter values could be difficult to infeasible, because for such data the construction of the regression forests takes considerably longer. For data with a very large number of observations that, at the same time, features low-dimensional covariate data, an easy possibility to reduce the computational burden considerably without affecting the precision notably is to choose a small value for Bntreeprior (e.g., Bntreeprior = 10). The reason why the precision of the OF is not considerably reduced by choosing a small Bntreeprior in this situation is that in the case of a large number of observations, the individual trees in the regression forests are more precise. Thus, a smaller number of trees in the regression forests fs b , b = 1, . . . , Bsets , is necessary to obtain reliable scb values. A related concept to using optimized score values in place of class values of the ordinal response variable is considered by Casalicchio et al. in a work in progress. They consider the following approach: (1) Fit a regression model for continuous outcome to the data using the score values 1, . . . , J for the values of the ordinal response variable; (2) For obtaining predictions of the class values of new observations, assign an observation to class j (j ∈ {1, . . . , J }), if its predicted value y? is contained in the interval ]aj , aj +1 ], where a1 := −∞, aJ +1 := +∞, and the values a2 , . . . , aJ are chosen so that they minimize the crossvalidation error of the predictions of the class values. Casalicchio et al. plan to compare the prediction performances obtained using this approach for various regression methods for continuous outcome. As seen in this paper, OF is a well-performing prediction method for ordinal response variables. In addition to the purpose of predicting the values of the ordinal response variable, OF can be used to rank the covariates according to their importance for prediction. The estimated class widths resulting as a by-product of the OF algorithm do not, however, contain Journal of Classi?cation (2020) 37:4–17 17 any useful information on the actual class widths. The OF algorithm is implemented in the R package ordinalForest that is available on CRAN in version 2.2 (Hornung 2018). Acknowledgments The author thanks Giuseppe Casalicchio for proofreading and comments and Jenny Lee for language corrections. This work was supported by the German Science Foundation (DFGEinzelfo?rderung BO3139/6-1 to Anne-Laure Boulesteix). Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. References Ben-David, A. (2008). Comparison of classification accuracy using Cohen’s weighted Kappa. Expert Systems with Applications, 34(2), 825–832. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. Breiman, L., Friedman, J.H., Olshen, R.A., Ston, C.J. (1984). Classification and regression trees. Monterey: Wadsworth International Group. Cohen, J. (1960). A Coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46. Cohen, J. (1968). Weighed Kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213–220. Hornung, R. (2018). ordinalForest: Ordinal Forests: Prediction and Variable Ranking with Ordinal Target Variables, R package version 2.2. Hothorn, T.,...