Homework answers / question archive / In the article this week, we focus on deciding whether the results of two different data mining algorithms provides significantly different information

In the article this week, we focus on deciding whether the results of two different data mining algorithms provides significantly different information

Writing

Share With

In the article this week, we focus on deciding whether the results of two different data mining algorithms provides significantly different information. Therefore, answer the following questions:

When using different data algorithms, why is it fundamentally important to understand why they are being used?
If there are significant differences in the data output, how can this happen and why is it important to note the differences?
Who should determine which algorithm is “right” and the one to keep? Why?

Data Min Knowl Disc (2012) 25:173–207 DOI 10.1007/s10618-012-0275-9 Comparing apples and oranges: measuring differences between exploratory data mining results Nikolaj Tatti · Jilles Vreeken Received: 27 October 2011 / Accepted: 5 June 2012 / Published online: 22 June 2012 © The Author(s) 2012 Abstract Deciding whether the results of two different mining algorithms provide significantly different information is an important, yet understudied, open problem in exploratory data mining. Whether the goal is to select the most informative result for analysis, or to decide which mining approach will most likely provide the most novel insight, it is essential that we can tell how different the information is that different results by possibly different methods provide. In this paper we take a first step towards comparing exploratory data mining results on binary data. We propose to meaningfully convert results into sets of noisy tiles, and compare between these sets by maximum entropy modelling and Kullback–Leibler divergence, well-founded notions from Information Theory. We so construct a measure that is highly flexible, and allows us to naturally include background knowledge, such that differences in results can be measured from the perspective of what a user already knows. Furthermore, adding to its interpretability, it coincides with Jaccard dissimilarity when we only consider exact tiles. Our approach provides a means to study and tell differences between results of different exploratory data mining methods. As an application, we show that our measure can also be used to identify which parts of results best redescribe other results. Furthermore, we study its use for iterative data mining, where Responsible editor: Dimitrios Gunopulos, Donato Malerba, Michalis Vazirgiannis. The research described in this paper builds upon and extends the work appearing in ECML PKDD’11 as Tatti and Vreeken (2011). N. Tatti (B) · J. Vreeken Advanced Database Research and Modelling, Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium e-mail: nikolaj.tatti@ua.ac.be J. Vreeken e-mail: jilles.vreeken@ua.ac.be 123 174 N. Tatti, J. Vreeken one iteratively wants to find that result that will provide maximal novel information. Experimental evaluation shows our measure gives meaningful results, correctly identifies methods that are similar in nature, automatically provides sound redescriptions of results, and is highly applicable for iterative data mining. Keywords Exploratory data mining · Distance · Maximum entropy model · Iterative data mining · Tiling · Binary data 1 Introduction Deciding whether the results of different mining algorithms provide significantly different information is an important, yet understudied, open problem in exploratory data mining. Whether we want to select the most promising result for analysis by an expert, or decide which mining approach we should apply next in order to most likely gain novel insight, we need to be able to tell how different the information is that is provided by different results, possibly from different methods. However, while the comparison of results is a well-studied topic in statistics, it has received much less attention in the knowledge discovery community. Clearly, any dataset only contains a limited amount of knowledge—which is the most that we can hope to discover from it. To extract this information, we have at our disposal an ever growing number of data mining algorithms. However, when analysing data we have to keep in mind that most data mining results are complex, and their analysis and validation hence often requires considerable effort and/or cost. So, simply applying ‘all’ methods and having an expert analyse ‘all’ results is not a feasible approach to extract ‘all’ knowledge. Moreover, many of these results will be redundant, i.e. they will convey roughly the same information, and hence analysing ‘all’ obtained results will mostly require effort that will not provide extra insight. Instead, we would ideally just select that result for analysis that will provide us the most new knowledge. In order to be able to do this, two basic requirements have to be met. First of all, we need to be able to measure how different two results are from an information-providing perspective; if they essentially provide the same information, we could just select one for processing. Second, we should be able to include our background knowledge, such that we can gauge the amount of information included in a result compared to what we already know. Although an important practical problem, it has been surprisingly understudied in data mining. The main focus in exploratory data mining research has mostly been on developing techniques to discover structure, and not so much on how to compare between results of different methods. As a result there currently exist no general methods or theory to this end in the data mining literature. For tasks where a formal objective is available, it makes sense to straightforwardly use that objective for comparing fairly between the results of different methods. In classification, for instance, we can use accuracy, or AUC scores—as well as that we can measure how differently classifiers behave, for instance in ensembles (Kuncheva and Whitaker 2003). In fact, in learning tasks in general, there typically are clearly defined goals, and hence different methods can be compared on their performance on that particular task. 123 Comparing apples and oranges 175 For exploratory data mining, however, there is no formal common goal: the main goal is the exploration of the data at hand, and hence basically poses the rather general question ‘what can you tell me about my data?’. In fact, any type of data mining result that provides the analyst novel insight in the data, and the process that generated it, is potentially useful; regardless of whether the provided result of that method is, for example, a clustering, some subspace clusters, a collection of itemsets or correlated patterns, a decision tree, or even the identification of some statistical properties; as long as the analyst can obtain insight from that result, any data mining method can provide a valid (partial) answer to this question. Clearly, however, these methods do not share a common formal goal, and hence their results that are not easily comparable. The core of the problem is that comparing between methods is thus like comparing apples to oranges: a clustering is a different result than a set of itemsets, which are, in turn, different from a classifier, set of subgroups, etc. So, in order to make a sensible comparison, we need to find a common language. In this regard, the comparison of complex objects, e.g. of datasets (Tatti 2007; Vreeken et al. 2007), is related. Our setting, however, is more general, as now we do not want to compare between one type of complex object, but want to consider a very rich class of objects—a class potentially consisting of any data mining result. Arguably, some of the most general complex objects to compare between are probability distributions. Statistics and information theory provide us tools for measuring differences between distributions, such as Kullback–Leibler divergence (Cover and Thomas 2006). Data mining results, however, rarely are probability distributions, and if they are, not necessarily for the same random variable; making these tools unsuited for direct application for our goal. A simple yet important observation is that any mining result essentially identifies some properties of the dataset at hand. In fact, in an abstract way, we can identify all datasets for which these properties hold. This is an important notion, as it provides us with a way to compare between results of different methods: if two results provide the same information, they identify the same subspace of possible datasets, and the more different the information two results give, the smaller the overlap between the sets of possible datasets will be. As such, a straightforward approach for measuring similarity between results would be to simply count the number of possible datasets per result, and regard the relative number in the overlap—the larger this number, the more alike the information provided. However, the most basic situation of rectangles full of ones aside, counting databases is far from trivial. Instead of counting, we can view our problem more generally, and regard a result as implicitly defining a probability distribution over datasets. And hence, if we can model these distributions, we can use standard tools from statistics to compare between data mining results fair and square. In this paper, we propose to translate data mining results into probability distributions over datasets using the maximum entropy principle (Csiszár 1975). This allows us to uniquely identify that probabilistic model that makes optimal use of the information provided by a result, but is least biased otherwise. By subsequently measuring the Kullback–Leibler divergence between the obtained distributions, we can tell how different two results are from an information perspective. Furthermore, besides directly measuring the information content between data mining results, we can also 123 176 N. Tatti, J. Vreeken incorporate background knowledge into these models, and so score and compare informativeness of results from specific points of view. Finding these probability distributions, and conditioning them using data mining results, however, is far from trivial in general. Here, we give a proof of concept of this approach for binary data, for which the basics of maximum entropy modelling are available (De Bie 2011b). We show that many types of results of exploratory data mining approaches on binary data can easily and meaningfully be translated into sets of noisy tiles: combinations of rows and columns, for which we know the density of 1s. We show we can efficiently acquire the maximum entropy distribution over datasets given such sets of noisy tiles, and construct a flexible measure that uses KL divergence by which we can compare between results. To show that our measure works in practice, we compare between the results of ten different exploratory data mining methods, including (bi-)clusters, subspace clusters, sets of tiles, and sets of (frequent) itemsets. We further give a theoretic framework to mine for redescriptions. That is, given a (sub)set of noisy tiles from one result, we can identify the set of tiles from another result that best approximates the same information. Moreover, we show that our measure is highly effective for iterative data mining: given a (large) collection of results, we can efficiently discover which result provides the most novel information, and which hence is a likely candidate to be analysed next. Experiments show our measure works well in practice: dissimilarity converges to 0 when results approximate each other, methodologically close methods are correctly grouped together, sensible redescriptions for tile-sets are obtained, and meaningful orders of analysis are provided. In other words, we give an approach by which we can meaningfully mix apples and oranges, and compare between them fairly. In this paper we build upon and extend the work published as (Tatti and Vreeken 2011). Here, we discuss the theory, properties, and choices of our measure in closer detail, as well as the limitations of this proof-of-concept measure. Most importantly, we investigate the practical application of our measure for both mining redescriptions of (partial) results, and for application to the end of iterative data mining—giving algorithms for both these settings, and providing experimental validation that they provide meaningful results. The road map of this paper is as follows. Next, in Sect. 2 we give the notation and preliminaries we use throughout the paper, and discuss how to convert results on binary data into sets of tiles in Sect. 3. Section 4 details how we can build a global model from a set of tiles, which we use in Sect. 5 to define a measure to compare such sets. In Sect. 6 we subsequently use this measure for redescribing sets of tiles, as well as to iteratively identify the most informative result. We discuss related work in Sect. 7, and we evaluate our measure empirically in Sect. 8. We round up with discussion and conclusions in Sects. 9 and 10. We give the proofs, such as for NP-completeness, in Appendix A. 2 Preliminaries In this section, we define the preliminaries we will use in subsequent sections. A binary dataset D is a binary matrix of size n-by-m consisting of n rows, or transactions, and m columns, or attributes. A row is simply a binary vector of size m. 123 Comparing apples and oranges 177 We assume that both the rows and columns of D have unique integer identifiers such that we can easily refer to them. We denote the (i, j)th entry of D by D(i, j). We denote the space of all n-by-m binary datasets by D. Given two distributions, p and q, defined over D, we define entropy as H ( p) = − p(D) log p(D) D∈D and Kullback–Leibler divergence as KL( p q) = D∈D p(D) log p(D) . q(D) All logarithms are natural logarithms, to base e. We employ the usual convention of 0 log 0 = 0. 3 Results on binary data as noisy tiles As discussed in Sect. 1, in this paper we give a proof of concept for measuring differences between data mining results by first converting these results into maximum entropy distributions over datasets, and subsequently measuring the difference in information content between these distributions. Slightly more formally, the first step entails inferring a maximum entropy distribution for the provided background knowledge. The background knowledge, in this case, being the data mining result at hand. To do so, we need theory on how to do this for the particular type of background knowledge— which is far from trivial in general, and hence there exists no theory for converting data mining results in general into (maximum entropy) probability distributions. Recently, however, De Bie (2011b) formalised the theory for inferring the maximum entropy distribution for a binary dataset given so-called noisy tiles as background knowledge. We employ this result for our proof of concept, and hence focus on measuring differences between data mining results on binary data. While certainly not all data in the world is binary, surprisingly many (exploratory) data mining methods are aimed at extracting significant structure from 0 to 1 data. First of all, there are approaches that identify whether data exhibits particular structure. As an example for binary data, the most simple example is density: what is the relative number of 1s of the data. A more intricate type of structure we can identify in this type of data is bandedness (Garriga et al. 2011): the property that the rows and columns of the data can be permutated such that the non-zero entries exhibit a staircase pattern of overlapping rows. Nestedness (Mannila and Terzi 2007), is the property that for every pair of rows of a 0–1 dataset one row is either a superset or subset of the other. Perhaps the most well-known example for analysing binary data is frequent itemset mining (Agrawal and Srikant 1994). Given appropriate distance functions, binary data can be clustered (MacQueen 1967), or subspace clusters (clusters within specific subsets of dimensions) (Aggarwal et al. 1999; Müller et al. 2009) can be identified. 123 178 N. Tatti, J. Vreeken Both traditional pattern mining and sub-space clustering typically provide incredibly many results. In response, a recent development is pattern set mining, where the goal is not to find all patterns (e.g. itemsets) that satisfy some conditions (e.g. frequency), but to instead find small, non-redundant, high-quality sets of patterns that together generalise the data well (Siebes et al. 2006; Miettinen et al. 2008; Knobbe and Ho 2006). An early proponent of this approach is Tiling (Geerts et al. 2004), which has the goal of finding large tiles, i.e. itemsets that cover many 1s in the data. Kontonasios and De Bie (2010) proposed to mine for noisy tiles that are surprising given a background model of the data. We discuss mining sets of patterns or tiles, as well as mining binary data in general, in more detail in Sect. 7. Interestingly, as we will show below, many data mining results on binary data, including clustering and pattern mining results, can be translated into noisy tiles without much, or any loss of information. Before we discuss how different results can be translated, we have to formally introduce the concept of tiles. A tile T = (t(T ) , a(T )) is a tuple consisting of two lists. The first list, t(T ), is a set of integers between 1 and n representing the transaction ids of T . The second list, a(T ), is a set of integers between 1 and m representing the attribute ids. We define s(T ) to give us all entries of T , that is, the Cartesian product of t(T ) and a(T ), s(T ) = {(i, j) | i ∈ t(T ) , j ∈ a(T )}. The number of entries of s(T), |s(T )|, we refer to as the area of a tile T . Given a tile set T we also define s(T ) = T ∈T s(T ) to give us the list of all entries T covers. Given a tile T and a dataset D we define a frequency fr(T ; D) to be the proportion of ones in D corresponding to the entries identified by T , fr(T ; D) = 1 |s(T )| D(i, j). i∈t(T ) j∈a(T ) Example 1 Assume that we are given a set of text documents, for example, a set of abstracts of accepted papers at a computer science conference. We can transform these documents into a binary dataset using a bag-of-words representation. That is, each item represents a particular word, and every document is represented by a binary vector t with entries ti = 1 if and only if the corresponding word i occurs in the abstract. Now, let T be a tile in this dataset. Then the transactions of T correspond to specific abstracts, while the columns, or the items of T represent specific words. The frequency of 1s in T is simply the relative number of these words occurring in these documents. For example, when we run the Krimp algorithm on such a dataset (see Sect. 8 for more details on this algorithm), we see it finds (among others) tiles containing the words {large, database}, {algorithm, result, set, experiment}, and {synthetic, real}. In this case, for each of the discovered tiles the corresponding frequencies are 1.0, and hence we know that each of the documents identified by these tiles contains all of the corresponding words. Let p be a distribution defined over D, the space of all n-by-m datasets. We define the frequency of a tile to be the average frequency with respect to p, fr(T ; p) = D∈D 123 p(D)fr(T ; D) . Comparing apples and oranges (a) (b) 179 (c) (d) (e) (f) (g) Fig. 1 Toy example of a dataset, and maximum entropy models for this data given different sets of tiles, as used in Examples 2–6. a Toy dataset D and five tiles, T1 , . . . , T5 , identified therein, b the maximum entropy model for D given tile set T = {T2 , T4 }, c the maximum entropy model for D given tile set U = {T2 , T3 , T5 }, d the maximum entropy model for D given T ∪ U , e the maximum entropy model for D given T ∪ B, with B = {T1 }, f the maximum entropy model for D given tile set U ∪ B, and g the maximum entropy model for D given tile set T ∪ U ∪ B. The cell entries in (b–g) are given by Theorem 2 We can also express the frequency directly by this distribution. Lemma 1 Given a distribution p and a tile T , the frequency is equal to fr(T ; p) = 1 |s(T )| p((i, j) = 1), (i, j)∈s(T ) where p((i, j) = 1) is the probability of a dataset having 1 as (i, j)th entry. We say a tile is exact if its frequency is 0 or 1, and otherwise say it is noisy. Corollary 1 For an exact tile T , p((i, j) = 1) = fr(T ; p), where (i, j) ∈ s(T ). Example 2 Consider a dataset D given in Fig. 1a. We consider five different tiles, – – – – – T1 = (2, . . . , 5) × (1, . . . , 5) of area 20 and covering 10 ones of D, T2 = (1, 2) × (1, 2) of area 4 and covering 4 ones, T3 = (3, 4, 5) × (1, 2) of area 6 and covering 0 ones, T4 = (4, 5) × (3, 4, 5) of area 6 and covering 6 ones, and T5 = (3, 4, 5) × (4, 5) of area 6 and also covering 6 ones. Their subsequent frequencies are hence fr(T1 ; D) = 10/20 = 1/2 for T1 , while fr(T2 ; D) = fr(T4 ; D) = fr(T5 ; D) = 1, and fr(T3 ; D) = 0. By definition, T1 is a noisy tile, while T2 , . . . , T5 are exact tiles. There are numerous techniques for mining tiles and sets from a database, but we observe that a large number of other statistics and mining results obtained on binary data can also be naturally described using sets of (noisy) tiles: – itemsets: any itemset can be converted into a tile by taking the supporting transactions. For standard frequent itemsets this results in an exact tile, for fault-tolerant (or noisy) itemsets this results in a noisy tile. Thus, an itemset collection can be converted into a tile set. – clustering: Given a clustering, we can construct a tile set in two different ways. The first way is to represent each cluster by a single tile, consisting of the transactions in the cluster and all the items. The other way is to transform each cluster into m 123 180 N. Tatti, J. Vreeken tiles, where m is the number of items in the data. Each tile corresponds to an item and the transactions associated with the cluster. The frequency corresponding to a tile then simply is the relative occurrence of that item in that cluster, i.e. its mean value. This is particularly natural for k-means, since a centroid then corresponds to the column means of the corresponding transactions. – bi-clustering and subspace-clustering: both subspace clusters (Aggarwal et al. 1999) and bi-clusters (Pensa et al. 2005) are sets of transactions and columns. Hence, we can naturally represent these results by equivalent tiles. – data density: the density of the data is equal to the frequency of a tile containing the whole data. – column and row margins: the margin of a column i, i.e. its frequency, can be expressed with a single (noisy) tile containing the column i and all rows r ∈ D. Analogously, we can express row margins by creating a tile per row over all columns. As such, we can convert a wide range of results on binary data into sets of (noisy) tiles—and retain most, if not all, information provided by these results. Note, however, that this list is far from complete: we cannot convert every result on binary data into (sets of) (noisy) tiles. For instance, we can currently not model structures within a tile beyond its density, and hence in this proof-of-concept we disregard statistics that go beyond density, such as Lazarus counts (Fortelius et al. 2006), or nestedness (Mannila and Terzi 2007). We discuss converting data mining results into tiles, and the limitations of the current proof-of-concept, in more detail in Sect. 9. 4 Building global models from tiles To meet our goal, we have to construct a statistically sound technique for comparing two sets of tiles. In this section we construct a global model for datasets using the given tiles. We will use these models for comparing the tile sets. Consider that we are given a tile set T , and for each tile T ∈ T we are also given a frequency αT . Typically, the frequencies are obtained from the data at hand, αT = fr(T ; Din ), but this is not a necessary condition. The tiles convey local information about the data Din and our goal is to infer a distribution p over D, that is, how probable data set D ∈ D is given a tile set T . If the information at hand defines the data set uniquely, then p(D) = 1 if and only if D = Din . To derive the model, we use a well-founded notion from information theory, the maximum entropy principle (Csiszár 1975; Jaynes 1982). Roughly speaking, by maximum entropy, we incorporate the given information into a distribution, yet further making it as evenly spread as possible. To define the distribution, we first define the space of distribution candidates. That is, the space of those distributions that produce the same frequencies for the given tiles, P = { p | fr(T ; p) = αT , for all T ∈ T }. In other words, P contains all distributions that explain the frequencies αT . From this set, we select one distribution, which we denote by pT∗ , such that pT∗ maximises the entropy, H pT∗ ≥ H ( p) for any p ∈ P. We will abuse notation and write H (T ) where we mean H pT∗ . Similarly we write ∗ , where both T and U are tile sets. KL(T U) to mean KL pT∗ pU 123 Comparing apples and oranges 181 A classic theorem states that the maximum entropy distribution p ∗ can be written in an exponential form—which ensures generality, and as we show below, provides us with a practical way for inferring the model. Theorem 1 (Theorem 3.1 in (Csiszár 1975)) Given a tile set T , a distribution p ∗ is the maximum entropy distribution if and only if it can be written as exp p (D) ∝ 0 ∗ T ∈T λT |s(T )|fr(T ; D) D∈ /Z D ∈ Z, where λT is a certain weight for fr(T ; D) and Z is a collection of datasets such that p(D) = 0 for each p ∈ P. The next theorem allows to factorize the distribution p ∗ into a product of Bernoulli random variables, each variable representing a single entry in the dataset. Such a representation gives us a practical way for inferring the model. Theorem 2 Let T be a tile set. Write T (i, j) = {T ∈ T | (i, j) ∈ s(T )} to be the subset of T containing the tiles that cover an entry (i, j). Then, the maximum entropy distribution can be factorised as p ∗ (D) = i, j p ∗ ((i, j) = D(i, j)), where exp T ∈T (i, j) λT or p ∗ ((i, j) = 1) = 0, 1. p ∗ ((i, j) = 1) = exp T ∈T (i, j) λT + 1 Proof See Appendix A. Theorem 2 allows to represent p ∗ as Bernoulli variables. We should stress that this is a different model than assuming independence between items in a random transaction. Our next step is to discover the correct frequencies for these variables. Here we use a variant of a well-known Iterative Scaling algorithm (Darroch and Ratcliff 1972). The algorithm is given in Algorithm 1. Informally said, given a tile T , the algorithm updates the probabilities in the model such that the expected frequency of 1s within the area defined by T becomes equal to αT . We do this for each tile. However, updating the model to satisfy a single tile might change the frequency of another tile, and hence we need several passes. The update phase is done by modifying the weight λT (given in Theorem 1). For the proof of convergence see Theorem 3.2 in (Csiszár 1975). Instead of representing a distribution with the exponential form given in Theorem 1 we prefer working with the matrix form given in Theorem 2. We need to show that the update phase in Algorithm 1 actually corresponds modifying the weight λT . In order to do that let p be a distribution having the exponential form, let T be a tile let q be a distribution q(D) ∝ exp(λfr(T ; D)) p(D), that is q is the distribution for which λT is shifted by λ. Let us define u(y, x) = xy . 1 − y(1 − x) (1) 123 182 N. Tatti, J. Vreeken Algorithm 1: Iterative scaling for solving the MaxEnt distribution 1 2 3 4 5 6 7 input : tile set T , target frequencies {αT } output : Maximum entropy distribution p p ← a n-by-m matrix with values 1/2; foreach T ∈ T , αT = 0, 1 do p(i, j) ← αT for all (i, j) ∈ s(T ); while not converged do foreach T ∈ T , 0 < αT < 1 do f ← fr(T ; p); find x such that f = (i, j)∈s(T ) u( p(i, j), x); p(i, j) ← u( p(i, j), x) for all (i, j) ∈ s(T ); {see Equation 1} For the sake of clarity, let us write a = p((i, j) = 1) and b = q((i, j) = 1). We claim that b = u(a, eλ ) if (i, j) ∈ s(T ) and a = b otherwise. The latter claim follows directly from Theorem 2 since if (i, j) ∈ / s(T ), then the fraction in Theorem 2 does not contain λT . Assume that (i, j) ∈ s(T ). Then b= eλ c +1 eλ c and a= c , c+1 where c= λT . T ∈T (i, j) Expressing b using a implies that b = u(a, eλ ). In order to find the correct eλ we use interpolation search using a linear interpolation. It turns out, that if the given tiles are exact, the maximum entropy distribution has a very simple form. The Bernoulli variable corresponding to the (i, j)th entry is always 1 (or 0), if that entry is covered by an exact tile, i.e. with frequency 1 (or 0). Otherwise, the variable is equal to a fair coin toss. This form will allow us to express distances between sets of exact tiles in the next section. Theorem 3 Let T be a collection of exact tiles and let αT be the desired frequency of a tile T ∈ T . Then αT if there exists T ∈ T such that (i, j) ∈ s(T ) ∗ pT ((i, j) = 1) = 1/2 otherwise. Proof See Appendix A. In essence, this theorem states that in the maximum entropy model for a dataset given set T of exact tiles, the probability p ∗ of observing a 1 in a specified cell (i, j) is either 0, or 1 if this cell maps to one of the tiles in T —and otherwise, the probability is 1/2 as we cannot infer anything about this cell from T . Example 3 Let us continue Example 2. Consider the following three tile sets T = {T2 , T4 }, U = {T2 , T3 , T5 }, and B = {T1 }. The corresponding maximum entropy models are given in Fig. 1, such that each entry represents the probability p ∗ ((i, j) = 1). As the sets T and U contain only exact tiles, Theorem 3 implies that ∗ , and p ∗ the entries for the models pT∗ , pU T ∪U are either 0, 1, or 1/2. 123 Comparing apples and oranges 183 Consider pT∗ ∪B . Corollary 1 states that entries in s(T2 ) and s(T4 ) should be 1, since both tiles are exact. From T1 , we know that there are 10 ones, yet T2 and T4 account only for 8. Hence, there should be 2 ones in the 12 entries outside of T , on average. We aim to be as fair as possible, hence we spread uniformly, giving us probabilities ∗ 2/12 = 1/6 for the unaccounted entries in T1 . For pU ∪B , there are 2 unaccounted 1s in 6 entries, giving us 2/6 = 1/3. Finally, all 1s are accounted for in pT∗ ∪U ∪B . Outside of T ∪ U ∪ B we have no information, hence these probabilities default to 1/2. 5 Comparing sets of tiles Now that we have a technique for incorporating the information contained within a set of tiles into a probabilistic model, we can subsequently use these models to compare between the sets of tiles. We assume that we are given three tile sets T , U, and B. Let tile set B contain the background information. Typically, this information would be simple, like column margins, row margins, or just the proportions of ones in the whole dataset. B can be also be empty, if we do not have or wish to use any background knowledge. Our goal is now to compute the distance between T and U given B. We assume that the frequencies for the tile sets we are given are mutually consistent; which is automatically guaranteed if the frequencies for all three tile sets are computed from a single dataset. Now, let M = T ∪ U ∪ B be the collection containing all tiles. We define the distance between T and U, w.r.t. B, as d(T , U; B) = KL(M U ∪ B) + KL(M T ∪ B) . KL(M B) If KL(M B) = 0, we define d(T , U; B) = 1. A direct application of Theorem 2 leads to the following theorem which in turns imply that we can compute the distance in O(nm) time. Theorem 4 Let A and B be two sets of tiles such that B ⊆ A. Then KL(A B) = i, j v=0,1 p ∗ ((i, j) = v) ∗ pA ((i, j) = v) log A ∗ ((i, j) = v) . pB If the given tiles are exact, the distance has a simple interpretable form; namely, the distance can be expressed with Jaccard similarity. Theorem 5 Assume three tile collections T , U, and B with exact frequencies {αT }, {βU }, and {γ B }. Define X = s(T ) \ s(B) and Y = s(U) \ s(B). Then d(T , U; B) = 1 − |X ∩ Y |/|X ∪ Y | for |X ∪ Y | > 0 and d(T , U; B) = 1 for |X ∪ Y | = 0. Proof See Appendix A. In other words, the theorem states that if we only consider sets of exact tiles, we can easily compute (and interpret!) the dissimilarity score d between them, as our measure coincides with Jaccard dissimilarity. 123 184 N. Tatti, J. Vreeken Example 4 Let us continue Example 3. To compute d(T , U; ∅) we first note that T and U only have exact tiles, and hence we can use Theorem 5. So, we have |s(T ) \ s(U)| = 2, |s(U) \ s(T )| = 8, and |s(T ) ∪ s(U)| = 18. And hence, the distance d(T , U; ∅) = (2 + 8)/18 = 5/9. Next, let use M as a shorthand for T ∪ U ∪ B, i.e. M = T ∪ U ∪ B. To compute d(T , U; B), note that KL(M T ∪ B) = 2 log(6) + 10 log(6/5) ≈ 5.4067. where the first term represents the positive entries in M and the second term the negative entries in M. Similarly, KL(M B) ≈ 15.2, and KL(M U ∪ B) ≈ 3.8. Consequently, the distance between T and U in light of B is d(T , U; B) = 3.8 + 5.4 ≈ 0.6 15.2 which is slightly larger than d(T , U; ∅) ≈ 0.56. This is due to the fact that adding B, which effectively states the data is sparse, to T makes the probability of encountering a 1 at (3, 4) and (3, 5) in the model less likely. Hence, given that background knowledge, and regarding U, we are more surprised to find that the values of these entries are indeed ones. Next, we discuss some further properties of our measure. We start by showing the measurement is upper bounded by 2 for noisy tiles, and 1 if we only consider exact tiles. Theorem 6 Assume three tile collections T , U, and B. Then d(T , U; B) ≤ 2. If T , U, and B consist only of exact tiles, then d(T , U; B) ≤ 1. Proof See Appendix A. Before, in Theorem 5 we showed that when all tiles are exact and s(T ) ∩ s(U) ⊆ s(B), that is, T does not provide any new information about U, then the distance is equal to 1. We can extend this property to noisy tiles. Theorem 7 Assume three tile collections T , U, and B. Assume that B consists of exact tiles and s(T ) ∩ s(U) ⊆ s(B). Then d(T , U; B) = 1. Proof See Appendix A. Exploiting the fact that our measure coincides with Jaccard dissimilarity when considering exact tiles, our measure upholds the triangle inequality when all tile sets consists of exact tiles. Theorem 8 Assume four tile collections S, T , U, and B consisting only of exact tiles. Then the distance satisfies the triangle inequality, d(T , U; B) ≤ d(T , S; B) + d(S, U; B) . 123 Comparing apples and oranges 185 Proof See Appendix A. Consequently, the distance is (almost) a metric when working only with exact tiles. The technical subtlety is that the distance can be 0 even if the tile sets are different. This will happen, if (s(T ) ∪ s(U)) \ (s(T ) ∩ s(U)) ⊆ s(B). This highlights the fact that we are measuring the information hidden in the tile sets, not the actual tiles. Next, we show that if we add an element from one tile set to the other, the distance we measure between them decreases. Theorem 9 Assume three tile collections T , U, and B. Let U ∈ U. Then d(T ∪ {U } , U; B) ≤ d(T , U; B) . Proof See Appendix A. In other words, if we add tiles from the right-hand tile set U to the left-hand tile set T , regardless of the background information B, we decrease the dissimilarity d(T , U; B), as T contains more information about U. Example 5 Consider the tile sets U and M = T ∪ U ∪ B given in Fig. 1. The distance between U and M is 6/22. The distance between U ∪ {T4 } = U ∪ T and M is 3/22. Since T4 ∈ M, Theorem 10 states that the distance d(U, M; ∅) should be as large as d(U ∪ {T4 } , M; ∅), which is the case. If U ⊆ T , then we can show that the smaller the distance, the larger the Kullback– Leibler is between U and the background knowledge. We will use this property for iterative data mining in Sect. 6.2. Theorem 10 Assume three tile collections T , U, and B. Let M = T ∪ U ∪ B. Assume that U ∪ B ⊆ T ∪ B. Then d(T , U; B) = KL(M U ∪ B) KL(U ∪ B B) =1− . KL(M B) KL(M B) Proof See Appendix A. Let us next consider an example of this property. Example 6 Consider the tile sets T and U given in Fig. 1. Let M = T ∪ U. We have KL(M U) = 2, KL(U ∅) = 16, KL(M ∅) = 18. The distance is then d(U, U ∪ T ; ∅) = KL(M U) 2 16 KL(U ∅) = =1− =1− . KL(M ∅) 18 18 KL(M ∅) 6 Applying the distance Above, we were only concerned in finding out how much information two tile sets share. With our measure, we can quantify the difference in information provided by different tile sets, and hence, measure differences between results—a very useful application on itself. In this section we consider two more elaborate problems to which our measure can be applied. 123 186 N. Tatti, J. Vreeken 6.1 Redescribing results by redescribing sets of tiles The first application we consider is redescription of results. Opposed to traditional redescription mining (Ramakrishnan et al. 2004), where given a pattern X we want to find a syntactically different pattern Y that identifies the same rows in the database as X , we here study how well we can describe the information provided by one result using (parts of) another result. Between these two approaches, the underlying question is the same: are there different representations of the same information. Whereas in standard redescription mining the row-set of a pattern is regarded as ‘information’, we consider the detail given on the data (distribution) as information. More formally, given two tile sets, say T and U, we want to find which tiles from U best describe the information provided by T . To this end, we can apply our distance measure as follows. Problem 1 (R EDESCRIBE) Given three sets of tiles T , U, and B with consistent frequencies, find a subset V ⊆ U such that d(V, T ; B) is minimised. As with many optimisation problems in data mining, it turns out that finding the best tile subset is computationally intractable. Theorem 11 The decision version of Redescribe is an NP-hard problem. Proof See Appendix A. Hence, we resort to a simple greedy heuristic to find good approximations for the redescription problem: we iteratively choose to add that tile to our selection that maximally reduces the difference to the target tile set. We give the pseudo-code for our redescription mining approach as Algorithm 2. As the number of tiles we can select is finite, so is the procedure. Moreover, we stop selecting new tiles when we cannot decrease the distance by adding more tiles. Algorithm 2: Fruits, finds redescription used in a tile set. 1 2 3 4 5 6 7 8 9 input : target tile set T , candidate tile set S, background knowledge B output : R a subset of S redescribing T R ← ∅; while changes do d ← d(R, T ; B); C ← ∅; foreach S ∈ S do if d(R ∪ {C} , T ; B) ≤ d then C ← S; d ← d(R ∪ {C} , T ; B); if C = ∅ then add C to R; remove C from S; 10 return R; Note that Algorithm 2 selects tiles in a particular order: we iteratively find the tile that minimises the distance to the information in the target tile set. As such, if needed, 123 Comparing apples and oranges 187 one can only inspect the top-k selected tiles of a redescription to see how result S captures the most important information in T . Furthermore, by iteratively minimising the distance, if the candidate tile set S contains elements of the target tile set T , these tiles are likely to be selected—as those parts of the target information can be ‘redescribed’ exactly. Also note that by minimising the distance between R and T we will not simply select all tiles in S that overlap with tiles in T —even though by overlapping these tiles will provide some similar information, their non-overlapping parts will provide different information, and hence increase the distance between the two tile sets. As such, we automatically prevent the selection of tiles in S that are much more specific (e.g. contain more items) than what T describes. 6.2 Iterative data mining Next, we apply our distance measure for application in an iterative data mining setting. The key idea of iterative data mining is that what one finds interesting depends on what one knows. As such, practitioners study results one at the time (Hanhijärvi et al. 2009; Mampaey et al. 2012). Once we have studied a result, we are no longer interested in other results that provide (approximately) the same information, and we rather want to be served the most interesting pattern with regard to what we now know. As such, the key difference between iterative data mining and traditional pattern ranking is that we take into account what a user has already seen when determining what might be interesting. Consequently, if a result has a peculiar statistic, yet it can be (approximately) explained by the previously inspected results, we consider this result to be redundant and choose not present it to the user. To be more specific, let us assume that we ran k different algorithms on a single dataset D, and so obtained T1 , . . . , Tk different tile sets. Now, if we merge all tiles into one tile set T = T1 ∪ . . . ∪ Tk , this set represents all information we have assembled about D. Note however that T is unordered, and will likely contain redundant tiles. We order the tiles by applying Fitamin, which orders the tiles in a given tile set T such that every tile provides maximal novel information with regard to the tiles ranked above it. Algorithm 3: Fitamin, Find iteratively those tiles that are most informative 1 2 3 4 5 6 input : tile set T , background knowledge B output : ordered list of tiles occurring in T L ← ∅; U ←T; while U = ∅ do T ← arg minU {d(L ∪ {U } , T ; B) | U ∈ U }; add T to the end of L; delete T from U ; 7 return L; The Fitamin algorithm, for which we present the pseudo-code as Algorithm 3, starts with an empty list L and selects a tile T ∈ T which minimises the distance d(L ∪ {U } , T ; B). We add T into L and repeat the process until all tiles are ordered. 123 188 N. Tatti, J. Vreeken In order to understand the connection between this approach and iterative data mining, let us consider Theorem 10 which implies that the tile that minimises the distance, also maximises KL(L ∪ B ∪ {T } B). Corollary 2 (given in Appendix) allows us to rewrite the divergence as a difference between two entropies KL(L ∪ B ∪ {T } B) = H (B) − H (L ∪ B ∪ {T }) . Note that H (B) does not depend on T and so if we replace it with H (L ∪ B), then we conclude that T must maximise H (B ∪ L) − H (L ∪ B ∪ {T }) = KL(L ∪ B ∪ {T } B ∪ L) , where the equality follows from Corollary 2. The divergence KL(L ∪ B ∪ {T } B ∪ L) = 0 if and only if the frequency of T obtained from the dataset is equal to the frequency obtained from the maximum entropy model constructed from the known tiles B ∪ L. The divergence increases as these two frequencies become more different. Consequently, a tile T is the one that is most surprising given the known tiles. 7 Related work To our knowledge, defining a distance between two general tile sets is a novel idea. However, there exist several techniques for comparing between datasets by comparing how the supports of patterns change between the datasets. Such proposals include a Mahalanobis distance between itemset collections (Tatti 2007) and a compressionbased distance between itemsets (Vreeken et al. 2007). In addition, Hollmén et al. (2003) suggested using L 1 distance between frequent itemset collections, where the missing frequencies were estimated by a support threshold. From technical point of view, comparing pattern sets given background knowledge is akin to defining an interestingness measure based on deviation from the background knowledge. In fact, our approach for building a global maximum entropy model from tiles was inspired by the work of De Bie (2011b), where he builds a similar maximum entropy model from row and column margins (i.e. a Rasch model (Rasch 1960)) and uses it as a static null hypothesis to rank tiles. Further related proposals include iterative mining of patterns by empirical p values and randomisation (Hanhijärvi et al. 2009), and maximum entropy models based on itemsets (Wang and Parthasarathy 2006; Mampaey et al. 2012). Several techniques have been proposed for mining sets of tiles. Geerts et al. (2004) suggested discovering tilings that cover as many ones as possible. Xiang et al. (2011) gave a method to mine (possibly noisy) tiles that cover ones while minimising a cost: the number of transactions and items needed to describe tiles. These methods focus on covering the ones in the data, alternatively, we can assess the quality of a tiling by statistical means. Gionis et al. (2004) suggested discovering hierarchical tiles by building a statistical model and optimising an MDL score. De Bie (2011b) gave a maximum entropy model based on column/row margins to rank tiles. Vreeken et al. 123 Comparing apples and oranges 189 (2011) propose that the best set of tiles (or itemsets) is the tile set that compresses the dataset best. An alternative approach for discovering tiles is to consider a Boolean matrix factorisation (Miettinen et al. 2008). That is, factorise the dataset into two low rank Boolean matrices, where the row vectors of the one matrix correspond to itemsets, while the column vectors of the other matrix correspond to tid-lists. The Boolean product of these matrices naturally defines a set of noisy tiles. Compared to tiles, computing maximum entropy models based on itemsets is much more difficult. The reason for this is that there is no equivalent version of Lemma 1 for itemsets. In fact, computing an expected value of an itemset from a maximum entropy model is PP-hard (Tatti 2006). To avoid these problems, we can convert itemsets to exact tiles by considering their supporting transactions. Redescribing tile sets is closely related to redescription mining, in which the idea is to find pairs of syntactically different patterns covering roughly the same transactions. Ramakrishnan et al. (2004) originally approached the problem by building decision trees. Other approaches include Boolean Formulae with no overlap (Gallo et al. 2008), and exact minimal redescriptions (Zaki and Ramakrishnan 2005). From a computational point of view, the difference between our problem and existing work is that whereas redescription mining aims to construct alternative patterns given a single target pattern, we on the other hand consider sets of target and candidate patterns, and aim to find the subset of patterns from candidates that together describe the target best. Iterative data mining, in which one iteratively finds the most interesting result given what one already knows, and subsequently dynamically updates the background model, is a relatively new approach to data mining (De Bie 2011a). Hanhijärvi et al. (2009) propose to use swap-randomisation for acquiring empirical p values for patterns, keeping the margins of already selected patterns (approximately) fixed. By using an analytical model of the data, instead of having to sample many randomised datasets per pattern, we gain much computational efficiency. Whereas we model complete databases, Mampaey et al. (2012) formalise a maximum entropy model for rows of the data by iteratively choosing that itemset for which the frequency prediction is most off. Both models have advantages: whereas in our model it is not obvious how itemset frequencies can be easily incorporated, i.e. without specifying in which rows of the dataset these occur, in the row-model approach co-occurrences of patterns are difficult to detect, as we do not know where patterns occur. 8 Experiments In this section we empirically evaluate our measure, applying it for measuring distances between results, redescriptions of results, and for iterative data mining. 8.1 Set up We evaluate our measure on four publicly available real world datasets. The Abstracts dataset contains the abstracts of the papers accepted at ICDM up to 2007, where words have been stemmed and stop words removed (De Bie 2011b). The DNA 123 190 N. Tatti, J. Vreeken Table 1 Number of tiles extracted by each of the considered methods Number of tiles per method Dataset n m clus bicl atcl sscl asso tiling hyper itt krimp mtv Abstracts 859 3,933 5m 25 753 100 100 38 100 100 100 25 DNA 4,590 392 5m 25 56 100 100 32 100 100 100 100 Mammals 2,183 124 5m 25 28 100 91 3 100 2 100 14 Paleo 501 139 5m 25 514 100 100 100 100 71 85 14 amplification data is data on DNA copy number amplifications. Such copies activate oncogenes and are hallmarks of nearly all advanced tumours (Myllykangas et al. 2006). Amplified genes represent attractive targets for therapy, diagnostics and prognostics. The Mammals presence data consists of presence records of European mammals1 within geographical areas of 50 × 50 km (Mitchell-Jones et al. 1999). Finally, Paleo contains information on fossil records2 found at specific palaeontological sites in Europe (Fortelius et al. 2006). We provide our prototype implementation for research purposes3 . Computing a single distance typically takes a few seconds—up to maximally two minutes for the Abstracts dataset, when considering the most complex sets of tiles, and most detailed background knowledge. 8.2 Methods and mining results We apply our measure to compare between the results of ten different exploratory data mining methods for binary data. Table 1 gives an overview, here we list the methods, stating the parameters we use, and giving how we refer to each of the methods between brackets. clus We employ simple k-means clustering with k = 5 clusters, using L 1 distance. We turn the clusters into tiles by computing column margins inside each cluster. bicl A bi-clustering simultaneously clusters the transactions and the items; a cluster is then a tile defined by the corresponding pair of item and transaction clusters (Pensa et al. 2005). We apply biclustering by separately clustering the columns and rows using again k-means clustering (k = 5) and combine the two clusterings into a grid. Puolamäki et al. (2008) showed that a good approximation bound can be achieved with this approach. Each cluster is represented by a single tile. atcl We use the parameter-free attribute clustering approach by Mampaey and Vreeken (2012, 2010) that groups together binary or categorical attributes 1 The full version of the mammal dataset is available for research purposes upon request from the Societas Europaea Mammalogica. http://www.european-mammals.org. 2 NOW public release 030717 available from (Fortelius et al. 2006). 3 http://www.adrem.ua.ac.be/implementations/. 123 Comparing apples and oranges sscl asso tiling hyper itt mtv krimp 191 for which the values show strong interaction. We convert each attribute cluster into a tile—thereby somewhat oversimplifying these results, as we disregard the identified structure in attribute-value combinations within the cluster. Subspace clustering aims at finding groups of rows that exhibit strong similarity on a subset, i.e. subspace, of the attributes. For subspace clustering, we used the implementation of Müller et al. (2009) of the ProClus algorithm (Aggarwal et al. 1999), mined 100 clusters, each over maximally 32 dimensions, and converted each into a noisy tile. The Asso algorithm (Miettinen et al. 2008) approximates the optimal k-rank Boolean Matrix Factorisation of the data. We ran with a maximum of 100 factors, of which the non-empty ones were converted into tiles. Using the Tiling algorithm (Geerts et al. 2004) we mine overlapping tilings, of up to 100 exact tiles, allowing the algorithm a maximum running time of 8 h. The Hyper algorithm is strongly related to Tiling, but can find noisy tiles by combining exact tiles such that error is minimised. Per dataset, we mined up to 100 hyper rectangles (Xiang et al. 2011), directly obtaining noisy tiles. We mined Information-theoretic exact tiles (De Bie 2011b), where the method automatically selects the number of tiles, and ranks them according to informativeness. Out of the returned tiles, we use the top-100. We applied the mtv algorithm for obtaining the most informative itemsets; those itemsets for which the predicted frequency is most off given the current model (Mampaey et al. 2012). We allow a maximum run time of 2 hours, and a maximum number of 100 returned itemsets. As this algorithm focuses on frequency, selecting both itemsets for which the frequency is surprisingly high as well as surprisingly low, we convert these itemsets into tiles by selecting all rows. We used Krimp (Vreeken et al. 2011) to mine itemsets that compress. From the resulting code tables, we identified the top-100 itemsets that are most used during compression, i.e. that aid compression best, and convert these into tiles by using their Krimp-usage as tid-sets. The last four of these methods select itemsets from a candidate collection. As candidate itemsets to select from, we used closed frequent itemsets mined at as low as feasible support thresholds, of resp. 5, 1, 800, and 1. For Krimp, however, by its more optimised implementation we can use lower support thresholds for the Abstract and Mammals datasets, of resp. 4 and 300. 8.3 Measuring distances First, we evaluate whether the measured distance converges to 0 when two sets of tiles approximate each other with regard to background knowledge. To this end, we take the tile sets of asso, krimp, and itt, as obtained on resp. the Abstracts and DNA datasets. In Fig. 2 we plot, per method, the measured distance between the top-k and top-100 tiles. We give the measurements for three different background knowledge settings, resp. no background knowledge, knowledge of the average density of the dataset, and 123 192 N. Tatti, J. Vreeken Fig. 2 Distance between top-k tile sets and top-100 tile sets as a function of k. Rows represent datasets while the columns represent the methods the column margins. The tiles are sorted according to their output order, for asso and itt, and ascending on code length for krimp. As Fig. 2 shows, measurements indeed converge to 0 for higher k, i.e. when the two tile sets become more identical. Adding background information typically increases the distance. This is due to two reasons. First, when density is used, then we can infer additional differences between the areas that are not covered by tiles, thus highlighting the differences. Second, when we are using column margins, we reduce KL(M B), the joint information w.r.t. the background knowledge, consequently increasing the distance. Interestingly, for Abstracts, the distances for asso in fact decrease when density is used as background knowledge. This is caused by the fact that asso produces many overlapping tiles and these overlaps are emphasised when density is used. 8.4 Distances between results Our main experiment is to investigate how well we can compare between results of different methods. To do so, for every dataset, and every method considered, we convert their results into tile sets as described above. We measure the pair-wise difference between each of these tile sets, using resp. the empty set, overall density, the column margins, and the combination of row and column margins, as background knowledge. For analysis, we present these numerical results, and the averages over the datasets, visually in Fig. 3 by plotting all pair-wise distances by Sammon projection (Sammon 1969). Sammon projection is a commonly used variant of Multi-dimensional scaling (MDS), in which one projects a high-dimensional space (here the pair-wise distance tables) onto a two-dimensional space, while preserving the structure of the individual 123 Comparing apples and oranges 193 Fig. 3 Sammon projections (a type of MDS, see Sammon 1969) of distances between tile sets. Each row represents a dataset and each column represents used background knowledge. Note that atcl and mtv are not included in the rightmost columns, as their tiles provide no information beyond column margins distances as well as possible. We colour tiling and mark and clustering methods differently. Note that by our conversion into noisy tiles, we here treat the results of mtv (itemsets and their frequencies) as attribute clusters. Considering the first column first, we see that without background knowledge three of the clustering approaches provide virtually the same result. We also see that, albeit not identical, the results of asso, itt, krimp, and tiling are relatively close to each other; which makes sense from a conceptual point of view, as these methods are methodologically relatively similar. For DNA, the measured dissimilarity between these methods lies between 0.28 and 0.38, whereas the dissimilarities to the other methods measure approximately 0.9. We observe that hyper, while conceptually similar, is measured to provide different results when no background knowledge is given. This is mostly due to it including a few very large tiles, that practically cover the whole data, whereas the other methods 123 194 N. Tatti, J. Vreeken Table 2 Redescribing the results of clustering (clus) by the results of resp. asso, tiling, hyper, itt, and krimp, using the Fruits algorithm Dataset Redescription/# of tiles/full tile set asso tiling hyper itt krimp Abstracts .76 (70) .76 .74 (38) .74 .50 (32) .57 .68 (100) .68 .85 (98) .85 DNA .65 (11) .83 .70 (9) .79 .67 (21) .84 .67 (11) .81 .67 (19) .81 Mammals .30 (51) .31 .68 (3) .68 .34 (24) .39 .78 (2) .78 .49 (91) .49 Paleo .68 (16) .83 .69 (28) .78 .68 (23) .81 .73 (21) .81 .74 (38) .79 Shown are, per combination, the dissimilarity between clus and the discovered redescription, the number of tiles of the redescription, and the dissimilarity measurement to the complete tile set only cover the data partially. For hyper we see that once background knowledge is included, these large tiles are explained away, and the method subsequently becomes part of the ‘tiling’ group. Clustering algorithms are close to each other when no background information is used because by covering all the data, they convey well that these datasets are sparse. When we use density as background knowledge, the differences between clusterings become visible. Interestingly enough, adding row margins to column margins as background information has small impact on the distances. 8.5 Redescribing results Next, we empirically evaluate how our measure can be employed with regard to redescribing results; both as validation as well as possible application. To this end, we first investigate redescribing results of completely different methods. As such, we take clustering as the target and density as the background information, and redescribe its result by using the tile sets of five of the pattern mining methods as candidate tile sets. Table 2 shows the results of these experiments: for four datasets, the measured divergence between the redescription and the target, the divergence of the complete tile set to the target, and the number of tiles selected for the redescription. First, and foremost, we see that redescription decreases the measured divergence, which correctly shows that by filtering out, e.g. too specific, tiles that provide information not in the target, we obtain a better description of the target. We also see, with the exception of Mammals, that the measurements are quite high overall, suggesting the clustering provides information these results do not; not surprising, as these pattern mining methods focus on covering 1s, and not necessarily cover the whole data. Indeed, we see that by providing large and noisy tiles, and so covering more of the data, asso and hyper lead to the best redescriptions of the clustering results. In particular for Mammals, the target can be approximated very well, by resp. only half and a quarter of the total tiles. Overall, we note that typically only fractions of the full tile sets are selected, yet the amount of shared information is larger than for the full tile set: the pattern mining methods provide detailed local information not captured by clustering. 123 Comparing apples and oranges 195 Fig. 4 Redescribing, using the Fruits algorithm, (left) 5 tiles selected from the krimp result on the Abstracts dataset, by resp. tiles from (centre) itt and (right) asso Second, we take a closer look at individual redescriptions. In order to be able to interpret these, we use the Abstracts dataset. We use asso, krimp, and itt, as these provide sufficiently many tiles to choose from; we leave hyper out, as for this data it mostly gives only very general tiles, covering all 1s in only 100 tiles. By hand, we select 5 out of 100 krimp tiles, and we identify, for asso and itt, the sets of tiles that best approximate that partial result, and investigate how well the target concepts are approximated. In Fig. 4, we give an example. By the high distances, 0.77 and 0.83, we see the target is not approximated in detail. Overall, for itt we find only high-level translations that leave out detail, as its full tile set consists mostly of small itemsets. For asso, we see the redescription consists of target tiles combined with general concepts, which together give a reasonable approximation of the target. It is important to note that not simply all intersecting itemsets are given, but only those that provide sufficient information on the target; for both methods, overly large and overly general tiles (e.g. ‘high’) are not included in the redescription. 8.6 Iterative data mining Next, we empirically evaluate how our measure can be employed for iterative data mining; iteratively identifying that (partial) result that provides most novel information with regard to the data. First, for the DNA dataset, we combine the results of hyper, tiling, krimp, itt, asso, and sscl into one target tile set. Next, we apply the Fitamin algorithm to rank these tiles by iteratively finding the tile that makes for the most informative addition to our selection. We do this for our four settings of background knowledge, and visually depict the results in Fig. 5. We see that if we know nothing about the data, the sscl tiles, which cover relatively large areas of 0s, provide much information; whereas the tiles of the other methods, depicting mainly structure of the 1s, are ranked much lower. Once the density of the data is included in B, we already know the data is relatively sparse—and subsequently tiles that identify relatively large areas of relatively many ones are initially most informative. For all non-empty background knowledge, we see the tiles of tiling, krimp, itt, and sscl to be presented quite late in the process. This makes sense, as these methods are aimed at identifying interesting local structure, which in an iterative setting is only informative once we know the bigger picture. To reduce computation, Fitamin employs a heuristic to determine which tile to add next to its selection—as opposed to iteratively fitting maximum entropy models for 123 196 N. Tatti, J. Vreeken Fig. 5 Selection order (from left to right), and gain in information (bar width) with regard to the target of the combined results of hyper, tiling, krimp, itt, asso, and sscl, for the DNA. Top to bottom for background knowledge consisting resp. of the empty set, the data density, the column margins, and both column and row margins. The x-axis is the distance between the selected tiles so far and the full tile collection every candidate, and measuring differences between those. To investigate the quality of this heuristic, we ran a variant of Fitamin that does calculate exact distances for the Paleo dataset. While this variant is able to optimally locally select the best addition, 123 Comparing apples and oranges 197 we only observe very slight differences in the final rankings; the main differences occur when we have the choice between two or more tiles approximately providing the same gain in information, and the tie is broken differently between the exact and heuristic approach. When overall we take a look at the individual differences between estimates and real distances, we see the estimates are of very high quality; the largest absolute recorded difference we recorded was only 0.007 (on a range of 0–1), which is close to numerical stability. Next, for all datasets, using row and column margins as background knowledge, we show the selection order and gains in information as Fig. 6. As one would expect for datasets with different characteristics, different methods are best suited to extract the most information—there is no free lunch. Overall, and as expected, we observe that initially relatively large tiles not already explained away by the background knowledge are selected. As such, tiles from asso and hyper are often selected first, as these give a broad-strokes view of the data. Tiles that give much more local detail, or focus only on the 1s of the data, such as those mined by krimp and itt are most informative later on. Do note that Fitamin does not give a qualitative ranking of results: our measure says nothing about interpretability, only about relative informativeness. 9 Discussion The goal of this paper is to take a first step towards comparing between the results of different data mining methods. The method we propose here is for results obtained on binary data. By developing further maximum entropy modelling techniques, however, the same basic idea could be applied to richer data types. A recent result by Kontonasios et al. (2011) formalises a maximum entropy model for real-valued data. If this model can be extended to handle real-valued tiles, then our approach is also applicable for results on real-valued data. As for real-valued data statistics much richer than an average over a tile can be identified, a future work will involve in defining sufficiently rich theory formalising maximum entropy models that can take such statistics within tiles, such as correlations between attributes, into account as background information. Besides measuring divergence of results between different methods, using the Fruits algorithm we can identify high-quality redescriptions of results, and by Fitamin we can heuristically approximate the optimal ranking of partial results based on their iterative informativeness. While beyond the scope of this paper, the latter algorithm can also be used rank complete results, i.e. tile sets. Moreover, our measure could be extended for choose the most informative result out of many runs of a randomised method, or, to measure differences between results when varying parameters of a method. Another, likely more challenging, extension would be to identify a ‘centroid’ or ‘medioid’ result that summarises a given large collection of results well. Currently we rather straightforwardly convert results into tile sets. While we so capture much of the information they provide, we might not capture all information present in a result. Ideally, we would be able to encode structure beyond simple tile densities, such that the information captured in a result can be maintained more precisely when converted into sets of tiles. A desirable extension would be to be able to specify for a tile which attribute-value combinations occur how often within it, which 123 198 N. Tatti, J. Vreeken Fig. 6 Selection order (from left to right), and gain in informativeness (width of the bars) with regard to the target tile set of the combined results of hyper, tiling, krimp, itt, asso, and sscl, for, from top to bottom, Abstracts, DNA, Mammals, and Paleo. The x-axis is the distance between the selected tiles so far and the full tile collection would allow for better conversion of the results of attribute clustering (Mampaey and Vreeken 2010), as well as allow us to take simple itemset frequencies into account. A second refinement for future work would be to be able to specify complex structures in the data captured by statistics such as Lazarus counts (Fortelius et al. 2006), or nestedness (Mannila and Terzi 2007). Such refinements require further theory on 123 Comparing apples and oranges 199 maximum entropy modelling. More complex modelling aside, our general approach of comparing results by comparing how many possible datasets exhibit the identified structure remains the same. Our distance is based on maximum entropy model which uses the available information—that is the given tiles—as efficiently as possible. For example, if the frequencies of tiles in the second set can be derived from the first set, then the distance will be 0 between these two sets. However, we do not take into account how complex are the derivations for the human and this might lead to some counterintuitive results. Another restriction of the maximum entropy model we currently employ, is that it cannot recognise the informativeness of a data mining result identifying that the data is uniformly distributed. While this can be a highly informative, and powerful result, as our maximum entropy model essentially is the generalised uniform distribution under constraints, the uniform property follows naturally, is no deviation, and hence is not deemed informative. These restrictions are both due to our choice of modelling the space of all datasets by maximum entropy. While a well-founded choice, with many desirable properties, it is a choice nevertheless and future work may identify other probabilistic models that circumvent the above mentioned restrictions. It is very important to note that our method is not a quality measure on data mining results. It solely measures the information shared between two sets of tiles; it does not measure the subjective quality of results, nor does it say anything about the ease of analysis of a result. Note that this is a good thing, and our explicit goal. Ease of analysis is highly subjective, being dependent on the expert and the available means. By our measure, the expert can check how informative different results are, and given this information decide which result to analyse in detail. If two results of different complexity of analysis are measured to be approximately equally informative, the logical decision would be to analyse the more simple one; only investing the time and effort of considering the more complex result in detail if the measurement shows it will likely provide a lot of information not also available in the simple result. 10 Conclusion In this paper we discussed comparing results of different explorative data mining algorithms. We argued that any mining result identifies some properties of the data, and that, in an abstract way, we can identify all datasets for which these properties hold. By incorporating these properties into a model using the Maximum Entropy principle, we can measure the shared amount of information by Kullback–Leibler divergence. The measure we construct this way is flexible, and naturally allows including background knowledge, such that differences in results can be measured from the perspective of what a user already knows. As a first step towards comparing results in general, we formalised our approach for binary data, showed that we can convert results into tiles, and discussed how to incorporate these into a maximum entropy model. Our approach provides a means to study and tell differences between results of different data mining methods. For applying the measure, we gave the Fruits algorithm for parameter-freely identifying which parts of a given set of results best redescribe a given (partial) result, and the Fitamin algorithm for iteratively finding that result that provides maximal novel information 123 200 N. Tatti, J. Vreeken with regard to what we have learned so far. Experiments showed our measure gives meaningful results, correctly identifies methods that are similar in nature, automatically identifies sound redescriptions of results, and is highly applicable for iterative data mining. Acknowledgments The authors wish to thank, in alphabetical order: Tijl De Bie for his information-theoretic noisy tile miner (De Bie 2011b); David Fuhry for his implementation of Hyper (Xiang et al. 2011); Matthijs van Leeuwen for major contributions to the implementation of Krimp (Vreeken et al. 2011); Stijn Ligot for running experiments with ProClus (Aggarwal et al. 1999; Müller et al. 2009); Michael Mampaey for the implementations of Attribute Clustering (Mampaey and Vreeken 2010, 2012) as well as mtv (Mampaey et al. 2012); Pauli Miettinen for his implementation of Asso (Miettinen 2008); Nikolaj Tatti and Jilles Vreeken are both supported by a Post-Doctoral Fellowship of the Research Foundation—Flanders (fwo). A Proofs for Theorems Proof of Theorem 2 Let p ∗ be the maximum entropy distribution. Define a distribution q p ∗ ((i, j) = D(i, j)). q(D) = i, j Lemma 1 now implies that q ∈ P so Theorem 1 implies that {D | p ∗ (D) = 0} ⊆ {D | q(D) = 0}. Assume that p ∗ (D) > 0. Then p ∗ ((i, j) = D(i, j)) > 0 and consequently q(D) > 0. This implies that {D | p ∗ (D) = 0} = {D | q(D) = 0}. Assume that p ∗ (D) > 0, then we can decompose the sums of fr(T ; D) in the exponentialform in Theorem 1 p ∗ (D) can be written as a product of potentials, p ∗ (D) ∝ i, j φi, j (D(i, j)). If we normalize each term by φi, j (1) + φi, j (0), we have that p ∗ (D) = i, j p ∗ ((i, j) = D(i, j)) for p ∗ (D) > 0. If on the other hand p ∗ ((a, b) = D(a, b)) = 0 for some p ∗ D = 0, then also q(D) = 0, so by definition ∗ ∗ (a, b). This implies that p (D) = 0 = i, j p ((i, j) = D(i, j)). Let us show that the individual terms have the stated form. Fix (i, j) such that 0 < p ∗ ((i, j) = 1) < 1. Let D be a dataset and let D be the dataset with flipped (i, j)th entry. Then q(D) > 0 if and only if q(D ) > 0 and hence p ∗ (D) > 0 if and only if p ∗ (D ) > 0. Divide the dataset space Dv = D ∈ D | D(i, j) = v, p ∗ (D) > 0 . The previous argument shows that for each D ∈ D1 , the dataset D with flipped entry D ∈ D0 . Theorem 1 implies that p ∗ (D)/ p ∗ (D ) = exp T ∈T (i, j) λT . We have now ∗ ∗ p ∗ ((i, j) = 1) D∈D1 p (D) D∈D1 p (D) λT . = = = exp ∗ ∗ ∗ p ((i, j) = 0) D∈D0 p (D) D∈D1 p (D ) T ∈T (i, j) The theorem follows by rearranging the terms. 123 Comparing apples and oranges 201 Proof of Theorem 3 We will prove the theorem by showing that pT∗ satisfies the conditions in Theorem 2. Let P be the distributions satisfying the frequencies. Let Z be the collection of datasets for which p(D) = 0 for every p ∈ P. Set Z = D | pT∗ (D) = 0 . Let q ∈ P and let (i, j) ∈ s(T ). Corollary 1 implies that q((i, j) = 1) = αT . This implies that Z ⊆ Z and since pT∗ ∈ P, we must have Z = Z . By setting λT = 0 for every tile we see that pT∗ has the exponential form given in Theorem 2. Hence, pT∗ is truly the maximum entropy distribution. Theorem 5 follows directly from the following lemma. Lemma 2 Let T and U be two tile sets such that s(U) ⊆ s(T ). Then KL(T U) = |s(T ) \ s(U)| log 2. Proof Let X = s(T ) \ s(U). Theorem 2 implies that KL(T U) = i, j v=0,1 p ∗ ((i, j) = v) pT∗ ((i, j) = v) log T∗ . pU ((i, j) = v) The only non-zero terms in the sum are the entries (i, j) ∈ X for which pT∗ ((i, j) = v) = 1. Hence we have KL(T U) = 1 × log (i, j)∈X 1 = |X | log 2. 1/2 This proves the lemma. To prove the next theorems we will need the following result. Lemma 3 Let T be a set of tiles. Let p ∗ be the corresponding maximum entropy model. Let q be a distribution such that fr(T ; q) = fr(T ; p ∗ ) for T ∈ T . Then q(D) log p ∗ (D) = λ0 + D∈D λT |s(T )|fr T ; p ∗ = −H p ∗ , T ∈T where λT are the weights as defined in Theorem 1, and λ0 is the normalisation constant of p ∗ . Proof Let Z = {D ∈ D | p ∗ (D) = 0}. Note that q(D) = 0 for any D ∈ Z. Applying Theorem 1 leads us to q(D) log p ∗ (D) = q(D) log p ∗ (D) D∈D D∈D \Z = q(D) λ0 + λT |s(T )|fr(T ; D) D∈D \Z = λ0 + T ∈T = λ0 + T ∈T λT |s(T )| q(D)fr(T ; D) D∈D \Z λT |s(T )|fr T ; p ∗ . T ∈T 123 202 N. Tatti, J. Vreeken This proves the left side of the equation. To prove the right side apply the same argu ment with q = p ∗ . Corollary 2 Let A and B be two tile collections such that B ⊆ A. Then KL(A B) = H (B) − H (A). ∗ and p ∗ be two maximum entropy distributions. Since fr T ; p ∗ = Proof Let pA B A ∗ fr T ; pB for any T ∈ B, we have KL(A B) = − D∈D ∗ ∗ pA (D) log pB (D) − H (A) = H (B) − H (A) , which proves the result. Proof of Theorem 6 The case for exact tiles follows directly from Theorem 5. The general case follows from Corollary 2 by first noticing that KL(M U ∪ B) = H (U ∪ B) − H (M) ≤ H (B) − H (M) = KL(M B) and similarly for KL(M T ∪ B). Thus we have d(T , U; B) = KL(M B) KL(M U ∪ B) + KL(M T ∪ B) ≤2 = 2, KL(M B) KL(M B) which proves the result. ∗ , p ∗ , p ∗ , and p ∗ Proof of Theorem 7 Define X = T ∪ B and Y = U ∪ B. Let pX Y B M be the corresponding maximum entropy distributions. Define a distribution q(D) ∝ ∗ (D) p ∗ (D). pX Y ∗ . To prove this we first show that q satisfies the tiles in M. We claim that q = pM Let e = (i, j) be an entry. Assume that e ∈ s(B). Since the tiles in B are exact, we have, due Corollary 1, ∗ ∗ ∗ (e = 1) = pY (e = 1) = pM (e = 1) = v, pX where v = 0, 1. ∗ (D) = 0, and consequently q(D) = 0, if Assume that v = 1. This means that pY D(e) = 0. This implies that q(e = 1) = 1. Similar argument holds for v = 0. Hence, q satisfies tiles from B. Assume now that e ∈ / s(B). Since q has an exponential form, q(e = 1) can be written in the form given in Theorem 2. Assume that e ∈ s(T ), then e ∈ / s(U). Assume that 0 < q(e = 1) < 1. This implies that the fraction given in Theorem 2 contains ∗ (e = 1) = q(e = 1). If q(e = 1) = 0, 1, only weights from the tiles in T . Hence, pX ∗ ∗ / s(Y), we must have then either pX (e = 1) = 0, 1 or pY (e = 1) = 0, 1. Since e ∈ ∗ (e = 1) = 1/2. Hence, p ∗ (e = 1) = q(e = 1). This implies that q satisfies tiles pY X in T . Similarly, q satisfies the tiles in U. Hence q satisfies the tiles in M. ∗ (D) = 0 ⊆ {D ∈ D | q(D) = 0}. Theorem 1 now implies that D ∈ D | pM ∗ (D) = 0 To prove the other direction, let D be such that q(D) = 0. Then pX 123 Comparing apples and oranges 203 ∗ (D) = 0. Assume that p ∗ (D) = 0. Theorem 1 implies that any distribuor pY X ∗ (D) = 0. Consequently, tion that satisfies X must have vanish at D. Hence, pM ∗ D ∈ D | pM (D) = 0 = {D ∈ D | q(D) = 0} Since q has the correct exponential form and satisfies the tiles in M, Theorem 1 ∗ . implies that q = pM We have the following properties ∗ ∗ ∗ (e = 1) = pX (e = 1) = pY (e = 1), e ∈ s(B) ⇒ q(e = 1) = pB ∗ ∗ ∗ e ∈ s(T ) ⇒ q(e = 1) = pX (e = 1) and pY (e = 1) = pB (e = 1), ∗ ∗ ∗ e ∈ s(U) ⇒ q(e = 1) = pY (e = 1) and pX (e = 1) = pB (e = 1), ∗ ∗ ∗ e∈ / s(M) ⇒ q(e = 1) = pB (e = 1) = pX (e = 1) = pY (e = 1). Theorem 4 now implies that KL(M B) = KL(X B) + KL(Y B) and KL(M X ) = KL(Y B) and KL(M Y) = KL(X B) , which proves the theorem. Proof of Theorem 8 Theorem 5 implies that we can prove the claim by showing that 1− |A ∩ B| |A ∩ C| |B ∩ C| ≤1− +1− , |A ∪ B| |A ∪ C| |B ∪ C| where A, B, and C are finite sets. We can rewrite the inequality as |A ∩ C| |B ∩ C| |A ∩ B| + − ≤ 1. |A ∪ C| |B ∪ C| |A ∪ B| (2) To prove the inequality we will modify the sets in such a way that it will only increase the left side. Assume that there exists x ∈ C \ (A ∪ B). Removing x from C will decrease the denominators of the first two terms by 1, and thus increase the left side of the inequality. Hence we can assume safely that C ⊆ A ∪ B. Assume that there exists x ∈ (A ∩ B) \ C. Adding x to C will increase the numerators of the first two terms by 1, and thus increase the left side of the inequality. Hence we can assume safely that A ∩ B ⊆ C. Now let us assume that we have three variables x ∈ A∩ B, y ∈ A\C, and z ∈ B \C. If we remove x from A and B and C, add z to C, and add y to C, then the numerators and the denominators of the first two terms will increase by 1 and the numerator and denominator of the third term will decrease by 1. Since x +1 x ≤ , y y+1 123 204 N. Tatti, J. Vreeken for any 0 ≤ x ≤ y = 0, it follows that the left side of Eq. 2 can only increase. We repeat this replacement until two cases happen (this is possible since A, B, and C are finite). Either A ∩ B = ∅ or A ⊆ C (or B ⊆ C). Assume that A ∩ B = ∅. Then Eq. 2 reduces to |A ∩ C| |B ∩ C| + ≤ 1. |A ∪ C| |B ∪ C| Assume that there exists x ∈ A \ C. Then we can remove x from A and only increase the first term. Hence we can assume that A ⊆ C and similarly B ⊆ C. This implies that C = A ∪ B, and we get |A| |B| |A| |B| |A ∩ C| |B ∩ C| + = + = + = 1, |A ∪ C| |B ∪ C| |C| |C| |A| + |B| |A| + |B| which proves the first case. To prove the second case, assume that A ⊆ C. This implies that A ∪ B ⊆ C ∪ B ⊂ A ∪ B ∪ B = A ∪ B, that is A ∪ B = B ∪ C. Assume that there exists x ∈ B \C. Removing x from B decreases the denominators of the second term and the third terms. We need to show that |A ∩ B| |B ∩ C| |A ∩ B| |B ∩ C| − ≥ − . |B ∪ C| − 1 |A ∪ B| − 1 |B ∪ C| |A ∪ B| Replacing B ∪ C with A ∪ B we have |A ∩ B| |B ∩ C| |A ∩ B| |B ∩ C| − ≥ − . |A ∪ B| − 1 |A ∪ B| − 1 |A ∪ B| |A ∪ B| Multiplying both sides with |A ∪ B|(|A ∪ B| − 1) leads us to |B ∩ C||A ∪ B| − |A ∩ B||A ∪ B| ≥ |B ∩ C|(|A ∪ B| − 1) − |A ∩ B|(|A ∪ B| − 1). Eliminating the terms leads us to inequality |B ∩ C| ≥ |A ∩ B| which is true because A ∩ B ⊆ C. Thus removing x from B can only increase the left side, hence we can safely assume that B ⊆ C, that is C = A∪B. This implies that A∪C = B∪C = A∪B, B ∩ C = C, and B ∩ C = B. But then |A| |B| |A ∩ B| |A ∩ C| |B ∩ C| |A ∩ B| + − = + − = 1, |A ∪ C| |B ∪ C| |A ∪ B| |A ∪ B| |A ∪ B| |A ∪ B| which proves the theorem. Proof of Theorem 9 Let S = T ∪ {U }. Corollary 2 implies that KL(M T ∪ B) = H (T ∪ B) − H (M) ≥ H (S ∪ B) − H (M) = KL(M S ∪ B) . 123 Comparing apples and oranges 205 Thus KL(M U ∪ B) + KL(M S ∪ B) KL(M B) KL(M U ∪ B) + KL(M T ∪ B) = d(T , U; B) , ≤ KL(M B) d(S, U; B) = which proves the result. Proof of Theorem 10 Since M = T ∪ B, the left side of the equation follows immediately. Write S = U ∪ B. To prove the right side apply Corollary 2 to obtain KL(M S) H (S) − H (M) H (S) − H (B) + H (B) − H (M) = = KL(M B) KL(M B) KL(M B) KL(S B) , = 1− KL(M B) which proves the result. Proof of Theorem 11 We prove the hardness by reducing ExactCover, testing whether a cover has a mutually disjoint cover, to the decision version of Redescribe. Assume that we are given a ground set S with N elements and a cover C = {C1 , . . . , C L } for S. To define the tiles and the target frequencies we will first define a particular dataset D. This dataset has N + 1 + L transactions of length N , an item representing an item in the ground set S. The first L transactions correspond to the cover sets, (i, j) entry is equal to 1 if and only if Ci contains the jth item. The last N + 1 transactions are full of ones. The tile sets are as follows. B = ∅. T contains one tile with the last N +1 transactions and all items. U = {U1 , . . . , U L } contains L tiles such that a(Ui ) = C i and t(Ui ) = {i, L + 1, . . . , L + N + 1}. Let V be a subset of U. Write f (V) = V ∈V |a(V )|. Theorem 5 now implies that d(V, T ; B) = (N + 1)(N − |a(V)|) + f (V) . (N + 1)N + f (V) If |a(V)| < N , then we can augment with V with a tile containing an item outside of a(V). This will reduce the first term in the numerator at least by N + 1 and increase the second term by N , at maximum. Thus such augmentation will decrease the distance and so we can safely assume that |a(V)| = N . This implies that d(V, T ; B) = f (V ) (N +1)N + f (V ) . This is a monotonic function of f (V). Since a(V) contains all items, f (V) ≥ N and is equal to N if and only if the tiles in V are disjoint. Hence ExactCover has a solution if and only if, there is a redescription with distance equal to N /(N (N + 1) + N ). 123 206 N. Tatti, J. Vreeken References Aggarwal CC, Wolf JL, Yu PS, Procopiuc C, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings of the ACM international conference on management of data (SIGMOD), Philadelphia, PA, ACM, pp 61–72 Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases (VLDB), Santiago de Chile, Chile, pp 487–499 Cover TM, Thomas JA (2006) Elements of information theory. Wiley, New York Csiszár I (1975) I-divergence geometry of probability distributions and minimization problems. Ann Probab 3(1):146–158 Darroch J, Ratcliff D (1972) Generalized iterative scaling for log-linear models. Ann Math Stat 43(5):1470– 1480 De Bie T (2011a) An information theoretic framework for data mining. In: Proceedings of the 17th ACM international conference on knowledge discovery and data mining (SIGKDD), San Diego, CA, ACM De Bie T (2011b) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23:1–40 Fortelius M, Gionis A, Jernvall J, Mannila H (2006) Spectral ordering and biochronology of European fossil mammals. Paleobiology 32(2):206–214 Gallo A, Miettinen P, Mannila H (2008) Finding subgroups having several descriptions: algorithms for redescription mining. In: Proceedings of the 8th SIAM international conference on data mining (SDM), Atlanta, GA Garriga GC, Junttila E, Mannila H (2011) Banded structure in binary matrices. Knowl Inform Syst 28(1):197–226 Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: Proceedings of discovery science, pp 278–289 Gionis A, Mannila H, Seppänen JK (2004) Geometric and combinatorial tiles in 0–1 data. In: Proceedings of the 8th European conference on principles and practice of knowledge discovery in databases (PKDD), Pisa, Italy, pp 173–184 Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something I don’t know: randomization strategies for iterative data mining. In: Proceedings of the 15th ACM international conference on knowledge discovery and data mining (SIGKDD), Paris, France, ACM, pp 379–388 Hollmén J, Seppänen JK, Mannila H (2003) Mixture models and frequent sets: combining global and local methods for 0–1 data. In: Proceedings of the 3rd SIAM international conference on data mining (SDM), San Francisco, CA Jaynes E (1982) On the rationale of maximum-entropy methods. Proc IEEE 70(9):939–952 Knobbe A, Ho E (2006) Pattern teams. In: Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases (PKDD), Berlin, Germany, vol 4213. Springer, pp 577–584 Kontonasios KN, De Bie T (2010) An information-theoretic approach to finding noisy tiles in binary databases. In: Proceedings of the 10th SIAM international conference on data mining (SDM), Columbus, OH, SIAM, pp 153–164 Kontonasios KN, Vreeken J, De Bie T (2011) Maximum entropy modelling for assessing results on realvalued data. In: Proceedings of the 11th IEEE international conference on data mining (ICDM), Vancouver, Canada, ICDM Kuncheva LI, Whitaker CJ (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach Learn 51(2):181–207 MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statististics and probability (Berkeley, Calif., 1965/66), vol I: Statistics. Univ. California Press, Berkeley, pp 281–297 Mampaey M, Vreeken J (2010) Summarising data by clustering items. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD), Barcelona, Spain. Springer, pp 321–336 Mampaey M, Vreeken J (2012) Summarising categorical data by clustering attributes. Data Min Knowl Discov (in press) Mampaey M, Tatti N, Vreeken J (2012) Succinctly summarizing data with itemsets. ACM Trans Knowl Discov Data (in press) 123 Comparing apples and oranges 207 Mannila H, Terzi E (2007) Nestedness and segmented nestedness. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining (SIGKDD), San Jose, CA, ACM, p 489 Miettinen P (2008) On the positive-negative partial set cover problem. Inform Process Lett 108(4):219–221 Miettinen P, Mielikäinen T, Gionis A, Das G, Mannila H (2008) The discrete basis problem. IEEE Trans Knowl Data Eng 20(10):1348–1362 Mitchell-Jones A, Amori G, Bogdanowicz W, Krystufek B, Reijnders PH, Spitzenberger F, Stubbe M, Thissen J, Vohralik V, Zima J (1999) The Atlas of European Mammals. Academic Press Müller E, Günnemann S, Assent I, Seidl T (2009) Evaluating clustering in subspace projections of high dimensional data. In: Proceedings of the 35th international conference on very large databases (VLDB), Lyon, France, pp 1270–1281 Myllykangas S, Himberg J, Böhling T, Nagy B, Hollmén J, Knuutila S (2006) DNA copy number amplification profiling of human neoplasms. Oncogene 25(55): 7324–7332 Pensa RG, Robardet C, Boulicaut JF (2005) A bi-clustering framework for categorical data. In: Proceedings of the 9th European conference on principles and practice of knowledge discovery in databases (PKDD), Porto, Portugal, pp 643–650 Puolamäki K, Hanhijärvi S, Garriga GC (2008) An approximation ratio for biclustering. Inform Process Lett 108(2):45–49 Ramakrishnan N, Kumar D, Mishra B, Potts M, Helm RF (2004) Turning cartwheels: an alternating algorithm for mining redescriptions. In: Proceedings of the 10th ACM international conference on knowledge discovery and data mining (SIGKDD), Seattle, WA, pp 266–275 Rasch G (1960) Probabilistic Models for Some Intelligence and Attainnment Tests. Danmarks paedagogiske Institut Sammon J (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput 18:401–409 Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of the 6th SIAM international conference on data mining (SDM), Bethesda, MD, SIAM, pp 393–404 Tatti N (2006) Computational complexity of queries based on itemsets. Inform Process Lett 98(5):183–187. doi:10.1016/j.ipl.2006.02.003 Tatti N (2007) Distances between data sets based on summary statistics. J Mach Learn Res 8:131–154 Tatti N, Vreeken J (2011) Comparing apples and oranges: measuring differences between data mining results. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD), Athens, Greece. Springer, pp 398–413 Vreeken J, van Leeuwen M, Siebes A (2007) Characterising the difference. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining (SIGKDD), San Jose, CA, pp 765–774 Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1):169–214 Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proceedings of the 12th ACM international conference on knowledge discovery and data mining (SIGKDD), Philadelphia, PA, pp 730–735 Xiang Y, Jin R, Fuhry D, Dragan F (2011) Summarizing transactional databases with overlapped hyperrectangles. Data Min Knowl Discov 23(2):215–251 Zaki MJ, Ramakrishnan N (2005) Reasoning about sets using redescription mining. In: Proceedings of the 11th ACM international conference on knowledge discovery and data mining (SIGKDD), Chicago, IL, ACM, pp 364–373. doi:10.1145/1081870.1081912 123 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

In the article this week, we focus on deciding whether the results of two different data mining algorithms provides significantly different information

Writing

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Answer Preview

Download Attached File

Sitejabber (5.0)

BBC (5.0)

Trustpilot (4.9)

Google (5.0)

Related Questions

menu

In the article this week, we focus on deciding whether the results of two different data mining algorithms provides significantly different information

Writing

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Answer Preview

Download Attached File

Sitejabber (5.0)

BBC (5.0)

Trustpilot (4.9)

Google (5.0)

Related Questions