Home / Papers / What Is Data Mining? 1.2 the Data-mining Communities

What Is Data Mining? 1.2 the Data-mining Communities

88 Citations2023
journal unavailable

This chapter focuses on the challenges that appear when the data is large and the computations complex, and can be thought o f a s algorithms for executing very complex queries on non-main-memory data.

Abstract

Originally, data mining" was a statistician's term for overusing data to draw i n valid inferences. Bonferroni's theorem warns us that if there are too many possible conclusions to draw, some will be true for purely statistical reasons, with no physical validity. Famous example: David Rhine, a parapsychologist" at Duke in the 1950's tested students for extrasensory perception" by asking them to guess 10 cards | red or black. He found about 111000 of them guessed all 10, and instead of realizing that that is what you'd expect from random guessing, declared them to have ESP. When he retested them, he found they did no better than average. His conclusion: telling people they have ESP causes them to lose it! Our deenition: discovery of useful summaries of data." 1.1 Applications Some examples of successes": 1. Decision trees constructed from bank-loan histories to produce algorithms to decide whether to grant a loan. 2. Patterns of traveler behavior mined to manage the sale of discounted seats on planes, rooms in hotels, etc. 3. Diapers and beer." Observation that customers who buy diapers are more likely to by beer than average allowed supermarkets to place beer and diapers nearby, knowing many customers would walk between them. Placing potato chips between increased sales of all three items. 4. Skycat and Sloan Sky Survey: clustering sky objects by their radiation levels in diierent bands allowed astromomers to distinguish between galaxies, nearby stars, and many other kinds of celestial objects. 5. Comparison of the genotype of people withhwithout a condition allowed the discovery of a set of genes that together account for many cases of diabetes. This sort of mining will become much more important as the human genome is constructed. As data-mining has become recognized as a powerful tool, several diierent communities have laid claim to the subject: 1. Statistics. 2. AI, where it is called machine learning." 3. Researchers in clustering algorithms. 4. Visualization researchers. 5. Databases. We'll be taking this approach, of course, concentrating on the challenges that appear when the data is large and the computations complex. In a sense, data mining can be thought o f a s algorithms for executing very complex queries on non-main-memory data.