1 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 03/04/06
Data Mining
Practical Machine Learning Tools and Techniques
Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank
2 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 03/04/06
Implementation:
Real machine learning schemes
- Decision trees
♦ From ID3 to C4.5 (pruning, numeric attributes, ...)
- Classification rules
♦ From PRISM to RIPPER and PART (pruning, numeric data, ...)
- Extending linear models
♦ Support vector machines and neural networks
- Instance-based learning
♦ Pruning examples, generalized exemplars, distance functions
- Numeric prediction
♦ Regression/model trees, locally weighted regression
- Clustering: hierarchical, incremental, probabilistic
♦ Hierarchical, incremental, probabilistic
- Bayesian networks
♦ Learning and prediction, fast data structures for learning 3 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 03/04/06
Industrial-strength algorithms
✁ For an algorithm to be useful in a widerange of real-world applications it must:
♦ Permit numeric attributes ♦ Allow missing values ♦ Be robust in the presence of noise ♦ Be able to approximate arbitrary concept
descriptions (at least in principle)
✁ Basic schemes need to be extended to fulfillthese requirements
4 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 03/04/06
Decision trees
✂ Extending ID3: ✄ to permit numeric attributes:straightforward
✄ to dealing sensibly with missing values: trickier ✄ stability for noisy data:requires pruning mechanism
✂ End result: C4.5 (Quinlan) ✄ Best-known and (probably) most widely-usedlearning algorithm
✄ Commercial successor: C5.05 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 03/04/06
Numeric attributes
✁ Standard method: binary splits♦ E.g. temp < 45
✁ Unlike nominal attributes,every attribute has many possible split points
✁ Solution is straightforward extension:♦ Evaluate info gain (or other measure)
for every possible split point of attribute
♦ Choose “best” split point ♦ Info gain for best split point is info gain for attribute
✁ Computationally more demanding6 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 03/04/06
Weather data (again!)
… … … … … Yes False Normal Mild Rainy Yes False High Hot Overcast No True High Hot Sunny No False High Hot Sunny Play Windy Humidity Temperature Outlook If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes