Ins Dutra ines@dcc.fc.up.pt Office: 1.31 Office hours: Mon, 10-12 - - PowerPoint PPT Presentation

in s dutra
SMART_READER_LITE
LIVE PREVIEW

Ins Dutra ines@dcc.fc.up.pt Office: 1.31 Office hours: Mon, 10-12 - - PowerPoint PPT Presentation

Data Mining: Presentation Ins Dutra ines@dcc.fc.up.pt Office: 1.31 Office hours: Mon, 10-12 am Fri, 2-4 pm Evaluation Assignments (2): 8 points 2 Tests: Nov 6th Dec 18th OR Exam: 12 points Best score between Test and


slide-1
SLIDE 1

Data Mining: Presentation Inês Dutra ines@dcc.fc.up.pt Office: 1.31 Office hours: Mon, 10-12 am Fri, 2-4 pm

slide-2
SLIDE 2

Evaluation

 Assignments (2): 8 points  2 Tests:

– Nov 6th – Dec 18th

 OR Exam: 12 points  Best score between Test and Exam is considered  Paper reading and discussion

slide-3
SLIDE 3

Communication

 In person  Email: ines@dcc.fc.up.pt

(PLEASE, DO NOT SEND EMAIL TO dutra@fc.up.pt)

 Always use a subject prefix DM1 in your messages  Sign your messages, so that I can identify you by more

than a number 

 Other means:

– Moodle (warnings, news, and forum) – dm1-1516@dcc.fc.up.pt

 Discipline web page:

http://www.dcc.fc.up.pt/~ines/aulas/1516/DM1/DM1.html

slide-4
SLIDE 4

Syllabus

 What is data mining?  Data versus knowledge  Kinds of data  Phases of data mining  Data Preprocessing  Descriptive Statistics  Association rules  Clustering  Predictive Models  Performance Metrics and model validation

slide-5
SLIDE 5

Bibliography

 Data Mining Concepts and Techniques (3rd ed)

Jiawei Han, Micheline Kamber and Jian Pei

 Introduction to Data Mining

Pang-Ning Tan, Michael Steinbach and Vipin Kumar

slide-6
SLIDE 6

Resources

 For programming and libraries

– R and stats and machine learning packages – PyML

 For data visualization and machine learning

– WEKA – KNIME – RapidMiner

 For relational learning

– Aleph and YAP – GILPS

slide-7
SLIDE 7

Useful links

 KDD nuggets: http://www.kdnuggets.com  Data Sets at UCI: http://archive.ics.uci.edu/ml/  http://www.acm.org/sigs/sigkdd/explorations/  https://www.kaggle.com/

slide-8
SLIDE 8

The Homo Platipus 

(excellent insight by Carlos Somohano, Founder of DataScience London)

8

Hacking Machine Learning Math Science Programming Visualization Data Mining Statistics

slide-9
SLIDE 9

The Homo Platipus 

(excellent insight by Carlos Somohano, Founder of DataScience London)

9

Hacking Machine Learning Math Science Programming Visualization Data Mining Statistics

More commonly called: Data Scientist!

slide-10
SLIDE 10

Requirements

 Willingness to learn  Lots of patience

– Interact with other areas – Data preprocessing

 Creativity  Rigor and correctness

Let’s have fun!

slide-11
SLIDE 11

Data x knowledge

 Data:

– refer to single and primitive instances (single

  • bjects, people, events, points in time, etc)

– describe individual properties – are often easy to collect or to obtain (e.g., scanner cashiers, internet, etc) – do not allow us to make predictions or forecasts

slide-12
SLIDE 12

Data x Knowledge

 Knowledge

– refers to classes of instances (sets of...) – describes general patterns, structures, laws, principles, etc – consists of as few statements as possible – is often difficult and time-consuming to find or to obtain – allows us to make predictions and forecasts

slide-13
SLIDE 13

Criteria to assess Knowledge

 correctness (probability, success in tests)  generality (domain and conditions of validity)  usefulness (relevance, predictive power)  comphreensibility (simplicity, clarity, parsimony)  novelty (previously unknown, unexpected)

slide-14
SLIDE 14

 In the science domain, focus is on:

– correctness, generality and simplicity

 In economy and industry, focus is on:

– usefulness, comprehensibility and novelty “We are drowning in information, but starving for knowledge” (John Naisbitt)