Introduction to Data Warehousing and Business Intelligence Slides - - PowerPoint PPT Presentation

introduction to data warehousing and business intelligence
SMART_READER_LITE
LIVE PREVIEW

Introduction to Data Warehousing and Business Intelligence Slides - - PowerPoint PPT Presentation

Introduction to Data Warehousing and Business Intelligence Slides adapted from Torben Bach Pedersen Course Structure Business intelligence Extract knowledge from large amounts of data collected in a modern enterprise Data


slide-1
SLIDE 1

Introduction to Data Warehousing and Business Intelligence

Slides adapted from Torben Bach Pedersen

slide-2
SLIDE 2

Aalborg University 2008 - DWML course 2

Course Structure

  • Business intelligence
  • Extract knowledge from large amounts of data collected in a

modern enterprise

  • Data warehousing, machine learning
  • Purpose
  • Acquire theoretical background in lectures and literature studies
  • Obtain practical experience on (industrial) tools in a mini-project

Data warehousing: construction

  • f a database with only data

analysis purpose Machine learning: find patterns automatically in databases Business Intelligence (BI)

slide-3
SLIDE 3

Aalborg University 2008 - DWML course 3

Contact Information

  • Data warehousing

Teacher: Ken

(Man Lung YIU)

Office: 3.2.48

Email: mly@cs.aau.dk

  • Machine learning

Teacher: Thomas D. Nielsen Office: 2.2.03

Email: tdn@cs.aau.dk

  • Course homepage:

http://www.cs.aau.dk/~tdn/Teaching/DWML08/

Lecture slides, mini-project, ……

slide-4
SLIDE 4

Aalborg University 2008 - DWML course 4

Literature for Data Warehousing

  • No textbook
  • Books (selected pages available in the class)

The Data Warehouse Lifecycle Toolkit, Kimball et. al.,

Wiley 1998

Fundamentals of Data Warehousing, Jarke et. al.,

Springer Verlag 2003

  • Additional references/articles:

To be posted at course homepage

slide-5
SLIDE 5

Aalborg University 2008 - DWML course 5

Mini-Project and Exam

  • Mini-project

Performed in groups of ~3 persons Documented in report of 20 pages Firm Deadline: April 20

◆ The homepage also shows the soft deadline of each task

  • Exam (information from last year)

Individual oral exam, for 20 minutes

◆ 8 minutes of DW questions ◆ 8 minutes of ML questions

Mini-project report as the basis for discussion Exam also covers theoretical background in lectures and

literature

More details at the end of the course

slide-6
SLIDE 6

Aalborg University 2008 - DWML course 6

Overview

  • Why Business Intelligence?
  • Data analysis problems
  • Data Warehouse (DW) introduction
  • DW Topics

Multidimensional modeling ETL Performance optimization

slide-7
SLIDE 7

Aalborg University 2008 - DWML course 7

What is Business Intelligence (BI)?

  • BI is different from Artificial Intelligence (AI)

AI systems make decisions for the users BI systems help the users make the right decisions,

based on available data

  • Combination of technologies

Data Warehousing (DW) On-Line Analytical Processing (OLAP) Data Mining (DM) ……

slide-8
SLIDE 8

Aalborg University 2008 - DWML course 8

Why BI Important?

  • Worldwide BI revenue in 2005 = US$ 5.7 billion

10% growth each year

  • The Web makes BI more necessary

Customers do not appear “physically” in the store Customers can change to other stores more easily

  • Thus:

Know your customers using data and BI! Utilize Web logs, analyze customer behavior in more detail

than before (e.g., what was not bought?)

Combine web data with traditional customer data

slide-9
SLIDE 9

Aalborg University 2008 - DWML course 9

Data Analysis Problems

  • The same data found in many different systems

Example: customer data across different departments The same concept is defined differently

  • Heterogeneous sources

Relational DBMS, On-Line Transaction Processing (OLTP) Unstructured data in files (e.g., MS Excel) and documents

(e.g., MS Word)

slide-10
SLIDE 10

Aalborg University 2008 - DWML course 10

Data Analysis Problems (cont’)

  • Data is suited for operational systems

Accounting, billing, etc. Do not support analysis across business functions

  • Data quality is bad

Missing data, imprecise data, different use of systems

  • Data are “volatile”

Data deleted in operational systems (6 months) Data change over time – no historical information

slide-11
SLIDE 11

Aalborg University 2008 - DWML course 11

Data Warehousing

  • Solution: new analysis environment (DW) where data are

Subject oriented (versus function oriented) Integrated (logically and physically) Time variant (data can always be related to time) Stable (data not deleted, several versions) Supporting management decisions (different organization)

  • A good DW is a prerequisite for successful BI
slide-12
SLIDE 12

Aalborg University 2008 - DWML course 12

DW: Purpose and Definition

  • DW is a store of information organized in a unified

data model

  • Data collected from a number of different sources

Finance, billing, web logs, personnel, …

  • Purpose of a data warehouse (DW): support

decision making

  • Easy to perform advanced analysis

Ad-hoc analysis and reports

◆ We will cover this soon ……

Data mining: discovery of hidden patterns and trends

slide-13
SLIDE 13

Aalborg University 2008 - DWML course 13

DW Architecture – Data as Materialized Views

DB DB DB DB DB

Appl. Appl. Appl. Trans.

DW DM DM DM

OLAP Visua- lization Appl. Appl. Data mining

(Local) Data Marts (Global) Data Warehouse Existing databases and systems (OLTP) New databases and systems (OLAP)

Analogy: (data) suppliers ↔ warehouse ↔ (data) consumers

slide-14
SLIDE 14

Aalborg University 2008 - DWML course 14

Function- vs. Subject Orientation

DB DB DB DB DB

Appl. Appl. Appl. Trans.

DW DM DM DM

D-Appl. D-Appl. Appl. Appl. D-Appl.

Function-oriented systems Selected subjects All subjects, integrated Subject-oriented systems

slide-15
SLIDE 15

Aalborg University 2008 - DWML course 15

Central DW Architecture

  • All data in one, central DW
  • All client queries directly on the

central DW

  • Pros
  • Simplicity
  • Easy to manage
  • Cons
  • Bad performance due to no

redundancy/ workload distribution

Central DW Source Source

Clients

slide-16
SLIDE 16

Aalborg University 2008 - DWML course 16

Federated DW Architecture

  • Data stored in separate data marts,

aimed at special departments

  • Logical DW (i.e., virtual)
  • Data marts contain detail data
  • Pros
  • Performance due to distribution
  • Cons
  • More complex

Logical DW Source Source

Clients

Finance mart Mrktng mart Distr. mart

slide-17
SLIDE 17

Aalborg University 2008 - DWML course 17

Tiered Architecture

  • Central DW is materialized
  • Data is distributed to data marts in
  • ne or more tiers
  • Only aggregated data in cube tiers
  • Data is aggregated/reduced as it

moves through tiers

  • Pros
  • Best performance due to

redundancy and distribution

  • Cons
  • Most complex
  • Hard to manage

2000 2001 Aalborg Copenhagen Milk Bread 123 127 57 45 56 67 211

2000 2001 Aalborg Copenhagen Milk Bread 123 127 57 45 56 67 211 2000 2001 Aalborg Copenhagen Milk Bread 123 127 57 45 56 67 211

Central DW

2000 2001 Aalborg Copenhagen Milk Bread 123 127 57 45 56 67 211

2000 2001 Aalborg Copenhagen Milk Bread 123 127 57 45 56 67 211

slide-18
SLIDE 18

Aalborg University 2008 - DWML course 18

Queries Hard/Infeasible for OLTP

  • Business analysis

In the past five years, which product is the most profitable? Which public holiday we have the largest sales? Which week we have the largest sales? Does the sales of dairy products increase over time?

  • Difficult to represent these queries by using SQL

3rd query: extract the “week” value using a function

◆ But the user has to learn many transformation functions …

4th query: use a “special” table to store IDs of all dairy products,

in advance

◆ We have many other product types as well …

  • The need of multidimensional modeling
slide-19
SLIDE 19

Aalborg University 2008 - DWML course 19

Multidimensional Modeling

  • Example: sales of supermarkets
  • Facts and measures

Each sales record is a fact, and its sales value is a

measure

  • Dimensions

Each sales record is associated with its values of

Product, Store, Time

Correlated attributes grouped into the same dimension

easier for analysis tasks

5.75 1997 Maj 25 Århus Århus Trøjborg Beverage Beer Top Sales Year Month Day County City Store Category Type Product

Product Store Time

slide-20
SLIDE 20

Aalborg University 2008 - DWML course 20

Multidimensional Modeling

  • How do we model the Time dimension?

A tree structure, with multiple levels Attributes, e.g., holiday, event

  • Advantage of this model?

Easy for query (more about this later)

  • Disadvantage?

Data redundancy (controlled redundancy is acceptable)

Day Week Month Year T

… Yes 2008 1 1 2 2 … … … … … … … … No 2008 1 1 1 1 … work day year month week day tid

slide-21
SLIDE 21

Aalborg University 2008 - DWML course 21

OLTP vs. OLAP

not necessary necessary Transactional recovery not unified SQL Query language query operations update operations Optimized for business analysis

  • perational needs

Target large, historical data small, operational data Data denormalized/ multidimensional normalized Model large small Queries infrequent and batch frequent and small Updates OLAP OLTP

slide-22
SLIDE 22

Aalborg University 2008 - DWML course 22

Quick Review: Normalized Database

05-02-2008 6.00 Beverage Beer 3302 02-02-2008 4.00 Cereal Rice 3301 02-02-2008 6.00 Beverage Beer 3301 07-02-2008 5.00 Cereal Wheat 3303 Date Price Category Product Customer ID

  • Normalized database avoids

Redundant data Modification anomalies

  • How to get the original table? (join them)
  • No redundancy in OLTP, controlled redundancy in OLAP

05-02-2008 013 3302 02-02-2008 052 3301 02-02-2008 013 3301 07-02-2008 067 3303 Date ProductID Customer ID 4.00 Cereal Rice 052 6.00 Beverage Beer 013 5.00 Cereal Wheat 067 Price Category Product ProductID

slide-23
SLIDE 23

Aalborg University 2008 - DWML course 23

OLAP Data Cube

  • Data cube

Useful data analysis tool in DW Generalized GROUP BY queries Aggregate facts based on chosen

dimensions

◆ Product, store, time dimensions ◆ Sales measure of sale facts

  • Why data cube?

Good for visualization (i.e., text

results hard to understand)

Multidimensional, intuitive Support interactive OLAP

  • perations
  • How is it different from a

spreadsheet?

slide-24
SLIDE 24

Aalborg University 2008 - DWML course 24

On-Line Analytical Processing (OLAP)

  • On-Line Analytical Processing

Interactive analysis Explorative discovery Fast response times required

  • OLAP operations/queries

Aggregation, e.g., SUM Starting level, (Year, City) Roll Up: Less detail Drill Down: More detail Slice/Dice: Selection, Year=2000

102 250 All Time

9 10 11 15

slide-25
SLIDE 25

Aalborg University 2008 - DWML course 25

Advanced Multidimensional Modeling

  • Changing dimensions

Some dimensions are not static. They can change over

  • time. E.g.,

◆ Time dimension ◆ A new store opens, or an existing store closes ◆ The price of a product changes

How do we handle these changes?

  • Large-scale dimensional modeling

How do we coordinate the dimensions in different data

cubes and data marts?

+ + + + Profit + + Costs + + + Sales Supplier Product Customer Time

Dimensions Data marts Bus architecture

slide-26
SLIDE 26

Aalborg University 2008 - DWML course 26

Top-down vs. Bottom-up

DB DB DB DB DB

Appl. Appl. Appl. Trans.

DW DM DM DM

D-Appl. D-Appl. Appl. Appl. D-Appl.

Top-down:

  • 1. Design of DW
  • 2. Design of DMs

Bottom-up:

  • 1. Design of DMs
  • 2. Maybe integration
  • f DMs in DW
  • 3. Maybe no DW

In-between:

  • 1. Design of DW for

DM1

  • 2. Design of DM2 and

integration with DW

  • 3. Design of DM3 and

integration with DW

  • 4. ...
slide-27
SLIDE 27

Aalborg University 2008 - DWML course 27

Extract, Transform, Load (ETL)

  • “Getting multidimensional data into the DW”
  • Problems

Data from different sources Data with different formats Handling of missing data and erroneous data

  • ETL

Extract Transformations / cleansing Load

  • The most time-consuming process in DW development

80% of development time spent on ETL

slide-28
SLIDE 28

Aalborg University 2008 - DWML course 28

Data’s Way To DW

  • Extraction

Extract from many heterogeneous systems

  • Staging area

Large, sequential bulk operations flat files best?

  • Cleansing

Data checked for missing parts and erroneous values Default values provided and out-of-range values marked

  • Transformation

Data transformed to decision-oriented format Data from several sources merged, optimize for querying

  • Aggregation?

Are individual business transactions needed in the DW?

  • Loading into DW

Large bulk loads rather than SQL INSERTs (Why?) Fast indexing (and pre-aggregation) required

slide-29
SLIDE 29

Aalborg University 2008 - DWML course 29

Performance Optimization

  • Performance optimization

Fine tune performance for important queries Aggregates, indexing, other optimizations (environment,

partitioning)

  • Using aggregates

How can aggregates improve performance?

  • Choosing aggregates

Which aggregates should we materialize?

slide-30
SLIDE 30

Aalborg University 2008 - DWML course 30

Materialization Example

  • Imagine 1 billion sales rows, 1000 products, 100

locations

  • CREATE VIEW TotalSales (pid,locid,total) AS

SELECT s.pid,s.locid,SUM(s.sales) FROM Sales s GROUP BY s.pid,s.locid

  • The materialized view has 100,000 rows
  • Rewrite the query to use the view
  • SELECT p.category,SUM(s.sales) FROM Products p,

Sales s WHERE p.pid=s.pid GROUP BY p.category

◆ can be rewritten to

  • SELECT p.category,SUM(t.total) FROM Products p,

TotalSales t WHERE p.pid=t.pid GROUP BY p.category

◆ Query becomes 10,000 times faster!

40 3 2 3 … … … … 1 1 locid 2 10 1 1 20 1 sales pid tid

Sales

40 3 2 … … … 1 locid 30 1 sales pid

VIEW TotalSales

1 billion rows 100,000 rows

slide-31
SLIDE 31

Aalborg University 2008 - DWML course 31

Common DW Issues

  • Metadata management

Need to understand data = metadata needed Greater need in OLAP than in OLTP as “raw” data is used Need to know about:

◆ Data definitions, dataflow, transformations, versions, usage, security

  • DW project management

DW projects are large and different from ordinary SW projects

◆ 12-36 months and US$ 1+ million per project ◆ Data marts are smaller and “safer” (bottom up approach)

Reasons for failure

◆ Lack of proper design methodologies ◆ High HW+SW cost ◆ Deployment problems (lack of training) ◆ Organizational change is hard… (new processes, data ownership,..) ◆ Ethical issues (security, privacy,…)

slide-32
SLIDE 32

Aalborg University 2008 - DWML course 32

Topics not to be Covered

  • Privacy/security of data during ETL

Encryption may not work During extraction/transformation, may need to know original

values in order to check whether ETL performs correctly

  • Data Visualization (VIS)
  • Decision Analysis (What-if)
  • Customer Relationship Management (CRM)
slide-33
SLIDE 33

Aalborg University 2008 - DWML course 33

Summary

  • Why Business Intelligence?
  • Data analysis problems
  • Data Warehouse (DW) introduction
  • DW Topics

Multidimensional modeling ETL Performance optimization

  • BI provide many advantages to your organization

A good DW is a prerequisite for BI

slide-34
SLIDE 34

Aalborg University 2008 - DWML course 34

DW Software

  • DW part of the mini-project
  • DW software

Obtain from MSDNAA, and install

◆ MS SQL Server 2005 RDBMS ◆ MS Analysis Services, Integration Services,

Reporting Services

Checking after installation

◆ Open “Component Services” and check whether all

four services above have been started

◆ Open “SQL Server Management Studio” and see

whether you can connect to “Database Engine”

Read the mini-project webpage for installation

details

slide-35
SLIDE 35

Aalborg University 2008 - DWML course 35

Demonstration Session

  • To get you familiar with DW software, we have a

demonstration session after the next few lectures

  • Details of demonstration

Location: 3.2.48 Time: just after each lecture (not this one) Three demonstration slots (30 minutes each):

◆ 14.30-15.00 ◆ 15.00-15.30 ◆ 15.30-16.00

At most 5 students can fit in the same slot (not necessarily

from the same group)

Signup for demonstration slots today!

slide-36
SLIDE 36

Aalborg University 2008 - DWML course 36

Mini-Project Group Formation

  • Form groups ~3 persons
  • Discuss with your classmates NOW
  • Groups to be formed today, before

we leave this room!