Introduction to Data Warehousing and Business Intelligence
Slides adapted from Torben Bach Pedersen
Introduction to Data Warehousing and Business Intelligence Slides - - PowerPoint PPT Presentation
Introduction to Data Warehousing and Business Intelligence Slides adapted from Torben Bach Pedersen Course Structure Business intelligence Extract knowledge from large amounts of data collected in a modern enterprise Data
Slides adapted from Torben Bach Pedersen
Aalborg University 2008 - DWML course 2
modern enterprise
Data warehousing: construction
analysis purpose Machine learning: find patterns automatically in databases Business Intelligence (BI)
Aalborg University 2008 - DWML course 3
Teacher: Ken
Office: 3.2.48
Teacher: Thomas D. Nielsen Office: 2.2.03
Lecture slides, mini-project, ……
Aalborg University 2008 - DWML course 4
The Data Warehouse Lifecycle Toolkit, Kimball et. al.,
Fundamentals of Data Warehousing, Jarke et. al.,
To be posted at course homepage
Aalborg University 2008 - DWML course 5
Performed in groups of ~3 persons Documented in report of 20 pages Firm Deadline: April 20
◆ The homepage also shows the soft deadline of each task
Individual oral exam, for 20 minutes
◆ 8 minutes of DW questions ◆ 8 minutes of ML questions
Mini-project report as the basis for discussion Exam also covers theoretical background in lectures and
More details at the end of the course
Aalborg University 2008 - DWML course 6
Multidimensional modeling ETL Performance optimization
Aalborg University 2008 - DWML course 7
AI systems make decisions for the users BI systems help the users make the right decisions,
Data Warehousing (DW) On-Line Analytical Processing (OLAP) Data Mining (DM) ……
Aalborg University 2008 - DWML course 8
10% growth each year
Customers do not appear “physically” in the store Customers can change to other stores more easily
Know your customers using data and BI! Utilize Web logs, analyze customer behavior in more detail
Combine web data with traditional customer data
Aalborg University 2008 - DWML course 9
Example: customer data across different departments The same concept is defined differently
Relational DBMS, On-Line Transaction Processing (OLTP) Unstructured data in files (e.g., MS Excel) and documents
Aalborg University 2008 - DWML course 10
Accounting, billing, etc. Do not support analysis across business functions
Missing data, imprecise data, different use of systems
Data deleted in operational systems (6 months) Data change over time – no historical information
Aalborg University 2008 - DWML course 11
Subject oriented (versus function oriented) Integrated (logically and physically) Time variant (data can always be related to time) Stable (data not deleted, several versions) Supporting management decisions (different organization)
Aalborg University 2008 - DWML course 12
Finance, billing, web logs, personnel, …
Ad-hoc analysis and reports
◆ We will cover this soon ……
Data mining: discovery of hidden patterns and trends
Aalborg University 2008 - DWML course 13
Appl. Appl. Appl. Trans.
OLAP Visua- lization Appl. Appl. Data mining
Aalborg University 2008 - DWML course 14
Appl. Appl. Appl. Trans.
D-Appl. D-Appl. Appl. Appl. D-Appl.
Aalborg University 2008 - DWML course 15
redundancy/ workload distribution
Clients
Aalborg University 2008 - DWML course 16
Clients
Aalborg University 2008 - DWML course 17
redundancy and distribution
2000 2001 Aalborg Copenhagen Milk Bread 123 127 57 45 56 67 211
2000 2001 Aalborg Copenhagen Milk Bread 123 127 57 45 56 67 211 2000 2001 Aalborg Copenhagen Milk Bread 123 127 57 45 56 67 2112000 2001 Aalborg Copenhagen Milk Bread 123 127 57 45 56 67 211
Aalborg University 2008 - DWML course 18
In the past five years, which product is the most profitable? Which public holiday we have the largest sales? Which week we have the largest sales? Does the sales of dairy products increase over time?
3rd query: extract the “week” value using a function
◆ But the user has to learn many transformation functions …
4th query: use a “special” table to store IDs of all dairy products,
◆ We have many other product types as well …
Aalborg University 2008 - DWML course 19
Each sales record is a fact, and its sales value is a
Each sales record is associated with its values of
Correlated attributes grouped into the same dimension
5.75 1997 Maj 25 Århus Århus Trøjborg Beverage Beer Top Sales Year Month Day County City Store Category Type Product
Aalborg University 2008 - DWML course 20
A tree structure, with multiple levels Attributes, e.g., holiday, event
Easy for query (more about this later)
Data redundancy (controlled redundancy is acceptable)
… Yes 2008 1 1 2 2 … … … … … … … … No 2008 1 1 1 1 … work day year month week day tid
Aalborg University 2008 - DWML course 21
Aalborg University 2008 - DWML course 22
05-02-2008 6.00 Beverage Beer 3302 02-02-2008 4.00 Cereal Rice 3301 02-02-2008 6.00 Beverage Beer 3301 07-02-2008 5.00 Cereal Wheat 3303 Date Price Category Product Customer ID
Redundant data Modification anomalies
05-02-2008 013 3302 02-02-2008 052 3301 02-02-2008 013 3301 07-02-2008 067 3303 Date ProductID Customer ID 4.00 Cereal Rice 052 6.00 Beverage Beer 013 5.00 Cereal Wheat 067 Price Category Product ProductID
Aalborg University 2008 - DWML course 23
Useful data analysis tool in DW Generalized GROUP BY queries Aggregate facts based on chosen
◆ Product, store, time dimensions ◆ Sales measure of sale facts
Good for visualization (i.e., text
Multidimensional, intuitive Support interactive OLAP
Aalborg University 2008 - DWML course 24
Interactive analysis Explorative discovery Fast response times required
Aggregation, e.g., SUM Starting level, (Year, City) Roll Up: Less detail Drill Down: More detail Slice/Dice: Selection, Year=2000
9 10 11 15
Aalborg University 2008 - DWML course 25
Some dimensions are not static. They can change over
◆ Time dimension ◆ A new store opens, or an existing store closes ◆ The price of a product changes
How do we handle these changes?
How do we coordinate the dimensions in different data
+ + + + Profit + + Costs + + + Sales Supplier Product Customer Time
Dimensions Data marts Bus architecture
Aalborg University 2008 - DWML course 26
Appl. Appl. Appl. Trans.
D-Appl. D-Appl. Appl. Appl. D-Appl.
Aalborg University 2008 - DWML course 27
Data from different sources Data with different formats Handling of missing data and erroneous data
Extract Transformations / cleansing Load
80% of development time spent on ETL
Aalborg University 2008 - DWML course 28
Extract from many heterogeneous systems
Large, sequential bulk operations flat files best?
Data checked for missing parts and erroneous values Default values provided and out-of-range values marked
Data transformed to decision-oriented format Data from several sources merged, optimize for querying
Are individual business transactions needed in the DW?
Large bulk loads rather than SQL INSERTs (Why?) Fast indexing (and pre-aggregation) required
Aalborg University 2008 - DWML course 29
Fine tune performance for important queries Aggregates, indexing, other optimizations (environment,
How can aggregates improve performance?
Which aggregates should we materialize?
Aalborg University 2008 - DWML course 30
Sales s WHERE p.pid=s.pid GROUP BY p.category
◆ can be rewritten to
TotalSales t WHERE p.pid=t.pid GROUP BY p.category
◆ Query becomes 10,000 times faster!
40 3 2 3 … … … … 1 1 locid 2 10 1 1 20 1 sales pid tid
Sales
40 3 2 … … … 1 locid 30 1 sales pid
VIEW TotalSales
1 billion rows 100,000 rows
Aalborg University 2008 - DWML course 31
Need to understand data = metadata needed Greater need in OLAP than in OLTP as “raw” data is used Need to know about:
◆ Data definitions, dataflow, transformations, versions, usage, security
DW projects are large and different from ordinary SW projects
◆ 12-36 months and US$ 1+ million per project ◆ Data marts are smaller and “safer” (bottom up approach)
Reasons for failure
◆ Lack of proper design methodologies ◆ High HW+SW cost ◆ Deployment problems (lack of training) ◆ Organizational change is hard… (new processes, data ownership,..) ◆ Ethical issues (security, privacy,…)
Aalborg University 2008 - DWML course 32
Encryption may not work During extraction/transformation, may need to know original
Aalborg University 2008 - DWML course 33
Multidimensional modeling ETL Performance optimization
A good DW is a prerequisite for BI
Aalborg University 2008 - DWML course 34
Obtain from MSDNAA, and install
◆ MS SQL Server 2005 RDBMS ◆ MS Analysis Services, Integration Services,
Checking after installation
◆ Open “Component Services” and check whether all
◆ Open “SQL Server Management Studio” and see
Read the mini-project webpage for installation
Aalborg University 2008 - DWML course 35
Location: 3.2.48 Time: just after each lecture (not this one) Three demonstration slots (30 minutes each):
◆ 14.30-15.00 ◆ 15.00-15.30 ◆ 15.30-16.00
At most 5 students can fit in the same slot (not necessarily
Signup for demonstration slots today!
Aalborg University 2008 - DWML course 36