Data Quality Blueprint for Pentaho: Better Data Leads to Better - - PowerPoint PPT Presentation

data quality blueprint for pentaho better data leads to
SMART_READER_LITE
LIVE PREVIEW

Data Quality Blueprint for Pentaho: Better Data Leads to Better - - PowerPoint PPT Presentation

Data Quality Blueprint for Pentaho: Better Data Leads to Better Results Charles Gaddy Director Global Sales & Alliances, Melissa Data Agenda What is Data Quality, and What Role Does it Play? 6 Concepts of Data Quality Full Data


slide-1
SLIDE 1

Data Quality Blueprint for Pentaho: Better Data Leads to Better Results

Charles Gaddy Director Global Sales & Alliances, Melissa Data

slide-2
SLIDE 2

Agenda

  • What is Data Quality, and What Role Does it Play?
  • 6 Concepts of Data Quality
  • Full Data Quality Lifecycle
slide-3
SLIDE 3

What is Data Quality?

  • Data quality is an assessment of

data’s fitness to serve its stated

  • purpose. Think “Fitness for use.”
  • Data Quality can be any kind of

data; scientific, transactional, customers, products, assets, locations, financial

  • Data cleansing is the verb used

to achieve Data Quality.

slide-4
SLIDE 4

Information Industry

  • The data governance discipline, the data quality discipline and the Master Data

Management (MDM) discipline are closely related.

  • Data quality improvement is important within data governance and MDM.

Furthermore you seldom see an MDM implementation without a (master) data governance work stream.

slide-5
SLIDE 5

Survey Conducted by Melissa Global Intelligence

What is data used for?

Revenues 63% Service 54% Marketing 38% Risk Reduction 37% Channel Pipeline 36% New Projects 34% Regulatory 32%

Information Industry

slide-6
SLIDE 6

Atomic Domains of Data Quality

  • Basic data domains represent data such as: age, date of birth, and

sales amount, that are common to many businesses.

  • Advanced data domains span the range of data classifications to

provide more specific cases for your use.

  • In addition, the rule conditions for these advanced data rule

definitions can be more complex.

slide-7
SLIDE 7

Personal Identity Age Date of Birth US SSN CA SIN Passport Number Asset Identity IP address Information Phone Number Email Address VIN Number Financial Policy Portfolio Bank Account Number Orders and Sales Order Amount Sales Amount Order ID Location Address Name Zip Code Latitude Longitude State City Country

Atomic Domains of Data Quality

Atomic or Entity Domains that need special handling and available domain based knowledge

slide-8
SLIDE 8

Atomic or Entity Domains that need special handling and requires custom domain based knowledge

Product Items Assemblies Parts SKUs

Advanced Atomic Domains of Data Quality

slide-9
SLIDE 9

Completeness Conformity Consistency Accuracy Integrity Duplication

6 Concepts of Data Quality

slide-10
SLIDE 10
  • 1. Completeness
  • Is all the requisite information

available?

  • Are data values missing, or in

an unusable state? In some cases, missing data is irrelevant, but when the information that is missing is critical to a specific business process, completeness becomes an issue.

slide-11
SLIDE 11
  • 2. Conformity
  • Are there expectations that

data values conform to specified formats?

  • If so, do all the values

conform to those formats? Maintaining conformance to specific formats is important in data representation, presentation, aggregate reporting, search, and establishing key relationships.

slide-12
SLIDE 12
  • 3. Consistency
  • Do distinct data instances provide conflicting information about the same

underlying data object?

  • Are values consistent across data sets?
  • Do interdependent attributes always appropriately reflect their expected

consistency? Inconsistency between data values plagues companies attempting to reconcile between systems and applications.

slide-13
SLIDE 13
  • 4. Accuracy
  • Do data objects accurately represent

the “real-world” values they are expected to model? Incorrect spellings of product or person names, addresses, and even untimely

  • r not current data can impact
  • perational and analytical applications.
slide-14
SLIDE 14
  • 5. Duplication
  • Are there multiple, unnecessary

representations of the same data

  • bjects within your data set?

The inability to maintain a single representation for each entity across your systems poses numerous vulnerabilities and risks.

slide-15
SLIDE 15
  • 6. Integrity
  • What data is missing important relationship linkages?

The inability to link related records together may actually introduce duplication across your systems. Not only that, as more value is derived from analyzing connectivity and relationships, the inability to link related data instance together impedes this valuable analysis.

slide-16
SLIDE 16

Why Data is Always in Flux

  • 40 million Americans (1 in 6)

move annually

  • More than 100,000 changes

(adds, deletes, or modifications) every month

  • Quality of stored U.S. addresses

declines 17% per year

  • Phone Area Code Splits
  • Email Domain Changes
  • Disconnected Phone Numbers
slide-17
SLIDE 17

The Full Life Cycle

  • f Data Quality
slide-18
SLIDE 18

Profiling

  • Gathering Metadata for Analysis

– Data about your data

  • Identify the Problems

– NULLs/Blanks, Unnecessary Spaces, Incorrect Patterns, Unstandardized Data, etc.

  • Overall status of the Quality of Data

– Statistical Analysis

slide-19
SLIDE 19

Hygiene

  • Data Standardization/Normalization

– Proper Casing – Proper Formatting – Removal of Unnecessary Characters

  • Data Cleansing

– Misspellings – Parsing – Abbreviations

slide-20
SLIDE 20

Data Verification

  • Verifying the actual content of data

– Do the Addresses actually exist? – Are the Phone Numbers callable? – Are Email Addresses deliverable? – Are the names actually people’s names? – Do the Address, Name, Phone and Email correspond to each other?

slide-21
SLIDE 21

Enrich and Update

  • Missing Information

– Appending Fill in missing data pieces such as a missing phone number or email address

  • Enrichment of Data

– Property Information, Geographic Information, Firmographics, Demographics

  • Retrieve the latest information

(eg. Move Address and Latest Phone Number)

– Data becomes outdated over time

slide-22
SLIDE 22

Matching

  • De-Duplication

– Duplicate data is bad data

  • Fuzzy Matching

– Application of fuzzy logic algorithms for inexact matches

  • Deep Domain Knowledge

– Handles matching problems in international data and in multiple domains

slide-23
SLIDE 23

Merging

  • Golden Record Selection

– Selection of the best record

  • Consolidation and Survivorship

– Merging the best pieces of data according to intelligent rules

slide-24
SLIDE 24

Monitoring

  • Profiling Over Time

– Continuously gather metadata – Allows for maintenance of Data Quality – Data profiling with a good tool can also be employed as an active monitoring solution. – Active monitoring is something that can be employed to safeguard collected data – By using the same profiling techniques, it is possible to reassess the current state of the quality of data

slide-25
SLIDE 25

Summary

  • What is Data Quality and what role it plays?
  • 6 concepts of Data Quality
  • Full Data Quality Lifecycle
slide-26
SLIDE 26