Measurement and Data Data describes the real world Data maps - - PowerPoint PPT Presentation

▶

Aug 22, 2022 217 likes •507 views

Measurement and Data Data describes the real world Data maps entities in the domain of interest to symbolic representation by means of a measurement procedure Numerical relationships between variables capture relationships between

SLIDE 1

Measurement and Data

SLIDE 2

Data describes the real world

Data maps entities in the domain of interest

to symbolic representation by means of a measurement procedure

Numerical relationships between variables

capture relationships between objects

Measurement process is crucial

SLIDE 3

Types of Measurement

Ordinal, e.g., excellent=5, very good=4, good=3…
Nominal, e.g., religion, profession

– Need non-metric methods

Ratio, e.g., weight

– has concatenation property, two weights add to balance a third: 2+3 = 5 – changing scale (multiplying values by a constant) does not change ratio

Interval, e.g., temperature, calendar time

– Unit of measurement is arbitrary, as well as origin

SLIDE 4

Distance Measures

Many data mining techniques (e.g., nn-

classification, cluster analysis) are based on similarity measures between objects

s(i,j): similarity, d(i,j): dissimilarity
Possible transformations: d(i,j)= 1 – s(i,j) or

d(i,j)=sqrt(2*(1-s(i,j))

SLIDE 5

Metric Properties

1. d(i,j) > 0:

Positivity

2. d(i,j) = d(j,i):

Commutativity

3. d(i,j) < d(i,k)+d(k,j):

Triangle Inequality

SLIDE 6

Euclidean Distance between vectors

( )

2 / 1 1 2

) , (         − = ∑

= p k k k E

y x y x d

SLIDE 7

Commensurability

Euclidean distance assumes variables are

commensurate

E.g., each variable a measure of length
If one were weight and other was length

there is no obvious choice of units

Altering units would change which

variables are important

SLIDE 8

Standardizing the Data

Divide each variable by its standard deviation
Standard deviation for the kth variable is

where

2 1 1 2

) ) ( ( 1       − =

∑

= i k k k

i x n µ σ

) ( 1

i x n

n i k k

∑

= µ

SLIDE 9

Weighted Euclidean Distance

If we know relative importance of variables

2 1 1 2

)) ( ) ( (( ) , (         − = ∑

= p k k k k WE

j x i x w j i d

SLIDE 10

Need for Covariance in distance measure

Suppose we measured a cup’s height 100

times and diameter only once

Clearly height will dominate although 99 of

the height measurements are not contributing anything

They are very highly correlated
To eliminate redundancy we need a data-

driven method

SLIDE 11

Sample Covariance between X and Y

Measure of how X and Y vary together
Large positive value if large values of X tend to be

associated with large values of Y and small values

f X with small values of Y
Large negative value if large values of X tend to

be associated with small values of Y

      −       − = ∑

= _ 1 _

) ( ) ( 1 ) , ( y i y x i x n Y X Cov

n i

SLIDE 12

Correlation Coefficient

Value of Covariance is dependent upon ranges of X and Y Removed by dividing values of X by their standard deviation and values of Y by their standard deviation

y x n i

y i y x i x Y X σ σ ρ

∑

− − =

1 _ _

) ) ( )( ) ( ( ) , (

SLIDE 13

Correlation Matrix

SLIDE 14

Mahanalobis Distance

( )

2 1 1

] )) ( ) ( ( ) ( ) ( [ ) , (

∑

−

− − = j x i x j x i x j i d

T M

SLIDE 15

Generalizing Euclidean Distance

Minkowski or L? metric
? = 2 gives the Euclidean metric

( )

λ λ 1 1

) ( ) (         −

∑

= p k k k

j x i x

SLIDE 16

Minkowski metric

? = 1 is the Manhattan or city block metric
? = infinity yields

∑

−

p k k k

j x i x

| ) ( ) ( |

| ) ( ) ( | max j x i x

k k k

−

SLIDE 17

Mutivariate Binary Data

Most obvious measure is Hamming Distance normalized by number of

bits

If we don’t care about irrelevant properties had by neither object we

have Jaccard Coefficient

Dice Coefficient extends this argument. If 00 matches are irrelevant

then 10 and 01 matches should have half relevance

00 01 10 11 00 11

S S S S S S + + + +

01 10 11 11

S S S S + +

SLIDE 18

Some Similarity/Dissimilarity Measures for N-dim Binary Vectors

where

* * *

SLIDE 19

Some Similarity/Dissimilarity Measures for N-dim Binary Vectors

where

* * *

SLIDE 20

Weighted Dissimilarity Measures for Binary Vectors

Unequal importance to ‘0’

matches and ‘1’ matches

Multiply S00 with ß ([0,1])
Examples:

00 11

N S S (X,Y) Dsm ⋅ + − = β 2 ) ( 2 ) , (

00 11 00 11

S S N S S N Y X Drta ⋅ − − ⋅ − − = β β

SLIDE 21

Transforming the Data

SLIDE 22

V1 is non-linearly Related to V2 V3=1/V2 is linearly related to V1 V1 V2

SLIDE 23

Square root transformation keeps the variance constant Variance increases (regression assumes variance is constant)

SLIDE 24

Form of Data

SLIDE 25

Data Matrix

A set of p measurements on objects
(1)…o(n)
n rows and p columns
Also called standard data, data matrix or

table

SLIDE 26

Multirelational Data

Payroll database has

– Employees table: name, department-name, age, salary – Department table: department-name, budget, manager

The tables are connected to each other by the

department-name field and the fields name and manager

Can be combined together, e.g., with fields name,

department-name, age, salary, budget, manager

Or create as many rows as department-names
Flattening may require needless replication of values

SLIDE 27

Data Quality

SLIDE 28

Outlier