[PPT] - Quantitative analysis with statistics (and ponies) (Some slides and PowerPoint Presentation

SLIDE 1

1

Quantitative analysis with statistics (and ponies)

(Some slides and pony examples from Blase Ur)

SLIDE 2

2

Logistics and updates

New homework coming soon
Ethics reading for Thursday

– Ethical or not? Come prepared to vote

No office hours today

– By appointment this week instead

SLIDE 3

3

Statistics

The main idea: Hypothesis testing
Choosing the right test: Comparisons
Regressions
Other stuff

– Non-independence, directional tests, effect size

Tools

SLIDE 4

4

OVERVIEW

What’s the big idea, anyway?

SLIDE 5

5

Statistics

In general: analyzing and interpreting data
We often mean: Statistical hypothesis testing

– Is it unlikely the data would look like this unless there is actually a difference in real life?

SLIDE 6

6

The prototypical case

Q:

Q: Do ponies who drink more caffeine make better passwords?

Experiment: Recruit 30 ponies. Give 15 caffeine

pills and 15 placebos. They all create passwords.

http://www.fanpop.com/clubs/my-little-pony-friendship-is-magic/images/33207334/title/little-pony-friendship-magic-photo

SLIDE 7

7

Hypotheses

Nul

Null hypot l hypothesis hesis: There is no difference Caffeine does not affect pony password strength.

Al

Alternat ternative hypot ive hypothesis hesis: There is a difference Caffeine affects pony password strength.

Note what is not here (more on this later):

– Which direction is the effect? – How strong is the effect?

SLIDE 8

8

Hypotheses, continued

Statistical test gives you one of two answers:
1. Reject the null: We have strong evidence the

alternative is true.

2. Don’t reject the null: We don’t have strong evidence

the alternative is true.

Again, note what isn’t here:

– We have strong evidence the null is true. (NOPE)

SLIDE 9

9

P values

What is the probability that the data would look

like this if there’s no actual difference?

– i.e., Probability we tell everyone about ponies and caffeine but it isn’t really true

Most often, α = 0.05; some people choose 0.01

– If p < 0.05 , reject null hypothesis; there is a “significant” difference between caffeine and placebo – True or false ONLY: You don’t say that something is “more significant” because the p-value is lower – A p-value is not magic, it’s just probability

SLIDE 10

10

P values and correction

Type I error (false positive)

– You expect this to happen 5% of the time if α = 0.05

What happens if you conduct a lot of statistical

tests in one experiment?

Many methods for “correcting” p values

– Bonferroni correction (multiply p values by the number

f tests) is the easiest to calculate but most

conservative

SLIDE 11

11

Type II Error (False negative)

There is a difference, but you didn’t find evidence

– No one will know the power of caffeinated ponies

Hypothesis tests DO NOT BOUND this error
Instead, statistical power is the probability of

rejecting the null hypothesis if you should

– Requires that you estimate the effect size (hard)

SLIDE 12

12

After an experiment, one of four things has

happened (total P=1).

Which box are you in? You don’t know.

Hypotheses, power, probability

PROBABILITY You rejected the null You didn’t Reality: Difference Estimated via power analysis ? Reality: No difference Bounded by α ?

SLIDE 13

13

Correlation and causation

Correlation: We observe that two things are

to groups and gave them different treatments

– If designed properly Do password meters help ponies?

SLIDE 14

14

CHOOSING THE RIGHT TEST

SLIDE 15

15

http://i196.photobucket.com/albums/aa92/ karina408_album/Wallpaper-53.jpg

What kind of data do you have?

For explanatory and outcome variables
Quantitative

– Discrete (Number of caffeine pills taken by each pony) – Continuous (Weight of each pony)

Categorical

– Binary (Is it or isn’t it a pony?) – Nominal: No order (Color of the pony) – Ordinal: Ordered (Is the pony super cool, cool, a little cool, or uncool)

SLIDE 16

16

What kind of data do you have?

Does your dependent data follow a normal

distribution? (You can calculate this!)

– If so, use parametric tests. – If not, use non-parametric tests.

Are your data independent?

– If not, repeated-measures, mixed models, etc.

http://www.wikipedia.org

SLIDE 17

17

If both are categorical ….

Use (Pearson’s) χ2 (Chi-squared) test of

independence.

– Fewer than 5 data points in any single cell, use Fisher’s Exact Test (also works with lots of data)

Do not use χ2 if you are testing quantitative
utcomes!

SLIDE 18

18

Contingency tables

Rows one variable,

columns the other

Example:
χ2 = 97.013, df = 14, p

= 1.767e-14

SLIDE 19

19

Explanatory: categorical Outcome: continuous/ordinal ….

If you want to compare “Which is bigger?”
Normal, continuous outcome (compare mean):

– 2 conditions: T-test – 3+ conditions: ANOVA

Non-normal data / ordinal data

– Does one group tend to have larger values? – 2 conditions: Mann-Whitney U (AKA Wilcoxon rank-sum) – 3+ conditions: Kruskal-Wallis

SLIDE 20

20

Continuous/ordinal data

SLIDE 21

21

What about Likert-scale data?

Respond to the statement: Ponies are magical.

– 7: Strongly agree – 6: Agree – 5: Mildly agree – 4: Neutral – 3: Mildly disagree – 2: Disagree – 1: Strongly disagree

SLIDE 22

22

What about Likert-scale data?

Some people treat it as continuous (not good)
Other people treat it as ordinal (better!)

– Difference 1-2 ≠ 2-3 – Use Mann-Whitney U / Kruskal-Wallis

Another OK option: binning (simpler)

– Transform into binary “agree” and “not agree” – Use χ2 or FET

SLIDE 23

23

nudge-comp8

23

baseline meter three-segment green tiny huge no suggestions text-only bunny half-score

ne-third-score

nudge-16 text-only half- score bold text-only half- score

Visual Scoring Visual & Scoring Control

Password meter annoying

SLIDE 24

24

Contrasts

If you have more than two conditions, H1 = “the

conditions are not all the same”

– “Omnibus test”

If you reject this null, you may compare conditions
Planned vs. unplanned contrasts

contrasts

N-1 free planned

planned contrasts

– Actually, really planned. No peeking at the data.

Unplanned and post-hoc require p-correction

SLIDE 25

25

Contrasts in the meters paper

“We ran pairwise contrasts comparing each condition to our two control conditions, no meter and baseline meter. In addition, to investigate hypotheses about the ways in which conditions varied, we ran planned contrasts comparing tiny to huge, nudge-16 to nudge-comp8, half-score to one- third-score, text-only to text-only half-score, half- score to text-only half-score, and text-only half- score to bold text-only half-score.”

SLIDE 26

26

Continuous/ordinal data

SLIDE 27

27

REGRESSIONS

Finding a relationship among variables

SLIDE 28

28

Regressions

What is the relationship among variables?

– Generally one outcome (dependent variable) – Often multiple factors (independent variables)

The type of regression you perform depends on

the outcome

– Binary outcome: logistic regression – Ordinal outcome: ordinal / ordered regression – Continuous outcome: linear regression

SLIDE 29

29

Example regression

Outcome:

– Completed pony race (or not): Logistic – Finish time in pony race: Linear

Independent variables:

– Age of pony – Number of prior races – Diet: hay or pop-tarts (code as eatsHay=true/false) – (Indicator variables for color categories) – Etc.

SLIDE 30

30

What you get

Linear: Outcome = ax1 + bx2 + c

– Finish time = 3age - 5eatsHay + 7

Logistic: Outcome is in log likelihood

– Intuition: probability of finishing decreases with age, increases if ate hay, etc.

SLIDE 31

31

Interactions in a regression

Normally, outcome = ax1 + bx2 + c + …
Interactions account for situations when two

variables are not simply additive. Instead, their interaction impacts the outcome

– e.g., Maybe brown ponies, and only brown ponies, get a larger benefit from eating pop-tarts before a race

Outcome = ax1 + bx2 + c + d(x1x2) + …

SLIDE 32

32

Example logistic regression output

Factor Coef. Exp(coef) SE p-value number of digits

0.343

0.709 0.009 <0.001 number of lowercase

0.355

0.701 0.008 <0.001 number of uppercase

0.783

0.457 0.028 <0.001 number of symbols

0.582

0.559 0.037 <0.001 digits in middle

0.714

0.490 0.040 <0.001 digits spread out

1.624

0.197 0.051 <0.001 digits at beginning

0.256

0.774 0.066 <0.001 uppercase in middle

0.168

0.845 0.105 0.108† uppercase spread out 0.055 1.057 0.114 0.629† uppercase at beginning 0.631 1.879 0.105 <0.001 symbols in middle

0.844

0.430 0.038 <0.001 symbols spread out

1.217

0.296 0.085 <0.001 symbols at beginning

0.287

0.751 0.070 <0.001 gender (male)

4.4 E-4

1.000 0.023 0.985† birth year 0.005 1.005 0.001 <0.001 engineering

0.140

0.870 0.042 <0.001 humanities

0.078

0.925 0.049 0.108† public policy 0.029 1.029 0.051 0.576† science

0.161

0.851 0.055 0.003

ther
0.066

0.936 0.046 0.154† computer science

0.195

0.823 0.047 <0.001 business 0.167 1.182 0.049 <0.001

SLIDE 33

33

What if you have lots of questions?

If we ask 40 privacy questions on a Likert scale,

how do we analyze this survey?

One option: Add responses to get “privacy score”

– Make sure the scales are the same – Reverse if needed (e.g., “personal privacy is important to me” “I don’t care if companies sell my data”) – Important: Verify that responses are correlated!

SLIDE 34

34

Verifying correlation

Usually preferred: Spearman’s rank correlation

coefficient (Spearman’s ρ)

– Evaluates a relationship’s monotonicity – e.g., all variables get larger with privacy sensitivity

SLIDE 35

35

Another option: Factor analysis

Evaluate underlying factors you are detecting
You specify N, a number of factors
Algorithm groups related questions (N groups)

– Each group is a factor

Factor loadings measure goodness of correlation

– Questions loading primarily onto one factor are useful

SLIDE 36

36

In groups: Plan your analysis

Does caffeine impact pony password strength?

– When strength = cracked or not cracked – When strength = 0-100 scoring – Compare caffeine, NyQuil, placebo

Do gender, age, state of residence, and

education level impact pony privacy concern?

– Concerned vs. unconcerned – Privacy “score” by adding 30 questions

SLIDE 37

37

OTHER THINGS TO CONSIDER

Non-independence, directional testing, effect size

SLIDE 38

38

Independence

Why might your data not be independent?

– Non-independent sample (bad!) – The inherent design of the experiment (ok!)

Example: Same ponies make passwords, before

and after taking the caffeine pills

– Each pony cannot be independent of itself

SLIDE 39

39

Repeated measures

AKA within subjects

– Measure the same participant multiple times

Paired T-test

– Two samples per participant, two groups

Repeated measures ANOVA

– More general

SLIDE 40

40

Hierarchy and mixed model

For regressions, use a “mixed model”
Intuition: Each pony’s result driven by combo of

individual skills, group characteristics, treatment effects

Case 1: Many measurements of each pony
Case 2: The ponies have some other relationship.

e.g., all ponies attended 1 of 5 security camps. (You want to control for this, but not evaluate it.)

SLIDE 41

41

Directional testing

If your hypothesis goes one way:

Caffeinated ponies make stronger passwords.

More power than more general tests

– BUT, must select direction BEFORE looking at data – Won’t reject null if there’s a difference the other way

Example: One-tailed T-test
Use with caution!

SLIDE 42

42

Effect size

Hypothesis test: Is there a difference?
Also (more?) important: How big a difference?
Findings can be “significant” but unimportant

Factor Coef. Exp(coef) SE p-value login count <0.001 1.000 <0.001 <0.001 password fail rate

0.543

0.581 0.116 <0.001 gender (male) 0.078 0.925 0.027 0.005 engineering

0.273

0.761 0.048 <0.001 humanities

0.107

0.898 0.054 0.048 public policy 0.079 1.082 0.058 0.176†

SLIDE 43

43

TOOLS

SLIDE 44

44

So how do I DO these tests?

Excel: Very easy, but not very powerful
R: Most powerful, steepest learning curve

– Like Matlab but for stats – Somewhat bizarre language/API/data representation – Free and open-source (awesome add-on packages)

SPSS: Graphical, pretty powerful

– Expensive ($25 student license from Terpware) – Somewhat scriptable, not as flexible as R

SLIDE 45

45

R tutorials

http://www.statmethods.net
http://cyclismo.org/tutorial/R/

Quantitative analysis with statistics (and ponies)

(Some slides and pony examples from Blase Ur)

Logistics and updates

– Ethical or not? Come prepared to vote

– By appointment this week instead

Statistics

– Non-independence, directional tests, effect size

OVERVIEW

What’s the big idea, anyway?

Statistics

– Is it unlikely the data would look like this unless there is actually a difference in real life?

The prototypical case

Q: Do ponies who drink more caffeine make better passwords?

pills and 15 placebos. They all create passwords.

Hypotheses

Null hypot l hypothesis hesis: There is no difference Caffeine does not affect pony password strength.

Alternat ternative hypot ive hypothesis hesis: There is a difference Caffeine affects pony password strength.

– Which direction is the effect? – How strong is the effect?

Hypotheses, continued

alternative is true.

the alternative is true.

– We have strong evidence the null is true. (NOPE)

P values

like this if there’s no actual difference?

– i.e., Probability we tell everyone about ponies and caffeine but it isn’t really true

– If p < 0.05 , reject null hypothesis; there is a “significant” difference between caffeine and placebo – True or false ONLY: You don’t say that something is “more significant” because the p-value is lower – A p-value is not magic, it’s just probability

P values and correction

– You expect this to happen 5% of the time if α = 0.05

tests in one experiment?

– Bonferroni correction (multiply p values by the number

conservative

Type II Error (False negative)

– No one will know the power of caffeinated ponies

rejecting the null hypothesis if you should

– Requires that you estimate the effect size (hard)

happened (total P=1).

Hypotheses, power, probability

Correlation and causation

related

Do rural or urban ponies make stronger passwords?

to groups and gave them different treatments

– If designed properly Do password meters help ponies?

CHOOSING THE RIGHT TEST

What kind of data do you have?

– Discrete (Number of caffeine pills taken by each pony) – Continuous (Weight of each pony)

– Binary (Is it or isn’t it a pony?) – Nominal: No order (Color of the pony) – Ordinal: Ordered (Is the pony super cool, cool, a little cool, or uncool)

What kind of data do you have?

distribution? (You can calculate this!)

– If so, use parametric tests. – If not, use non-parametric tests.

– If not, repeated-measures, mixed models, etc.

If both are categorical ….

independence.

– Fewer than 5 data points in any single cell, use Fisher’s Exact Test (also works with lots of data)

Contingency tables

columns the other

= 1.767e-14

Explanatory: categorical Outcome: continuous/ordinal ….

– 2 conditions: T-test – 3+ conditions: ANOVA

– Does one group tend to have larger values? – 2 conditions: Mann-Whitney U (AKA Wilcoxon rank-sum) – 3+ conditions: Kruskal-Wallis

Continuous/ordinal data

What about Likert-scale data?

– 7: Strongly agree – 6: Agree – 5: Mildly agree – 4: Neutral – 3: Mildly disagree – 2: Disagree – 1: Strongly disagree

What about Likert-scale data?

– Difference 1-2 ≠ 2-3 – Use Mann-Whitney U / Kruskal-Wallis

– Transform into binary “agree” and “not agree” – Use χ2 or FET

nudge-comp8

baseline meter three-segment green tiny huge no suggestions text-only bunny half-score

nudge-16 text-only half- score bold text-only half- score

Visual Scoring Visual & Scoring Control

Password meter annoying

Contrasts

conditions are not all the same”

– “Omnibus test”

contrasts

planned contrasts

– Actually, really planned. No peeking at the data.

Contrasts in the meters paper

Continuous/ordinal data

REGRESSIONS

Finding a relationship among variables

– Finish time = 3age - 5eatsHay + 7