Introduction to Survey Statistics Day 2 Sampling and Weighting - - PowerPoint PPT Presentation

introduction to survey statistics day 2 sampling and
SMART_READER_LITE
LIVE PREVIEW

Introduction to Survey Statistics Day 2 Sampling and Weighting - - PowerPoint PPT Presentation

Introduction to Survey Statistics Day 2 Sampling and Weighting Federico Vegetti Central European University University of Heidelberg 1 / 32 Sources of error in surveys Figure 1: From Groves et al. (2009) 2 / 32 Representation error


slide-1
SLIDE 1

Introduction to Survey Statistics – Day 2 Sampling and Weighting

Federico Vegetti Central European University University of Heidelberg

1 / 32

slide-2
SLIDE 2

Sources of error in surveys

Figure 1: From Groves et al. (2009)

2 / 32

slide-3
SLIDE 3

Representation error

◮ The difference between the values that we observe in the

sample and the true values in the population

◮ It has many sources

◮ Coverage, Sampling, Non-response

◮ Sampling is arguably the most relevant ◮ However, a similar logic applies to all of them

3 / 32

slide-4
SLIDE 4

Two types of error

◮ Bias: when the deviation from the true value systematically

goes in a specific direction

◮ E.g. We want to know whether people liked the new Star Wars

movie

◮ We interview people leaving the Opera house after a Wagner’s

play

◮ Our sample will probably show lower appreciation of the movie

than the average moviegoer

◮ Variability: when the deviation from the true value is a

random incidence

◮ We sample 100 people from the phone list of Berlin, and ask

them their attitude towards EU integration

◮ The next day we draw other 100 people from the same list, and

ask the same question

◮ Most likely figures won’t be identical 4 / 32

slide-5
SLIDE 5

Sampling and variability

Figure 2: From Groves et al. (2009)

5 / 32

slide-6
SLIDE 6

Standard error

◮ Variability between samples is reflected in the variability within

the sample

◮ In fact, the standard error of an estimated parameter is

interpreted as the standard deviation of such estimate across different independent samples

◮ It is calculated from the variance of the parameter in the

sample

◮ It corrects by the number of observations

◮ The more observations we have, the more information we have,

and the more precise is our estimate

6 / 32

slide-7
SLIDE 7

Two goals

  • 1. Reduce the bias of the parameter estimates
  • 2. Increase the precision of the parameter estimates

◮ We can do a lot to reach these goals when planning the data

collection

◮ As a less optimal solution, we can also adjust the data after the

collection, in order to make them more resemblant of the population

7 / 32

slide-8
SLIDE 8

On inference, again

◮ We saw two inferences that we make when we work with survey

data:

  • 1. From answers to questions to individual characteristics
  • 2. From samples to populations

◮ In statistics, there is a distinction between model-based and

design-based inference

◮ To a certain extent, these two types mirror the two inferences

we make with survey data

8 / 32

slide-9
SLIDE 9

Model-based inference

◮ Inferences that require us to make assumptions regarding the

process that generated the data

◮ Assumptions are theories

◮ We assume/theorize that a dichotomic variable (e.g. voting/not

voting) has been generated by a Bernoulli distribution

◮ We assume/theorize that an outcome is a function of some

predictors

◮ In fact we do not know what model generated the data, but we

  • ffer an approximation of reality with our theory

◮ As long as our assumptions are correct, our results can be

generalized to other situations where the same process is at work

9 / 32

slide-10
SLIDE 10

Model-based inference (2)

◮ Maximum Likelihood estimation is a classic example of

model-based inference

◮ Our sample is assumed to be a realization of an infinite

population that follows a given theoretical distribution

◮ Observations in the sample are linked to observations outside

the sample by the assumption that they all come from the same distribution

◮ The parameters that we estimate from the sample are then our

best guess about the values of the true parameters in the population given the data

◮ The sample does not need to be random, as long as we control

by possible factors that make it different from the population

10 / 32

slide-11
SLIDE 11

Model-based inference and measurement

◮ When we model a survey outcome (e.g. the response to a logic

quiz) we assume that it has been produced by a random process that we theorize (e.g. intelligence)

◮ In this framework, both interpreting the output of a regression

and the parametes of the distribution of a survey variable imply making a model-based inference

◮ The idea that measurement can be conceptualized as a

statistical model where an observed outcome is a function of a hypothesized (latent) process is behind most psychometric methods

11 / 32

slide-12
SLIDE 12

Design-based inference

◮ Example: a randomized experiment

◮ We want to see if a drug cures depression ◮ We take a pool of subjects with depression ◮ We assign them randomly to either one of two groups ◮ To the subjects in one group we give the actual drug, to the

  • thers we give a placebo

◮ We keep them all in a clinic where they have the exact same

treatment in all other respects

12 / 32

slide-13
SLIDE 13

Design-based inference (2)

◮ In a randomized experiment:

  • 1. We know which subjects have been given the treatment
  • 2. We know that the only thing that differs between groups is the

treatment itself

◮ What allows us to make a valid inference in experiments is

random assignment

◮ To make sure that the only systematic difference between the

two groups is the occurrence of the treatment, we must assign units randomly to one group or the other

◮ In other words, we know that each unit has equal probability to

end up in either one of the two groups

◮ This knowledge is the central point of design-based inference

13 / 32

slide-14
SLIDE 14

Design-based inference in surveys

◮ Design-based inference allows us to draw conclusions about a

variable in the the target population by looking at a sample and without assuming an underlying generative model

◮ In other words, we can draw descriptive evidence directly from

the sample to the population

◮ To be able to do so, we need to know the design that has been

used to produce the sample

◮ This implies:

◮ Knowing the sample frame (the finite population from which

the sample is drawn)

◮ Knowing the selection process for the observations (what rules

drive the random sampling procedure)

14 / 32

slide-15
SLIDE 15

Random samples

A random sample is a sample with the following characteristics (see Lumley 2010):

  • 1. Every individual i in the sample frame has a non-zero

probability πi to end up in the sample

  • 2. We can calculate this probability for every unit in the sample
  • 3. Every pair of individuals i and j in the sample frame have a

non-zero probability πij to end up together in the sample

  • 4. We can calculate this probability for every pair of units in the

sample

◮ Note that if individuals are sampled independently from each

  • ther, then πij = πiπj

15 / 32

slide-16
SLIDE 16

Nonrandom samples

◮ When conditions 1 and 2 are not met, we have a nonrandom

sample

◮ In nonrandom samples

◮ We might not know the sampling frame ◮ E.g. we take everyone who shows up in the lab ◮ We might not be able to calculate the probabilities of selection ◮ E.g. we use snowball sampling

◮ Nonrandom samples are very common in social science ◮ We can still use them to draw a model-based inference, under

certain conditions (see Sterba 2009)

16 / 32

slide-17
SLIDE 17

Simple random samples

◮ In a simple random sample we choose units at random from

the entire population

◮ The probability of inclusion for all units is πi = ni/Ni

◮ where ni is the sample size and Ni the size of the sample frame

◮ Such probabilities serve as the basis to calculate sampling

weights

◮ Weights are then calculated as 1/πi for each unit i ◮ They reflect how many units in the sample frame each

  • bservation in the sample represents

17 / 32

slide-18
SLIDE 18

Sampling weights in simple random samples (2)

◮ Example: we take a random sample of 1,000 respondents from

a sample frame of 100,000 individuals

◮ For each individual, π = 1000/100000 = 0.01 ◮ Then 1/0.01 = 100 ◮ Every respondent represents 100 people in the sample frame

18 / 32

slide-19
SLIDE 19

Stratified samples

◮ We divide the population into groups that are

◮ Internally homogeneous (with respect to specific characteristics) ◮ Mutually exclusive ◮ Collectively exhaustive

◮ We draw a random sample within each group ◮ This way we make sure that observations in each stratum end

up in the sample

◮ Obviously, we need to know the stratum membership for each

  • bservation before we contact them

19 / 32

slide-20
SLIDE 20

Stratified samples (2)

◮ Stratified samples increase the precision of the estimated

parameters

◮ They tend to have smaller standard errors than in simple

random samples

◮ But only when the variables for which we estimate the

parameter are predicted by the variables used to stratify

◮ Why?

◮ The precision of an estimate is always a function of the amount

  • f information that we have

◮ In stratified samples, the mere presence of an observation in the

sample conveys information about some characteristics of that

  • bservation

20 / 32

slide-21
SLIDE 21

Weights in stratified samples

◮ Stratified samples are simple random samples drawn within

each stratum

◮ Hence, the probability of selection for an individual i in a

stratum s is πis = nis/Nis

◮ where nis is the sample size and Nis the population size within

the stratum s

21 / 32

slide-22
SLIDE 22

Cluster sampling

◮ Using a random sample of the entire population may be

difficult in case surveys are conducted face-to-face

◮ An alternative is to divide the population into clusters

(e.g. districts) and take a random sample of clusters

◮ Then we can either:

◮ Take all units inside of the cluster (single-stage sampling) ◮ Sample further (multistage sampling) 22 / 32

slide-23
SLIDE 23

Cluster sampling (2)

◮ Unlike stratified sampling, cluster sampling decreases the

precision of the estimated parameters

◮ Why?

◮ People in the same cluster tend to be more similar to one

another (more so than people from different clusters)

◮ Formally, values of respondents from the same cluster tend to

be more correlated

◮ With a clustered sample, the correlation between units will be

  • n average higher

◮ Hence, the information that we get from each respondent will

be a bit less than with a random sample of the full population

◮ This is less of a problem the more the clusters are similar to

  • ne another

23 / 32

slide-24
SLIDE 24

Weights in clustered samples

◮ In single-stage cluster sampling, the probability πi that an

individual i is sampled is equivalent to the probability πc that the cluster c to which the individual belongs is sampled

◮ Where πc = nc/Nc ◮ nc is the number of sampled clusters ◮ Nc is the total number of clusters in the sample frame

◮ In multistage sampling, πi is also a function of the probability

πic that i is sampled within the cluster c so that πi = πcπic

◮ Where πic = nic/Nic ◮ nic is the sample size ◮ Nic is the population size within the cluster c 24 / 32

slide-25
SLIDE 25

What do we do with weights?

◮ We may need weights to calculate sample statistics, especially

if we want to obtain descriptive statistics about the sample

◮ For instance, if we have a stratified sample, weights allow us to

compute unbiased and efficient (i.e. with high precision) parameter estimates

◮ We can adjust the sample weights to correct for deviations of

the sample from some (known) parameters of the population

25 / 32

slide-26
SLIDE 26

Horvitz-Thompson estimator

◮ Estimates of the population total are the basis for most other

more complex statistics

◮ The Horvitz-Thompson estimator is a method used to calculate

the population total (and its standard error) ˆ TX =

n

  • i=1

1 πi Xi

◮ Where:

◮ Xi is the measurement of variable X for respondent i ◮ πi is the probability of inclusion for respondent i 26 / 32

slide-27
SLIDE 27

Horvitz-Thompson estimator (2)

◮ From here we can obtain, for instance, the estimated

population mean of X by dividing ˆ TX by the population size N ˆ µX = 1 N

n

  • i=1

1 πi Xi

◮ Which in a simple random sample, is equivalent to the sample

average ˆ µX = 1 n

n

  • i=1

Xi

◮ In a stratified sample, the formula for ˆ

µX produces what is

  • ften called the weighted mean of X, which is an unbiased

and efficient estimator of the population mean

27 / 32

slide-28
SLIDE 28

Post-stratification

◮ Suppose we have a sample where females are 48% and males

are 52%, but we know that in the population females are 52% and males are 48%

◮ If our sample was stratified on sex, this difference in proportion

would be reflected in the weights

◮ However

◮ The sample can not be stratified on everything ◮ Nonresponse patterns may be different between groups ◮ Group proportions in the sample may end up being different

from the ones in the population by chance

◮ Even in these cases, we can adjust the weights so that groups

have the same proportion that they would have in a stratified sample

◮ This adjustment is called post-stratification

28 / 32

slide-29
SLIDE 29

Post-stratification (2)

◮ When we apply post-stratification, we substitute the sampling

weights 1/πi with gi/πi

◮ Where gi = Nk/ˆ

Nk

◮ Nk is the population size in the group (or stratum) k ◮ ˆ

Nk is the Horvitz-Thompson estimator of the population size in the group k

◮ In other words, we change the values of the weights so that the

group size in the sample matches the group size in the population

29 / 32

slide-30
SLIDE 30

Raking

◮ We may need post-stratification to be performed for more than

  • ne variable

◮ This is more often the rule than the exception ◮ Ideally we would need a complete cross-classification of the

variables

◮ E.g. Males of age 18-24 and low education, males of age 18-24

and high education, etc.

◮ However, some resulting combinations may be so untypical that

nobody ends up sampled in those categories

◮ Raking is an iterative procedure that allows to post-stratify on

multiple grouping factors without the need for a full cross-classification

30 / 32

slide-31
SLIDE 31

Final remarks

◮ Note that the use of weights and of post-stratification

adjustments is necessary to have unbiased estimates of population parameters under a design-based inference paradigm

◮ When we make a model-based inference, what counts is that

  • ur model is correctly specified

◮ This usually implies

◮ Assuming the correct data generating process for the outcome

variable

◮ Assuming a correct specification for the function predicting the

  • utcome variable

◮ In regression models, we often include as predictors the

variables that in design-based inference we use to post-stratify

31 / 32

slide-32
SLIDE 32

References

Groves, Robert M., Floyd J. Fowler Jr, Mick P. Couper, James M. Lepkowski, Eleanor Singer, and Roger Tourangeau. 2009. Survey

  • Methodology. 2 edition. Hoboken, N.J: Wiley.

Lumley, Thomas. 2010. Complex Surveys: A Guide to Analysis Using R. 1 edition. Hoboken, N.J: Wiley. Sterba, Sonya K. 2009. “Alternative Model-Based and Design-Based Frameworks for Inference from Samples to Populations: From Polarization to Integration.” Multivariate Behavioral Research 44 (6): 711–40.

32 / 32