Introduction to Survey Statistics – Day 2 Sampling and Weighting
Federico Vegetti Central European University University of Heidelberg
1 / 32
Introduction to Survey Statistics Day 2 Sampling and Weighting - - PowerPoint PPT Presentation
Introduction to Survey Statistics Day 2 Sampling and Weighting Federico Vegetti Central European University University of Heidelberg 1 / 32 Sources of error in surveys Figure 1: From Groves et al. (2009) 2 / 32 Representation error
1 / 32
2 / 32
◮ Coverage, Sampling, Non-response
3 / 32
◮ E.g. We want to know whether people liked the new Star Wars
◮ We interview people leaving the Opera house after a Wagner’s
◮ Our sample will probably show lower appreciation of the movie
◮ We sample 100 people from the phone list of Berlin, and ask
◮ The next day we draw other 100 people from the same list, and
◮ Most likely figures won’t be identical 4 / 32
5 / 32
◮ The more observations we have, the more information we have,
6 / 32
7 / 32
8 / 32
◮ We assume/theorize that a dichotomic variable (e.g. voting/not
◮ We assume/theorize that an outcome is a function of some
9 / 32
10 / 32
11 / 32
◮ We want to see if a drug cures depression ◮ We take a pool of subjects with depression ◮ We assign them randomly to either one of two groups ◮ To the subjects in one group we give the actual drug, to the
◮ We keep them all in a clinic where they have the exact same
12 / 32
◮ To make sure that the only systematic difference between the
13 / 32
◮ In other words, we can draw descriptive evidence directly from
◮ Knowing the sample frame (the finite population from which
◮ Knowing the selection process for the observations (what rules
14 / 32
15 / 32
◮ We might not know the sampling frame ◮ E.g. we take everyone who shows up in the lab ◮ We might not be able to calculate the probabilities of selection ◮ E.g. we use snowball sampling
16 / 32
◮ where ni is the sample size and Ni the size of the sample frame
17 / 32
18 / 32
◮ Internally homogeneous (with respect to specific characteristics) ◮ Mutually exclusive ◮ Collectively exhaustive
19 / 32
◮ They tend to have smaller standard errors than in simple
◮ But only when the variables for which we estimate the
◮ The precision of an estimate is always a function of the amount
◮ In stratified samples, the mere presence of an observation in the
20 / 32
◮ where nis is the sample size and Nis the population size within
21 / 32
◮ Take all units inside of the cluster (single-stage sampling) ◮ Sample further (multistage sampling) 22 / 32
◮ People in the same cluster tend to be more similar to one
◮ Formally, values of respondents from the same cluster tend to
◮ With a clustered sample, the correlation between units will be
◮ Hence, the information that we get from each respondent will
23 / 32
◮ Where πc = nc/Nc ◮ nc is the number of sampled clusters ◮ Nc is the total number of clusters in the sample frame
◮ Where πic = nic/Nic ◮ nic is the sample size ◮ Nic is the population size within the cluster c 24 / 32
◮ For instance, if we have a stratified sample, weights allow us to
25 / 32
◮ Xi is the measurement of variable X for respondent i ◮ πi is the probability of inclusion for respondent i 26 / 32
27 / 32
◮ The sample can not be stratified on everything ◮ Nonresponse patterns may be different between groups ◮ Group proportions in the sample may end up being different
28 / 32
◮ Where gi = Nk/ˆ
◮ Nk is the population size in the group (or stratum) k ◮ ˆ
29 / 32
◮ E.g. Males of age 18-24 and low education, males of age 18-24
30 / 32
◮ Assuming the correct data generating process for the outcome
◮ Assuming a correct specification for the function predicting the
31 / 32
32 / 32