Bayesian Methods for Variable Selection with Applications to - - PowerPoint PPT Presentation

bayesian methods for variable selection with applications
SMART_READER_LITE
LIVE PREVIEW

Bayesian Methods for Variable Selection with Applications to - - PowerPoint PPT Presentation

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 5: Functional Data & Wavelets Marina Vannucci Rice University, USA ABS13-Italy 06/17-21/2013 Marina Vannucci (Rice University, USA) Bayesian Variable


slide-1
SLIDE 1

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data

Part 5: Functional Data & Wavelets Marina Vannucci

Rice University, USA

ABS13-Italy 06/17-21/2013

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 1 / 56

slide-2
SLIDE 2

Part 5: Functional Data & Wavelets

Introduction to nonparametric regression Brief intro to wavelets Wavelet shrinkage Functional data (multiple curves)

Curve regression models Curve classification Applications to near infrared spectral data from chemometrics

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 2 / 56

slide-3
SLIDE 3

Nonparametric Regression Models

Functional form of the regression function is not a simple function of a few parameters Technically: probability models with infinitely many parameters or models on functional spaces Typical setting: Given data (Xi, Yi), i = 1, ..., n we wish to estimate Y = f(X) + ǫ X can be a covariate, a set of covariates, time, spatial location,... Basic idea: Avoid simple (parametric) forms Approaches needed to penalize for overfitting since there are now many parameters Function estimation (avoid linearity): splines, wavelets (basis functions)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 3 / 56

slide-4
SLIDE 4

Basis functions

A basis is a set of simple functions that can be added together in a weighted fashion for form more complicated functions Suppose X is univariate Simplest basis: polynomial basis f(X) = β0 + β1X + ... + βpXp Set of power functions to reconstruct f based on data: {x; x2; x3...} With sufficiently large number of basis functions, we have essentially a nonparametric function estimator Matrix notation: f = p

j=0 Bj(x)βj, where Bj(X) = xj and coefficients

(weights) βj Polynomial basis - easy to fit but too simplistic for local features/variations

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 4 / 56

slide-5
SLIDE 5

Splines

Splines are one-way to model f flexibly by writing f(X) = B(X)β where B(:) are basis functions. Basis functions: there a lot choices available - truncated power basis, B-splines, thin plate splines, penalized splines ... Truncated power basis of order p f(X) = β0 + β1X + ... + βpXp +

K

  • k=1

βk+p(X − κk)p

+

with β’s the regression coefficients, κ the knots and K the number of knots. Linear/polynomial regression for K = 0 Construction of splines involves specifying knots: both number and location. Capture non-linear relationships between variables. Kernel based methods - Ruppert, Wand and Carroll (2003)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 5 / 56

slide-6
SLIDE 6

Wavelets: Major Milestones

The basic idea behind wavelets is to represent a generic function with simpler functions (building blocks) at different scales and locations.

1807: Fourier orthogonal decomposition of periodic signals 1946: Gabor windowed Fourier transform (STFT) 1984: A. Grossmann and J. Morlet introduce the continuous wavelet transform for the analysis of seismic signals. 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution analysis” (MRA) a framework that allows to construct orthonormal bases. A discrete wavelet transform is defined as a simple recursive algorithm to compute the wavelet decomposition of a signal from its approximation at a given scale. 1989: I. Daubechies constructs wavelets with compact support and a varying degree of smoothness. 1992: D. Donoho and I. Johnstone use wavelets to remove noise from data.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 6 / 56

slide-7
SLIDE 7

Wavelets as “Small Waves”

The basic idea behind wavelets is to represent a generic function with simpler functions (building blocks) at different scales and locations

−10 −5 5 10 −0.4 −0.2 0.2 0.4 0.6 0.8 1 x ψ(x) Mexican hat wavelet −10 −5 5 10 −1 −0.5 0.5 1 1.5 x √2ψ(2x); a=1/2, b=0 −10 −5 5 10 −0.4 −0.2 0.2 0.4 0.6 0.8 1 x (1/√2)ψ(x/2); a=2, b=0 −10 −5 5 10 −0.4 −0.2 0.2 0.4 0.6 0.8 1 x ψ(x+4); a=1, b=4

Mother wavelet ψ as oscillatory function with zero mean,

  • ψ(x)dx = 0

Wavelets as ψj,k(x) = 2j/2ψ(2jx − k) with j, k scale and translation parameters.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 7 / 56

slide-8
SLIDE 8

Wavelets vs Fourier Representations

Given f and a basis {f1, . . . , fn} → series expansion f(x) =

  • i

aifi(x)dx ai =< f, fi >=

  • f(x)fi(x)dx

Fourier transforms measure the frequency content of a signal. Wavelet transforms provide a time-scale analysis.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 8 / 56

slide-9
SLIDE 9

Examples of Wavelet Families

Haar wavelets. The simplest family of wavelets, already known before the formulation of the wavelet theory (Haar, 1909). ψ(x) = 1 for 0 ≤ x < 1/2; −1 for 1/2 ≤ x < 1 and 0 otherwise. Haar wavelets constructed from ψ via dilations and translations Given a generic function f, Haar wavelets approximate f with piecewise constant functions (not continuous)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 9 / 56

slide-10
SLIDE 10

Daubechies Wavelets

1 2 3 −1.5 −1 −0.5 0.5 1 1.5 2 x ψ(x) and φ(x) Daubechies 2 2 4 6 −1 −0.5 0.5 1 1.5 Daubechies 4 x 5 10 −1.5 −1 −0.5 0.5 1 1.5 Daubechies 7 x ψ(x) and φ(x) 5 10 15 −1.5 −1 −0.5 0.5 1 1.5 Daubechies 10 x

Compact support (good time localization) Vanishing moments

  • xlψ(x)dx = 0, l = 0, 1, . . . , N ensure decay as

< f, ψj,k >≤ C2−jN, (good for compression) Various degrees of smoothness. For large N, φ ∈ CµN, µ ≈ 0.2.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 10 / 56

slide-11
SLIDE 11

Properties of Wavelets

Wavelet series: f(x) =

j,k < f, ψj,k > ψj,k(x)

{Wf (j, k) = f, ψj,k =

  • f(x)ψj,k(x)dx}j,k∈Z

describing features of f at different locations and scales. Properties: Small waves with zero mean Time-frequency localization Good at describing non-stationarity and discontinuities Multi-scale decomposition of functions (MRA) - sparsity, shrinkage Recursive relationships among coefficients → DWT

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 11 / 56

slide-12
SLIDE 12

Wavelets in Practice (DWT)

Let’s consider a vector Y of observations of f at n equispaced points yi = f(ti), i = 1, . . . , n, with n = 2m Discrete Wavelet Transforms operate via recursive filters applied to Y

data ≡ c(m)

H

− → c(m−1)

H

− → · · ·

H

− → c(m−r)

G

ց

G

ց

G

ց d(m−1) · · · d(m−r) H : cm−1,k =

  • l

hl−2kcm,l, G : dm−1,k =

  • l

gl−2kcm,l, ...

Then, in practice Y DWT − → d(Y) = (c(m−r), d(m−r), d(m−r−1), . . . , d(m−1)) discrete approx of < f, ψj,k > at scales m − 1, . . . , m − r

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 12 / 56

slide-13
SLIDE 13

Example

100 200 300 400 500 600 700 800 900 1000 −4 −2 2 4 6 8 10 f+ε 100 200 300 400 500 600 700 800 900 1000 −15 −10 −5 5 10 15 Coefficient index DWT

data ≡ Y DWT − → d(Y) = (c(3), d(3), . . . , d(7), d(8), d(9)) In matrix notation, Y − → d(Y) = WY, with W determined by ψ and such that W′W = I

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 13 / 56

slide-14
SLIDE 14

Summary on Wavelets

Choose the mother wavelet ψ and obtain the wavelet family {ψj,k}j,k∈Z In theory, given f, calculate the wavelet coefficients < f, ψj,k >=

  • f(x)ψj,k(x)dx and construct the wavelet series

f(x) =

  • j,k

< f, ψj,k > ψj,k(x) In practice, given data Y = (y1, . . . , yn), use the DWT to calculate approximated wavelet coefficients DWT uses fast recursive filtering. Use matrix notation for convenience Z = WY

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 14 / 56

slide-15
SLIDE 15

Covariance matrix of wavelet coefficients

yi = f(ti), i = 1, . . . , n, with n = 2m DWT: Y DWT − → d(Y) = WY, W determined by ψ, W′W = I Variances and covariances of DWT coefficients: Σ Σ Σd(Y) = WΣ Σ ΣYW′ for given ΣY(i, j) = [γ(|i − j|)] γ(τ) autocovariance function of the process generating the data VANNUCCI & CORRADI (JRSS, Series B, 1999).

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 15 / 56

slide-16
SLIDE 16

Fields of Application

Applied Mathematics: Partial/ordinary differential equations solutions; representation of basic operators ... Engineering: Signal and image processing (compression, smoothing) ... Statistics: Smoothing of noisy data; nonparametric density and regression estimation; stochastic processes representation, time series, functional data ... Physics, Biology, Genetics: Turbolence, DNA sequence analysis, magnetic resonance imaging ... Music, Human Vision, Computer Graphics ...

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 16 / 56

slide-17
SLIDE 17

Wavelet Shrinkage

yi = f(ti) + ǫi, i = 1, . . . , n = 2m (equispaced design), ǫi ∼ N(0, σ2) Noise removal technique proposed by Donoho and Johnstone (1992-1998) Compute empirical wavelet coefficients → apply shrinkage/thresholding → reconstruct f Implemented through fast DWT Optimal for general function classes (Holder and Sobolev) as well as spaces of irregular functions such as those of “bounded variations”. Bayesian wavelet shrinkage estimators has better MSE properties for finite samples.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 17 / 56

slide-18
SLIDE 18

3-step procedure

yi = f(ti) + ǫi, ǫi i.i.d. N(0, σ2) Step 1: Y DWT − → d(Y) = d(f) + ǫ′, ǫ′ ∼ N(0, σ2I) Step 2: Estimate d(f)

Remove ǫ′ by thresholding/shrinking d(Y) Use mixture priors on d(f)

Step 3: Inverse transform to estimate f

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 18 / 56

slide-19
SLIDE 19

Hard and soft thresholding rules

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −15 −10 −5 5 10 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −15 −10 −5 5 10 15

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 4 3 2 1 DWT (hsr=0) Level Time 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 4 3 2 1 DWT (hsr=0) Level Time

ˆ dj,k = dj,kI(|dj,k|>λ) (hard thresholding). ˆ dj,k = (dj,k − sign(dj,k)λ)I(|dj,k|>λ) (soft thresholding). with global or level-dependent thresholds.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 19 / 56

slide-20
SLIDE 20

Bayesian Wavelet Shrinkage

At step 2 write the model as (d = WY, θ θ θ = Wf) d|θ θ θ, σ2 ∼ N(θ θ θ, σ2I), and impose priors on the unknown parameters θ θ θ (and possibly σ2). Use spike-and-slab priors θj,k|sj,k ∼ sj,kN(0, ·) + (1 − sj,k)I0, sj,k ∼ Ber(pj) (Clyde et al. 1998, Clyde & George 2000) Shrinkage is a by-product of these procedures, as Bayes rules naturally shrink posterior estimates towards prior quantities. Also, shrinkage is adaptive when level-specific variances are specified. Can also specify dependent priors (Vannucci & Corradi, 1999). Estimate θ θ θ via appropriate Bayes rule (posterior mean, median) and estimate f by W−1ˆ θ θ θ (step 3).

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 20 / 56

slide-21
SLIDE 21

Useful References

Daubechies, I. (1992). Ten lectures on wavelets. SIAM. Mallat, S.G. (1989). Multiresolution approximations and wavelet

  • rthonormal bases of L2(R). Transactions of the American Mathematical

Society, 315(1), 69-87. Nason, G. (2008), Wavelet Methods in Statistcs with R, Springer Verlag: New York. (with R package) Antoniadis, A., Bigot, J. and Sapatinas, T. (2001). Wavelet estimators in nonparametric regression: A comparative simulation study. Journal of Statistical Software, 6(6), 1-83. Matlab software. Wavelab (matlab) and WaveThresh (R)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 21 / 56

slide-22
SLIDE 22

Functional Data (multiple curves)

Data observed over an ordered domain: curves, images, 3-d objects First step: characterize unknown function via basis functions (projection to low-dimension). Multiple choices: Splines, wavelets, functional principal component (empirical basis). Second step: fit models to “reduced” space. Functional analogues to standard models: functional regression, functional mixed models; generalized functional regression etc... Bayesian framework allows unified modeling perspective.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 22 / 56

slide-23
SLIDE 23

Wavelets for Functional Data

Overall objective: Methods for prediction or classification or clustering based

  • n functional data (multiple curves).

Regression: a continuous response is observed. Classification: class membership observed in a training set. Clustering: group membership needs to be uncovered. Data are curves.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 23 / 56

slide-24
SLIDE 24

Motivating Examples from NIR Calibration

Experiment: 40 biscuit doughs made with variations in quantities of fat, flour, sugar and water in recipe. Fat 15-21%, Sugar 10-23%, Flour 44-54%, Water 11-17% Data: Composition (by weight) of the 4 constituents and spectral data at 700 wavelengths (from 1100nm to 2498nm in steps of 2nm) for each dough piece. Aim: Use the spectral data to predict the composition. Wavelets as a dimension reduction tool that preserves local features.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 24 / 56

slide-25
SLIDE 25

1000 1500 2000 2500 0.5 1 1.5 2 Biscuit − training 1000 1500 2000 2500 0.5 1 1.5 2 2.5 3 Biscuit − validation Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 25 / 56

slide-26
SLIDE 26

Another Example

NIR spectra used in analysis of food and drink and pharmaceutical products, measured at hundreds of wavelengths. 3 varieties of wheat, 94 observations. NIR spectra with 100 wavelengths from 850 to 1048nm in steps of 2nm. Aim: Use the spectral data to predict the wheat variety.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 26 / 56

slide-27
SLIDE 27

850 900 950 1000 1050 2.4 2.6 2.8 3 3.2 3.4 Wavelength(nm) Absorbance Variety 1 850 900 950 1000 1050 2.4 2.6 2.8 3 3.2 3.4 Wavelength(nm) Absorbance Variety 2 850 900 950 1000 1050 2.4 2.6 2.8 3 3.2 3.4 Wavelength(nm) Absorbance Variety 3 Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 27 / 56

slide-28
SLIDE 28

Our Approach

Use wavelets as dimension reduction tool that preserves local features. Apply DWT to curves. Wavelet coefficients summarise curve features in an efficient way, they are localized (unlike Fourier coefficients) and can capture noise in the data. Develop Bayesian methods in wavelet domain for simultaneous feature selection and prediction or classification. We use mixture priors and Bayes methods to look for wavelet component that well describe the variation in the responses. Wavelet coeffs selection is done by assigning a probability to every possible subset and then searching for subsets with high probability. Small coefficients can be important.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 28 / 56

slide-29
SLIDE 29

Curve Regression

The basic setup is a multivariate linear regression model, with n observations

  • n a q-variate response and p explanatory variables

Y = 1nα α α′ + XB + E Y(n × q) − responses, X(n × p) − predictors Concern is with functional predictor data, ie the situation where each row of X is a vector of observations of a curve x(t) at p equally spaced points. Interest is in situations where p is large→ variable selection and/or dimension reduction methods are needed

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 29 / 56

slide-30
SLIDE 30

Wavelet Transformation

A wavelet transform is applied to each row of X Y = 1nα α α′ + XW′WB + E, W′W = I

1400 1600 1800 2000 2200 2400 −0.14 −0.12 −0.1 −0.08 −0.06 Reflectance 1400 1600 1800 2000 2200 2400 0.05 0.1 Reflectance 1400 1600 1800 2000 2200 2400 0.02 0.04 0.06 0.08 Reflectance Wavelength 50 100 150 200 250 −1.5 −1 −0.5 0.5 DWT 50 100 150 200 250 −0.5 0.5 1 DWT 50 100 150 200 250 −0.2 0.2 0.4 DWT Wavelet coefficient

Y = 1nα α α′ + D˜ B + E with ˜ B = WB and D = XW′ a matrix of wavelet coefficients.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 30 / 56

slide-31
SLIDE 31

Prior Model to Select Wavelet Coefficients

Mixture priors to represent negligible coefficients. A latent binary vector γ γ γ identifies different models γ γ γ = (γ1, . . . , γp), γj = 0, 1, γj = 1 ↔ Dj in Coefficients are drawn from a mixture distribution B − B0 ∼ N(Hγ γ γ, Σ Σ Σ) B:,j ∼ (1 − γj)I0 + γjN(0, hjjΣ Σ Σ) π(γ γ γ) as single Bernoulli’s or Beta-Binomial or related predictors Prob(γj = 1) = wj, wj = w, j = 1, ..., p

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 31 / 56

slide-32
SLIDE 32

Prior Model

If H is the var-cov matrix of the prior on B then ˜ H = [WHW′] We compute transformed covariance priors on ˜ B as WHW′ using results of Vannucci and Corradi, JRSSB (1999). Priors on α α α and Σ Σ Σ are unchanged. (˜ B → B, ˜ H → H)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 32 / 56

slide-33
SLIDE 33

Metropolis Search

MCMC used to ”search” the posterior g(γ γ γ) ∼ π(γ γ γ|Y, D) looking for good models. At each step the algorithm generates γ γ γnew from γ γ γold by one of two possible moves: Add or Delete a component by choosing at random one component in γ γ γold and changing its value. Swap two components by choosing independently at random a 0 and a 1 in γ γ γold and changing both of them. The new candidate γ γ γnew is accepted with probability min{g(γ γ γnew) g(γ γ γold) , 1}

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 33 / 56

slide-34
SLIDE 34

Posterior Inference

The stochastic search results in a list of visited models (γ γ γ(0), γ γ γ(1), . . .) and their corresponding relative posterior probabilities p(γ γ γ(0)|D, Y), p(γ γ γ(1)|D, Y) . . . Select variables:

  • in the “best” models, i.e. the γ

γ γ’s with highest p(γ γ γ|D, Y) or

  • with largest marginal posterior probabilities

Prediction on future Yf:

  • via Bayesian model averaging (BMA) as posterior weighted average of

model predictions

  • via single model predictions as LS/Bayes predictions on single best

models or LS/Bayes predictions with “threshold” models (eg, “median” model) obtained from estimated marginal probabilities of inclusion.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 34 / 56

slide-35
SLIDE 35

Prior Specification

Priors on α α α and Σ Σ Σ vague and largely uninformative α α α′ − α α α′

0 ∼ N(h, Σ

Σ Σ), h → ∞, Σ Σ Σ ∼ IW(δ, Q) Choices for Hγ γ γ: Hγ γ γ = C ∗ [(D′D)−1]γ γ γ Hγ γ γ = C ∗ [diag(D′D)−1]γ γ γ H = CI H = AR(σ, ρ) with EB estimates of hyperparameters, L(·; D, Y) = |K|−q/2|Q|−n/2|In + K−1YQ−1Y′|−(δ+q+n−1)/2 K = I + DHD′ Choice of w:

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 35 / 56

slide-36
SLIDE 36

Diagonal Elements of ˜ H = WHW′

50 100 150 200 250 150 200 250 300 350 400 450 500 Wavelet coefficient Prior variance of regression coefficient

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 36 / 56

slide-37
SLIDE 37

NIR Data on Biscuit Doughs

4 parallel MCMC with 100,000 iterations and about 40,000 successful moves per chain. H(i, j) = σ2ρ|i−j|, σ2 = 254, ρ = .32 Modal model: 10 coeffs (0%, 0%, 0%, 37%, 6%, 6%, 1%, 2%, from coarsest to finest level) MSE = (.059, .466, .351, .047) BMA: 500 models and 219 coeffs MSE = (.063, .449, .348, .050) PLS- MSE=(.151, .583, .375, .105) PCR- MSE=(.160, .614, .388, .106) Stepwise MLR: MSE=(.044, 1.188, .722, .221)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 37 / 56

slide-38
SLIDE 38

Diagnostic Plots

1 2 3 4 5 6 7 8 9 10 x 10

4

20 40 60 80 100 120 (a) Iteration number Number of ones 1 2 3 4 5 6 7 8 9 10 x 10

4

−400 −300 −200 −100 (b) Iteration number Log probs

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 38 / 56

slide-39
SLIDE 39

Marginal Plots

50 100 150 200 250 0.2 0.4 0.6 0.8 1 (i) Probability 50 100 150 200 250 0.2 0.4 0.6 0.8 1 (ii) 50 100 150 200 250 0.2 0.4 0.6 0.8 1 (iii) Probability Wavelet coefficient 50 100 150 200 250 0.2 0.4 0.6 0.8 1 Wavelet coefficient (iv)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 39 / 56

slide-40
SLIDE 40

IDWT to Columns of LS Estimate of ˜ B

1400 1500 1600 1700 1800 −200 −100 100 200 Coefficient Fat 1400 1500 1600 1700 1800 −400 −200 200 400 Sugar 1400 1500 1600 1700 1800 −400 −200 200 400 Flour Wavelength Coefficient 1400 1500 1600 1700 1800 −200 −100 100 200 Wavelength Water

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 40 / 56

slide-41
SLIDE 41

Probit Models for Curve Classification

Z → n × 1 categorical response vector (J categories). X → n × p predictor matrix. We have functional predictor data. Probit models use data augmentation and latent data to write Yi = α α α′ + X′

iB + ǫ

ǫ ǫi, ǫ ǫ ǫi ∼ N(0, Σ Σ Σ), i = 1, . . . , n Relationship between the realization zi and the unobserved Yi zi =

  • if yi,j < 0 for every j

j if yi,j = max

1≤k≤J−1{yi,k}

... now transform to wavelets, use selection prior on wavelet coefficients and MCMC for inference in a probit model ...

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 41 / 56

slide-42
SLIDE 42

Results for Wheat Data

3 classes, 94 samples, NIR transmission spectra at p=100 wavelengths from 850 to 1048nm. Variety 1 2 3 Total Training 32 22 17 71 Validation 10 7 6 23 Fearn et al.(2001) Bayesian decision approach that balances costs for variables against the loss of misclassification.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 42 / 56

slide-43
SLIDE 43

Mis-classification Errors

Prediction

  • Var. Sel./# PC.

Error “Best” Model 5 3/23 Marginal model 9 3/23 BD(Fearn et al.) 6 5/23 BD(Fearn et al.) 12 3/23 LDA with PCA 14–18 PCs 4/23 QDA with PCA 8–10 PCs 7/23

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 43 / 56

slide-44
SLIDE 44

20 40 60 80 100 0.2 0.4 0.6 0.8 1 Marginal probability (a) 20 40 60 80 100 0.2 0.4 0.6 0.8 1 (b) 20 40 60 80 100 0.2 0.4 0.6 0.8 1 Marginal probability Wavelet coefficient (c) 20 40 60 80 100 0.2 0.4 0.6 0.8 1 Wavelet coefficient (d)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 44 / 56

slide-45
SLIDE 45

850 900 950 1000 1050 −1000 −500 500 1000 (a) β1 850 900 950 1000 1050 −1000 −500 500 1000 (c) wavelength β2 850 900 950 1000 1050 −2000 −1000 1000 2000 β1 (b) 850 900 950 1000 1050 −2000 −1000 1000 2000 wavelength β2 (d)

Plots (a) and (c) are obtained using 4 wavelet coefficients and plots (b) and (d) using 9 coefficients.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 45 / 56

slide-46
SLIDE 46

Code from my Website

bvsme wav: Bayesian Variable Selection applied to wavelet coefficients Metropolis search non-diagonal selection prior Bernoulli priors or Beta-Binomial prior Predictions by LS and BMA

http://stat.rice.edu/∼marina

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 46 / 56

slide-47
SLIDE 47

Discriminant Analysis for Classification

DA classifies observations into groups. Data from group g modeled as Xg − 1ngµ

g ∼ N(I, Σg),

with g = 1, . . . , G and where the vector µg and the matrix Σg are the mean and the covariance matrix of the g-th group, respectively. Group assignments: y = (y1, . . . , yn)′, where yi = g if the ith observation comes from group g Conjugate prior model µg ∼ N(mg, hgΣg) Σg ∼ IW(δg, Ωg), where Ωg is a scale matrix and δg a shape parameter.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 47 / 56

slide-48
SLIDE 48

Transformation to Wavelet Domain

We apply DWT as Dg = XgW′, with (W′W = I). Model becomes Dg − 1ngµ

g ∼ N(I, ˜

Σg), where ˜ Σg = WΣgW′, while µg is unchanged. The prior model on ˜ Σg transforms accordingly. The predictive distribution (t-student) is used to classify new samples πg(yf |D) = pg(df )ˆ πg G

i=1 pi(df )ˆ

πi , where pg(df ) indicates the predictive distribution and where the prior probability that one observation comes from group g as ˆ πg = ng/n. A new observations is then assigned to the group with the highest posterior probability.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 48 / 56

slide-49
SLIDE 49

Variable selection

Want to select discriminating wavelet coefficients. Introduce latent p-vector γ with binary entries γj = 1 if coeff j defines a mixture distribution γj = 0

  • therwise.

The likelihood function is given by L(D, y; ·) =

n

  • i=1

p(Di(γc)|Di(γ))

G

  • g=1

ng

  • i=1

wng

g pg(Di(γ)),

Conjugate priors for selected and non-selected coefficients.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 49 / 56

slide-50
SLIDE 50

Markov Random Tree Priors on γ

P(γj|γi, i ∈ Nj) = exp(γjF(γj)) 1 + exp(F(γj)) (1) where F(γj) = µ + η

i∈Nj(2γi − 1) and Nj is the set of children of coeff j.

Here µ controls sparsity. Higher η’s induce more neighbors to take on same values.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 50 / 56

slide-51
SLIDE 51

Model Fitting - MCMC

Can marginalize over the model parameters Update γ by Metropolis algorithm. Classify new samples based on selected coeffs

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 51 / 56

slide-52
SLIDE 52

An Example

NIR spectra used in analysis of food and drink and pharmaceutical products, measured at hundreds of wavelengths. 5 varieties of meat (Beef, Chicken, Lamb, Pork and Turkey), 231

  • bservations. NIR spectra with 1024 reflectances over the range

400-2498nm at intervals of 2nm. Aim: Use the spectral data to predict the meat variety.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 52 / 56

slide-53
SLIDE 53

500 1000 1500 2000 2500 0.8 1 1.2 1.4 1.6 Beef Wavelength (nm) Absorbance 500 1000 1500 2000 2500 0.8 1 1.2 1.4 1.6 Lamb Wavelength (nm) Absorbance 500 1000 1500 2000 2500 0.8 1 1.2 1.4 1.6 Pork Wavelength (nm) Absorbance 500 1000 1500 2000 2500 0.8 1 1.2 1.4 1.6 Turkey Wavelength (nm) Absorbance 500 1000 1500 2000 2500 0.8 1 1.2 1.4 1.6 Chicken Wavelength (nm) Absorbance

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 53 / 56

slide-54
SLIDE 54

Results

Predicted Truth Beef Lamb Pork Turkey Chicken Beef 100.0 0.0 0.0 0.0 0.0 Lamb 5.9 (1) 94.1 0.0 0.0 0.0 Pork 0.0 0.0 100.0 0.0 0.0 Turkey 0.0 0.0 0.0 92.6 7.4 (2) Chicken 0.0 0.0 0.0 3.7 (1) 96.3

Table : Validation set: classification results for the five different species of meat using a threshold of 0.4 for the posterior of inclusion (14 coeffs). The number of misclassified units is reported in parentheses (3.4% total mis-clas rate).

LDA and QDA on selected (11-14) PCA components achieve a mis-classification rate of 5.3% and 6.1%.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 54 / 56

slide-55
SLIDE 55

200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 1.0 Wavelet coefficients

  • Post. Prob.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 55 / 56

slide-56
SLIDE 56

Main References

Brown, P.J., Fearn, T. and Vannucci, M. (2001). Bayesian wavelet regression on curves with application to a spectroscopic calibration

  • problem. Journal of the American Statistical Association, 96, 398-408.

Vannucci, M., Sha, N. and Brown, P.J. (2005). NIR and mass spectra classification: Bayesian methods for wavelet-based feature selection. Chemometrics and Intelligent Laboratory Systems, 77, 139–148. Stingo, F.C., Vannucci, M. and Downey, G. (2011). Bayesian Wavelet-based Curve Classification via Discriminant Analysis with Markov Random Tree Priors. Statistica Sinica, 22, 465–488.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 56 / 56