The landscape of empirical risk for non-convex losses Song Mei - - PowerPoint PPT Presentation

the landscape of empirical risk for non convex losses
SMART_READER_LITE
LIVE PREVIEW

The landscape of empirical risk for non-convex losses Song Mei - - PowerPoint PPT Presentation

The landscape of empirical risk for non-convex losses Song Mei ICME, Stanford December 3, 2016 Joint work with Yu Bai and Andrea Montanari Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 1 / 17 Binary linear


slide-1
SLIDE 1

The landscape of empirical risk for non-convex losses

Song Mei

ICME, Stanford

December 3, 2016

Joint work with Yu Bai and Andrea Montanari

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 1 / 17

slide-2
SLIDE 2

Binary linear classification

The model

Z✐ ❂ ✭X✐❀ ❨✐✮. X✐ ✷ R❞, ❨✐ ✷ ❢✵❀ ✶❣, i = 1,. . . , n.

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 2 / 17

slide-3
SLIDE 3

Non-convex formulation of binary classification

The model

Z✐ ❂ ✭X✐❀ ❨✐✮. X✐ ✷ R❞, ❨✐ ✷ ❢✵❀ ✶❣, i = 1,. . . , n.

◮ Convex logit loss (❵❝ is cvx in θ)

❵❝✭θ❀ Z✮ ❂ ❨ ❤X❀ θ✐ ❧♦❣

✶ ✰ ❡①♣✭❤X❀ θ✐✮

◮ Non-convex loss (❵ is not cvx in θ)

❵✭θ❀ Z✮ ❂

❨ ✛✭❤X❀ θ✐✮

✑✷

❀ where ✛✭t✮ ❂ ✶❂✭✶ ✰ ❡①♣✭t✮✮✿

◮ Empirical Risk

❘♥✭θ✮ ❂ ✶ ♥

✐❂✶

❵✭θ❀ Z✐✮✿

◮ Empirical risk minimizer

❫ θ♥ ❂ ❛r❣ ♠✐♥

θ✷B❞✭❘✮

❘♥✭θ✮✿

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 3 / 17

slide-4
SLIDE 4

Non-convex formulation of binary classification

The model

Z✐ ❂ ✭X✐❀ ❨✐✮. X✐ ✷ R❞, ❨✐ ✷ ❢✵❀ ✶❣, i = 1,. . . , n.

◮ Convex logit loss (❵❝ is cvx in θ)

❵❝✭θ❀ Z✮ ❂ ❨ ❤X❀ θ✐ ❧♦❣

✶ ✰ ❡①♣✭❤X❀ θ✐✮

◮ Non-convex loss (❵ is not cvx in θ)

❵✭θ❀ Z✮ ❂

❨ ✛✭❤X❀ θ✐✮

✑✷

❀ where ✛✭t✮ ❂ ✶❂✭✶ ✰ ❡①♣✭t✮✮✿

◮ Empirical Risk

❘♥✭θ✮ ❂ ✶ ♥

✐❂✶

❵✭θ❀ Z✐✮✿

◮ Empirical risk minimizer

❫ θ♥ ❂ ❛r❣ ♠✐♥

θ✷B❞✭❘✮

❘♥✭θ✮✿

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 3 / 17

slide-5
SLIDE 5

Non-convex formulation of binary classification

The model

Z✐ ❂ ✭X✐❀ ❨✐✮. X✐ ✷ R❞, ❨✐ ✷ ❢✵❀ ✶❣, i = 1,. . . , n.

◮ Convex logit loss (❵❝ is cvx in θ)

❵❝✭θ❀ Z✮ ❂ ❨ ❤X❀ θ✐ ❧♦❣

✶ ✰ ❡①♣✭❤X❀ θ✐✮

◮ Non-convex loss (❵ is not cvx in θ)

❵✭θ❀ Z✮ ❂

❨ ✛✭❤X❀ θ✐✮

✑✷

❀ where ✛✭t✮ ❂ ✶❂✭✶ ✰ ❡①♣✭t✮✮✿

◮ Empirical Risk

❘♥✭θ✮ ❂ ✶ ♥

✐❂✶

❵✭θ❀ Z✐✮✿

◮ Empirical risk minimizer

❫ θ♥ ❂ ❛r❣ ♠✐♥

θ✷B❞✭❘✮

❘♥✭θ✮✿

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 3 / 17

slide-6
SLIDE 6

Non-convex formulation of binary classification

The model

Z✐ ❂ ✭X✐❀ ❨✐✮. X✐ ✷ R❞, ❨✐ ✷ ❢✵❀ ✶❣, i = 1,. . . , n.

◮ Convex logit loss (❵❝ is cvx in θ)

❵❝✭θ❀ Z✮ ❂ ❨ ❤X❀ θ✐ ❧♦❣

✶ ✰ ❡①♣✭❤X❀ θ✐✮

◮ Non-convex loss (❵ is not cvx in θ)

❵✭θ❀ Z✮ ❂

❨ ✛✭❤X❀ θ✐✮

✑✷

❀ where ✛✭t✮ ❂ ✶❂✭✶ ✰ ❡①♣✭t✮✮✿

◮ Empirical Risk

❘♥✭θ✮ ❂ ✶ ♥

✐❂✶

❵✭θ❀ Z✐✮✿

◮ Empirical risk minimizer

❫ θ♥ ❂ ❛r❣ ♠✐♥

θ✷B❞✭❘✮

❘♥✭θ✮✿

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 3 / 17

slide-7
SLIDE 7

Why use non-convex loss?

◮ Comparing to logistic regression, non-convex formulation is robust

to ourliers.

◮ This model is the same as neural network with a single layer and a

single node.

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 4 / 17

slide-8
SLIDE 8

Why use non-convex loss?

◮ Comparing to logistic regression, non-convex formulation is robust

to ourliers.

◮ This model is the same as neural network with a single layer and a

single node.

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 4 / 17

slide-9
SLIDE 9

A negative theoretical result

Theorem (Auer et. al. 1996 [AHW✰96])

For the non-convex binary classification problem, for any ♥ and ❞, there exists a dataset ✭x✐❀ ②✐✮♥

✐❂✶ such that the empirical risk ❜

❘♥✭θ✮ has ❜ ♥

❞ ❝❞ distinct local minima.

❘♥✭ ✮

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 5 / 17

slide-10
SLIDE 10

A negative theoretical result

Theorem (Auer et. al. 1996 [AHW✰96])

For the non-convex binary classification problem, for any ♥ and ❞, there exists a dataset ✭x✐❀ ②✐✮♥

✐❂✶ such that the empirical risk ❜

❘♥✭θ✮ has ❜ ♥

❞ ❝❞ distinct local minima.

Seems to imply the landscape of the non-convex empirical risk ❜ ❘♥✭θ✮ is very rough.

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 5 / 17

slide-11
SLIDE 11

A negative theoretical result

Theorem (Auer et. al. 1996 [AHW✰96])

For the non-convex binary classification problem, for any ♥ and ❞, there exists a dataset ✭x✐❀ ②✐✮♥

✐❂✶ such that the empirical risk ❜

❘♥✭θ✮ has ❜ ♥

❞ ❝❞ distinct local minima.

Seems to imply the landscape of the non-convex empirical risk ❜ ❘♥✭θ✮ is very rough. Is this the end of the world of non-convex binary classification?

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 5 / 17

slide-12
SLIDE 12

Non-convex formulation of binary classification

On real data, we "always" observe a unique minimum!

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 6 / 17

slide-13
SLIDE 13

Non-convex formulation of binary classification

On real data, we "always" observe a unique minimum! Why?

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 6 / 17

slide-14
SLIDE 14

Non-convex formulation of binary classification

On real data, we "always" observe a unique minimum! Why? Data generated by nature is not against us!

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 6 / 17

slide-15
SLIDE 15

A negative theoretical result

Theorem (Auer et. al. . 1996 [AHW✰96])

For the non-convex binary classification problem, for all ♥ ❃ ✵ there exists a dataset ✭x✐❀ ②✐✮♥

✐❂✶ such that the empirical risk ❜

❘♥✭θ✮ has ❜ ♥

❞ ❝❞ distinct local minima.

Seems to imply the landscape of the non-convex empirical risk ❜ ❘♥✭θ✮ is very rough.

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 7 / 17

slide-16
SLIDE 16

Our main positive result

Theorem (Mei, Bai, Montanari. 2016 [MBM16])

Assume X✐ are i.i.d. sub-Gaussian random vectors, and ❨✐ are generated via P✭❨✐ ❂ ✶❥X✐✮ ❂ ✛✭❤X✐❀ θ✵✐✮. Then there exists a constant ❈ depending on ✍, such that as long as ♥ ✕ ❈❞ ❧♦❣ ❞, the following happens with probability at least ✶ ✍: ✭❛✮ ❜ ❘♥✭θ✮ has a unique local minimizer ❫ θ♥ in B❞✭0❀ ❘✮. ✭❜✮ ❫ θ♥ satisfies ❦❫ θ♥ θ✵❦✷ ✔ ❈

✭❞ ❧♦❣ ♥✮❂♥. ✭❝✮ Gradient descent converges exponentially fast to ❫ θ♥. The landscape of the non-convex empirical risk ❜ ❘♥✭θ✮ is actually smooth!

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 8 / 17

slide-17
SLIDE 17

Why assuming a statistical model make the landscape

  • f emprical risk smooth?

1 Assuming a statistical model Z✐ ✐✿✐✿❞✿

✘ PZ, ✐ ❂ ✶❀ ✿ ✿ ✿ ❀ ♥, we can define the population risk ❘✭θ✮ ❂ EZ

❤ ❜

❘♥✭θ✮

❂ EZ

✶ ♥

✐❂✶

❵✭θ❀ Z✐✮

✿ The population risk is usually very smooth.

2 We can transfer the good properties of the population risk to the

empirical risk using uniform convergence argument. So empirical risk will be also smooth.

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 9 / 17

slide-18
SLIDE 18

Why assuming a statistical model make the landscape

  • f emprical risk smooth?

1 Assuming a statistical model Z✐ ✐✿✐✿❞✿

✘ PZ, ✐ ❂ ✶❀ ✿ ✿ ✿ ❀ ♥, we can define the population risk ❘✭θ✮ ❂ EZ

❤ ❜

❘♥✭θ✮

❂ EZ

✶ ♥

✐❂✶

❵✭θ❀ Z✐✮

✿ The population risk is usually very smooth.

2 We can transfer the good properties of the population risk to the

empirical risk using uniform convergence argument. So empirical risk will be also smooth.

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 9 / 17

slide-19
SLIDE 19

Why assuming a statistical model make the landscape

  • f emprical risk smooth?

1 Assuming a statistical model Z✐ ✐✿✐✿❞✿

✘ PZ, ✐ ❂ ✶❀ ✿ ✿ ✿ ❀ ♥, we can define the population risk ❘✭θ✮ ❂ EZ

❤ ❜

❘♥✭θ✮

❂ EZ

✶ ♥

✐❂✶

❵✭θ❀ Z✐✮

✿ The population risk is usually very smooth.

2 We can transfer the good properties of the population risk to the

empirical risk using uniform convergence argument. So empirical risk will be also smooth.

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 9 / 17

slide-20
SLIDE 20

Recap: Non-convex binary linear classification

The model

Z✐ ❂ ✭X✐❀ ❨✐✮. X✐ ✷ R❞, ❨✐ ✷ ❢✵❀ ✶❣, i = 1,. . . , n.

◮ Non-convex square loss (❵ is not cvx in θ)

❵✭θ❀ Z✮ ❂

❨ ✛✭❤X❀ θ✐✮

✑✷

❀ where ✛✭t✮ ❂ ✶❂✭✶ ✰ ❡①♣✭t✮✮

◮ Empirical Risk

❘♥✭θ✮ ❂ ✶ ♥

✐❂✶

❵✭θ❀ Z✐✮✿

✐ ✐✿✐✿❞✿

✘ ✭❨✐ ❂ ✶❥

✐✮ ❂ ✛✭❤ ✐❀ ✵✐✮

✛✭t✮ ❂ ✶❂✭✶ ✰ ❡①♣✭t✮✮

✵ ✷ ❞

❘✭ ✮ ❂ ❬✭❨ ✛✭❤ ❀ ✐✮✮✷❪ ❂ ❬✭✛✭❤ ❀

✵✐✮ ✛✭❤

❀ ✐✮✮✷❪ ✰ ❝✿ ❘✭ ✮

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 10 / 17

slide-21
SLIDE 21

Recap: Non-convex binary linear classification

The model

Z✐ ❂ ✭X✐❀ ❨✐✮. X✐ ✷ R❞, ❨✐ ✷ ❢✵❀ ✶❣, i = 1,. . . , n.

◮ Non-convex square loss (❵ is not cvx in θ)

❵✭θ❀ Z✮ ❂

❨ ✛✭❤X❀ θ✐✮

✑✷

❀ where ✛✭t✮ ❂ ✶❂✭✶ ✰ ❡①♣✭t✮✮

◮ Empirical Risk

❘♥✭θ✮ ❂ ✶ ♥

✐❂✶

❵✭θ❀ Z✐✮✿

◮ Assume X✐ ✐✿✐✿❞✿

✘ PX, (PX is sub-Gaussian), and P✭❨✐ ❂ ✶❥X✐✮ ❂ ✛✭❤X✐❀ θ✵✐✮ with ✛✭t✮ ❂ ✶❂✭✶ ✰ ❡①♣✭t✮✮, θ✵ ✷ R❞. ❘✭ ✮ ❂ ❬✭❨ ✛✭❤ ❀ ✐✮✮✷❪ ❂ ❬✭✛✭❤ ❀

✵✐✮ ✛✭❤

❀ ✐✮✮✷❪ ✰ ❝✿ ❘✭ ✮

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 10 / 17

slide-22
SLIDE 22

Recap: Non-convex binary linear classification

The model

Z✐ ❂ ✭X✐❀ ❨✐✮. X✐ ✷ R❞, ❨✐ ✷ ❢✵❀ ✶❣, i = 1,. . . , n.

◮ Non-convex square loss (❵ is not cvx in θ)

❵✭θ❀ Z✮ ❂

❨ ✛✭❤X❀ θ✐✮

✑✷

❀ where ✛✭t✮ ❂ ✶❂✭✶ ✰ ❡①♣✭t✮✮

◮ Empirical Risk

❘♥✭θ✮ ❂ ✶ ♥

✐❂✶

❵✭θ❀ Z✐✮✿

◮ Assume X✐ ✐✿✐✿❞✿

✘ PX, (PX is sub-Gaussian), and P✭❨✐ ❂ ✶❥X✐✮ ❂ ✛✭❤X✐❀ θ✵✐✮ with ✛✭t✮ ❂ ✶❂✭✶ ✰ ❡①♣✭t✮✮, θ✵ ✷ R❞.

◮ Population risk:

❘✭θ✮ ❂ EZ❬✭❨ ✛✭❤X❀ θ✐✮✮✷❪ ❂ EZ❬✭✛✭❤X❀ θ✵✐✮ ✛✭❤X❀ θ✐✮✮✷❪ ✰ ❝✿ ❘✭ ✮

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 10 / 17

slide-23
SLIDE 23

Recap: Non-convex binary linear classification

The model

Z✐ ❂ ✭X✐❀ ❨✐✮. X✐ ✷ R❞, ❨✐ ✷ ❢✵❀ ✶❣, i = 1,. . . , n.

◮ Non-convex square loss (❵ is not cvx in θ)

❵✭θ❀ Z✮ ❂

❨ ✛✭❤X❀ θ✐✮

✑✷

❀ where ✛✭t✮ ❂ ✶❂✭✶ ✰ ❡①♣✭t✮✮

◮ Empirical Risk

❘♥✭θ✮ ❂ ✶ ♥

✐❂✶

❵✭θ❀ Z✐✮✿

◮ Assume X✐ ✐✿✐✿❞✿

✘ PX, (PX is sub-Gaussian), and P✭❨✐ ❂ ✶❥X✐✮ ❂ ✛✭❤X✐❀ θ✵✐✮ with ✛✭t✮ ❂ ✶❂✭✶ ✰ ❡①♣✭t✮✮, θ✵ ✷ R❞.

◮ Population risk:

❘✭θ✮ ❂ EZ❬✭❨ ✛✭❤X❀ θ✐✮✮✷❪ ❂ EZ❬✭✛✭❤X❀ θ✵✐✮ ✛✭❤X❀ θ✐✮✮✷❪ ✰ ❝✿

◮ ❘✭θ✮ has a unique minimum which is θ✵.

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 10 / 17

slide-24
SLIDE 24

Population risk and empirical risk

The population risk has good properties under mild assumptions.

θ1

  • 3
  • 2
  • 1

1 2 3

θ2

  • 3
  • 2
  • 1

1 2 3

θ0

Figure: Population risk.

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 11 / 17

slide-25
SLIDE 25

Population risk and empirical risk

The population risk has good properties under mild assumptions.

θ1

  • 3
  • 2
  • 1

1 2 3

θ2

  • 3
  • 2
  • 1

1 2 3

θ0

Figure: Population risk.

θ1

  • 3
  • 2
  • 1

1 2 3

θ2

  • 3
  • 2
  • 1

1 2 3

θ0 = [1, 0] ˆ θn = [0.816, −0.268]

Figure: An instance of empirical risk.

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 11 / 17

slide-26
SLIDE 26

Population risk and empirical risk

The population risk has good properties under mild assumptions.

θ1

  • 3
  • 2
  • 1

1 2 3

θ2

  • 3
  • 2
  • 1

1 2 3

θ0

Figure: Population risk.

θ1

  • 3
  • 2
  • 1

1 2 3

θ2

  • 3
  • 2
  • 1

1 2 3

θ0 = [1, 0] ˆ θn = [0.816, −0.268]

Figure: An instance of empirical risk.

How can we relate the properties of empirical risk to population risk?

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 11 / 17

slide-27
SLIDE 27

Population risk and empirical risk

The population risk has good properties under mild assumptions.

θ1

  • 3
  • 2
  • 1

1 2 3

θ2

  • 3
  • 2
  • 1

1 2 3

θ0

Figure: Population risk.

θ1

  • 3
  • 2
  • 1

1 2 3

θ2

  • 3
  • 2
  • 1

1 2 3

θ0 = [1, 0] ˆ θn = [0.816, −0.268]

Figure: An instance of empirical risk.

How can we relate the properties of empirical risk to population risk? Uniform convergence!

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 11 / 17

slide-28
SLIDE 28

Uniform convergence of gradients and Hessians.

Theorem (Uniform convergence. Informal)

Under suitable assumptions, for any ✍ ❃ ✵, there exists a positive constant ❈ depending on ✭❘❀ ✍✮ but independent of ♥ and ❞, such that as long as ♥ ✕ ❈❞ ❧♦❣ ❞, we have

1

P

✵ ❅

s✉♣

θ✷B❞✭0❀❘✮

✌ ✌ ✌r ❜

❘♥✭θ✮ r❘✭θ✮

✌ ✌ ✌

✷ ✔

s

❈❞ ❧♦❣ ♥ ♥

✶ ❆ ✕ ✶ ✍✿

2

P

✵ ❅

s✉♣

θ✷B❞✭0❀❘✮

✌ ✌ ✌r✷ ❜

❘♥✭θ✮ r✷❘✭θ✮

✌ ✌ ✌

♦♣ ✔

s

❈❞ ❧♦❣ ♥ ♥

✶ ❆ ✕ ✶ ✍✿

Proof is based on concentration inequalities and covering numbers.

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 12 / 17

slide-29
SLIDE 29

Uniform convergence implies unique minimum of empirical risk

Risk ¡ Empirical ¡risk Empirical ¡risk Empirical ¡risk Barriers ¡ Many ¡local ¡mins ¡ Risk ¡global ¡min Risk ¡ ERM ¡ Risk ¡global ¡min ERM ¡ Good ¡local ¡mins Smooth ¡far ¡from ¡mins ¡ Risk ¡global ¡min ERM ¡ Uniform ¡smooth ¡ surface ¡

The ¡landscape ¡of ¡non-­‑convex ¡empirical ¡risk

  • 1. ¡What ¡we ¡thought
  • 2. ¡What ¡hopefully ¡is ¡true
  • 3. ¡What ¡we ¡will ¡prove

Risk ¡

Figure: Landscape of empirical risk

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 13 / 17

slide-30
SLIDE 30

Numerical experiment

n/(d log d)

0.5 1 1.5 2 2.5 3 3.5

Success rate

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

d = 20 d = 40 d = 80 d = 160 d = 320

Figure: Probability to find a unique local minimum

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 14 / 17

slide-31
SLIDE 31

Extension to other models

◮ Robust regression.

Linear regression with bounded loss. Robust to outliers.

◮ Gaussian mixture model with two equal-proportion Gaussians.

Two local minimum connected with a saddle point.

◮ Very high dimensional regime. ❞ ✢ ♥. Sparse θ✵.

Uniform convergence of gradient in the sense of ❧✶ norm.

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 15 / 17

slide-32
SLIDE 32

Conclusion

1 For non-convex empirical risk minimization problem, in the worst

case, there could be exponentially many local minimum.

2 If there are enough data generated by a statistical model, the

landscape of empirical risk is smooth.

3 The uniform convergence of gradients and Hessians is a powerful

tool and can supplement the classical empirical risk minimization theory.

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 16 / 17

slide-33
SLIDE 33

Bibliography

Peter Auer, Mark Herbster, Manfred K Warmuth, et al., Exponentially many local minima for single neurons, Advances in neural information processing systems (1996), 316–322. Song Mei, Yu Bai, and Andrea Montanari, The landscape of empirical risk for non-convex losses, arXiv preprint arXiv:1607.06534 (2016).

Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 17 / 17