On the Local Minima of the Empirical Risk Chi Jin * 1 , Lydia T. Liu* - - PowerPoint PPT Presentation

▶

Feb 08, 2024 267 likes •427 views

On the Local Minima of the Empirical Risk Chi Jin * 1 , Lydia T. Liu* 1 , Rong Ge 2 , Michael I. Jordan 1 1EECS, University of California, Berkeley. 2Duke University. 1 / 6 Chi Jin On the Local Minima of the Empirical Risk Overview Nonconvex

SLIDE 1

On the Local Minima of the Empirical Risk

Chi Jin1, Lydia T. Liu1, Rong Ge2, Michael I. Jordan1

1EECS, University of California, Berkeley. 2Duke University. 1 / 6 Chi Jin On the Local Minima of the Empirical Risk

SLIDE 2

Overview

Nonconvex Optimization. ◮ Gradient Descent (GD) → stationary points: local max, saddle points, local min.

2 / 6 Chi Jin On the Local Minima of the Empirical Risk

SLIDE 3

Overview

Nonconvex Optimization. ◮ Gradient Descent (GD) → stationary points: local max, saddle points, local min. ◮ Perturbed GD [Jin et al. 2017] efficiently escapes local max and saddle points.

2 / 6 Chi Jin On the Local Minima of the Empirical Risk

SLIDE 4

Overview

Nonconvex Optimization. ◮ Gradient Descent (GD) → stationary points: local max, saddle points, local min. ◮ Perturbed GD [Jin et al. 2017] efficiently escapes local max and saddle points. ◮ How to deal with spurious local min?

2 / 6 Chi Jin On the Local Minima of the Empirical Risk

SLIDE 5

Local Minima

In general, finding global minima is NP-hard.

3 / 6 Chi Jin On the Local Minima of the Empirical Risk

SLIDE 6

Local Minima

In general, finding global minima is NP-hard.

f

Avoiding “shallow” local minima Goal: finds approximate local minima of smooth nonconvex function F, given only access to an errorneous version f where supx |F(x) − f (x)| ≤ ν

3 / 6 Chi Jin On the Local Minima of the Empirical Risk

SLIDE 7

Application

Statistical Learning. Minimize population risk R while only have access to emprical risk ˆ Rn. R(θ) = Ez∼D[L(θ; z)], ˆ Rn(θ) = 1 n

n

L(θ; zi).

4 / 6 Chi Jin On the Local Minima of the Empirical Risk

SLIDE 8

Application

Statistical Learning. Minimize population risk R while only have access to emprical risk ˆ Rn. R(θ) = Ez∼D[L(θ; z)], ˆ Rn(θ) = 1 n

n

L(θ; zi). Unifrom convergence guarantees supθ |R(θ) − ˆ Rn(θ)| ≤ O(1/√n).

4 / 6 Chi Jin On the Local Minima of the Empirical Risk

SLIDE 9

Results

f

Goal: find ǫ-approximate local minima of F in polynomial time. Central Questions:

1. What algorithm can achieve this?
2. How much error ν can be tolerated?

5 / 6 Chi Jin On the Local Minima of the Empirical Risk

SLIDE 10

Results

f

Goal: find ǫ-approximate local minima of F in polynomial time. Central Questions:

1. What algorithm can achieve this?
2. How much error ν can be tolerated?

Zhang et al. [2017]: Stochastic Gradient Langevin Dynamics (SGLD) if ν ≤ ǫ2/d8.

5 / 6 Chi Jin On the Local Minima of the Empirical Risk

SLIDE 11

Results

f

Goal: find ǫ-approximate local minima of F in polynomial time. Central Questions:

1. What algorithm can achieve this?
2. How much error ν can be tolerated?

Zhang et al. [2017]: Stochastic Gradient Langevin Dynamics (SGLD) if ν ≤ ǫ2/d8. This Work: Perturbed SGD on a “smoothed” version of f if ν ≤ ǫ1.5/d.

5 / 6 Chi Jin On the Local Minima of the Empirical Risk

SLIDE 12

Almost Sharp Guarantees

Is there better polynomial time algorithms that tolerate larger error?

6 / 6 Chi Jin On the Local Minima of the Empirical Risk

SLIDE 13

Almost Sharp Guarantees

Is there better polynomial time algorithms that tolerate larger error? No! Complete characterization of error ν vs accuracy ǫ and dimension d.

6 / 6 Chi Jin On the Local Minima of the Empirical Risk

SLIDE 14

Almost Sharp Guarantees

Is there better polynomial time algorithms that tolerate larger error? No! Complete characterization of error ν vs accuracy ǫ and dimension d. Poster: Wed 5-7 PM, #43. Thanks!

6 / 6 Chi Jin On the Local Minima of the Empirical Risk

On the Local Minima of the Empirical Risk

Chi Jin*1, Lydia T. Liu*1, Rong Ge2, Michael I. Jordan1

Overview

Nonconvex Optimization. ◮ Gradient Descent (GD) → stationary points: local max, saddle points, local min.

Overview

Nonconvex Optimization. ◮ Gradient Descent (GD) → stationary points: local max, saddle points, local min. ◮ Perturbed GD [Jin et al. 2017] efficiently escapes local max and saddle points.

Overview

Nonconvex Optimization. ◮ Gradient Descent (GD) → stationary points: local max, saddle points, local min. ◮ Perturbed GD [Jin et al. 2017] efficiently escapes local max and saddle points. ◮ How to deal with spurious local min?

Local Minima

In general, finding global minima is NP-hard.

Local Minima

In general, finding global minima is NP-hard.

f

Avoiding “shallow” local minima Goal: finds approximate local minima of smooth nonconvex function F, given only access to an errorneous version f where supx |F(x) − f (x)| ≤ ν

Application

Statistical Learning. Minimize population risk R while only have access to emprical risk ˆ Rn. R(θ) = Ez∼D[L(θ; z)], ˆ Rn(θ) = 1 n

n

L(θ; zi).

Application

Statistical Learning. Minimize population risk R while only have access to emprical risk ˆ Rn. R(θ) = Ez∼D[L(θ; z)], ˆ Rn(θ) = 1 n

n

L(θ; zi). Unifrom convergence guarantees supθ |R(θ) − ˆ Rn(θ)| ≤ O(1/√n).

Results

f

Goal: find ǫ-approximate local minima of F in polynomial time. Central Questions:

Results

f

Goal: find ǫ-approximate local minima of F in polynomial time. Central Questions:

Zhang et al. [2017]: Stochastic Gradient Langevin Dynamics (SGLD) if ν ≤ ǫ2/d8.

Results

f

Goal: find ǫ-approximate local minima of F in polynomial time. Central Questions:

Zhang et al. [2017]: Stochastic Gradient Langevin Dynamics (SGLD) if ν ≤ ǫ2/d8. This Work: Perturbed SGD on a “smoothed” version of f if ν ≤ ǫ1.5/d.

Almost Sharp Guarantees

Is there better polynomial time algorithms that tolerate larger error?

Almost Sharp Guarantees

Is there better polynomial time algorithms that tolerate larger error? No! Complete characterization of error ν vs accuracy ǫ and dimension d.

Almost Sharp Guarantees

Is there better polynomial time algorithms that tolerate larger error? No! Complete characterization of error ν vs accuracy ǫ and dimension d. Poster: Wed 5-7 PM, #43. Thanks!

Chi Jin1, Lydia T. Liu1, Rong Ge2, Michael I. Jordan1