An inverse problem perspective on machine learning Lorenzo Rosasco - PowerPoint PPT Presentation
An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems and Machine Learning Workshop, CM+X
An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 – Inverse Problems and Machine Learning Workshop, CM+X Caltech
Today selection I Classics: “Learning as an inverse problem” I Latest releases: “Kernel methods as a test bed for algorithm design”
Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances
What’s learning ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 )
What’s learning ( x 7 , ?) ( x 6 , ?) ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 )
What’s learning ( x 7 , ?) ( x 6 , ?) ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 ) Learning is about inference not interpolation
Statistical Machine Learning (ML) I ( X, Y ) a pair of random variables in X ⇥ R . I L : R ⇥ R ! [0 , 1 ) a loss function. I H ⇢ R X Problem: Solve min f 2 H E [ L ( f ( X ) , Y )] given only ( x 1 , y 1 ) , . . . , ( x n , y n ) , a sample of n i.i. copies of ( X, Y ) .
ML theory around 2000-2010 I All algorithms are ERM (empirical risk minimization) n 1 X min L ( f ( x i ) , y i ) n f 2 H i =1 [Vapnik ’96] I Emphasis on empirical process theory. . . � � ! n � � 1 X � � sup L ( f ( X i ) , Y i ) � E [ L ( f ( X ) , Y )] � > ✏ P � � n f 2 H � i =1 [Vapnik, Chervonenkis,’71 Dudley, Gin´ e, Zinn ’94] I ...and complexity measures, e.g. Gaussian/Rademacher complexities n X C ( H ) = E sup � i f ( X i ) f 2 H i =1 [Barlett, Bousquet, Koltchinskii, Massart, Mendelson. . . 00]
Around the same time Cucker and Smale, On the mathematical foundations of learning theory, AMS I Caponnetto, De Vito and R. Verri, Learning as an Inverse Problem , JMLR I Smale, Zhou, Shannon sampling and function reconstruction from point values, Bull. AMS
Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances
Inverse Problems (IP) I A : H ! G bounded linear operator, between Hilbert spaces I g 2 G Problem: Find f solving Af = g assuming A and g � are given, with k g � g � k � [Engl, Hanke, Neubauer’96]
Ill-posedeness I Existence: g / 2 Range ( A ) I Uniqueness: Ker ( A ) 6 = ; I Stability: k A † k = 1 (large is also a mess) g f † g δ A Range ( A ) G H f † = A † g = min k Af � g k 2 , O = argmin O k f k H
Is machine learning an inverse problem? I A : H ! G I ( X, Y ) I g 2 G I L : R ⇥ R ! [0 , 1 ) I H ⇢ R X Find f solving Solve Af = g min f 2 H E [ L ( f ( X ) , Y )] given A and g � with k g � g � k � given only ( x 1 , y 1 ) , . . . , ( x n , y n ) . Actually yes, under some assumptions.
Key assumptions: least squares and RKHS Assumption L ( f ( x ) , y ) = ( f ( x ) � y ) 2 Assumption I ( H , h · , · i ) is a Hilbert space (real, separable) I continuous evaluation functionals, for all x 2 X , let e x : H ! R , with e x ( f ) = f ( x ) , then | e x ( f ) � e x ( f 0 ) | . k f � f 0 k [Aronszajn ’50]
Key assumptions: least squares and RKHS Assumption L ( f ( x ) , y ) = ( f ( x ) � y ) 2 Assumption I ( H , h · , · i ) is a Hilbert space (real, separable) I continuous evaluation functionals, for all x 2 X , let e x : H ! R , with e x ( f ) = f ( x ) , then | e x ( f ) � e x ( f 0 ) | . k f � f 0 k Implications [Aronszajn ’50] I k f k 1 . k f k I 9 k x 2 H such that f ( x ) = h f, k x i
Interpolation and sampling operator [Bertero, De mol, Pike ’85,’88] Sampling operator: S n : H ! R n , f ( x i ) = h f, k x i i = y i , i = 1 , . . . , n ( S n f ) i = h f, k x i i , 8 i = 1 , . . . , n + S n f = y S n f x 3 x 5 x 1 x 2 x 4 X
Learning and restriction operator [Caponnetto, De Vito, R. ’05] Restriction operator: S ⇢ : H ! L 2 ( X , ⇢ ) , h f, k x i = f ⇢ ( x ) , ⇢ � a.s. + ( S ⇢ f )( x ) = h f, k x i , ⇢ � almost surely . S ⇢ f = f ⇢ R f ⇢ ( x ) = d ⇢ ( x, y ) y ⇢ -almost surely. S ρ f R L 2 ( X , ⇢ ) = { f 2 R X | k f k 2 d ⇢ | f ( x ) | 2 < 1 } ⇢ = X
Learning as an inverse problem Inverse problem Find f solving S ⇢ f = f ⇢ given S n and y n = ( y 1 , . . . , y n ) .
Learning as an inverse problem Inverse problem Find f solving S ⇢ f = f ⇢ given S n and y n = ( y 1 , . . . , y n ) . Least squares ⇢ = E ( f ( X ) � Y ) 2 � E ( f ⇢ ( X ) � Y ) 2 H k S ⇢ f � f ⇢ k 2 k S ⇢ f � f ⇢ k 2 min ⇢ ,
Let’s see what we got I Noise model I Integral operators & covariance operators I Kernels
Noise model Ideal Empirical S ⇢ f = f ⇢ S n f = y S ⇤ ⇢ S ⇢ f = S ⇤ S ⇤ n S n f = S ⇤ ⇢ f ⇢ n y Noise model k S ⇤ n y � S ⇤ k S ⇤ ⇢ S ⇢ � S ⇤ ⇢ f ⇢ k � 1 n S n k � 2 Inverse problem discretization, Econometrics
Integral and covariance operators operators I Extension operator S ⇤ ⇢ : L 2 ( X , ⇢ ) ! H Z S ⇤ ⇢ f ( x 0 ) = d ⇢ ( x ) k ( x 0 , x ) f ( x ) where k ( x, x 0 ) = h k x , k 0 x i is pos.def. I Covariance operator S ⇤ ⇢ S ⇢ : H ! H Z S ⇤ ⇢ S ⇢ = d ⇢ ( x ) k x ⌦ k x 0
Kernels Choosing a RKHS implies choosing a representation. Theorem (Moore-Aronzaijn) Let k : X ⇥ X ! R , pos.def., then the completion of N X { f 2 R X | f = c i k x i , c 1 , . . . , c N 2 R , x 1 , . . . , x N 2 X , N 2 N } j =1 w.r.t. h k x , k 0 x i = k ( x, x 0 ) is a RKHS.
Kernels If K ( x, x 0 ) = x > x 0 , then, I S n is the n by D data matrix ( S ⇢ infinite data matrix) I S ⇤ n S n and S ⇤ ⇢ S ⇢ are the empirical and true covariance operators
Kernels If K ( x, x 0 ) = x > x 0 , then, I S n is the n by D data matrix ( S ⇢ infinite data matrix) I S ⇤ n S n and S ⇤ ⇢ S ⇢ are the empirical and true covariance operators Other kernels: I K ( x, x 0 ) = (1 + x > x 0 ) p I K ( x, x 0 ) = e �k x � x 0 k 2 � I K ( x, x 0 ) = e �k x � x 0 k �
What now? Steal
Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances
Tikhonov aka ridge regression n = ( S ⇤ n S n + � nI ) � 1 S ⇤ f � n y
Tikhonov aka ridge regression n = ( S ⇤ n S n + � nI ) � 1 S ⇤ n y = S ⇤ n ( S n S ⇤ + � nI ) � 1 y f � n | {z } K n K n c = y
Statistics Theorem (Caponnetto De Vito ’05) Assume K ( X, X ) , | Y | 1 a.s. and f † 2 Range ( S ⇢ S ⇤ 1 ⇢ ) r , 1 / 2 < r < 1 . If � n = n � 2 r +1 2 r E [ k Sf � n � f † k 2 ⇢ ] . n � 2 r +1 n
Statistics Theorem (Caponnetto De Vito ’05) Assume K ( X, X ) , | Y | 1 a.s. and f † 2 Range ( S ⇢ S ⇤ 1 ⇢ ) r , 1 / 2 < r < 1 . If � n = n � 2 r +1 2 r E [ k Sf � n � f † k 2 ⇢ ] . n � 2 r +1 n Proof ⇢ ] . 1 E [ k Sf � n � f ⇢ k 2 � ( � 1 + � 2 ) + � 2 r 8 � > 0 , 1 p n E [ � 1 ] , E [ � 2 ] .
Iterative regularization From the Neumann series. . . t � 1 X f t ( I � � S ⇤ n S n ) j S ⇤ n = � n y j =0
Iterative regularization From the Neumann series. . . t � 1 t � 1 X X f t ( I � � S ⇤ n S n ) j S ⇤ n y = � S ⇤ ( I � � S n S ⇤ ) j y n = � n n | {z } j =0 j =0 K n
Iterative regularization From the Neumann series. . . t � 1 t � 1 X X f t ( I � � S ⇤ n S n ) j S ⇤ n y = � S ⇤ ( I � � S n S ⇤ ) j y n = � n n | {z } j =0 j =0 K n . . . to gradient descent n = f t � 1 � � S ⇤ n ( S n f t � 1 n = c t � 1 � � ( K n c t � 1 f t c t � y ) � y ) n n n n Test Training t
Iterative regularization statistics Theorem (Bauer, Pereverzev, R. ’07) Assume K ( X, X ) , | Y | 1 a.s. and f † 2 Range ( S ⇢ S ⇤ 1 ⇢ ) r , 1 / 2 < r < 1 . If t n = n 2 r +1 2 r ⇢ ] . n � E [ k Sf t n n � f † k 2 2 r +1
Iterative regularization statistics Theorem (Bauer, Pereverzev, R. ’07) Assume K ( X, X ) , | Y | 1 a.s. and f † 2 Range ( S ⇢ S ⇤ 1 ⇢ ) r , 1 / 2 < r < 1 . If t n = n 2 r +1 2 r ⇢ ] . n � E [ k Sf t n n � f † k 2 2 r +1 Proof 1 E [ k Sf t n � f ⇢ k 2 8 � > 0 , ⇢ ] . t ( � 1 + � 2 ) + t 2 r 1 E [ � 1 ] , E [ � 2 ] . p n
Tikhonov vs iterative regularization I Same statistical properties... 1 I ... but time complexities are di ff erent O ( n 3 ) vs O ( n 2 n 2 r +1 ) , I Iterative regularization provides a bridge between statistics and computations. I Kernel methods become a test bed for algorithmic solutions.
Computational regularization Tikhonov time O ( n 3 ) + space O ( n 2 ) for 1 / p n learning bound
Computational regularization Tikhonov time O ( n 3 ) + space O ( n 2 ) for 1 / p n learning bound + Iterative regularization time O ( n 2 p n ) + space O ( n 2 ) for 1 / p n learning bound
Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances
Steal from optimization Acceleration I Conjugate gradient [Blanchard, Kramer ’96] I Chebyshev method [Bauer, Pervezev. R. ’07] I Nesterov acceleration (Nesterov, ’83) [Salzo, R. ’18] Stochastic gradient I Single pass stochastic gradient [Tarres, Yao, ’05, Pontil, Ying, ’09, Bach, Dieuleveut, Flammarion, ’17] I Multi-pass incremental gradient [Villa, R. ’15] I Multi-pass stochastic gradient with mini-batches. [Lin, R. ’16]
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.