Neglected topics CS 446 Adversarial examples and deep networks 1 / - - PowerPoint PPT Presentation

▶

Aug 22, 2022 389 likes •674 views

Neglected topics CS 446 Adversarial examples and deep networks 1 / 23 Adversarial examples? Standard ML setup: We have training data; try to do well on withheld testing data. Adversarial/robust ML setup: We have training

SLIDE 1

Neglected topics

CS 446

SLIDE 2

Adversarial examples and deep networks

1 / 23

SLIDE 3

“Adversarial examples”?

◮ Standard ML setup:

◮ We have training data; try to do well on withheld testing data.

◮ Adversarial/robust ML setup:

◮ We have training data; try to do well on small perturbations of training and testing data.

2 / 23

SLIDE 4

“Adversarial examples”?

◮ Standard ML setup:

◮ We have training data; try to do well on withheld testing data.

◮ Adversarial/robust ML setup:

◮ We have training data; try to do well on small perturbations of training and testing data.

◮ This is an old problem (see for instance “robust statistics”). ◮ For deep networks, it has been rekindled for the following reasons:

◮ Deep networks have absurdly good performance on training and test error, comparable to humans. ◮ Unlike humans, deep networks completely choke on small perturbations.

◮ Some background reading:

◮ Original paper: “Intriguing properties of neural networks”; Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, Rob Fergus; https://arxiv.org/abs/1312.6199. ◮ Nice theory overview: video lecture by Sebastien Bubeck https://www.youtube.com/watch?v=9flSRJdnWek.

2 / 23

SLIDE 5

Adversarial examples in computer vision

(“Explaining and harnessing adversarial examples”; Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy.) ◮ Can make a small change humans can’t see, and fool an otherwise impressive deep network. ◮ There are versions that are “physical”, e.g., you wear special 3-d printed glasses and fool a deep-network-based security system.

◮ This is one reason self-driving cars are scary, but there are others. (The death caused by an Uber self-driving car was not due to a deep network.)

3 / 23

SLIDE 6

Formal statement of problem

◮ Training loss: ℓ(f(x), y). ◮ Adversarial training loss (ℓ∞ is popular): max

p∞≤δ ℓ(f(x + p), y)

◮ By making δ small and solving for p, we can find an imperceptible adversarial perturbation. ◮ There are many variations; e.g., forcing us to switch the label to a specific y′ (“targeted attacks”). ◮ Finding an adversarial example means solving this maximization

problem. There’s a lot of research into this, it seems to boil down to

gradient descent (ascent) variants.

4 / 23

SLIDE 7

Defenses

◮ Finding ways to make networks robust is a big research area (“defenses against these attacks”). ◮ Natural approach: do ERM on the adversarial loss: min

f∈F

1 n

max

pi∞≤δ ℓ

f(xi + pi), yi
.

(From “Towards deep learning models resistant to adversarial attacks” by Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu.)

5 / 23

SLIDE 8

Other comments

◮ Big research area (both attacks and defenses). ◮ Lots of hype (this is good and bad). ◮ Isn’t just an issue with deep networks; but it’s interesting to know what parts of the question are due to them. ◮ Might motivate changes in training algorithms.

6 / 23

SLIDE 9

Time series

7 / 23

SLIDE 10

Time series

◮ Rather than having IID ((x1, y1), . . . , (xn, yn)), we either have:

◮ ((x1, y1), . . . , (xn, yn)) where (xi+1, yi+1) uses (xi, yi) (“Markov assumption”), or even more history. ◮ Multiple “traces”: we collect m time series, with lengths (t1, . . . , tm):

(x(j)

i , y(j) i )

tj

i=1

m

j=1

.

◮ This is an extremely classical topic with many approaches inside and

utside ML.

◮ E.g., the signal processing community; look up “auto-regressive model” for a basic approach (linear). ◮ We skipped/rushed the Hidden Markov Model (HMM) slides, which give a graphical model formulation and approach. ◮ Recurrent neural networks (RNNs) are another approach. ◮ ML Theory community is behind on this topic (e.g., a “generalization bound” that doesn’t grow quickly with length tj). I’m not sure why.

8 / 23

SLIDE 11

SLIDE 12

Recurrent neural networks (RNNs)

◮ The model is based on a deep network f.

◮ At time i we get both an input xi, and a state vector si. ◮ We compute (yi, si+1) = f(xi, si), where yi is our output, and si+1 is the state vector consumed in the next round.

◮ Popular choice of f: “Long short-term memory (LSTM)”. ◮ Example: consume English words, output Japanese words.

◮ There are all sorts of issues with this; for instance, words should not be in 1-to-1 mapping! ◮ Most language stuff these days uses a “BiLSTM”, but I’ve also heard

f people using multi-layer 1-d conv equivalently well.

9 / 23

SLIDE 13

Reinforcement learning (RL)

10 / 23

SLIDE 14

RL setup

◮ We are again in the time series setup (x1, . . .), but:

◮ Our choices affect future x ! ◮ There are no clear losses/rewards; people talk about “rewards”, “feedback”, and “reinforcement”.

◮ There are many variants of the problem with many approaches.

◮ Some problems can be solved with a deterministic approach; dynamic programming was proposed by Bellman for reinforcement learning problems, and “Bellman equation” still fundamental in RL. ◮ For some other classical ideas, look up “MDP” and “POMDP”.

11 / 23

SLIDE 15

Chess example

◮ Very few moves actually result in feedback (checkmate or draw).

◮ We can work backwards from such moves to assign scores. ◮ This is computationally prohibitive, but some people have tried (look up “chess tablebase”). ◮ This is too conservative against weak opponents.

◮ Despite this, we can “guess” feedback, either deterministically, or “statistically” (as in course project).

◮ We can also “improve” such an estimate by playing against ourself and averaging the outcomes (“Monte-Carlo Tree Search (MCTS)”). ◮ If instead we form upper and lower bounds on the outcome and descend the game tree to refine them, it is called “alpha-beta search”.

12 / 23

SLIDE 16

Chess example (continued)

Here’s an alternating approach to chess RL (used by “alphago zero”):

1. Fix the evaluation/scoring function f, and play games against self

using MCTS (which improves the scoring function).

2. Go over the games, and fit f to the move choices with standard

supervised learning (course project suggests only this step). To train, alternate steps 1 and 2. To “test”/play, do step 1. (“AlphaGo Zero cheatsheet”, not by me.)

13 / 23

SLIDE 17

(“AlphaGo Zero cheatsheet”, not by me; larger version.)

14 / 23

SLIDE 18

Resources

◮ This is a huge field with not just many approaches, but many styles

f approaches from many different fields (not just CS or ML even).

◮ The Berkeley “Deep RL” course presents some cutting-edge material and also links to many other resources: http://rail.eecs.berkeley.edu/deeprlcourse/ . ◮ Theoretical simplification of the problem: bandit algorithms.

15 / 23

SLIDE 19

Natural language processing (NLP)

16 / 23

SLIDE 20

Natural language processing (NLP)

◮ This is a big application area concerned with human text. ◮ Most basic approach: rewrite task as supervised learning from Rd to Rk.

◮ Input encoding #1: “bag of words” (document becomes normalized vector of per-word counts); effective for easy problems. ◮ Input encoding #2: Word2Vec (a standard deep network that has become a very standard way to encode words).

◮ Cutting-edge approaches now use word-level and even character-level deep networks (or recurrent networks), with complicated outputs (tasks like a sequence of question-answer pairs). ◮ For more info, see this recent stanford NLP class: https://web.stanford.edu/class/cs224n/.

17 / 23

SLIDE 21

Dealing with data

18 / 23

SLIDE 22

Dealing with data

◮ Data cleaning/normalizing: a huge issue which could break everything we’ve discussed if it is ignored.

◮ Another issue is missing data/entries. People used to use EM for this, but I’m not sure what’s current practice?

◮ Data augmentation.

◮ For CIFAR, it’s standard to thrown in random crops and flips.

◮ pytorch provides tools for data cleaning and augmentation (look up torchvision.transforms).

19 / 23

SLIDE 23

Why are deep networks dominating?

20 / 23

SLIDE 24

Why are deep networks dominating?

◮ I don’t think anyone really knows. Certainly, no one has good predictive power (why didn’t we use ReLU, batch norm, and convnets when they were discovered in the 1970s?).

21 / 23

SLIDE 25

Why are deep networks dominating?

◮ I don’t think anyone really knows. Certainly, no one has good predictive power (why didn’t we use ReLU, batch norm, and convnets when they were discovered in the 1970s?). ◮ A few reasons:

◮ They seems to succinctly approximate many natural phenomena, perhaps due to some underlying compositional/hierarchical structure. ◮ They seem to work well with recent hardware coincidences (“GPU” was not designed for deep learning). ◮ They seem to work well with lots of data (at least, as they are trained now), and we now have lots of data. ◮ Gradient descent + deep networks = magic. ◮ The software infrastructure is amazing; “hacking” deep networks is somehow fun and accessible to basically every programmer. ◮ The momentum with the “social coding” ecosystem.

21 / 23

SLIDE 26

Neglected topics

Adversarial examples and deep networks

“Adversarial examples”?

◮ Standard ML setup:

◮ We have training data; try to do well on withheld testing data.

◮ Adversarial/robust ML setup:

◮ We have training data; try to do well on small perturbations of training and testing data.

“Adversarial examples”?

◮ Standard ML setup:

◮ We have training data; try to do well on withheld testing data.

◮ Adversarial/robust ML setup:

◮ We have training data; try to do well on small perturbations of training and testing data.

◮ This is an old problem (see for instance “robust statistics”). ◮ For deep networks, it has been rekindled for the following reasons:

◮ Deep networks have absurdly good performance on training and test error, comparable to humans. ◮ Unlike humans, deep networks completely choke on small perturbations.

◮ Some background reading:

Adversarial examples in computer vision

◮ This is one reason self-driving cars are scary, but there are others. (The death caused by an Uber self-driving car was not due to a deep network.)

Formal statement of problem

◮ Training loss: ℓ(f(x), y). ◮ Adversarial training loss (ℓ∞ is popular): max

◮ By making δ small and solving for p, we can find an imperceptible adversarial perturbation. ◮ There are many variations; e.g., forcing us to switch the label to a specific y′ (“targeted attacks”). ◮ Finding an adversarial example means solving this maximization

gradient descent (ascent) variants.

Defenses

◮ Finding ways to make networks robust is a big research area (“defenses against these attacks”). ◮ Natural approach: do ERM on the adversarial loss: min

1 n

max

(From “Towards deep learning models resistant to adversarial attacks” by Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu.)

Other comments

◮ Big research area (both attacks and defenses). ◮ Lots of hype (this is good and bad). ◮ Isn’t just an issue with deep networks; but it’s interesting to know what parts of the question are due to them. ◮ Might motivate changes in training algorithms.

Time series

Time series

◮ Rather than having IID ((x1, y1), . . . , (xn, yn)), we either have:

◮ ((x1, y1), . . . , (xn, yn)) where (xi+1, yi+1) uses (xi, yi) (“Markov assumption”), or even more history. ◮ Multiple “traces”: we collect m time series, with lengths (t1, . . . , tm):

tj

m

.

◮ This is an extremely classical topic with many approaches inside and

Recurrent neural networks (RNNs)

◮ The model is based on a deep network f.

◮ At time i we get both an input xi, and a state vector si. ◮ We compute (yi, si+1) = f(xi, si), where yi is our output, and si+1 is the state vector consumed in the next round.

◮ Popular choice of f: “Long short-term memory (LSTM)”. ◮ Example: consume English words, output Japanese words.

◮ There are all sorts of issues with this; for instance, words should not be in 1-to-1 mapping! ◮ Most language stuff these days uses a “BiLSTM”, but I’ve also heard

Reinforcement learning (RL)

RL setup

◮ We are again in the time series setup (x1, . . .), but:

◮ Our choices affect future x ! ◮ There are no clear losses/rewards; people talk about “rewards”, “feedback”, and “reinforcement”.

◮ There are many variants of the problem with many approaches.

◮ Some problems can be solved with a deterministic approach; dynamic programming was proposed by Bellman for reinforcement learning problems, and “Bellman equation” still fundamental in RL. ◮ For some other classical ideas, look up “MDP” and “POMDP”.

Chess example

◮ Very few moves actually result in feedback (checkmate or draw).

◮ We can work backwards from such moves to assign scores. ◮ This is computationally prohibitive, but some people have tried (look up “chess tablebase”). ◮ This is too conservative against weak opponents.

◮ Despite this, we can “guess” feedback, either deterministically, or “statistically” (as in course project).

◮ We can also “improve” such an estimate by playing against ourself and averaging the outcomes (“Monte-Carlo Tree Search (MCTS)”). ◮ If instead we form upper and lower bounds on the outcome and descend the game tree to refine them, it is called “alpha-beta search”.

Chess example (continued)

Here’s an alternating approach to chess RL (used by “alphago zero”):

using MCTS (which improves the scoring function).

supervised learning (course project suggests only this step). To train, alternate steps 1 and 2. To “test”/play, do step 1. (“AlphaGo Zero cheatsheet”, not by me.)

(“AlphaGo Zero cheatsheet”, not by me; larger version.)

Resources

◮ This is a huge field with not just many approaches, but many styles

◮ The Berkeley “Deep RL” course presents some cutting-edge material and also links to many other resources: http://rail.eecs.berkeley.edu/deeprlcourse/ . ◮ Theoretical simplification of the problem: bandit algorithms.

Natural language processing (NLP)

Natural language processing (NLP)

◮ This is a big application area concerned with human text. ◮ Most basic approach: rewrite task as supervised learning from Rd to Rk.

◮ Input encoding #1: “bag of words” (document becomes normalized vector of per-word counts); effective for easy problems. ◮ Input encoding #2: Word2Vec (a standard deep network that has become a very standard way to encode words).

◮ Cutting-edge approaches now use word-level and even character-level deep networks (or recurrent networks), with complicated outputs (tasks like a sequence of question-answer pairs). ◮ For more info, see this recent stanford NLP class: https://web.stanford.edu/class/cs224n/.

Dealing with data

Dealing with data

◮ Data cleaning/normalizing: a huge issue which could break everything we’ve discussed if it is ignored.

◮ Another issue is missing data/entries. People used to use EM for this, but I’m not sure what’s current practice?

◮ Data augmentation.

◮ For CIFAR, it’s standard to thrown in random crops and flips.

◮ pytorch provides tools for data cleaning and augmentation (look up torchvision.transforms).

Why are deep networks dominating?

Why are deep networks dominating?

◮ I don’t think anyone really knows. Certainly, no one has good predictive power (why didn’t we use ReLU, batch norm, and convnets when they were discovered in the 1970s?).

Why are deep networks dominating?

◮ I don’t think anyone really knows. Certainly, no one has good predictive power (why didn’t we use ReLU, batch norm, and convnets when they were discovered in the 1970s?). ◮ A few reasons:

Other big neglected topics

Other big neglected topics

◮ Interpretability (crucial for many applications, including medicine and law). ◮ Applications-specific issues (e.g., in audio, robotics, . . . ).