Graph-based, Self-Supervised Program Repair from Diagnostic Feedback - PowerPoint PPT Presentation
Graph-based, Self-Supervised Program Repair from Diagnostic Feedback ICML 2020 Michihiro Yasunaga, Percy Liang Stanford University Why program repair? Programmers spend 75% of time fixing source code errors Automatic program repair can
Our contributions 2. Self-supervised learning Collect unlabeled programs ○ Corrupt and get diagnostic feedback (e.g. run compiler) ○ ⇒ Extra training data : <broken code, feedback, fixed code> Working code Corrupted 5 int i, n; 5 int i, n; corrupt compiler 6 string A; 6 char A; Error! 7 cin >> n; 7 cin >> n . line 7: expected ‘;’ 8 A.resize(n); 8 A.resize(n); 9 for (i=0;i<n;i++){ 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 21
Our results Improved performance on two applications DeepFix: correct intro programming assignments in C ● SPoC: correct output of C++ program synthesis ● DeepFix Test SPoC TestP 22
Outline Innovations ● 1. Reasoning via program-feedback graph 2. Self-supervised learning Evaluations ● 1. DeepFix 2. SPoC Analysis & Examples ● Takeaways ● 23
1. Reasoning via program-feedback graph 24
1. Reasoning via program-feedback graph Challenges How to connect two modalities: program and feedback ? ● How to model the reasoning of repair (e.g. tracking symbols)? ● ? Source code 4 int main() { Compiler message 5 char tmp, a, b; request for 6 map<string,int> mp; member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 25
1. Reasoning via program-feedback graph Our solution: program-feedback graph Source code 4 int main() { Compiler message 5 char tmp, a, b; char a b request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; b ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a size ‘Char ’ char 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 26
1. Reasoning via program-feedback graph Our solution: program-feedback graph Join program & feedback through symbols relevant to program repair ● → shared/abstracted semantic space Source code 4 int main() { Compiler message 5 char tmp, a, b; char a b request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; b ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a size ‘Char ’ char 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 27
1. Reasoning via program-feedback graph Our solution: program-feedback graph Join program & feedback through symbols relevant to program repair ● → shared/abstracted semantic space Reason over this space using graph attention ● Source code 4 int main() { Compiler message 5 char tmp, a, b; char a b request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; b ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a size ‘Char ’ char 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 28
1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● Source code 4 int main() { Compiler message 5 char tmp, a, b; request for 6 map<string,int> mp; member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 29
1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> Source code 4 int main() { Compiler message 5 char tmp, a, b; request for 6 map<string,int> mp; member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 30
1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> <identifier> Source code 4 int main() { Compiler message 5 char tmp, a, b; request for 6 map<string,int> mp; member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 31
1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> <identifier> Source code 4 int main() { Compiler message 5 char tmp, a, b; request for 6 map<string,int> mp; member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 32
1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> <identifier> Source code 4 int main() { Compiler message 5 char tmp, a, b; request for 6 map<string,int> mp; member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 33
1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> <identifier> Source code 4 int main() { Compiler message 5 char tmp, a, b; <operator> request for 6 map<string,int> mp; member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 34
1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> <identifier> Source code 4 int main() { Compiler message 5 char tmp, a, b; <operator> request for 6 map<string,int> mp; member ‘ size ’ in 7 cin >> a >> b; ‘ a ’ , which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘ char ’ 10 tmp.push_back(a [i]); <diagnostic argument> 11 string tmp1 = tmp; 35
1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> <identifier> Source code 4 int main() { Compiler message 5 char tmp, a, b; <operator> request for 6 map<string,int> mp; member ‘ size ’ in 7 cin >> a >> b; ‘ a ’ , which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘ char ’ 10 tmp.push_back(a [i]); <diagnostic argument> 11 string tmp1 = tmp; 36
1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● <data type> <identifier> Source code 4 int main() { Compiler message 5 char tmp, a, b; <operator> request for 6 map<string,int> mp; member ‘ size ’ in 7 cin >> a >> b; ‘ a ’ , which is of 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... ‘ char ’ 10 tmp.push_back(a [i]); <diagnostic argument> 11 string tmp1 = tmp; 37
1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● Nodes : diagnostic arguments ● Source code 4 int main() { Compiler message 5 char tmp, a, b; request for 6 map<string,int> mp; size member ‘size ’ in 7 cin >> a >> b; ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... char ‘Char ’ 10 tmp.push_back(a [i]); 11 string tmp1 = tmp; 38
1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● Nodes : diagnostic arguments, their occurences ● Source code 4 int main() { Compiler message 5 char tmp, a, b; a request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a char ‘Char ’ 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 39
1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● Nodes : diagnostic arguments, their occurences ● Source code 4 int main() { Compiler message 5 char tmp, a, b; char a request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a char ‘Char ’ 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 40
1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● Nodes : diagnostic arguments, their occurences ● Source code 4 int main() { Compiler message 5 char tmp, a, b; char a request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a size char ‘Char ’ 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 41
1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● Nodes : diagnostic arguments, their occurences, and all identifiers ● Source code 4 int main() { Compiler message 5 char tmp, a, b; char a b request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; b ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a size char ‘Char ’ 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 42
1. Reasoning via program-feedback graph How to construct graph? Recognize token types ● Nodes : diagnostic arguments, their occurences, and all identifiers ● Edges : connect identical tokens to capture semantic correspondence ● Source code 4 int main() { Compiler message 5 char tmp, a, b; char a b request for 6 map<string,int> mp; size member ‘size ’ in a 7 cin >> a >> b; b ‘a’, which is of ‘a’ 8 int i, j; non-class type 9 for (i = 0; i < a .size () ... a size char ‘Char ’ 10 tmp.push_back(a [i]); a 11 string tmp1 = tmp; 43
1. Reasoning via program-feedback graph Model Initial encoding ● Graph attention ● Recontextualization ● Decoding ● 44
1. Reasoning via program-feedback graph Model (Initial encoding) Source code 1 int main() { 2 char tmp, a, b; 3 map<string,int> mp; ... Compiler message 9: request for member ‘size ’ … 45
1. Reasoning via program-feedback graph Model (Initial encoding) Source code 1 int main() { 2 char tmp, a, b; 3 map<string,int> mp; ... x 11 x 12 x 13 Compiler message Line 1 9: request for member ‘size ’ … 46
1. Reasoning via program-feedback graph Model (Initial encoding) Source code 1 int main() { 2 char tmp, a, b; 3 map<string,int> mp; ... x 11 x 12 x 13 x 11 x 12 x 13 Compiler message Line 1 Line 2 9: request for member ‘size ’ … 47
1. Reasoning via program-feedback graph Model (Initial encoding) Source code 1 int main() { 2 char tmp, a, b; 3 map<string,int> mp; ... x 11 x 12 x 13 x 11 x 12 x 13 x 11 x 12 x 13 Compiler message Line 1 Line 2 Line 3 9: request for member Source code ‘size ’ … 48
1. Reasoning via program-feedback graph Model (Initial encoding) Source code 1 int main() { 2 char tmp, a, b; 3 map<string,int> mp; ... x 11 x 12 x 13 x 11 x 12 x 13 x 11 x 12 x 13 i err m 1 m 2 m 3 Compiler message Line 1 Line 2 Line 3 Line idx Msg content 9: request for member Feedback Source code ‘size ’ … (compiler message) 49
1. Reasoning via program-feedback graph Model (Initial encoding) Source code 1 int main() { 2 char tmp, a, b; Position embedding LSTM code (1) LSTM code (1) LSTM code (1) LSTM msg (1) 3 map<string,int> mp; ... x 11 x 12 x 13 x 11 x 12 x 13 x 11 x 12 x 13 i err m 1 m 2 m 3 Compiler message Line 1 Line 2 Line 3 Line idx Msg content 9: request for member Feedback Source code ‘size ’ … (compiler message) 50
1. Reasoning via program-feedback graph Model (Graph attention) Source code Graph Attention 1 int main() { 2 char tmp, a, b; Position embedding LSTM code (1) LSTM code (1) LSTM code (1) LSTM msg (1) 3 map<string,int> mp; ... x 11 x 12 x 13 x 11 x 12 x 13 x 11 x 12 x 13 i err m 1 m 2 m 3 Compiler message Line 1 Line 2 Line 3 Line idx Msg content 9: request for member Feedback Source code ‘size ’ … (compiler message) 51
1. Reasoning via program-feedback graph Model (Graph attention) Message passing across tokens with long-range dependencies ● Source code hx 11 hx 12 hx 13 ... 1 int main() { ’ Line 1 hm 1 Multi-Head 2 char tmp, a, b; Attention 3 map<string,int> mp; Aggregate ... hx 21 hx 22 hx 23 ... hm 1 hm 2 hm 3 .. Line 2 Compiler message Compiler 9: request for member message hx 31 hx 32 hx 33 ... ‘size ’ … Line 3 Program-Feedback Graph 52
1. Reasoning via program-feedback graph Model (Graph attention) Source code Graph Attention 1 int main() { 2 char tmp, a, b; Position embedding LSTM code (1) LSTM code (1) LSTM code (1) LSTM msg (1) 3 map<string,int> mp; ... x 11 x 12 x 13 x 11 x 12 x 13 x 11 x 12 x 13 i err m 1 m 2 m 3 Compiler message Line 1 Line 2 Line 3 Line idx Msg content 9: request for member Feedback Source code ‘size ’ … (compiler message) 53
1. Reasoning via program-feedback graph Model (Recontextualization) Source code 1 int main() { 2 char tmp, a, b; 3 map<string,int> mp; ... Graph Attention Compiler message 9: request for member ‘size ’ … Line 1 Line 2 Line 3 Line idx Msg content 54
1. Reasoning via program-feedback graph Model (Recontextualization) Source code 1 int main() { 2 char tmp, a, b; 3 map<string,int> mp; ... LSTM code (2) LSTM code (2) LSTM code (2) LSTM msg (2) Graph Attention Compiler message 9: request for member ‘size ’ … Line 1 Line 2 Line 3 Line idx Msg content 55
1. Reasoning via program-feedback graph Model (Recontextualization) Source code 1 int main() { LSTM code (3) 2 char tmp, a, b; 3 map<string,int> mp; ... LSTM code (2) LSTM code (2) LSTM code (2) LSTM msg (2) Graph Attention Compiler message 9: request for member ‘size ’ … Line 1 Line 2 Line 3 Line idx Msg content 56
1. Reasoning via program-feedback graph Model (Decoding) Localize = 2 MLP + softmax Source code 1 int main() { LSTM code (3) 2 char tmp, a, b; 3 map<string,int> mp; ... LSTM code (2) LSTM code (2) LSTM code (2) LSTM msg (2) Graph Attention Compiler message 9: request for member ‘size ’ … Line 1 Line 2 Line 3 Line idx Msg content 57
1. Reasoning via program-feedback graph Model (Decoding) Localize = 2 Repair = "string tmp,a,b;" MLP Pointer-Generator + softmax Decoder Source code 1 int main() { LSTM code (3) 2 char tmp, a, b; 3 map<string,int> mp; ... LSTM code (2) LSTM code (2) LSTM code (2) LSTM msg (2) Graph Attention Compiler message 9: request for member ‘size ’ … Line 1 Line 2 Line 3 Line idx Msg content 58
1. Reasoning via program-feedback graph Model overview 59
2. Self-supervised learning 60
2. Self-supervised learning Why? Labeled datasets of program repair are small (10-100K examples) ● Vast amount of unlabeled programs available online ● Can we leverage them to improve learning? ● >> 1M submissions > 30M repos 61
2. Self-supervised learning Our idea (outline) Step 1. Collect unlabeled, working programs y Design (randomized) program corruption procedure P Step 2. Step 3. Corrupt and get diagnostic feedback (e.g. run compiler) ⇒ Extra training data : <broken code x , feedback f , fixed code y > Step 4. Use them for pre-training 62
2. Self-supervised learning 1. Collect unlabeled programs Our target tasks (DeepFix & SPoC) are in C/C++ ● Collect 300K working C++ programs from codeforces.com ● 63
2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors (know your enemy) ● Statistics Error type Common compiler messages Expected ... expected @@@ (e.g. expected ‘;’ before... ) ● operator/punctuator missing @@@ (e.g. missing " ) ● primary expression redeclaration/conflicting declaration Identifier type invalid conversion from <type> to <type> Identifier undeclared @@@ was not declared ‘else’ without a previous ‘if’ Others no matching function for call to... [Gupta et al. 17] [Mesbah et al. 19] [Kulal et al. 19] 64
2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors (know your enemy) ● Statistics Error type Common compiler messages Experienced Expected ... expected @@@ (e.g. expected ‘;’ before... ) ● operator/punctuator 9% missing @@@ (e.g. missing " ) ● primary expression redeclaration/conflicting declaration Identifier type 9% invalid conversion from <type> to <type> Identifier undeclared 62% @@@ was not declared ‘else’ without a previous ‘if’ Others 20% no matching function for call to... [Gupta et al. 17] [Mesbah et al. 19] [Kulal et al. 19] 65
2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors (know your enemy) ● Statistics Error type Common compiler messages Experienced Beginner Expected ... 48% expected @@@ (e.g. expected ‘;’ before... ) ● operator/punctuator 9% ● 37% missing @@@ (e.g. missing " ) ● primary expression ● 11 redeclaration/conflicting declaration Identifier type 9% 5% invalid conversion from <type> to <type> Identifier undeclared 62% 33% @@@ was not declared ‘else’ without a previous ‘if’ Others 20% 14% no matching function for call to... [Gupta et al. 17] [Mesbah et al. 19] [Kulal et al. 19] 66
2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors (know your enemy) ● Statistics Error type Common compiler messages Experienced Beginner SPoC Expected ... 48% 35% expected @@@ (e.g. expected ‘;’ before... ) ● operator/punctuator 9% ● 37% ● 29% missing @@@ (e.g. missing " ) ● primary expression ● 11 ● 6% redeclaration/conflicting declaration Identifier type 9% 5% 18% invalid conversion from <type> to <type> Identifier undeclared 62% 33% 31% @@@ was not declared ‘else’ without a previous ‘if’ Others 20% 14% 16% no matching function for call to... [Gupta et al. 17] [Mesbah et al. 19] [Kulal et al. 19] 67
2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors (know your enemy) ● Statistics Error type Common compiler messages Avg. Experienced Beginner SPoC 30% Expected ... 48% 35% expected @@@ (e.g. expected ‘;’ before... ) ● operator/punctuator ● 23% 9% ● 37% ● 29% missing @@@ (e.g. missing " ) ● primary expression ● 11 ● 6% ● 7% redeclaration/conflicting declaration 11% Identifier type 9% 5% 18% invalid conversion from <type> to <type> 42% Identifier undeclared 62% 33% 31% @@@ was not declared ‘else’ without a previous ‘if’ Others 17% 20% 14% 16% no matching function for call to... [Gupta et al. 17] [Mesbah et al. 19] [Kulal et al. 19] 68
2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors (know your enemy) ● Statistics Error type Common compiler messages Avg. Experienced Beginner SPoC 30% Expected ... 48% 35% expected @@@ (e.g. expected ‘;’ before... ) ● operator/punctuator ● 23% 9% ● 37% ● 29% missing @@@ (e.g. missing " ) ● primary expression ● 11 ● 6% ● 7% redeclaration/conflicting declaration 11% Identifier type 9% 5% 18% invalid conversion from <type> to <type> 42% Identifier undeclared 62% 33% 31% @@@ was not declared ‘else’ without a previous ‘if’ Others 17% 20% 14% 16% no matching function for call to... [Gupta et al. 17] [Mesbah et al. 19] [Kulal et al. 19] 69
2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors ● Design perturbation modules M to cause those errors ● 70
2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors ● Design perturbation modules M to cause those errors ● Perturbation module Example return 0; } Syntax (delete/insert/replace operators .;(){}'"+, etc.) → return 0; } } 71
2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors ● Design perturbation modules M to cause those errors ● Perturbation module Example return 0; } Syntax (delete/insert/replace operators .;(){}'"+, etc.) → return 0; } } string tmp; ID-type (delete/insert/replace type ) → char tmp; 72
2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors ● Design perturbation modules M to cause those errors ● Perturbation module Example return 0; } Syntax (delete/insert/replace operators .;(){}'"+, etc.) → return 0; } } string tmp; ID-type (delete/insert/replace type ) → char tmp; int a, b=0, m; ID-typo (delete/insert/replace IDentifier ) → int a, m; 73
2. Self-supervised learning 2. How to design corruption procedure P ? Look at common errors ● Design perturbation modules M to cause those errors ● Perturbation module Example return 0; } Syntax (delete/insert/replace operators .;(){}'"+, etc.) → return 0; } } string tmp; ID-type (delete/insert/replace type ) → char tmp; int a, b=0, m; ID-typo (delete/insert/replace IDentifier ) → int a, m; if (n >= 0) Keyword (delete/insert/replace keyword/call ) → while (n >= 0) 74
2. Self-supervised learning 2. Our corruption procedure P Look at common errors ● Design perturbation modules M to cause those errors ● P : Sample 1-5 modules from M , and apply to program sequentially ● 75
2. Self-supervised learning 2. Our corruption procedure P Look at common errors ● Design perturbation modules M to cause those errors ● P : Sample 1-5 modules from M , and apply to program sequentially ● e.g. ID-type , ID-typo , Syntax . 76
2. Self-supervised learning 2. Our corruption procedure P Look at common errors ● Design perturbation modules M to cause those errors ● P : Sample 1-5 modules from M , and apply to program sequentially ● e.g. ID-type , ID-typo , Syntax . Working code 5 int i, n; 6 string A; 7 cin >> n; 8 A.resize(n); 9 for (i = 0; i < n; i++){ 10 cin >> A[i]; 11 cout << i; } 77
2. Self-supervised learning 2. Our corruption procedure P Look at common errors ● Design perturbation modules M to cause those errors ● P : Sample 1-5 modules from M , and apply to program sequentially ● e.g. ID-type , ID-typo , Syntax . Working code Perturbed 1 5 int i, n; 5 int i, n; 6 string A; 6 char A; 7 cin >> n; 7 cin >> n; 8 A.resize(n); 8 A.resize(n); 9 for (i = 0; i < n; i++){ 9 for (i = 0; i < n; i++){ 10 cin >> A[i]; 10 cin >> A[i]; 11 cout << i; } 11 cout << i; } 78
2. Self-supervised learning 2. Our corruption procedure P Look at common errors ● Design perturbation modules M to cause those errors ● P : Sample 1-5 modules from M , and apply to program sequentially ● e.g. ID-type , ID-typo , Syntax . Working code Perturbed 1 Perturbed 2 5 int i, n; 5 int i, n; 5 int i, n; 6 string A; 6 char A; 6 char A; 7 cin >> n; 7 cin >> n; 7 cin >> n; 8 A.resize(n); 8 A.resize(n); 8 A.resize(n); 9 for (i = 0; i < n; i++){ 9 for (i = 0; i < n; i++){ 9 for (i = 0; i < n; i++){ 10 cin >> A[i]; 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 11 cout << i; } 79
2. Self-supervised learning 2. Our corruption procedure P Look at common errors ● Design perturbation modules M to cause those errors ● P : Sample 1-5 modules from M , and apply to program sequentially ● e.g. ID-type , ID-typo , Syntax . Working code Perturbed 1 Perturbed 2 Perturbed 3 5 int i, n; 5 int i, n; 5 int i, n; 5 int i, n; 6 string A; 6 char A; 6 char A; 6 char A; 7 cin >> n; 7 cin >> n; 7 cin >> n; 7 cin >> n . 8 A.resize(n); 8 A.resize(n); 8 A.resize(n); 8 A.resize(n); 9 for (i = 0; i < n; i++){ 9 for (i = 0; i < n; i++){ 9 for (i = 0; i < n; i++){ 9 for (i = 0; i < n; i++){ 10 cin >> A[i]; 10 cin >> A[i]; 10 cin >> A[j]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 11 cout << i; } 11 cout << i; } 80
2. Self-supervised learning 3. Prepare pre-training data 300K working programs from codeforces.com ● Working code 5 int i, n; 6 string A; 7 cin >> n; 8 A.resize(n); 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 11 cout << i; } 81
2. Self-supervised learning 3. Prepare pre-training data 300K working programs from codeforces.com ● For each program, create corrupted versions by applying P ● Working code Corrupted 5 int i, n; 5 int i, n; P 6 string A; 6 char A; 7 cin >> n; 7 cin >> n . 8 A.resize(n); 8 A.resize(n); 9 for (i=0;i<n;i++){ 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 82
2. Self-supervised learning 3. Prepare pre-training data 300K working programs from codeforces.com ● For each program, create corrupted versions by applying P ● Working code Corrupted 5 int i, n; 5 int i, n; P compiler 6 string A; 6 char A; Error! 7 cin >> n; 7 cin >> n . line 7: expected ‘;’ 8 A.resize(n); 8 A.resize(n); 9 for (i=0;i<n;i++){ 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 83
2. Self-supervised learning 3. Prepare pre-training data 300K working programs from codeforces.com ● For each program, create corrupted versions by applying P ● ⇒ New program repair examples: <broken code, feedback, fixed code> Working code Corrupted 5 int i, n; 5 int i, n; P compiler 6 string A; 6 char A; Error! 7 cin >> n; 7 cin >> n . line 7: expected ‘;’ 8 A.resize(n); 8 A.resize(n); 9 for (i=0;i<n;i++){ 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 84
2. Self-supervised learning 3. Prepare pre-training data 300K working programs from codeforces.com ● For each program, create corrupted versions by applying P ● ⇒ New program repair examples: <broken code, feedback, fixed code> Working code Corrupted 5 int i, n; 5 int i, n; P compiler 6 string A; 6 char A; Error! 7 cin >> n; 7 cin >> n . line 7: expected ‘;’ 8 A.resize(n); 8 A.resize(n); 9 for (i=0;i<n;i++){ 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 85
2. Self-supervised learning 3. Prepare pre-training data 300K working programs from codeforces.com ● For each program, create corrupted versions by applying P ● ⇒ New program repair examples: <broken code, feedback, fixed code> Working code Corrupted 5 int i, n; 5 int i, n; P compiler 6 string A; 6 char A; Error! 7 cin >> n; 7 cin >> n . line 7: expected ‘;’ 8 A.resize(n); 8 A.resize(n); 9 for (i=0;i<n;i++){ 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 86
2. Self-supervised learning 3. Prepare pre-training data 300K working programs from codeforces.com ● For each program, create corrupted versions by applying P ● ⇒ New program repair examples: <broken code, feedback, fixed code> Working code Corrupted 5 int i, n; 5 int i, n; P compiler 6 string A; 6 char A; Error! 7 cin >> n; 7 cin >> n . line 7: expected ‘;’ 8 A.resize(n); 8 A.resize(n); 9 for (i=0;i<n;i++){ 9 for (i=0;i<n;i++){ 10 cin >> A[i]; 10 cin >> A[j]; 11 cout << i; } 11 cout << i; } 87
2. Self-supervised learning What’s interesting? Typically, pre-training task ≠ target task (e.g. masked LM v.s. QA) ● Here, targeted pre-training (pre-training task = target task = program repair) ● More direct pre-training structure ○ Data distributions can be different between pre-training & target ○ 88
Evaluation 1: DeepFix 89
Evaluation 1: DeepFix Task Repair C programs ● May have multiple error lines ● Apply repair model iteratively (up to 5 times) ● [Gupta et al., 17] 90
Evaluation 1: DeepFix Our model outputs Input code 4 int main() { 5 int n; 6 int * m[2]; 7 m[0] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 9 for (i = 0; i < n; i++) { 10 m[0][i] = -1; 11 m[1][i] = -1; } 12 return 0 } 91
Evaluation 1: DeepFix Our model outputs Input code 4 int main() { 5 int n; 6 int * m[2]; 7 m[0] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 9 for (i = 0; i < n; i++) { 10 m[0][i] = -1; 11 m[1][i] = -1; } 12 return 0 } Error message line 9: ‘i’ undeclared 92
Evaluation 1: DeepFix Our model outputs DrRepair Input code Attempt 1 4 int main() { 4 int main() { 5 int n; 5 int n, i; 6 int * m[2]; 6 int * m[2]; 7 m[0] = malloc(n*sizeof(int)); 7 m[0] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 9 for (i = 0; i < n; i++) { 9 for (i = 0; i < n; i++) { 10 m[0][i] = -1; 10 m[0][i] = -1; 11 m[1][i] = -1; } 11 m[1][i] = -1; } 12 return 0 } 12 return 0 } Error message line 9: ‘i’ undeclared 93
Evaluation 1: DeepFix Our model outputs DrRepair Input code Attempt 1 4 int main() { 4 int main() { 5 int n; 5 int n, i; 6 int * m[2]; 6 int * m[2]; 7 m[0] = malloc(n*sizeof(int)); 7 m[0] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 9 for (i = 0; i < n; i++) { 9 for (i = 0; i < n; i++) { 10 m[0][i] = -1; 10 m[0][i] = -1; 11 m[1][i] = -1; } 11 m[1][i] = -1; } 12 return 0 } 12 return 0 } Error message Error message line 9: ‘i’ undeclared line 12: expected ‘;’ before ‘}’ 94
Evaluation 1: DeepFix Our model outputs DrRepair DrRepair Input code Attempt 1 Attempt 2 4 int main() { 4 int main() { 4 int main() { 5 int n; 5 int n, i; 5 int n; 6 int * m[2]; 6 int * m[2]; 6 int * m[2]; 7 m[0] = malloc(n*sizeof(int)); 7 m[0] = malloc(n*sizeof(int)); 7 m[0] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 9 for (i = 0; i < n; i++) { 9 for (i = 0; i < n; i++) { 9 for (i = 0; i < n; i++) { 10 m[0][i] = -1; 10 m[0][i] = -1; 10 m[0][i] = -1; 11 m[1][i] = -1; } 11 m[1][i] = -1; } 11 m[1][i] = -1; } 12 return 0 } 12 return 0 } 12 return 0 ; } Error message Error message line 9: ‘i’ undeclared line 12: expected ‘;’ before ‘}’ 95
Evaluation 1: DeepFix Our model outputs DrRepair DrRepair Input code Attempt 1 Attempt 2 4 int main() { 4 int main() { 4 int main() { 5 int n; 5 int n, i; 5 int n; 6 int * m[2]; 6 int * m[2]; 6 int * m[2]; 7 m[0] = malloc(n*sizeof(int)); 7 m[0] = malloc(n*sizeof(int)); 7 m[0] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 8 m[1] = malloc(n*sizeof(int)); 9 for (i = 0; i < n; i++) { 9 for (i = 0; i < n; i++) { 9 for (i = 0; i < n; i++) { 10 m[0][i] = -1; 10 m[0][i] = -1; 10 m[0][i] = -1; 11 m[1][i] = -1; } 11 m[1][i] = -1; } 11 m[1][i] = -1; } 12 return 0 } 12 return 0 } 12 return 0 ; } Error message Error message Compiled!! line 9: ‘i’ undeclared line 12: expected ‘;’ before ‘}’ 96
Evaluation 1: DeepFix Results Test (full repair accuracy) Prior works do not use compiler messages 97
Evaluation 1: DeepFix Results Test (full repair accuracy) Prior works do not use compiler messages 98
Evaluation 1: DeepFix Results Use of compiler messages is Test (full repair accuracy) important Prior works do not use compiler messages 99
Evaluation 1: DeepFix Results Use of compiler messages is Test (full repair accuracy) important Prior works do not use compiler messages 100
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.