Statistical Analysis of Computer Program Text Charles Sutton - PowerPoint PPT Presentation
Statistical Analysis of Computer Program Text Charles Sutton University of Edinburgh Source code is a means of human communication Development out in the open 6000 Posts (Stack Overflow) 5000 4000 Pull requests (Github) Count
Statistical Analysis of Computer Program Text Charles Sutton University of Edinburgh
Source code is a means of human communication
Development “out in the open” 6000 Posts (Stack Overflow) 5000 4000 Pull requests (Github) Count 3000 (x1000) 2000 Repositories 1000 (Sourceforge) 0 2011 2012 2013 2014 Year
Probabilistic modelling Model Problem (family of distributions) Supervised Learning Source Source Source (x 1, y 1 )…(x n, y n ) files Source (objective function) files Source files files files Unsupervised x 1 …x n Distribution Data Predict y from x p(y|x test ) “Explore” x p(z|x test ) Do stuff Inspect distribution p(z|x 1 …x n )
Learning Natural Coding Conventions [Allamanis, Barr, Bird, Sutton; FSE 2014]
junit/src/test/java/junit/tests/runner/TextRunnerTest.java public class TextRunnerTest extends TestCase { void execTest(String testClass, boolean success) throws Exception { ... InputStream i = p.getInputStream(); while (( i .read()) != -1); ... } ... }
junit/src/test/java/junit/tests/runner/TextRunnerTest.java public class TextRunnerTest extends TestCase { void execTest(String testClass, boolean success) throws Exception { ... InputStream i = p.getInputStream(); Suggest while (( i .read()) != -1); alternate names ... } input ... inputStream } is stream Score and threshold input (81.93%)
Language Models for Source Code Probability distribution over token sequences: Consider naive estimator: In Naturalize : Choose the name other programmers use in similar contexts
Naming Methods and Classes [Allamanis, Barr, Bird, Sutton; FSE 2015]
Name that Tune Java Method 1 private void createDefaultShader () { String vertexShader = "literal_1"; 2 String fragmentShader = "literal_2"; 3 shader = new ShaderProgram(vertexShader, 4 fragmentShader); 5 if(shader.isCompiled() == false) 6 throw new IllegalArgumentException( 7 "literal_3" + shader.getLog()); 8 9 } Figure 1: This method is from libgdx ’s CameraGroupStrategy from libgdx “Desktop/Android/Blackberry/iOS/HTML5 Java game development framework” http://libgdx.badlogicgames.com
Embedding Identifiers [Mnih, Hinton, 2007; Mnih, Teh, 2012; Mnih, Log Bilinear Context Model Kavukcuoglu, 2013; Maddison, Tarlow, 2014] exp { s θ ( t, c 1: m ) } q createDefaultShader P ( t | c 1: m ) = q hashCode P t 0 exp { s θ ( t 0 , c 1: m ) } ˆ r c s θ ( t, c 1: m ) = q > t ˆ r c + b t (private, void, (, ), {, String, vertexShader, c = =, “literal_1”, ;, String, …) t = createDefaultShader q v ∈ R D are “embeddings” ::: model parameters What about ? ˆ r c More complex, we need to summarize many tokens
Mining Idioms from Code [Allamanis and Sutton; FSE 2014]
Mined Idioms (General Java) Iterate through the elements of an Creating a logger for a class Iterator Looping through lines from a Defining a String constant BufferedReader
Mined Idioms (Library-Specific) Database transaction in node4j Get an HTML Document in jsoup Get the distance between Show a small popup in Android two points in Android
Model: Tree substitution grammars
TwitterFactory.getInstance TwitterFactory.<init> Twitter.setOAuthConsumer Twitter.setOAuthAccessToken Mining API Patterns http://arxiv.org/abs/1510.04130 [Fowkes and Sutton; NIPS WS 2014]
API patterns from code TwitterFactory.getInstance TwitterFactory.<init> TwitterFactory.getInstance Status.getUser TwitterFactory.getInstance TwitterFactory.<init> Status.getText TwitterFactory.<init> Status.getUser TwitterFactory.getInstance TwitterFactory.getInstance Status.getText TwitterFactory.<init> Twitter.setOAuthConsumer Twitter.setOAuthConsumer ConfigurationBuilder.<init> Twitter.setOAuthAccessToken TwitterFactory.<init> ConfigurationBuilder.build Twitter.setOAuthConsumer auth.AccessToken.getToken ConfigurationBuilder.<init> auth.AccessToken.getTokenSecret TwitterFactory.getInstance TwitterFactory.<init> TwitterFactory.<init> ConfigurationBuilder.<init> Twitter.setOAuthConsumer ConfigurationBuilder.<init> ConfigurationBuilder.setDebugEnabled ConfigurationBuilder.setOAuthConsumerKey ConfigurationBuilder.build Status.getUser Status.getText ConfigurationBuilder.<init> TwitterFactory.getInstance ConfigurationBuilder.setOAuthConsumerKey TwitterFactory.<init> Twitter.setOAuthConsumer ConfigurationBuilder.build Twitter.setOAuthConsumer Twitter.setOAuthAccessToken Twitter.setOAuthAccessToken ConfigurationBuilder.setOAuthConsumerKey Twitter.updateStatus ConfigurationBuilder.<init> ConfigurationBuilder.build ConfigurationBuilder.build ConfigurationBuilder.<init> User.getId ConfigurationBuilder.setOAuthConsumerKey TwitterFactory.<init> User.getId ConfigurationBuilder.setOAuthConsumerSecret Twitter.setOAuthAccessToken ConfigurationBuilder.build User.getId TwitterFactory.getInstance User.getScreenName http.AccessToken.getToken Twitter.setOAuthAccessToken http.AccessToken.getTokenSecret ConfigurationBuilder.<init> ConfigurationBuilder.<init> ConfigurationBuilder.setOAuthConsumerKey TwitterFactory.<init> TwitterFactory.<init> ConfigurationBuilder.setOAuthConsumerSecret TwitterFactory.getInstance Status.getId Status.getId MAPO UP-Miner IIM [Zhong et al, 2009] [Wang et al, 2013] (actually a slight extension)
Model z ( j ) π S S To sample a transaction, S ∈ I S ∈ I 1. For each itemset, sample X ( j ) z S ∼ Bernoulli ( π S ) . j ∈ 1 , ..., m 2. Deterministically set � X = S. Parameters: z s =1 I Collection of “interesting” itemsets π S ∈ [0 , 1] for each S ∈ I probability of occurrence
Stepping Back
Local conventions (naming, formatting) Mining idioms (ngram models) (probabilistic grammars) Itemset Method naming mining (word embeddings) TwitterFactory.getInstance (latent-variable TwitterFactory.<init> Twitter.setOAuthConsumer modelling) Twitter.setOAuthAccessToken Thanks! Miltiadis Allamanis • Chris Bird, MSR • Jaroslav Fowkes • Earl Barr, UCL • Hao Peng •
Key concepts in probabilistic modelling Sufficiency • what statistics of the data am I memorizing? • Latent variables, e.g., • what tree macros were used to generate AST? • what item sets were used in a transaction? •
Why patterns in software? Orthogonal interfaces Tools that “do one thing well” need to be combined well Surface-semantic correspondence Semantics available from glancing rather than reading void addOne (int[] arr) { for (int i = 0; i < arr.length; i++) { arr[i] += 1; } } void foo (int[] bar) { int baz = 0; while (true) { bar[baz] = bar[baz] + 1; Natural code: Code with baz = baz + 1; good correspondence? if (baz > bar.length) break; } }
A new type of program analysis Static analysis Construct program abstraction (loses information) Why abstract: Exact decision Turing-complete Then logical inference Statistical analysis Construct program abstraction (loses information) Why: Data sparsity, inductive bias Then statistical inference “Semantic retreat” NLP —> statistical NLP PL analysis —> statistical PL analysis
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.