Mining Source Code Repositories at Massive Scale using Language - PowerPoint PPT Presentation
Mining Source Code Repositories at Massive Scale using Language Modeling Miltos Allamanis, Charles Sutton m.allamanis@ed.ac.uk csutton@inf.ed.ac.uk University of Edinburgh Supported by: Polyglot programmers Multitude of APIs &
Mining Source Code Repositories at Massive Scale using Language Modeling Miltos Allamanis, Charles Sutton m.allamanis@ed.ac.uk csutton@inf.ed.ac.uk University of Edinburgh Supported by:
Polyglot programmers Multitude of APIs & libraries Transfer Knowledge from available code
Why Language Models? ● Statistical models ● Learn from data ● Abundance of code available online ● Non-language specific method [Hindle et al., ICSE 2012]
n-gram Language Models public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }
n-gram Language Models public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }
n-gram Language Models public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }
n-gram Language Models public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }
n-gram Language Models Predictability Measures n-gram Log Probability (NGLP) Cross-Entropy (H)
The Java GitHub Corpus Java projects >1 fork Deduplication through git commit SHAs URL: http://groups.inf.ed.ac.uk/cup/javaGithub/
Language Models of Code
Learning about identifiers
Learning about identifiers API calls are predictable
n-gram log probability (NGLP) as a complexity metric NGLP is Data-Driven An n-gram is more complex if it is more rare
Complexity trade-offs from elasticsearch
vs from elasticsearch
Identifier Information Metric (IIM) Evaluate domain specificity of code Larger IIM, more domain specific identifiers Use to evaluate code reusability H full - H collapsed ContinuationPending.java 5.2 JSSetter.java 1.0 FastDtoa.java 5.0 GeneratedClassLoader. 1.1 java PrivateAccessClass.java 4.7 UintMap.java 1.2
Contributions ● GitHub Java Corpus ● New gigatoken language models ● API calls are predictable ● Data-driven code complexity metrics ● Metric of domain-specificity
Mining Source Code Repositories at Massive Scale using Language Modeling Miltos Allamanis, Charles Sutton m.allamanis@ed.ac.uk csutton@inf.ed.ac.uk University of Edinburgh Supported by:
n-gram Language Models
Language Models - Metrics Log Probability (NGLP) Cross Entropy (H)
Learning about identifiers
Learning about identifiers Method and Type identifiers are equally hard, irrespectively of the amount of data.
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.