[PPT] - Mining Source Code Repositories at Massive Scale using Language PowerPoint Presentation

SLIDE 1

Mining Source Code Repositories at Massive Scale using Language Modeling

Miltos Allamanis, Charles Sutton

m.allamanis@ed.ac.uk csutton@inf.ed.ac.uk

University of Edinburgh

Supported by:

SLIDE 2

Polyglot programmers Multitude of APIs & libraries Transfer Knowledge from available code

SLIDE 3

Why Language Models?

Statistical models
Learn from data
Abundance of code

available online

Non-language specific

method

[Hindle et al., ICSE 2012]

SLIDE 4

n-gram Language Models

public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }

SLIDE 5

n-gram Language Models

public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }

SLIDE 6

n-gram Language Models

public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }

SLIDE 7

n-gram Language Models

public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction (task); externalPush(job); }

SLIDE 8

n-gram Language Models

Predictability Measures n-gram Log Probability (NGLP) Cross-Entropy (H)

SLIDE 9

The Java GitHub Corpus

Java projects >1 fork Deduplication through git commit SHAs URL: http://groups.inf.ed.ac.uk/cup/javaGithub/

SLIDE 10

Language Models of Code

SLIDE 11

Learning about identifiers

SLIDE 12

Learning about identifiers

API calls are predictable

SLIDE 13

n-gram log probability (NGLP) as a complexity metric

NGLP is Data-Driven

An n-gram is more complex if it is more rare

SLIDE 14

Complexity trade-offs

from elasticsearch

SLIDE 15

vs

from elasticsearch

SLIDE 16

Evaluate domain specificity of code Larger IIM, more domain specific identifiers Use to evaluate code reusability

Identifier Information Metric (IIM)

Hfull - Hcollapsed

ContinuationPending.java 5.2 FastDtoa.java 5.0 PrivateAccessClass.java 4.7 JSSetter.java 1.0 GeneratedClassLoader. java 1.1 UintMap.java 1.2

SLIDE 17

Contributions

GitHub Java Corpus
New gigatoken language models
API calls are predictable
Data-driven code complexity metrics
Metric of domain-specificity

SLIDE 18

Mining Source Code Repositories at Massive Scale using Language Modeling

Miltos Allamanis, Charles Sutton

m.allamanis@ed.ac.uk csutton@inf.ed.ac.uk

University of Edinburgh

Supported by:

SLIDE 19

n-gram Language Models

SLIDE 20

Language Models - Metrics

Log Probability (NGLP) Cross Entropy (H)

SLIDE 21

Learning about identifiers

SLIDE 22

Learning about identifiers

Method and Type identifiers are equally hard, irrespectively of the amount of data.