Analyzing 750 billion events and 46 TB of code What you can learn - - PowerPoint PPT Presentation

analyzing 750 billion events and 46 tb of code
SMART_READER_LITE
LIVE PREVIEW

Analyzing 750 billion events and 46 TB of code What you can learn - - PowerPoint PPT Presentation

Analyzing 750 billion events and 46 TB of code What you can learn from GitHub's shared data on BigQuery Felipe Hoffa Developer Advocate @felipehoffa @felipehoffa @felipehoffa @felipehoffa @felipehoffa @felipehoffa DATA @felipehoffa Who


slide-1
SLIDE 1 Felipe Hoffa Developer Advocate @felipehoffa

Analyzing 750 billion events and 46 TB of code

What you can learn from GitHub's shared data on BigQuery

slide-2
SLIDE 2 @felipehoffa
slide-3
SLIDE 3 @felipehoffa
slide-4
SLIDE 4 @felipehoffa
slide-5
SLIDE 5 @felipehoffa
slide-6
SLIDE 6 @felipehoffa
slide-7
SLIDE 7 @felipehoffa

DATA

slide-8
SLIDE 8

Who wants to analyze GitHub?

slide-9
SLIDE 9

Project maintainers

  • Popularity
  • Who and how?
  • Change management:
  • New APIs?
  • Breaking changes?
  • Is my project healthy?
  • Issues closed on time?
  • Community participation?
slide-10
SLIDE 10

Project users

  • What other projects to follow?
  • Requesting features
  • Data based requests
  • Effective phrasing
slide-11
SLIDE 11

Project choosers

  • Is this project popular?
  • Is this project healthy?
  • Is this project well adopted?
  • Related projects?
slide-12
SLIDE 12

Data lovers

  • Data integrators
  • You
  • Me :)
slide-13
SLIDE 13

3 main datasets:

  • GitHub Archive
  • 8.7 billion events
  • Hourly updates
  • GHTorrent
  • These events annotated
  • Real-time updates
  • GitHub repos on BigQuery
  • 46 TB of code
slide-14
SLIDE 14 @felipehoffa

Google BigQuery

slide-15
SLIDE 15 15
  • Fast: terabytes in seconds
  • Simple: SQL
  • Scaleable: From bytes to petabytes
  • No CAPEX: Always on
  • Interoperable: Tableau, R, Python...
  • Instant sharing
  • Free monthly quota

Google BigQuery

15
slide-16
SLIDE 16

Top projects by stars 2016?

slide-17
SLIDE 17 @felipehoffa
slide-18
SLIDE 18

Really?

slide-19
SLIDE 19 @felipehoffa
slide-20
SLIDE 20

I got stars! What else did they star?

slide-21
SLIDE 21 @felipehoffa
slide-22
SLIDE 22

How did they find me?

Hacker News?

slide-23
SLIDE 23 @felipehoffa
slide-24
SLIDE 24 @felipehoffa
slide-25
SLIDE 25

Project health

  • Projects with most issues
  • Projects with most people filing issues
  • Projects with most engagement
  • Best projects at closing issues
  • Best phrasing for issue closing
slide-26
SLIDE 26 @felipehoffa
slide-27
SLIDE 27 @felipehoffa
slide-28
SLIDE 28

Even text analysis?

slide-29
SLIDE 29 @felipehoffa
slide-30
SLIDE 30

So where's the code?

slide-31
SLIDE 31 @felipehoffa
slide-32
SLIDE 32 @felipehoffa
slide-33
SLIDE 33 @felipehoffa
slide-34
SLIDE 34 @felipehoffa
slide-35
SLIDE 35 @felipehoffa

Rules to analyze [bigquery-public-data:github_repos.contents]

  • Text files <1MB
  • One copy of each unique file
  • JOIN with [github_repos.files] for paths
  • Don't JOIN with [github_repos.files] to get contents*path.
  • Extract first, analyze later
  • [github_repos.sample_contents]
  • > 10% of contents, top projects, 1 sample path.
  • Only open source projects - https://developer.github.com/v3/licenses/
  • Some projects missing - why?
slide-36
SLIDE 36

Top java imports growth 2013-16

slide-37
SLIDE 37 @felipehoffa
slide-38
SLIDE 38

Requesting a feature for Go

slide-39
SLIDE 39 @felipehoffa
slide-40
SLIDE 40 @felipehoffa
slide-41
SLIDE 41

Beyond regex

Static code analysis with UDFs

slide-42
SLIDE 42 @felipehoffa
slide-43
SLIDE 43 @felipehoffa
slide-44
SLIDE 44 @felipehoffa
slide-45
SLIDE 45 @felipehoffa
slide-46
SLIDE 46 @felipehoffa

Spaces vs Tabs - GitHub on BigQuery edition

The rules:
  • Data source: GitHub files stored in BigQuery.
  • Stars matter: We’ll only consider the top 400,000 repositories — by
number of stars they got on GitHub during the period Jan-May 2016.
  • No small files: Files need to have at least 10 lines that start with a
space or a tab.
  • No duplicates: Duplicate files only have one vote, regardless of how
many repos they live in.
  • One vote per file: Some files use a mix of spaces or tabs. We’ll
count on which side depending on which method they use more.
  • Top languages: We’ll look into files with the extensions (.java, .h, .js,
.c, .php, .html, .cs, .json, .py, .cpp, .xml, .rb, .cc, .go).
slide-47
SLIDE 47 @felipehoffa

Spaces vs Tabs - Extract

slide-48
SLIDE 48 @felipehoffa

Spaces vs Tabs - Apply the rules

slide-49
SLIDE 49 @felipehoffa

Spaces vs Tabs - Results

slide-50
SLIDE 50

Who wants to analyze GitHub? Project maintainers Project users Project choosers Data lovers YOU!

slide-51
SLIDE 51

GitHub

slide-52
SLIDE 52 @felipehoffa

Way more:

slide-53
SLIDE 53 @felipehoffa
slide-54
SLIDE 54

Questions?

News: reddit.com/r/bigquery Ask: stackoverflow.com Felipe Hoffa @felipehoffa

Rate me?

bit.ly/bqfeedback

slide-55
SLIDE 55 @felipehoffa
slide-56
SLIDE 56 @felipehoffa
slide-57
SLIDE 57 @felipehoffa
slide-58
SLIDE 58 @felipehoffa 2016 top imports vs 2010 top imports
slide-59
SLIDE 59 @felipehoffa
slide-60
SLIDE 60 @felipehoffa