a functional data scientist Richard Minerich, Director of R&D - - PowerPoint PPT Presentation

a functional data scientist
SMART_READER_LITE
LIVE PREVIEW

a functional data scientist Richard Minerich, Director of R&D - - PowerPoint PPT Presentation

A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus Projecting onto a 2D Plane The Pairwise Entity Resolution Process Two Datasets (Customer Data and Sanctions) Pairs of


slide-1
SLIDE 1

A day in the life of a functional data scientist

Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus

slide-2
SLIDE 2
slide-3
SLIDE 3

Projecting onto a 2D Plane

slide-4
SLIDE 4

The Pairwise Entity Resolution Process

Blocking

  • Two Datasets (Customer Data and Sanctions)
  • Pairs of Somehow Similar Records

Scoring

  • Pairs of Records
  • Probability of Representing Same Entity

Review

  • Records, Probability, Similarity Features
  • True/False Labels (Mostly by Hand)
slide-5
SLIDE 5

Blocking

slide-6
SLIDE 6

Scoring: Risk vs Probability (The Ideal)

Likely to Launder Money Probably the Same Person

slide-7
SLIDE 7

The Reality (Dominated by Garbage)

Tiny Bump 937 Upper Threshold 161 161,358

slide-8
SLIDE 8

Let’s dig into a single point

Jimmy Cournoyer El: 95/ SI:16

slide-9
SLIDE 9
slide-10
SLIDE 10

Citation Network (Safe View)

slide-11
SLIDE 11

Relationship Network (Safe View)

slide-12
SLIDE 12

British Columbia Rizzuto Crime Family Jimmy “Cosmo” “Superman” Cournoyer Quebec New York/NYC Bonanno Crime Family John “Big Man” Venizelos Reinvested in Cocaine California

Flow of Drugs

Hells Angels El Chapo Sinaloa Cartel

slide-13
SLIDE 13

Jorge HankRhon

Family & Friends

$100s Millions Citibank, CH Brother Murdered

slide-14
SLIDE 14

0.1 0.2 0.3 0.4 0.5 0.6

Munging Data Redoing Work / Investigating Problems Fun Algorithms

% Time Spent

slide-15
SLIDE 15

Disgustingly Bad but Fairly Large Datasets

▪ Both Wide (many fields) and Tall (many records) ▪ From different systems (different encodings) ▪ Missing data ▪ Poorly merged data ▪ Extra data ▪ Non-unique IDs Every client is awful in a completely different way.

NAME LARRY O BRIAN STATE CANADA CITY 121 Buffalo Drive, Montreal, Quebec H3G 1Z2 ADDRESS NULL ZIP 00000 DOB 10/24/80; 1/1/1979

slide-16
SLIDE 16

SAM – Building for Bad Data

▪ Lazy Pure Functional Core ▪ Programmable Data Cleaning ▪ Programmable ETL ▪ Ad-Hoc Behaviors All with an F# Core and Barb for scripting.

UI (C#) & Analysis (C#) Glue (F# and Barb) Data & Config In Data Out Algorithms (F#)

slide-17
SLIDE 17

Other Kinds of Problems (sometimes even my fault)

▪ Extra / Missing Data (e.g. incorrect subset or incorrect joins) ▪ Wrong version of data (e.g. bad sync in SQL) ▪ Bad configuration of dependencies The data lives in a locked down environment and so feedback cycles are slow.

Lesson: Be Paranoid

slide-18
SLIDE 18

F# Tools From Bayard Rock http://github.com/BayardRock

Tokens Classification Pegasus Airlines ORGANIZATION Istanbul LOCATION Sochi LOCATION Russia LOCATION Turkey LOCATION Transportation Ministry ORGANIZATION

slide-19
SLIDE 19

FSharpWebIntellisense

https://github.com/BayardRock/FSharpWebIntellisense

slide-20
SLIDE 20

iFSharp Notebook

https://github.com/BayardRock/IfSharp

slide-21
SLIDE 21

Barb, a simple .net record query language

Name.Contains "John“ and (Age > 20 or Weight > 200) https://github.com/Rickasaurus/Barb

slide-22
SLIDE 22

MITIE Dot Net (a wrapper for MIT’s MITIE)

https://github.com/BayardRock/MITIE-Dot-Net

A Pegasus Airlines plane landed at an Istanbul airport Friday after a passenger "said that there was a bomb on board" and wanted the plane to land in Sochi, Russia, the site of the Winter Olympics, said

  • fficials with Turkey's

Transportation Ministry. Tokens Classification Pegasus Airlines ORGANIZATION Istanbul LOCATION Sochi LOCATION Russia LOCATION Turkey LOCATION Transportation Ministry ORGANIZATION

slide-23
SLIDE 23

Other F# Community Tools (Not by Us)

▪ Data Type Providers (SQL, OData, CSV, etc..) ▪ Language Type Providers (R, Matlab, Python soon) ▪ Deedle (like Pandas but for F#) ▪ F# Charting

slide-24
SLIDE 24

The Magic of Type Providers

type Netflix = ODataService<"http://odata.netflix.com"> let avatarTitles = query { for t in netflix.Titles do where (t.Name.Contains "Avatar") sortBy t.Name take 100 }

slide-25
SLIDE 25

How it works!

Type Provider Compiler Types Erased Types The World!

Type Providers! Libraries For Free!

slide-26
SLIDE 26

Deedle (Like Python’s pandas but for F#)

▪ Designed with Data Type Providers in Mind ▪ Interops with the R Type Provider

slide-27
SLIDE 27

But what about algorithmic code?

slide-28
SLIDE 28

Ranking vs Regression

▪ Regression - you’re trying to guess a number, only distance matters ▪ May do a very bad job at ordering ▪ In Ranking you’re trying to figure out some order, only order matters ▪ May do a very bad job at providing a meaningful number Example: You’re a doctor with 20 spots open and 100 patents who want to see you today, which method would be the best for selecting 20?

slide-29
SLIDE 29

Regression

𝑧 = 𝑌𝛾 + 𝜁

y is labels X is features 𝛾 is weights 𝜁 is errors

slide-30
SLIDE 30

“OLS” Regression via Gradient Descent in F#

slide-31
SLIDE 31

Simple Ranking? You Can Use Regression.

▪ The features are the difference in would-be regression features ▪ The value to predict is the difference in rank Select 2 labeled samples randomly => (x1,y1) (x2,y2) x = x1 – x2 y = y1 – y2

Sample 1 Sample 2 Result Names? 1 1 Addresses? 1 1 DOB? 1

  • 1

Same Person?

slide-32
SLIDE 32

Simple Ranking in F#

slide-33
SLIDE 33

Combined Ranking and Regression – D. Sculley

You can improve your regression with ranking, and your ranking with regression. The best of both worlds!

slide-34
SLIDE 34

Combined Ranking and Regression –

  • D. Sculley @ Google, Inc
slide-35
SLIDE 35

Thank You!

Check out the NYC F# User Group: http://www.meetup.com/nyc-fsharp Find out more about F#: http://fsharp.org Contact me on twitter: @Rickasaurus