Inferring the Purposes of Network Tra ffi c in Mobile Apps Who - - PowerPoint PPT Presentation

inferring the purposes of network tra ffi c in mobile apps
SMART_READER_LITE
LIVE PREVIEW

Inferring the Purposes of Network Tra ffi c in Mobile Apps Who - - PowerPoint PPT Presentation

Inferring the Purposes of Network Tra ffi c in Mobile Apps Who (which app) sends the data? Where the data is being sent to? What data is being collected? Why the data is being collected? Haojian Jin , Minyi Liu, Yuanchun Li, Gaurav Srivastava,


slide-1
SLIDE 1

Inferring the Purposes of Network Traffic in Mobile Apps

Who (which app) sends the data? Where the data is being sent to? What data is being collected? Why the data is being collected?

Haojian Jin, Minyi Liu, Yuanchun Li, Gaurav Srivastava, Matthew Fredrikson, Yuvraj Agarwal, Jason Hong

1
slide-2
SLIDE 2

who request what data, and why

who: Camera app what: location why: to tag photos who: Uber what: location why: to locate pickup location

2
slide-3
SLIDE 3

These descriptions are only shown at the user interface layer and can be arbitrary text. No way to verify and not yet widely adopted.

3
slide-4
SLIDE 4 4

APIs

when an app calls an API and post data to remote servers over the network.

network perspective

user interface layer, arbitrary text no way to verify

slide-5
SLIDE 5

APIs

Can we index the privacy attributes of each network request similarly as the permission dialog?

5

who, where, what, why

slide-6
SLIDE 6

Who (which app) sends the data? Where the data is being sent to? What data is being collected? Why the data is being collected?

https://maps.google.com

Uber Google Location Map/navigation

6
slide-7
SLIDE 7

Towards a public, large scale privacy database

7

to improve the transparency of mobile data collection

slide-8
SLIDE 8

Related work

8
slide-9
SLIDE 9

Who (which app) sends the data? Where the data is being sent to? What data is being collected? Why the data is being collected?

https://maps.google.com

Uber Google Location Map/navigation

9

State of the art

1, 2 [1] Who Knows What About Me? A Survey of Behind the Scenes Personal Data Sharing to Third Parties by Mobile Apps [2] ReCon: Revealing and Controlling PII Leaks in Mobile Network Traffic
slide-10
SLIDE 10

Who Knows What About Me?

Zang et al. https://techscience.org/a/2015103001/

Related work

10
slide-11
SLIDE 11

Who Knows What About Me?

https://techscience.org/a/2015103001/

Related work Recruit ~10 participants Install VPN on their phones Test 10~100 apps Raw data type (e.g., email addr.)

11
slide-12
SLIDE 12

Who (which app) sends the data? Where the data is being sent to? What data is being collected? Why the data is being collected?

https://maps.google.com

Uber Google Location Map/navigation

12

State of the art less explored.

slide-13
SLIDE 13

Related work

Exposing the Data Sharing Practices

  • f Smartphone Apps [CHI’ 17]

Expectation and Purpose [Ubicomp’12]

13
slide-14
SLIDE 14

Exposing the Data Sharing Practices

  • f Smartphone Apps [CHI’ 17]

Expectation and Purpose [Ubicomp’12]

14

Purposes are manually annotated by researchers.

Related work

slide-15
SLIDE 15

MobiPurpose is a scalable in-lab solution that can index fjne-grained privacy attributes (who, where, what, why) of outgoing network requests.

15
slide-16
SLIDE 16

1 2 3

3 modules

Scalable network tracing Automated Inference Data types & purposes taxonomy

16
slide-17
SLIDE 17

1Network tracing

17

large scale network requests at a low cost

slide-18
SLIDE 18

Hardware & software setup (1)

……

downloaded 185, 173 apps

18
slide-19
SLIDE 19

Hardware & software setup (2)

……

installed 30,075 apps

19

(due to OS compatibility, etc)

slide-20
SLIDE 20

a men-in-the-middle VPN proxy app 3 minutes UI automation for each running for 50 days Hardware & software setup (3) We open source the tools at: http://bit.ly/mobipurpose

20
slide-21
SLIDE 21

Raw Traffic Data

21

Traffjc request snapshot

source app: com.inkcreature.predatorfree connect to host: inkcreature.com server path: /_predatorServer/ key-value pairs in request body: myLat: 40.4435877 myLon: -79.9452883 ....

slide-22
SLIDE 22

Raw Traffic Data

22

Traffjc request snapshot

source app: com.inkcreature.predatorfree connect to host: inkcreature.com server path: /_predatorServer/ key-value pairs in request body: myLat: 40.4435877 myLon: -79.9452883 ....

Who? Where? Key-value pairs

slide-23
SLIDE 23

Traffjc request snapshot

source app: com.inkcreature.predatorfree connect to host: inkcreature.com server path: /_predatorServer/ key-value pairs in request body: myLat: 40.4435877 myLon: -79.9452883 .... 2,008,912 unique traffic requests from 14,910 apps contacting 12,046 unique domains 302,893 unique URLs

Traffic Data stats We publish the dataset at: http://bit.ly/purposedata

23
slide-24
SLIDE 24

2Taxonomy

24

defjne and categorize purposes

slide-25
SLIDE 25

“usage strings” in iOS/Android

25

Arbitrary texts are hard to aggregate, analyze and verify.

slide-26
SLIDE 26
  • Many apps collect users’

data for similar purposes.

  • There are enumerable

purposes.

10-50 depends on the granularity.

Observations

26
slide-27
SLIDE 27

generate text describing the purpose build a taxonomy and classify the purpose

27
slide-28
SLIDE 28

Design the Taxonomy

1. Comprehensive and extendable

covers the majority of use cases

Meaningful granularity

not too narrow nor too broad

Understandable

28

minimal explanation for dev and users

1 2 3

slide-29
SLIDE 29

10 CS graduate students categorizing 1000+ network requests and 300+ permission usages 3 independent sessions affinity diagram

29
slide-30
SLIDE 30

purpose granularity

Purpose at App level

why a user downloads the app (e.g., app categories - Game)

Purpose at Network level

why an app sends the request the app (e.g., library categories - Ad)

30
slide-31
SLIDE 31

purpose granularity

Purpose at App level

why a user downloads the app (e.g., app categories - Game)

Purpose at Network level

why an app sends the request the app (e.g., library categories - Ad)

Purpose at Data level

why a developer collects the data (e.g., nearby search)

31
slide-32
SLIDE 32

purpose granularity

Purpose at App level

why a user downloads the app (e.g., app categories - Game)

Purpose at Network level

why a app sends the request the app (e.g., library categories - Ad)

Purpose at Data level

why a developer collects the data (e.g., usage descriptions)

32

contains most privacy details, consistent with usage strings

slide-33
SLIDE 33

data types

location

taxonomy typology

33
slide-34
SLIDE 34

data types

location nearby search

data purposes examples

34
slide-35
SLIDE 35

data types

location nearby search

data purposes examples

location-based customization

35
slide-36
SLIDE 36

data types

location nearby search

data purposes examples

location-based customization

ad analytics

…… ……

36
slide-37
SLIDE 37 37

See the complete taxonomy at: http://bit.ly/mobitaxonomy

Data purposes for location data

slide-38
SLIDE 38 38

data types

slide-39
SLIDE 39 39

extensibility

Bluetooth

slide-40
SLIDE 40

3Automated inference

40
slide-41
SLIDE 41

Traffjc request snapshot

source app: com.inkcreature.predatorfree connect to host: inkcreature.com server path: /_predatorServer/ key-value pairs in request body: myLat: 40.4435877 myLon: -79.9452883 ....

41

input

  • utput

What data is being collected? Why the data is being collected?

slide-42
SLIDE 42 42

intuitions

Self-explainable patterns

userAdvertisingId : 901e3310-3a26-487e-83c7-2fa26ac2786c advertising, Id machine generated UUID http://reports.crashlytics.com

1

report, crash, analytics

slide-43
SLIDE 43 43

intuitions

Self-explainable patterns

1

External knowledge (app type, server domain)

2

a game app sends location data to http://admob.com A mobile ad company

slide-44
SLIDE 44 44

data type inference

a bootstrapping method to predict the data type

slide-45
SLIDE 45 45

taxonomy lookup to get the purpose candidates

search nearby location-based customization transportation information recording map/navigation geosocial networking geotagging location spoofjng alert and remind location-based game reverse geocoding advertising analytics purposes candidates

taxonomy lookup

slide-46
SLIDE 46

Traffjc request snapshot

source app: com.inkcreature.predatorfree connect to host: inkcreature.com server path: /_predatorServer/ key-value pairs in request body: myLat: 40.4435877 myLon: -79.9452883 ....

purpose features

Source app feature predator is an offender registry search app

46
slide-47
SLIDE 47

Traffjc request snapshot

source app: com.inkcreature.predatorfree connect to host: inkcreature.com server path: /_predatorServer/ key-value pairs in request body: myLat: 40.4435877 myLon: -79.9452883 ....

purpose features

Source app feature predator is an offender registry search app Textual feature the app sends data to its own server

47
slide-48
SLIDE 48

Traffjc request snapshot

source app: com.inkcreature.predatorfree connect to host: inkcreature.com server path: /_predatorServer/ key-value pairs in request body: myLat: 40.4435877 myLon: -79.9452883 ....

purpose features

Source app feature predator is an offender registry search app Textual feature the app sends data to its own server Domain feature

  • company business type (Crunchbase)
  • decompile app fjles to mine the domain

references

48
slide-49
SLIDE 49

source app feature:

predator is an ofgender registry search app

textual feature:

the app sends data to its own server

domain feature:

  • company business type from Crunchbase
  • decompile large scale app fjles to

mine the domain references

search nearby location-based customization transportation information recording map/navigation geosocial networking geotagging location spoofjng alert and remind location-based game reverse geocoding advertising analytics

supervised learning

purposes candidates probability 0.72 0.2 0.03 0.02 0.02 0.01

supervised machine learning

49
slide-50
SLIDE 50

Evaluation

50

accuracy & recall

slide-51
SLIDE 51

Labeling ”what” & “why” in each traffic request. Each request has been labeled by three people.

51

data set labeling

slide-52
SLIDE 52

data set & method

1059 traffic requests in total across 7 data categories consensus on 98% data type labels, and 88% of purpose labels. method: 10-fold cross validation

52
slide-53
SLIDE 53 53

results

Overall precision of 95.9% precision above 93% for all 7 classes Data type inference:

slide-54
SLIDE 54 54

results

Overall precision of 84% for 19 unique categories Data purpose inference: confusion matrix for ID

slide-55
SLIDE 55 55

results

Overall precision of 84% for 19 unique categories Data purpose inference: confusion matrix for ID purposes

See more details in the paper.

slide-56
SLIDE 56 56
slide-57
SLIDE 57

Beta web: http://bit.ly/mobipurposeweb

57
slide-58
SLIDE 58

APIs

Make privacy a native feature by inspecting network requests

58

who, where, what, why

MobiPurpose

Recap

slide-59
SLIDE 59

Outreach

1

Network tracing tools

http://bit.ly/mobipurposetool

2

Traffic requests data set

http://bit.ly/mobipurposedata

3

Data type & purpose taxonomy

4

Beta web

http://bit.ly/mobipurposeweb

Traffjc request snapshot source app: com.inkcreature.predatorfree connect to host: inkcreature.com server path: /_predatorServer/ key-value pairs in request body: myLat: 40.4435877 myLon: -79.9452883 .... 59

http://bit.ly/mobitaxonomy

slide-60
SLIDE 60

Inferring the Purposes of Network Traffic in Mobile Apps

Who (which app) sends the data? Where the data is being sent to? What data is being collected? Why the data is being collected?

Haojian Jin (haojian@cs.cmu.edu)

60
slide-61
SLIDE 61

Backup slides

61
slide-62
SLIDE 62

Privacy frameworks Approximate information fmow Contextual Integrity

62
slide-63
SLIDE 63

user study v.s. in-lab devices

proportional to real usages recruiting participants

+

not scalable

  • may not be ethical
  • in-lab devices

not proportional

  • +scalable

cheap

+

63