DS504/CS586: Big Data Analytics Application I
- Prof. Yanhua Li
Welcome to
Time: 6:00pm –8:50pm R Loca2on: KH 116 Fall 2017
Welcome to DS504/CS586: Big Data Analytics Application I Prof. - - PowerPoint PPT Presentation
Welcome to DS504/CS586: Big Data Analytics Application I Prof. Yanhua Li Time: 6:00pm 8:50pm R Loca2on: KH 116 Fall 2017 16 critiques & Next Thur we have the last critique. Already graded 4 of them. Plan to grade 1-2 more.
Time: 6:00pm –8:50pm R Loca2on: KH 116 Fall 2017
– Final reports in the discussion forum (by 11:59pm 12/12 Tue); – Self-and-peer evalua2on form for project 2 (by 11:59PM 12/12 Tue);
5
v Review of the semester v Plus the last critique/review
Urban Sensing & Data Acquisition
Participatory Sensing, Crowd Sensing, Mobile Sensing Traffic Road Networks POIs Air Quality Human mobility Meteorolo gy Social Media Energy
Urban Data Management
Spatio-temporal index, streaming, trajectory, and graph data management,...
Urban Data Analytics
Data Mining, Machine Learning, Visualization
Service Providing
Improve urban planning, Ease Traffic Congestion, Save Energy, Reduce Air Pollution, ...
Urban Compu,ng: concepts, methodologies, and applica,ons. Zheng, Y., et al. ACM transac+ons on Intelligent Systems and Technology.
Authors: Yu Zheng, Microsok Research Asia
– NO2, SO2 – Aerosols: PM2.5, PM10
– Healthcare – Pollu2on control and dispersal
– Building a measurement sta2on is not easy – A limited number of sta2ons (poor coverage)
Beijing only has 22 air quality monitor sta2ons in its urban areas (50kmx40km) Air quality monitor station
2PM, June 17, 2013
– Weathers, traffic, land use… – Subtle to model with a clear formula
40 80 120 160 200 240 280 320 360 400 440 480 0.00 0.05 0.10 0.15 0.20 0.25 0.30
Deviation of PM2.5 between S12 and S13
>35%
Propor2on
A) Beijing (8/24/2012 - 3/8/2013)
– Linear interpola2on – Classical dispersion models
geometry, the roughness coefficient of the urban surface…
– Satellite remote sensing
– Outsourced crowd sensing using portable devices
30,000 + USD, 10ug/m3 202×85×168(mm)
Meteorology Traffic POIs Road networks Human Mobility Historical air quality data Real-2me air quality reports
– Fine-grained pollu2on alert – Rou2ng based on air quality
S2 S1 S5 S3 S7 S6 S4 S1 S8 S9 S10
– Spa2ally-related data: POIs, road networks – Temporally-related data: traffic, meteorology, human mobility
– Limited number of sta2ons – Many places to infer
– Meteorological features – Traffic features – Human mobility features – POI features – Road network features
– Predict the AQI labels – Data sparsity – Two classifiers
Good Moderate Unhealthy Unhealthy-S Very Unhealthy
AQI of PM10 August to Dec. 2012 in Beijing
Good Moderate km Unhealthy-S Very Unhealthy Unhealthy 0≤v<20 20≤v<40 v≥ 40 E(v) D(v)
GPS trajectories generated by over 30,000 taxis From August to Dec. 2012 in Beijing
– Traffic flow – Land use of a loca2on – Func2on of a region (like residen2al or business areas)
– Number of arrivals 𝑔↓𝑏 and leavings 𝑔↓𝑚
A) AQI of PM10 B) AQI of NO2
fl fl fa fa
Good Moderate Unhealthy Unhealthy-S Very Unhealthy Good Moderate Unhealthy Unhealthy-S Very Unhealthy
Number of arrivals fa and leavings (departures) fl Parks vs factories
– Indicate the land use and the func2on of the region – the traffic paPerns in the region
– Numbers of POIs over categories – Por2on of vacant places – The changes in the number of POIs
– Have a strong correla2on with traffic flows – A good complementary of traffic modeling
– Total length of highways 𝑔↓ℎ – Total length of other (low-level) road segments 𝑔↓𝑠 – The number of intersec2ons 𝑔↓𝑡 in the grid’s affec2ng region
fh fr fs
– States of air quality
– Genera2on of air pollutants
– Two sets of features
s2 s1 s3 s4 l s2 s1 s3 s4 l s2 s1 s3 s4 ti t1 t2 l Time G e
p a c e
A location with AQI labels A location to be inferred Temporal dependency Spatial correlation
POIs:
Spatial
Fh
Temporal
Road Networks: Fr Ft Fm Meteorologic: Traffic: Human mobility: Fp
Spa2al Classifier Temporal Classifier
Co-Training
– Model the spa2al correla2on between AQI of different loca2ons – Using spa2ally-related features – Based on a neural network
– Select n sta2ons to pair with – Perform m rounds
∆P1x ∆R1x c
D1
Fp Fr l1
D2
d1x
D1 D2 D1 D1
1 1
Fp Fr lk
k k
Fp
x
Fr
x
lx ∆Pkx ∆Rkx c dkx
1 k
Input generation
cx
ANN
w'11 w'qr w1 wr wpq w11 b1 bq b'r b'1
b''
X1 X2 X3 Y
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Output Input
Output Y is 1 if at least two of the three inputs are equal to 1.
X1 X2 X3 Y
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
S
0.3 0.3 0.3 t=0.4 Output node Input nodes
3 2 1
S
X1 X2 X3 Y Black box
w1 t Output node Input nodes w2 w3
i i i
Perceptron Model
i i i
Activation function g(Si )
I1 I2 I3 wi1 wi2 wi3 Oi Neuron i Input Output threshold, t
Input Layer Hidden Layer Output Layer x1 x2 x3 x4 x5 y
Training ANN means learning the weights of the neurons
– Model the spa2al correla2on between AQI of different loca2ons – Using spa2ally-related features – Based on a neural network
– Select n sta2ons to pair with – Perform m rounds
∆P1x ∆R1x c
D1
Fp Fr l1
D2
d1x
D1 D2 D1 D1
1 1
Fp Fr lk
k k
Fp
x
Fr
x
lx ∆Pkx ∆Rkx c dkx
1 k
Input generation
cx
ANN
w'11 w'qr w1 wr wpq w11 b1 bq b'r b'1
b''
– Model the temporal dependency of the air quality in a loca2on – Using temporally related features – Based on a Linear-Chain Condi2onal Random Field (CRF)
Yt-1 Fm(t-1)
t-1
Ft(t-1) Fh(t-1) Fm(t)
t
Ft(t) Fh(t) Fm(t+1)
t+1
Ft(t+1) Fh(t+1) Yt Yt-1
Yt-1 Fm(t-1)
t-1
Ft(t-1) Fh(t-1) Fm(t)
t
Ft(t) Fh(t) Fm(t+1)
t+1
Ft(t+1) Fh(t+1) Yt Yt-1
∆P1x ∆R1x c
D1
Fp Fr l1
D2
c d1x
D1 D2 D1 D1
1 1Fp Fr lk
k kFp
xFr
xlx ∆Pkx ∆Rkx c dkx
1 k xANN Input generation
w'11 w'qr w1 wr wpq w11 b1 bq b'r b'1
b''
Training
Temporally-related features Spa2ally-related features
Labeled data Unlabeled data Inference
∆P1x ∆R1x c
D1
Fp Fr l1
D2
c d1x
D1 D2 D1 D1
1 1
Fp Fr lk
k k
Fp
x
Fr
x
lx ∆Pkx ∆Rkx c dkx
1 k
x
ANN Input generation
w'11 w'qr w1 wr wpq w11 b1 bq b'r b'1
b''
Temporally-related features Spa2ally-related features <𝑞↓𝑑1 ,𝑞↓𝑑2 , …, 𝑞↓𝑑𝑜 > <𝑞′↓𝑑1 ,𝑞′↓𝑑2 , …, 𝑞′↓𝑑𝑜 > × 𝑑=arg↓𝑑↓𝑗 ∈𝒟 𝑁𝑏𝑦(𝑞↓𝑑𝑗 × 𝑞′↓𝑑𝑗 )
c1, · · · , p0 cn > c = argci∈CMax(P ci
SC × P ci T C)
Yt-1 Fm(t-1)
t-1
Ft(t-1) Fh(t-1) Fm(t)
t
Ft(t) Fh(t) Fm(t+1)
t+1
Ft(t+1) Fh(t+1) Yt Yt-1
Data sources Beijing Shanghai Shenzhen Wuhan POI
2012 Q1 271,634 321,529 107,061 102,467 2012 Q3 272,109 317,829 107,171 104,634
Road
#.Segments 162,246 171,191 45,231 38,477 Highways 1,497km 1,963km 256km 1,193km Roads 18,525km 25,530km KM 6,100km 9,691km #. Intersec. 49,981 70,293 32,112 25,359
AQI
#. Sta2on 22 10 9 10 Hours 23,300 8,588 6,489 6,741 Time spans 8/24/2012-3/8/201 3 1/19/2013-3/8/20 13 2/4/2013-3/8/201 3 2/4/2013-3/8/2013
Urban Size (grids)
5050km (2500) 5050km (2500) 5745km(2565) 4525km (1165)
S1 S2 S4 S5 S8 S5 S2 S1 S7 S5 S3 S3 S6 S7 S6 S9 S10 S12 S11 S13 S14 S22 S15 S16 S16 S17 S18 S19 S20 S21 S3 S7 S6 S4 S1 S8 S9 S10 S1 S4 S2 S6 S9 S8 S1 S2 S4 S3 S10 S5 S9 S6 S7 S8
A) Beijing B) Shanghai C) Shenzhen D) Wuhan
– Remove a sta2on – Cross ci2es
– Linear and Gaussian Interpola2ons – Classical Dispersion Model – Decision Tree (DT): – CRF-ALL – ANN-ALL
PM10 NO2 Features Precision Recall Precision Recall
Fm
0.572 0.514 0.477 0.454
Ft
0.341 0.36 0.371 0.35
Fh
0.327 0.364 0.411 0.483 Fp+Fr 0.441 0.443 0.307 0.354
Fm+Ft
0.664 0.675 0.634 0.635 Fm+Ft+Fp+Fr 0.731 0.734 0.701 0.691 Fm+Ft+Fp+Fr+Fh 0.773 0.754 0.723 0.704
20 40 60 80 100 120 140 160 0.65 0.70 0.75 0.80
SC TC Co-Training
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
NO2 PM10
Accuracy U-Air Linear Guassian Classical DT CRF-ALL ANN-ALL
Accuracy
Ground Truth Predic,ons G M S U G
3789 402 102 0.883
Recall M
602 3614 204 0.818
S
41 200 532 50 0.646
U
22 70 219 0.704 0.855 0.853 0.586 0.814 0.828
Precision
Ci,es PM2.5 PM10 NO2 Prec. Rec. Prec. Rec. Prec. Rec. Beijing 0.764 0.763 0.762 0.745 0.730 0.749 Shanghai 0.705 0.725 0.702 0.718 0.715 0.706 Shenzhen 0.740 0.737 0.710 0.742 0.732 0.722 Wuhan 0.727 0.723 0.731 0.739 0.744 0.719
Procedures Time(ms) Procedures Time(ms) Feature extrac,on (per grid)
Ft&Fh
53.2 Inference (per grid) SC 21.5
28.8 TC 13.1
14.4 Total 131
– Real-2me and historical air quality readings from exis2ng sta2ons – Other data sources: meteorology, POIs, road network, human mobility, and traffic condi2on
– Deal with data sparsity by learning from unlabeled data – Model the spa2al correla2on among the air quality of different loca2ons – Model the temporal dependency of the air quality in a loca2on
– 0.82 with traffic data (co-training) – 0.76 if only using spa2al classifier