DS504/CS586: Big Data Analytics Data Management
- Prof. Yanhua Li
Welcome to
Time: 6:00pm –8:50pm R Location: AK233 Spring 2018
DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li - - PowerPoint PPT Presentation
Welcome to DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK233 Spring 2018 Service Providing Improve urban planning, Ease Traffic Congestion, Save Energy, Reduce The Environment Air
Time: 6:00pm –8:50pm R Location: AK233 Spring 2018
Cities OS People The Environment Win Win Win Urban Computing
Urban Sensing & Data Acquisition
Participatory Sensing, Crowd Sensing, Mobile Sensing Traffic Road Networks POIs Air Quality Human mobility Meteorolo gy Social Media Energy
Urban Data Management
Spatio-temporal index, streaming, trajectory, and graph data management,...
Urban Data Analytics
Data Mining, Machine Learning, Visualization
Service Providing
Improve urban planning, Ease Traffic Congestion, Save Energy, Reduce Air Pollution, ...
Urban Computing: concepts, methodologies, and applications. Zheng, Y., et al. ACM transactions on Intelligent Systems and Technology.
K Nearest Neighbour (KNN) Queries Region (Range) Query Given a point or an object, find the nearest object that satisfies given conditions Ask for objects that lie partially or fully inside a specified region.
v Range queries E.g. Retrieve the trajectories of vehicles passing a given rectangular region R between 2pm-4pm in the past month
q1 q2 Tr1 Tr2 Tr3 Tr1 Tr2 B) KNN Point Query p2 Tr1 Tr2 Tr3 A) Range Query R
E.g. Retrieve the trajectories of people with the minimum aggregated distance to a set of query points Publications: [1][2] for a single point query, [3] for multiple query points E.g. Retrieve the trajectories of people with the minimum aggregated distance to a query trajectory Publications: Chen et al, SIGMOD05; Vlachos et al, ICDE02; Yi et al, ICDE98.
Tr1 Tr2
3
Tr1 Tr2 Tr3 qt C) KNN Trajectory Query
[3] Zaiben Chen, et al. Searching Trajectories by Locations: An Efficiency Study, SIGMOD 2010 [1] E. Frentzos, et al. Algorithms for nearest neighbor search on moving object trajectories. Geoinformatica, 2007 [2] D. Pfoser, et al. Novel approaches in query processing for moving object trajectories. VLDB, 2000.
v Temporal Indexing (1-D data)
v Space Partition-Based Indexing Structures (2-D
v Example
v B-tree is the most commonly used data
v It is fully dynamic, that is it can grow and
v Root node - contains node pointers to
v Branch node - contains pointers to leaf
v Leaf node - contains index items
v Temporal Indexing (1-D data)
v Space Partition-Based Indexing Structures
g1 p1 p3 g2 p4
g1 g2
p1 p3 p4
v Indexing § Partition the space into disjoint and uniform grids § Build an index between each grid and the points in the grid
v Range Query
§ Find the girds intersecting the range query § Retrieve the points from the grids and identify the points in the range
p1 p3 p2 p4
p2 p3 p1 p4 g1 g2 g4 g3
v Nearest neighbor query
§ Euclidian distance § Road network distance is quite different
p1 p1 p2 p1 p2
The nearest object is within the grid The nearest object is
Fast approximation
v Advantages
§ Easy to implement and understand § Very efficient for processing range and nearest queries
v Disadvantages
§ Index size could be big § Difficult to deal with unbalanced data § Think about what we discussed last time
15
– Each node of a quad-tree is associated with a rectangular region of space; the top node is associated with the entire target space. – Each non-leaf node divides its region into four equal sized quadrants – Leaf nodes have between zero and some fixed maximum number of points (set to 1 in example).
1 2 3
00 02 03 30 31 32 33 30 00
1 2 3
1 2 3
00 02 03 30 31 32 33
1 2 3
20 23
1 2 3
00 02 03 30 31 32 33
1 2 3
20 23
v Temporal Indexing (1-D data)
v Space Partition-Based Indexing Structures (2-D
Taxi GPS Trajectory Mobile User Trajectory
(System: A cluster of 3 machines with 24 Intel X5670 2.93GHz processors, 94GB memory.)
r1 r2 2 1
B Sampled index leaf nodes Trajectory list r1, r2 r3, r5 r6, r7 r9, r10 … kq
1, kq 2
kq
3, kq 5
kq
6, kq 7
kq
9, kq 10
… Occurrence time Inverted index r1 r2 r3 … Lng Lat Time Index leaf node list … query q ST-indexed data Data Indexing Structure Sampling and Estimation
Index leaf node list Index leaf node list
is the maximum number of trajectories in an index leaf node.
v Dataset: 3TB real human mobility data in a large city
v Baseline Algorithm
v Evaluation metric
Statistics Value City Size
400 square miles
City Population Size
three million people
Duration
eight days at the end of 2010
Number of trajectories
109,914 3G users
# of spatio-temporal points
400 million (407, 040, 083)
Relative error
0.1 0.2 0.4 0.8 1.6 −0.3 −0.2 −0.1 0.1 0.2 0.3 B/n(%) Relative error ratio n=7k n=13k n=23k Ground Truth 0.1 0.2 0.4 0.8 1.6 5 10 15 20 B/n(%) Query processing time (s) n=7k (ES: 112s) n=13k (ES: 115s) n=23k (ES: 117s)
Processing time Up to 2% relative error 5 times reduction 5 2%
Unbiased Estimators:
v Approximate query processing
v 6 weeks
v https://weka.waikato.ac.nz/dataminingwithweka/preview v https://www.futurelearn.com/courses/data-mining-with-