[PPT] - DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li PowerPoint Presentation

SLIDE 1

DS504/CS586: Big Data Analytics Data Management

Prof. Yanhua Li

Welcome to

Time: 6:00pm –8:50pm R Location: AK233 Spring 2018

SLIDE 2

Cities OS People The Environment Win Win Win Urban Computing

Tackle the Big challenges in Big cities using Big data!

Urban Sensing & Data Acquisition

Participatory Sensing, Crowd Sensing, Mobile Sensing Traffic Road Networks POIs Air Quality Human mobility Meteorolo gy Social Media Energy

Urban Data Management

Spatio-temporal index, streaming, trajectory, and graph data management,...

Urban Data Analytics

Data Mining, Machine Learning, Visualization

Service Providing

Improve urban planning, Ease Traffic Congestion, Save Energy, Reduce Air Pollution, ...

Urban Computing: concepts, methodologies, and applications. Zheng, Y., et al. ACM transactions on Intelligent Systems and Technology.

SLIDE 3

2D-Spatial Queries

K Nearest Neighbour (KNN) Queries Region (Range) Query Given a point or an object, find the nearest object that satisfies given conditions Ask for objects that lie partially or fully inside a specified region.

SLIDE 4

Trajectory Data Management

v Range queries E.g. Retrieve the trajectories of vehicles passing a given rectangular region R between 2pm-4pm in the past month

KNN queries

q1 q2 Tr1 Tr2 Tr3 Tr1 Tr2 B) KNN Point Query p2 Tr1 Tr2 Tr3 A) Range Query R

E.g. Retrieve the trajectories of people with the minimum aggregated distance to a set of query points Publications: [1][2] for a single point query, [3] for multiple query points E.g. Retrieve the trajectories of people with the minimum aggregated distance to a query trajectory Publications: Chen et al, SIGMOD05; Vlachos et al, ICDE02; Yi et al, ICDE98.

Tr1 Tr2

3

Tr1 Tr2 Tr3 qt C) KNN Trajectory Query

[3] Zaiben Chen, et al. Searching Trajectories by Locations: An Efficiency Study, SIGMOD 2010 [1] E. Frentzos, et al. Algorithms for nearest neighbor search on moving object trajectories. Geoinformatica, 2007 [2] D. Pfoser, et al. Novel approaches in query processing for moving object trajectories. VLDB, 2000.

SLIDE 5

Spatial/Temporal Indexing Structures

v Temporal Indexing (1-D data)

§ List index § B-tree

v Space Partition-Based Indexing Structures (2-D

data)

§ Grid-based § Quad-tree

SLIDE 6

List Index Structure

v Example

§ From YouTube Prefixes § To YouTube videos IDs

SLIDE 7

Full B-Tree Structure

SLIDE 8

B-Tree Index

v B-tree is the most commonly used data

structures for indexing.

v It is fully dynamic, that is it can grow and

shrink.

SLIDE 9

Three Types B-Tree Nodes

v Root node - contains node pointers to

branch nodes.

v Branch node - contains pointers to leaf

nodes or other branch nodes.

v Leaf node - contains index items

SLIDE 10

Spatial/Temporal Indexing Structures

v Temporal Indexing (1-D data)

§ List index § B-tree

v Space Partition-Based Indexing Structures

(2-D data) § Grid-based § Quad-tree

SLIDE 11

Grid-based Spatial Indexing

g1 p1 p3 g2 p4

g1 g2

p1 p3 p4

v Indexing § Partition the space into disjoint and uniform grids § Build an index between each grid and the points in the grid

SLIDE 12

Grid-based Spatial Indexing

v Range Query

§ Find the girds intersecting the range query § Retrieve the points from the grids and identify the points in the range

p1 p3 p2 p4

p2 p3 p1 p4 g1 g2 g4 g3

SLIDE 13

Grid-based Spatial Indexing

v Nearest neighbor query

§ Euclidian distance § Road network distance is quite different

p1 p1 p2 p1 p2

The nearest object is within the grid The nearest object is

utside the grid

Fast approximation

SLIDE 14

Grid-based Spatial Indexing

v Advantages

§ Easy to implement and understand § Very efficient for processing range and nearest queries

v Disadvantages

§ Index size could be big § Difficult to deal with unbalanced data § Think about what we discussed last time

n the POI sampling and estimation.

SLIDE 15

15

Quad-Tree

Indexing

– Each node of a quad-tree is associated with a rectangular region of space; the top node is associated with the entire target space. – Each non-leaf node divides its region into four equal sized quadrants – Leaf nodes have between zero and some fixed maximum number of points (set to 1 in example).

1 2 3

00 02 03 30 31 32 33 30 00

1 2 3

SLIDE 16

Quad-Tree

Range query (ok)

1 2 3

00 02 03 30 31 32 33

1 2 3

20 23

SLIDE 17

Quad-Tree

Nearest Neighbour Query (hard)

1 2 3

00 02 03 30 31 32 33

1 2 3

20 23

SLIDE 18

Spatial/Temporal (3D) Indexing Structures

v Temporal Indexing (1-D data)

§ List index § B-tree

v Space Partition-Based Indexing Structures (2-D

data)

§ Grid-based § Quad-tree

SLIDE 19

Sampling Big Trajectory Data

SLIDE 20

Big Trajectory Data in Urban Networks

Taxi GPS Trajectory Mobile User Trajectory

Urban roving sensors deliver big trajectory data.
Reveal moving patterns and urban issues.

How to manage the big trajectory data to enable efficient query processing. Challenge

SLIDE 21

Trajectory Aggregate Query

A trajectory aggregate query
Retrieves statistics of distinct trajectories passing a

user-specified spatio-temporal region;

Examples,
# of taxi trips with average speed of more than 5

miles per hour in New York City in 2014;

# of mobile users with iPhone in Hong Kong in 2013.

SLIDE 22

Exhaustive Search

ri: a sequence of GPS points in (TID, Lat, Lng, Time)
q: a trajectory aggregate query with Nq Trajectories
Spatio-temporal indexing: B-tree, Quad-tree, etc,

SLIDE 23

Challenges with Big Trajectory Data

Long responding time for large trajectory dataset
In 2013, Shenzhen, China;
Query: # of iPhone users and taxi/bus trips

(System: A cluster of 3 machines with 24 Intel X5670 2.93GHz processors, 94GB memory.)

Mobility Data 788.6TB 6million users Taxi GPS 1.58 TB 22,083 taxis Bus GPS 1.34 TB 8,427 buses iPhone Users 0.8 million Taxi GPS 302 million trips Bus GPS 1.38 billion trips 12 minutes to get the exact answers!

SLIDE 24

Key Challenges on Exact Answer

A trajectory ri may traverse multiple index leaf nodes
Cannot pre-compute and store the results on indices
Summing up two answers leads to over-counting

r1 r2 2 1

SLIDE 25

Motivation & Problem Definition

How to sample B index leaf nodes to estimate # of trajectories in q with a guaranteed error bound?

q covers n index leaf nodes

SLIDE 26

Random Index Sampling

B Sampled index leaf nodes Trajectory list r1, r2 r3, r5 r6, r7 r9, r10 … kq

1, kq 2

kq

3, kq 5

kq

6, kq 7

kq

9, kq 10

… Occurrence time Inverted index r1 r2 r3 … Lng Lat Time Index leaf node list … query q ST-indexed data Data Indexing Structure Sampling and Estimation

r2 r6 r9

…

r5 r1 r6 r7 r3

Index leaf node list Index leaf node list

SLIDE 27

Random Index Sampling

Stage 1: Sampling Stage:
Uniformly at random sample B index leaf

nodes with replacement

Stage 2: Estimation Stage: (Unbiased Estimator)
Convergence analysis:

when , .

is the maximum number of trajectories in an index leaf node.

SLIDE 28

Evaluation

v Dataset: 3TB real human mobility data in a large city

in eastern China

v Baseline Algorithm

§ Exhaustive search

v Evaluation metric

§ Relative error & Responding time

Statistics Value City Size

400 square miles

City Population Size

three million people

Duration

eight days at the end of 2010

Number of trajectories

109,914 3G users

# of spatio-temporal points

400 million (407, 040, 083)

SLIDE 29

Evaluation Results

Relative error

0.1 0.2 0.4 0.8 1.6 −0.3 −0.2 −0.1 0.1 0.2 0.3 B/n(%) Relative error ratio n=7k n=13k n=23k Ground Truth 0.1 0.2 0.4 0.8 1.6 5 10 15 20 B/n(%) Query processing time (s) n=7k (ES: 112s) n=13k (ES: 115s) n=23k (ES: 117s)

Processing time Up to 2% relative error 5 times reduction 5 2%

SLIDE 30

Concurrent Random Index Sampling

Practical Issue:
A large number of concurrent aggregate queries
Idea of Concurrent Random Index Sampling (CRIS):
Sampling Reuse
Stratified Sampling Technique

SLIDE 31

Concurrent Random Index Sampling

Unbiased Estimators:

SLIDE 32

Summary

v Approximate query processing

§ Single trajectory aggregate query

via Random Index Sampling (RIS)

§ Concurrent trajectory aggregate queries

via Concurrent Random Index Sampling (CRIS)

SLIDE 33

Any Comments & Critiques?

SLIDE 34

Weka

v 6 weeks

v https://weka.waikato.ac.nz/dataminingwithweka/preview v https://www.futurelearn.com/courses/data-mining-with-