DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li - - PowerPoint PPT Presentation

ds504 cs586 big data analytics data management
SMART_READER_LITE
LIVE PREVIEW

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li - - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK233 Spring 2018 Service Providing Improve urban planning, Ease Traffic Congestion, Save Energy, Reduce The Environment Air


slide-1
SLIDE 1

DS504/CS586: Big Data Analytics Data Management

  • Prof. Yanhua Li

Welcome to

Time: 6:00pm –8:50pm R Location: AK233 Spring 2018

slide-2
SLIDE 2

Cities OS People The Environment Win Win Win Urban Computing

Tackle the Big challenges in Big cities using Big data!

Urban Sensing & Data Acquisition

Participatory Sensing, Crowd Sensing, Mobile Sensing Traffic Road Networks POIs Air Quality Human mobility Meteorolo gy Social Media Energy

Urban Data Management

Spatio-temporal index, streaming, trajectory, and graph data management,...

Urban Data Analytics

Data Mining, Machine Learning, Visualization

Service Providing

Improve urban planning, Ease Traffic Congestion, Save Energy, Reduce Air Pollution, ...

Urban Computing: concepts, methodologies, and applications. Zheng, Y., et al. ACM transactions on Intelligent Systems and Technology.

slide-3
SLIDE 3

2D-Spatial Queries

K Nearest Neighbour (KNN) Queries Region (Range) Query Given a point or an object, find the nearest object that satisfies given conditions Ask for objects that lie partially or fully inside a specified region.

slide-4
SLIDE 4

Trajectory Data Management

v Range queries E.g. Retrieve the trajectories of vehicles passing a given rectangular region R between 2pm-4pm in the past month

  • KNN queries

q1 q2 Tr1 Tr2 Tr3 Tr1 Tr2 B) KNN Point Query p2 Tr1 Tr2 Tr3 A) Range Query R

E.g. Retrieve the trajectories of people with the minimum aggregated distance to a set of query points Publications: [1][2] for a single point query, [3] for multiple query points E.g. Retrieve the trajectories of people with the minimum aggregated distance to a query trajectory Publications: Chen et al, SIGMOD05; Vlachos et al, ICDE02; Yi et al, ICDE98.

Tr1 Tr2

3

Tr1 Tr2 Tr3 qt C) KNN Trajectory Query

[3] Zaiben Chen, et al. Searching Trajectories by Locations: An Efficiency Study, SIGMOD 2010 [1] E. Frentzos, et al. Algorithms for nearest neighbor search on moving object trajectories. Geoinformatica, 2007 [2] D. Pfoser, et al. Novel approaches in query processing for moving object trajectories. VLDB, 2000.

slide-5
SLIDE 5

Spatial/Temporal Indexing Structures

v Temporal Indexing (1-D data)

§ List index § B-tree

v Space Partition-Based Indexing Structures (2-D

data)

§ Grid-based § Quad-tree

slide-6
SLIDE 6

List Index Structure

v Example

§ From YouTube Prefixes § To YouTube videos IDs

slide-7
SLIDE 7

Full B-Tree Structure

slide-8
SLIDE 8

B-Tree Index

v B-tree is the most commonly used data

structures for indexing.

v It is fully dynamic, that is it can grow and

shrink.

slide-9
SLIDE 9

Three Types B-Tree Nodes

v Root node - contains node pointers to

branch nodes.

v Branch node - contains pointers to leaf

nodes or other branch nodes.

v Leaf node - contains index items

slide-10
SLIDE 10

Spatial/Temporal Indexing Structures

v Temporal Indexing (1-D data)

§ List index § B-tree

v Space Partition-Based Indexing Structures

(2-D data) § Grid-based § Quad-tree

slide-11
SLIDE 11

Grid-based Spatial Indexing

g1 p1 p3 g2 p4

g1 g2

p1 p3 p4

v Indexing § Partition the space into disjoint and uniform grids § Build an index between each grid and the points in the grid

slide-12
SLIDE 12

Grid-based Spatial Indexing

v Range Query

§ Find the girds intersecting the range query § Retrieve the points from the grids and identify the points in the range

p1 p3 p2 p4

p2 p3 p1 p4 g1 g2 g4 g3

slide-13
SLIDE 13

Grid-based Spatial Indexing

v Nearest neighbor query

§ Euclidian distance § Road network distance is quite different

p1 p1 p2 p1 p2

The nearest object is within the grid The nearest object is

  • utside the grid

Fast approximation

slide-14
SLIDE 14

Grid-based Spatial Indexing

v Advantages

§ Easy to implement and understand § Very efficient for processing range and nearest queries

v Disadvantages

§ Index size could be big § Difficult to deal with unbalanced data § Think about what we discussed last time

  • n the POI sampling and estimation.
slide-15
SLIDE 15

15

Quad-Tree

  • Indexing

– Each node of a quad-tree is associated with a rectangular region of space; the top node is associated with the entire target space. – Each non-leaf node divides its region into four equal sized quadrants – Leaf nodes have between zero and some fixed maximum number of points (set to 1 in example).

1 2 3

00 02 03 30 31 32 33 30 00

1 2 3

slide-16
SLIDE 16

Quad-Tree

  • Range query (ok)

1 2 3

00 02 03 30 31 32 33

1 2 3

20 23

slide-17
SLIDE 17

Quad-Tree

  • Nearest Neighbour Query (hard)

1 2 3

00 02 03 30 31 32 33

1 2 3

20 23

slide-18
SLIDE 18

Spatial/Temporal (3D) Indexing Structures

v Temporal Indexing (1-D data)

§ List index § B-tree

v Space Partition-Based Indexing Structures (2-D

data)

§ Grid-based § Quad-tree

slide-19
SLIDE 19

Sampling Big Trajectory Data

slide-20
SLIDE 20

Big Trajectory Data in Urban Networks

Taxi GPS Trajectory Mobile User Trajectory

  • Urban roving sensors deliver big trajectory data.
  • Reveal moving patterns and urban issues.

How to manage the big trajectory data to enable efficient query processing. Challenge

slide-21
SLIDE 21

Trajectory Aggregate Query

  • A trajectory aggregate query
  • Retrieves statistics of distinct trajectories passing a

user-specified spatio-temporal region;

  • Examples,
  • # of taxi trips with average speed of more than 5

miles per hour in New York City in 2014;

  • # of mobile users with iPhone in Hong Kong in 2013.
slide-22
SLIDE 22

Exhaustive Search

  • ri: a sequence of GPS points in (TID, Lat, Lng, Time)
  • q: a trajectory aggregate query with Nq Trajectories
  • Spatio-temporal indexing: B-tree, Quad-tree, etc,
slide-23
SLIDE 23

Challenges with Big Trajectory Data

  • Long responding time for large trajectory dataset
  • In 2013, Shenzhen, China;
  • Query: # of iPhone users and taxi/bus trips

(System: A cluster of 3 machines with 24 Intel X5670 2.93GHz processors, 94GB memory.)

Mobility Data 788.6TB 6million users Taxi GPS 1.58 TB 22,083 taxis Bus GPS 1.34 TB 8,427 buses iPhone Users 0.8 million Taxi GPS 302 million trips Bus GPS 1.38 billion trips 12 minutes to get the exact answers!

slide-24
SLIDE 24

Key Challenges on Exact Answer

  • A trajectory ri may traverse multiple index leaf nodes
  • Cannot pre-compute and store the results on indices
  • Summing up two answers leads to over-counting

r1 r2 2 1

slide-25
SLIDE 25

Motivation & Problem Definition

How to sample B index leaf nodes to estimate # of trajectories in q with a guaranteed error bound?

q covers n index leaf nodes

slide-26
SLIDE 26

Random Index Sampling

B Sampled index leaf nodes Trajectory list r1, r2 r3, r5 r6, r7 r9, r10 … kq

1, kq 2

kq

3, kq 5

kq

6, kq 7

kq

9, kq 10

… Occurrence time Inverted index r1 r2 r3 … Lng Lat Time Index leaf node list … query q ST-indexed data Data Indexing Structure Sampling and Estimation

r2 r6 r9

r5 r1 r6 r7 r3

Index leaf node list Index leaf node list

slide-27
SLIDE 27

Random Index Sampling

  • Stage 1: Sampling Stage:
  • Uniformly at random sample B index leaf

nodes with replacement

  • Stage 2: Estimation Stage: (Unbiased Estimator)
  • Convergence analysis:

when , .

is the maximum number of trajectories in an index leaf node.

slide-28
SLIDE 28

Evaluation

v Dataset: 3TB real human mobility data in a large city

in eastern China

v Baseline Algorithm

§ Exhaustive search

v Evaluation metric

§ Relative error & Responding time

Statistics Value City Size

400 square miles

City Population Size

three million people

Duration

eight days at the end of 2010

Number of trajectories

109,914 3G users

# of spatio-temporal points

400 million (407, 040, 083)

slide-29
SLIDE 29

Evaluation Results

Relative error

0.1 0.2 0.4 0.8 1.6 −0.3 −0.2 −0.1 0.1 0.2 0.3 B/n(%) Relative error ratio n=7k n=13k n=23k Ground Truth 0.1 0.2 0.4 0.8 1.6 5 10 15 20 B/n(%) Query processing time (s) n=7k (ES: 112s) n=13k (ES: 115s) n=23k (ES: 117s)

Processing time Up to 2% relative error 5 times reduction 5 2%

slide-30
SLIDE 30

Concurrent Random Index Sampling

  • Practical Issue:
  • A large number of concurrent aggregate queries
  • Idea of Concurrent Random Index Sampling (CRIS):
  • Sampling Reuse
  • Stratified Sampling Technique
slide-31
SLIDE 31

Concurrent Random Index Sampling

Unbiased Estimators:

slide-32
SLIDE 32

Summary

v Approximate query processing

§ Single trajectory aggregate query

  • via Random Index Sampling (RIS)

§ Concurrent trajectory aggregate queries

  • via Concurrent Random Index Sampling (CRIS)
slide-33
SLIDE 33

Any Comments & Critiques?

slide-34
SLIDE 34

Weka

v 6 weeks

v https://weka.waikato.ac.nz/dataminingwithweka/preview v https://www.futurelearn.com/courses/data-mining-with-

weka