Scaling real-time analytics using Postgres in the cloud - - PowerPoint PPT Presentation

scaling real time analytics using postgres in the cloud
SMART_READER_LITE
LIVE PREVIEW

Scaling real-time analytics using Postgres in the cloud - - PowerPoint PPT Presentation

Scaling real-time analytics using Postgres in the cloud Sai.Srirampur@microsoft.com Colton.Shepard@microsoft.com Why PostgreSQL? Proven Resilience and Stability Thousands of Mission Critical Workloads Open source Large Developer Community and


slide-1
SLIDE 1

Scaling real-time analytics using Postgres in the cloud

Sai.Srirampur@microsoft.com Colton.Shepard@microsoft.com

slide-2
SLIDE 2

Why PostgreSQL?

Proven Resilience and Stability Thousands of Mission Critical Workloads Open source Large Developer Community and Extensible Rich Feature Set: Solves multitude of use cases Constraints Extensions PostGIS / Geospatial HLL, TopN, Citus Foreign data wrappers JSONB Rich SQL CTEs Window functions Full text search Datatypes

slide-3
SLIDE 3

PostgreSQL is more popular than ever

loved wanted

https://insights.stackoverflow.com/survey/2019?utm_source=so-owned&utm_medium=blog&utm_campaign=dev-survey-2019&utm_content=launch-blog https://db-engines.com/en/blog_post/76 https://db-engines.com/en/ranking_trend/system/PostgreSQL

DBMS of the Year DB-Engines’ ranking of PostgreSQL popularity

PostgreSQL is more popular than ever

slide-4
SLIDE 4

More and more organizations are shifting open source workloads to the cloud to benefit from key advantages:

  • Improved manageability and security
  • Improved performance and intelligence
  • Global scalability

The cloud only makes PostgreSQL better

slide-5
SLIDE 5

Azure Database for PostgreSQL

slide-6
SLIDE 6

Azure Database for PostgreSQL is available in two deployment options

Single Server

Fully-managed, single-node PostgreSQL Example use cases

  • Apps with JSON, geospatial support, or full-text search
  • Transactional and operational analytics workloads
  • Cloud-native apps built with modern frameworks

Hyperscale (Citus)

High-performance Postgres for scale out Example use cases

  • Scaling PostgreSQL multi-tenant, SaaS apps
  • Real-time operational analytics
  • Building high throughput transactional apps

Enterprise-ready, fully managed community PostgreSQL with built- in HA and multi-layered security

slide-7
SLIDE 7

High performance scale-

  • ut with Hyperscale

(Citus)

The benefits of Azure Database for PostgreSQL

Intelligent performance

  • ptimization

Flexible and open Fully managed and secure Single Server Hyperscale (Citus) NEW

Build or migrate your workloads with confidence

slide-8
SLIDE 8

Real time Analytics

slide-9
SLIDE 9

Scenario: Real-time Analytics

slide-10
SLIDE 10

Common requirements for Real-time analytics applications

slide-11
SLIDE 11

Architecting Real-time Analytics with Postgres in the cloud

slide-12
SLIDE 12

Example Architecture for a real-time analytics application

slide-13
SLIDE 13

Typical Recipe for real-time analytics

slide-14
SLIDE 14

What is a rollup table?

Period Customer Country Site Hit Count

slide-15
SLIDE 15

Why Rollup tables?

slide-16
SLIDE 16

Typical Recipe for real-time analytics

ž Ingest large volumes of data to a raw table

slide-17
SLIDE 17

Schema for ingesting data into

CREATE TABLE events( event_id bigserial, event_time timestamptz default now(), customer_id bigint, event_type text, country text, browser text, device_id bigint, session_id bigint, details jsonb );

slide-18
SLIDE 18

Fast data loading - COPY

slide-19
SLIDE 19

Best practices for data loading

slide-20
SLIDE 20

Ways to have small indexes

slide-21
SLIDE 21

Expiring old data

slide-22
SLIDE 22

Bloat and Fragmentation

slide-23
SLIDE 23

Partitioning

Keep your data sorted by bucketing it.

COPY COPY

slide-24
SLIDE 24

CREATE EXTENSION pg_partman

slide-25
SLIDE 25

Now expiry becomes

slide-26
SLIDE 26

Typical Recipe for real-time analytics

ž Ingest large volumes of data to a raw table ž Periodically aggregate events into a rollup table ž Have application query the rollup table

slide-27
SLIDE 27

Typical Structure of Rollups

slide-28
SLIDE 28

Choose granularity and dimensions

Time Customer Country Aggregates Time Customer Site Aggregates Time Customer Country Site Aggregates ~100 rows per period/customer ~20 rows per period/customer ~20*100=2000 rows per period/customer

slide-29
SLIDE 29

Coarse-grained rollup for fast lookups

Look up records by primary key columns: hits_by_country

slide-30
SLIDE 30

Fine-grained rollup for versatility

Sum across all sites: hits_by_country_site

slide-31
SLIDE 31

Build coarse grained from fine grained rollups

rollup hits_by_country_site Useful if you want to keep coarse-grained data for much longer.

slide-32
SLIDE 32

Summary: Designing rollups

1. 2. 3. 4. 5.

slide-33
SLIDE 33

Computing rollups

slide-34
SLIDE 34

Append only vs Incremental

slide-35
SLIDE 35

Append-only aggregation

period,

slide-36
SLIDE 36

Incremental Aggregation

slide-37
SLIDE 37

Keeping track of aggregated events

slide-38
SLIDE 38

Track sequence number

S S

slide-39
SLIDE 39

Function to do transactional rollup

CREATE FUNCTION do_aggregation() RETURNS void LANGUAGE plpgsql AS $function$ DECLARE s bigint; e bigint; BEGIN

  • - Get and update the rollup window

SELECT * FROM safe_rollup_window('rollup') INTO s, e; INSERT INTO rollup SELECT period, dimensions, aggregates FROM events WHERE event_id BETWEEN s AND e GROUP BY period, dimensions ON CONFLICT (dimensions) DO UPDATE SET aggregates = aggregates + EXCLUDED.aggregates; END; $function$;

slide-40
SLIDE 40

Advanced aggregations – HLL and TopN

slide-41
SLIDE 41

Some metrics can’t be rolled up easily

slide-42
SLIDE 42

Solution: Use Approximations

slide-43
SLIDE 43

HLL

HyperLogLog starts by taking a hash of items counted: hll_hash_text('54.33.98.12') The hash function will produce a uniformly distributed bit string. Unlikely patterns occurring indicates high cardinality. Hash value with n 0-bits is observed → roughly 2n distinct items HyperLogLog divides values into m streams and averages the results.

slide-44
SLIDE 44

HyperLogLog Process

slide-45
SLIDE 45

Incremental Aggregation using HLL

Use hll_add_agg and hll_union to do incremental rollups. CREATE TABLE hourly_rollup ( customer_id bigint not null, period timestamptz not null, unique_ips hll not null, PRIMARY KEY (customer_id, period) ); INSERT INTO hourly_rollup SELECT customer_id, date_trunc('hour', created_at), hll_add_agg(ip) FROM page_views WHERE event_id BETWEEN start_id AND end_id GROUP BY 1, 2 ON CONFLICT (customer_id, period) DO UPDATE SET unique_ips = hll_union(unique_ips, EXCLUDED.unique_ips);

slide-46
SLIDE 46

Dashboard queries with HLL

Use hll_union_agg to merge HLL objects and hll_cardinality to extract distinct count.

  • - HLL

SELECT period::date, hll_cardinality(hll_union_agg(unique_ips)) AS uniques FROM hourly_rollup WHERE customer_id = 1283 AND period >= now() - interval '1 week' GROUP BY 1 ORDER BY 1; period │ uniques ────────────┼───────── 2018-08-29 │ 14712 2018-08-30 │ 33280 … (7 rows)

slide-47
SLIDE 47

TopN

TopN keeps track of a set of counters (e.g. 1000) in JSONB with the explicit goal of determining the top N (e.g. 10) most heavy hitters.

{ "184.31.49.1" : 1124712, "22.203.1.77" : 28371, "54.68.19.33" : 62183, … }

slide-48
SLIDE 48

Merging TopN objects

Like HLL, TopN objects can be merged over time periods, dimensions. topn_union(tn1,tn2) { { "184.31.49.1" : 1124712, "184.31.49.1" : 3407, "22.203.1.77" : 28371, + "22.203.1.77" : 22, "54.68.19.33" : 62183, "54.68.19.33" : 1, … … } }

slide-49
SLIDE 49

Incremental aggregation using TopN

Use topn_add_agg and topn_union to do incremental rollups. CREATE TABLE heavy_hitters_hour ( customer_id bigint not null, period timestamptz not null, top_ips jsonb not null, PRIMARY KEY (customer_id, period) ); INSERT INTO heavy_hitters_hours SELECT customer_id, date_trunc('hour', created_at), topn_add_agg(ip) FROM page_views WHERE event_id BETWEEN start_id AND end_id GROUP BY 1, 2 ON CONFLICT (customer_id, period) DO UPDATE SET top_ips = topn_union(top_ips, EXCLUDED.top_ips);

slide-50
SLIDE 50

Dashboard queries with TopN

Use topn_union_agg to merge TopN objects, topn to extract top N counts.

  • - Topn

SELECT (topn(topn_union_agg(top_ips), 10)).* FROM heavy_hitters_hour WHERE customer_id = 1283 AND period >= now() - interval '1 day'; item │ frequency ─────────────┼─────────── 184.31.49.1 │ 1124712 54.68.19.33 │ 62183 … (10 rows) Cheap index look-up with aggregation across 24 rows.

slide-51
SLIDE 51

Recipe for real-time analytics

ž Ingest large volumes of data to a raw table ž Periodically aggregate events into a rollup table ž Have application query the rollup table ž Automate all of this with pg_cron

slide-52
SLIDE 52

Automate jobs with pg_cron

ž PostgreSQL extension which allows you to run cron within the database ž Makes it easy to schedule jobs without requiring external tools

Example: Delete old data at midnight using pg_cron: SELECT cron.schedule('0 0 * * *', $$ DELETE FROM events WHERE event_time < date_trunc('day', now() - interval '1 week') $$);

slide-53
SLIDE 53

Periodic aggregation using pg_cron

Run aggregation every 5 minutes: SELECT cron.schedule('*/5 * * * *', $$ SELECT do_aggregation() $$);

slide-54
SLIDE 54

Scale out using Hyperscale (Citus)

slide-55
SLIDE 55

Architecture

Shard your PostgreSQL database across multiple nodes to give your application more memory, compute, and disk storage Easily add worker nodes to achieve horizontal scale, while being able to deliver parallelism even within each node Scale out to 100s of nodes

Coordinator

Table metadata

Each node PostgreSQL with Citus installed 1 shard = 1 Postgre SQL table

slide-56
SLIDE 56

Scaling with Hyperscale (Citus)

ž

SELECT create_distributed_table('events', 'customer_id');

slide-57
SLIDE 57

Multi-tenancy and colocation

Tenant ID provides a natural sharding dimension for many applications. Citus automatically co-locates event and rollup data for the same SELECT create_distributed_table('events', 'tenant_id'); SELECT create_distributed_table('rollup', 'tenant_id'); Aggregations can be done locally, without network traffic: INSERT INTO rollup SELECT tenant_id, … FROM events … Dashboard queries are always for a particular tenant: SELECT … FROM rollup WHERE tenant_id = 1238 … Or are parallelized when you want to compare tenants: SELECT tenant_id,… FROM rollup GROUP BY tenant_id …

slide-58
SLIDE 58

Benefits of Hyperscale (Citus)

slide-59
SLIDE 59

Data loading

Coordinator

COPY

events

slide-60
SLIDE 60

Aggregation and rollups

Coordinator

events

INSERT INTO rollup SELECT … FROM events GROUP BY …

rollup

INSERT INTO rollup_102182 SELECT … FROM events_102010 GROUP BY … INSERT INTO rollup_102180 SELECT … FROM events_102008 GROUP BY …

slide-61
SLIDE 61

Querying rollups

Coordinator

events

SELECT … FROM rollup WHERE tenant_id = 12834 … …

rollup

SELECT … FROM events_102180 WHERE tenant_id = 1283 … …

slide-62
SLIDE 62

Querying across tenants

Coordinator

SELECT tenant_id,… FROM rollup GROUP BY tenant_id …

rollup

SELECT tenant_id,… FROM rollup_102182 GROUP BY tenant_id … SELECT tenant_id,… FROM rollup_102180 GROUP BY tenant_id …

slide-63
SLIDE 63

Summary

slide-64
SLIDE 64

To build a real-time application which

slide-65
SLIDE 65

You should use:

slide-66
SLIDE 66

Hands on Lab

slide-67
SLIDE 67

Scenario: You = Cloud services provider helping businesses monitor their HTTP traffic

slide-68
SLIDE 68

http://tinyurl.com/yxreau7d

* Write down session id *

Instructions link

(if browser issues – disable popup blocker, try incognito)

slide-69
SLIDE 69

Setup (page 1 to page 4)

slide-70
SLIDE 70

Design and Implementation (page 5 to page 9)

slide-71
SLIDE 71

Thank You Q&A