Improve your SQL workload with observability PostgresOpen 2019 @ - - PowerPoint PPT Presentation

improve your sql workload with observability
SMART_READER_LITE
LIVE PREVIEW

Improve your SQL workload with observability PostgresOpen 2019 @ - - PowerPoint PPT Presentation

Improve your SQL workload with observability PostgresOpen 2019 @ Orlando Wilfried ROSET, Engineering Manager 1 Speaker Wilfried ROSET wilfriedroset @ github Engineering Manager @ OVH 2 WE BUILD A a truly Cloud SMART DIFFERENT


slide-1
SLIDE 1

1

Improve your SQL workload with

  • bservability

PostgresOpen 2019 @ Orlando

Wilfried ROSET, Engineering Manager

slide-2
SLIDE 2

2

Speaker

  • Wilfried ROSET
  • wilfriedroset @ github
  • Engineering Manager @ OVH
slide-3
SLIDE 3

3

SIMPLE AND QUICK TO SET UP MULTI-LOCAL, CLOSE TO EACH PERSON

THROUGHOUT THE WORLD WHOSE COST IS ACCESSIBLE AND PREDICTABLE

REVERSIBLE, OPEN AND INTEROPERABLE TRANSPARENT AND RESPONSIBLE

WE BUILD A DIFFERENT CLOUD

a truly Cloud SMART

TO GROW AND SUCCEED TOGETHER

slide-4
SLIDE 4

4

WE ARE A GLOBAL CLOUD PROVIDER WE ARE AN INDUSTRIAL PLAYER

Hillsboro x1

20 Tbps 34 Point of Presence

IN 3 CONTINENTS

BHS x6 Vinthill x1

+1 500 000

CUSTOMERS

IN 132

COUNTRIES

Singapore x1

2 200

EMPLOYEES IN 18 COUNTRIES

Sydney x1 Europe x18

BUILDER OF ITS OWN SERVERS SINCE 2002 + 1 million SERVERS built since 1999

28

DATA CENTERS in 12 locations EXISTING SINCE

1999

slide-5
SLIDE 5

5

Everyone uses SQL

  • Directly or Indirectly:

– CEO / CTO / CXO – Dev / DevOps / SRE – Support & Run

slide-6
SLIDE 6

6

Is this happening to you?

« We need a bigger DB! » « Our API is slow because of the DB! »

slide-7
SLIDE 7

7

No No Observability

slide-8
SLIDE 8

8

Ugly Observability

slide-9
SLIDE 9

9

Good Observability

slide-10
SLIDE 10

10

Our internal Databases perimeter

Applications Databases Clusters Disk space Person Hours per day

3k 400 60 20TB 2 à 4 24

slide-11
SLIDE 11

11

How to isolate a SLOW query

db# SELECT username FROM customers WHERE username LIKE ‘_wilfried%’; Time: 9433.400 ms (00:09.433) Tips: it’s not a missing index ;)

slide-12
SLIDE 12

12

Internal DBs Infra

Primary Replica Backup Prod Dev Replica Dev Read/Write Read-only Replication Read/Write 24-hour restore point Archive

slide-13
SLIDE 13

13

Old request process

Team DB

"Why is my DB slow?" "Can I have an extract from my logs?"

slide-14
SLIDE 14

14

Easier, Better, Faster, Stronger

  • Self-service for complete autonomy
  • Meaningful KPIs
  • Capacity planning
  • Ownership and proactivity
slide-15
SLIDE 15

15

Prerequisites

  • As-a-service

– OVH Logs – OVH Metrics – UI

  • Open source

– Data collection

slide-16
SLIDE 16

16

Logs

slide-17
SLIDE 17

17

First things first Observability is not about

ho how to to co

collect ct da data

but

wha what to to d do wit

with it it

slide-18
SLIDE 18

18

DBMS Configuration

  • PostgreSQL

– Everything happens in postgresql.conf – Format log to produce report

  • log_line_prefix = '%t [%p]: [%l-1] db=%d,user=%u,app=%a,client=%h '

– Log slow queries

  • log_min_duration_statement = '1000'
slide-19
SLIDE 19

19

DBMS Configuration

  • MySQL / MariaDB

– Everything happens in my.cnf – Performance Schema – Log slow queries

  • slow_query_log

= 1

  • slow_query_log_file = /var/log/mysql/slow.log
slide-20
SLIDE 20

20

Give meaning to logs

  • Process log files and produce fully readable reports

– pgbadger for PostgreSQL – pt-query-digest for MySQL/MariaDB

slide-21
SLIDE 21

21

Observability: step 1/5

Team DB

Reporting "Why is my DB slow?" "Can I have an extract from my logs?"

slide-22
SLIDE 22

22

Output Sample

slide-23
SLIDE 23

23

Logs, Logs, Logs…

  • Logs are useful only if they are usable

– <3 grep, less, … – Avoid ssh

slide-24
SLIDE 24

24

Logs pipeline

DBMS Servers Logstash Clusters ElasticSearch indexes

slide-25
SLIDE 25

25

Let’s grok

User@Host: ap app_u _user er @ [1. 1.2. 2.3. 3.4] ^# User@Host: %{WO WORD:query_user}\s*@\s*\[%{IP IP:query_ip}?\].* { "query_user": "app_user", "query_ip": "1.2.3.4" }

slide-26
SLIDE 26

26

Sending Data

  • Use whatever is Efficient && Easy to setup for your case

– Filebeat – Syslog

slide-27
SLIDE 27

27

Filebeat

# /etc/filebeat /filebeat.yml

  • utput.logstash:

hosts: ["graX.logs.ovh.com:6514"] ssl.certificate_authorities: ["/etc/filebeat/cert.pem"] filebeat.prospectors:

  • type: log

enabled: true paths:

  • /var/log/postgresql/postgresql.log
slide-28
SLIDE 28

28

Syslog

destination d_pg_ldp { tcp( "graX.logs.ovh.com", port(6514), ts_format("iso"), keep-alive(yes), so_keepalive(yes), log-fifo-size(10000), ); };

slide-29
SLIDE 29

29

Observability: step 2/5

Team DB

Reporting

Logs

Syslog/Filebeat

"Why is my DB slow?" "Can I have an extract from my logs?"

slide-30
SLIDE 30

30

Graylog

slide-31
SLIDE 31

31

Oops, Proxy and Connection Pooler…

VIP & Load Balancer Connection Pooler Load Balancer Connection Pooler

Primary Replica

Load Balancer Connection Pooler

Replica

With our setup we lost Source IP:

  • Less usefull pg_hba
  • Harder to track bad queries
  • Security audit
slide-32
SLIDE 32

32

… Proxy Protocol

What about Proxy Protocol? We want to help:

  • Proxy Protocol support @ PostgreSQL Hackers à https://bit.ly/2MN2H8U
  • PR#390 @ pgbouncer à https://github.com/pgbouncer/pgbouncer/pull/390
slide-33
SLIDE 33

33

Metrics

slide-34
SLIDE 34

34

It’s not about how

Metrics is is no

not ab

about t whi which to tools ls to use But

wha what to to d do wi

with yo your dat data

slide-35
SLIDE 35

35

Collect your metrics

  • There are several options for doing this:

– Homemade à please don’t do that – Collectl, Collectd, Statsd... – Telegraf, Prometheus…

slide-36
SLIDE 36

36

Telegraf’s plug-ins

  • Plug-ins:

– System-related

  • cpu, disk, diskio, system, network ...

– DBMS

  • PostgreSQL, MySQL

– Homemade

  • Exec
  • There is a lot more plug-ins in telegraf
slide-37
SLIDE 37

37

Let’s push some metrics

# /etc/telegraf/telegraf.conf [agent] interval = "30s" flush_interval = "30s" [[outputs.influxdb]] urls = ["https://influxdb.graXXX.metrics.ovh.net"] timeout = "15s" username = "telegraf" password = "write.token.from.metrics.manager"

slide-38
SLIDE 38

38

Observability: step 3/5

Team DB

Reporting

Logs

Metrics

Syslog/Logstash Telegraf

"Why is my DB slow?" "Can I have an extract from my logs?"

slide-39
SLIDE 39

39

System dashboard

slide-40
SLIDE 40

40

I/O dashboard

slide-41
SLIDE 41

41

DBMS dashboard

slide-42
SLIDE 42

42

Grafana is beautiful!

slide-43
SLIDE 43

43

Observability: step 4/5

Reporting

Logs

Syslog/Logstash Telegraf

"Why is my DB slow?" "Can I have an extract from my logs?"

Metrics

slide-44
SLIDE 44

44

Slow queries count/cumul. time

slide-45
SLIDE 45

45

Spot bugs

slide-46
SLIDE 46

46

Spot fixes

slide-47
SLIDE 47

47

We already knew that!

  • This is well known methods

– Utilization Saturation Errors – Rate Errors Durations

– Four Golden Signals

slide-48
SLIDE 48

48

80/20

  • Don’t try to fix the world…
  • Focus where it matters

– Start by your Top producers and iterate

slide-49
SLIDE 49

49

Talk…

Hey folks, I’ve open 60 tickets for your database slow queries and errors. XOXO <3

slide-50
SLIDE 50

50

Talk, again…

Folks, come on! What the F***

slide-51
SLIDE 51

51

Always talk (louder) …

That’s it folks! I’m done with

  • you. No more prod for you

until you fixe everything.

slide-52
SLIDE 52

52

… But do it gently

Ok folks, my bad. Let’s try again.

slide-53
SLIDE 53

53

Gamification

  • Make it fun
  • Make it interesting
  • Turn it to a game
  • Each game needs a ….
slide-54
SLIDE 54

54

Leader Board

slide-55
SLIDE 55

55

Leader Board Weekly Mail

Hello, You will find information which can help you identify your queries in our welcome guide: < insert documentation link > Tldr:

  • Database 33 did great because of blablabla
  • Database 94 is our #1 producer of slow queries
  • Blablabla

slide-56
SLIDE 56

56

Engage regularly

slide-57
SLIDE 57

57

What have we learned?

  • Observability can be set up in le

less ss tha than a a we week

  • Choose the right tools for the job

– Grafana for dashboard (mix and match sources) – Graylog for search – Reporting tools are not used by developers

  • 1 year down the road, we can expect…

– A widely-adopted gamechanger – Inspire others to do the same kind of report <3

– x4 x4 le less ss slo slow que queries

  • KPIs
slide-58
SLIDE 58

58

What’s next?

  • The easy part is done, we need to dig deeper

– Engage at higher level – Point everyone in the right direction… Broadcast the KPI – Help Developper refactor app & schema

  • Improve monitoring
  • Feed the data to Machine Learning
  • Automatic indexes recommendations

– Thanks to Percona blog – Based on pg_qualstats & hypopg

slide-59
SLIDE 59

59

Observability: step 5/5

Machine Learning

Logs

Syslog/Logstash Telegraf

"Why is my DB slow?" "Can I have an extract from my logs?"

Metrics

slide-60
SLIDE 60

60

Remember this SLOW query

db# SELECT username FROM customers WHERE username LIKE ‘_wilfried%’; Time: 9433.400 ms (00:09.433) Tips: it’s not a missing index ;)

slide-61
SLIDE 61

61

PS: We are hiring!

  • Opensource database expert
  • Site Reliability Engineers (Private cloud, Openstack, DNS, Deploy, Observability)
  • Software engineers (containers, baremetal, webhosting)
  • Back-end developers (go, python)
  • Engineering manager webhosting
  • … And a lot more
slide-62
SLIDE 62

62

Questions?