1
Improve your SQL workload with
- bservability
PostgresOpen 2019 @ Orlando
Wilfried ROSET, Engineering Manager
Improve your SQL workload with observability PostgresOpen 2019 @ - - PowerPoint PPT Presentation
Improve your SQL workload with observability PostgresOpen 2019 @ Orlando Wilfried ROSET, Engineering Manager 1 Speaker Wilfried ROSET wilfriedroset @ github Engineering Manager @ OVH 2 WE BUILD A a truly Cloud SMART DIFFERENT
1
PostgresOpen 2019 @ Orlando
Wilfried ROSET, Engineering Manager
2
Speaker
3
THROUGHOUT THE WORLD WHOSE COST IS ACCESSIBLE AND PREDICTABLE
TO GROW AND SUCCEED TOGETHER
4
Hillsboro x1
IN 3 CONTINENTS
BHS x6 Vinthill x1
CUSTOMERS
IN 132
COUNTRIES
Singapore x1
EMPLOYEES IN 18 COUNTRIES
Sydney x1 Europe x18
BUILDER OF ITS OWN SERVERS SINCE 2002 + 1 million SERVERS built since 1999
DATA CENTERS in 12 locations EXISTING SINCE
5
Everyone uses SQL
– CEO / CTO / CXO – Dev / DevOps / SRE – Support & Run
6
Is this happening to you?
« We need a bigger DB! » « Our API is slow because of the DB! »
7
No No Observability
8
Ugly Observability
9
Good Observability
10
Our internal Databases perimeter
Applications Databases Clusters Disk space Person Hours per day
3k 400 60 20TB 2 à 4 24
11
How to isolate a SLOW query
db# SELECT username FROM customers WHERE username LIKE ‘_wilfried%’; Time: 9433.400 ms (00:09.433) Tips: it’s not a missing index ;)
12
Internal DBs Infra
Primary Replica Backup Prod Dev Replica Dev Read/Write Read-only Replication Read/Write 24-hour restore point Archive
13
Old request process
Team DB
"Why is my DB slow?" "Can I have an extract from my logs?"
14
Easier, Better, Faster, Stronger
15
Prerequisites
– OVH Logs – OVH Metrics – UI
– Data collection
16
17
First things first Observability is not about
18
DBMS Configuration
– Everything happens in postgresql.conf – Format log to produce report
– Log slow queries
19
DBMS Configuration
– Everything happens in my.cnf – Performance Schema – Log slow queries
= 1
20
Give meaning to logs
– pgbadger for PostgreSQL – pt-query-digest for MySQL/MariaDB
21
Observability: step 1/5
Team DB
Reporting "Why is my DB slow?" "Can I have an extract from my logs?"
22
Output Sample
23
Logs, Logs, Logs…
– <3 grep, less, … – Avoid ssh
24
Logs pipeline
DBMS Servers Logstash Clusters ElasticSearch indexes
25
Let’s grok
User@Host: ap app_u _user er @ [1. 1.2. 2.3. 3.4] ^# User@Host: %{WO WORD:query_user}\s*@\s*\[%{IP IP:query_ip}?\].* { "query_user": "app_user", "query_ip": "1.2.3.4" }
26
Sending Data
– Filebeat – Syslog
27
Filebeat
# /etc/filebeat /filebeat.yml
hosts: ["graX.logs.ovh.com:6514"] ssl.certificate_authorities: ["/etc/filebeat/cert.pem"] filebeat.prospectors:
enabled: true paths:
28
Syslog
destination d_pg_ldp { tcp( "graX.logs.ovh.com", port(6514), ts_format("iso"), keep-alive(yes), so_keepalive(yes), log-fifo-size(10000), ); };
29
Observability: step 2/5
Team DB
Reporting
Logs
Syslog/Filebeat
"Why is my DB slow?" "Can I have an extract from my logs?"
30
Graylog
31
Oops, Proxy and Connection Pooler…
VIP & Load Balancer Connection Pooler Load Balancer Connection Pooler
Primary Replica
Load Balancer Connection Pooler
Replica
With our setup we lost Source IP:
32
… Proxy Protocol
What about Proxy Protocol? We want to help:
33
34
It’s not about how
35
Collect your metrics
– Homemade à please don’t do that – Collectl, Collectd, Statsd... – Telegraf, Prometheus…
36
Telegraf’s plug-ins
– System-related
– DBMS
– Homemade
37
Let’s push some metrics
# /etc/telegraf/telegraf.conf [agent] interval = "30s" flush_interval = "30s" [[outputs.influxdb]] urls = ["https://influxdb.graXXX.metrics.ovh.net"] timeout = "15s" username = "telegraf" password = "write.token.from.metrics.manager"
38
Observability: step 3/5
Team DB
Reporting
Logs
Metrics
Syslog/Logstash Telegraf
"Why is my DB slow?" "Can I have an extract from my logs?"
39
System dashboard
40
I/O dashboard
41
DBMS dashboard
42
Grafana is beautiful!
43
Observability: step 4/5
Reporting
Logs
Syslog/Logstash Telegraf
"Why is my DB slow?" "Can I have an extract from my logs?"
Metrics
44
Slow queries count/cumul. time
45
Spot bugs
46
Spot fixes
47
We already knew that!
– Utilization Saturation Errors – Rate Errors Durations
– Four Golden Signals
48
80/20
– Start by your Top producers and iterate
49
Talk…
Hey folks, I’ve open 60 tickets for your database slow queries and errors. XOXO <3
50
Talk, again…
Folks, come on! What the F***
51
Always talk (louder) …
That’s it folks! I’m done with
until you fixe everything.
52
… But do it gently
Ok folks, my bad. Let’s try again.
53
Gamification
54
Leader Board
55
Leader Board Weekly Mail
Hello, You will find information which can help you identify your queries in our welcome guide: < insert documentation link > Tldr:
…
56
Engage regularly
57
What have we learned?
less ss tha than a a we week
– Grafana for dashboard (mix and match sources) – Graylog for search – Reporting tools are not used by developers
– A widely-adopted gamechanger – Inspire others to do the same kind of report <3
– x4 x4 le less ss slo slow que queries
58
What’s next?
– Engage at higher level – Point everyone in the right direction… Broadcast the KPI – Help Developper refactor app & schema
– Thanks to Percona blog – Based on pg_qualstats & hypopg
59
Observability: step 5/5
Machine Learning
Logs
Syslog/Logstash Telegraf
"Why is my DB slow?" "Can I have an extract from my logs?"
Metrics
60
Remember this SLOW query
db# SELECT username FROM customers WHERE username LIKE ‘_wilfried%’; Time: 9433.400 ms (00:09.433) Tips: it’s not a missing index ;)
61
PS: We are hiring!
62