The SmugMug Tale Who are we? Premium photo & video sharing. - - PowerPoint PPT Presentation

the smugmug tale who are we
SMART_READER_LITE
LIVE PREVIEW

The SmugMug Tale Who are we? Premium photo & video sharing. - - PowerPoint PPT Presentation

The SmugMug Tale Who are we? Premium photo & video sharing. Bootstrapped in 02. $10M+ as of 07. Profitable. No debt. Top 400 website. Doubling yearly. Our challenge Premium means more and better. Unlimited storage.


slide-1
SLIDE 1

The SmugMug Tale

slide-2
SLIDE 2

Premium photo & video sharing. Bootstrapped in ’02. $10M+ as of ’07. Profitable. No debt. Top 400 website. Doubling yearly.

Who are we?

slide-3
SLIDE 3

Premium means “more” and “better”. Unlimited storage. Unlimited bandwidth. Big photos (48Mpix). 500M+ of them. Big video (1920x180p). Lots of photos per page. Super fast.

Our challenge

slide-4
SLIDE 4

LAMP(hp). x86 (mostly AMD) on Linux (~300 4+ core hosts?) 4 datacenters: 2 x SV, 1 x VA, 1 x SEA 2 Ops guys. :) Majority of boxes are diskless. Consume lots of cloud services (S3, EC2, etc).

Architecture overview

slide-5
SLIDE 5

Binary data (photos, video, etc): Stored in Amazon’s S3. PBs. Akamai fronts for caching and acceleration. Structured data (Database, etc): MySQL (InnoDB mostly). 4+ cores, 64GB, >2TB storage Memcached fronts for caching.

Storage

slide-6
SLIDE 6

Photo & video processing / encoding: Handled in Amazon EC2. Totally autonomous scaling (SkyNet) Customer facing: Diskless web boxes (PXE boot) Scaled up *and* out MySQL Memcached ~1TB

Compute

slide-7
SLIDE 7

Super-fast CDN: Reads often already close to customer. More than just a CDN: HTML/AJAX/etc inspection for pre-fetch Anticipate requests and get data to within low ms Optimal data path to SmugMug DNS latency reduction $$$ but worth it. Get what you pay for.

Secret Weapon: Akamai

slide-8
SLIDE 8

Screaming fast. ~1TB of data stored. >96% hit rate Contains MySQL row data, avoid SELECTs Misc other data cached, but MySQL biggest win Fall back on MySQL for cold data

Secret Weapon: memcached

slide-9
SLIDE 9

Most important technology at SmugMug. Super dependent on replication: Performance Reliability / High Availability No MySQL data loss in >7 years. No JOINs. (Or lots of 4.x+ features, either) Vertically partitioned, not horizontally (no shards)

Secret Weapon: MySQL

slide-10
SLIDE 10

Most important technology at SmugMug. Huge thanks to Heikki, Oracle, Percona and Google! Running 1.0.3+patches in production. Big performance gains with recent releases.

Secret Weapon: InnoDB

slide-11
SLIDE 11

Crazy concentration of talent under one roof. Best MySQL dollars we’ve ever spent. Helped us out of a major bind Have you heard of the ‘back_log’ mysqld setting? Me neither. Hope you never do. Percona had. Helped build, integrate, and test InnoDB patches.

Secret Weapon: Percona

slide-12
SLIDE 12

We care about write latency above all. Well, ok, maybe data integrity. ;) Scaling reads “easy”: replication and memcached. Replication needs to stay current (<1 sec). MySQL concurrency problems. (Much improved!) Parallel I/O - lots of cores. Large storage (TBs). Big RAM (64GB+) to keep indexes hot.

MySQL details

slide-13
SLIDE 13

Mostly SELECT pkey FROM table WHERE index; On cache miss, SELECT * FROM table WHERE pkey; UPDATEs/DELETEs mostly on single rows by pkey Easy memcached expiration. Easy slave-delay tracking. Very denormalized. No JOINs or complex SELECTs. OLTP benchmark imperfect. Time for sysbench-web?

MySQL query details

slide-14
SLIDE 14

Better filesystem: CentOS Linux shop (lots of expertise). MySQL is storage intensive (iops, size, etc). ext3 old and busted. fsck, well, sucks. ext4 already old and busted. :( Want good volume management. Serialized writes (non-parallel). Ugh.

MySQL Issues: Filesystems

slide-15
SLIDE 15

Transactional. Copy-on-write. End-to-end data integrity. On-the-fly corruption detection & repair. Integrated volume management. Snapshots & clones. Open source software.

Filesystem Solution - ZFS!

slide-16
SLIDE 16

We run Linux. ZFS doesn’t run on Linux. Crap.

The REAL Issue

slide-17
SLIDE 17

Unknown state on crash: Did *.info get written at commit? Or is it *2 months* out of date? Bringing TB+ slaves online quickly. Backups using LVM/ZFS a pain. Keeping up with master. Single thread for replication SQL. Master promotion cludgy.

MySQL Issues: Replication

slide-18
SLIDE 18

Transactional replication patches: Slave always in known state. Either ok to bring back up or CHANGE MASTER. Safe to take snapshots anytime, no effort. Safe to use innodb_flush_log_at_trx_commit=2 InnoDB only. Stopgap. Global trx IDs better. Using in pre-production. Production next week?

Replication solutions

slide-19
SLIDE 19

Toro aka S7410. NAS storage with a few twists. 2 x Quad-Core Opteron + 64GB RAM 100MB Readzilla SSD 2 x 18GB Writezilla SSD. 20K write iops. 22 x 1TB 7200rpm HDD Clustered HA configuration.

Secret Weapon: Sushi

slide-20
SLIDE 20

ZFS on Linux! SSD is here! SSD performance is cheap! Consume via NFS, iSCSI, CIFS, HTTP, FTP, etc. Massive flexibility - no more DAS. Fishworks interface is a dream. Analytics is a game changer.

Mmm, Toro tastes good.

slide-21
SLIDE 21

Initial sticker shock - $80K?! $142K clustered?! No one pays list price. Whew. Startup Essentials. Double-whew. Paradigm shift. Biggest whew! DAS -> NAS So much IO, in theory, can “stack” lots of clients. In practice, can stack *lots* of clients. We now have 5 clustered configs. :)

Sushi’s quite reasonable

slide-22
SLIDE 22

Crazy fast. 9.6K iops, 4.5K under 43us, 8K under 166us

Sushi served fast

slide-23
SLIDE 23
  • Scalable. 15K 4k write iops w/16 threads.

Low latency. ~250us @ 3K iops, ~700us @ 10K

Sushi served fast

5000 10000 15000 20000 1 2 4 8 16 32 4K write iops threads fio write benchmark

slide-24
SLIDE 24

So fast, we’re stacking like crazy. 5 different MySQL workloads on single clustered Toro. 8 slaves on single Toro. Each used to have 15K disks + write cache. Lots of excess io and space capacity still. Compression “for free” (no client CPU usage) Crazy fast ~1.5X ratio across TBs of InnoDB

Sushi today

slide-25
SLIDE 25

Backups a breeze. Automatic snapshots every n minutes / hours / days. No need to LOCK / shutdown / STOP SLAVE / etc Rollback anytime. Skip bad SQL statements. New slave? Click snapshot. Click clone. Done. Slaves share unchanged data on disk and in RAM. Future bright: clone + de-dupe = insanely efficient.

Sushi today

slide-26
SLIDE 26

DTrace on Linux! Never had analytics on storage before. Vendor used to say: “Um, we dunno. Buy more spindles?” Now I know all. Vendor now says: “What does Analytics say?” Drill down on everything. Correlate anything. God-like power.

Analyzing sushi

slide-27
SLIDE 27

NFSv3 (rather than v4) 16KB record size in ZFS (InnoDB) Mirrored (RAID1+0) disks w/striped Logzilla MySQL concurrency bound - can’t use all the I/O If compressing, use LZJB. In theory, can optimize InnoDB: doublewrite = 0, checksums = 0. ZFS does these. In practice, no big gain with our workload.

MySQL on Toro so far

slide-28
SLIDE 28

Replication *.info files not sync’d over NFS Found a slave with *2 month old* info files Transactional replication to the rescue! NFS locking and InnoDB Warnings on the Net. No hard data. Actively researching. What’s the problem?

MySQL on Toro problems

slide-29
SLIDE 29

10GbE for reduced latency? Actively testing this. Driver tuning required. Defaults for throughput. Cards (Intel) & switches (Arista) cheap & fast Less than $500/port. Copper twinax SFP+ cables cheap. Optical XFP not. $50 vs $1000+ Toro doesn’t support SFP+ cards yet. :(

Even faster?

slide-30
SLIDE 30

Everything runs better on Toro. :) Revision control. Stateless Linux mounts. Email. Developer home directories. Built-in, automatic replication for multi-site backups. Photo and video serving?

Kitchen sink on Toro

slide-31
SLIDE 31

100% SSD. Still too $$ for TB+ installs. Even better InnoDB. Community on fire. Oracle/MySQL accepting patches! Multi-threaded replication. Preview release is out. Yes! New storage engines PBXT, Falcon, Maria, oh my!

The future?

slide-32
SLIDE 32

MySQL is a crown jewel. Not a gateway drug to Oracle. Different customers. Kill btrfs. GPL ZFS. MySQL and InnoDB under one roof = opportunity. OpenStorage is game changer. Don’t kill it. Listen to your new communities. I’m busy. I’m up here because this is important.

Oracle wishlist

slide-33
SLIDE 33

Thanks!

Blog: http://blogs.smugmug.com/don Twitter: DonMacAskill Email: don@smugmug.com Percona Conference: Upstairs :)