Scaling Websites

Some Lessons Learned at


Who are you guys?


we both work in web operations @ mozilla

Not just firefox?

  • mozilla.org (Firefox downloads)
  • input.mozilla.org (happy/sad face)
  • crash-stats.mozilla.org (crash reporting)
  • support.mozilla.org (user support community)
  • ... and hundreds more

summary

architecture
load balancers
databases
async jobs
caching
self service
paas
cloud

architecture

clusters
admin node
web nodes
databases nodes
some clusters shared

Load Balancers

Zeus (now Stingray)
software solution

Platform

  • RHEL6
  • HP DL360
  • Myricom Myri-10G

Details

in front of nearly everything

  • apache, mysql, elasticsearch, hadoop
  • 200k packets per second for SSL

Databases

/dev/null is web scale.

MySQL Multi-Master

Only ONE is active for writes in the Load Balancer.

Read Slaves

Write to master, but only read from slaves.


 DATABASES = {
    ...

    'slave': {
       'ENGINE': 'django.db.backends.mysql',
       'NAME': 'mozillians_org',
       'USER': 'mozillians',
       'PASSWORD': 'YoUt#!nk+h1$izR3@l?',
       'HOST': 'generic-ro-zeus',
       'PORT': '3306',
    },
 }

 SLAVE_DATABASES = ['slave']
						

Hardware

No virtualization in production.

  • HP blades
  • Fusion-IO
  • HP and Kingston SSDs

DBA's

AWESOME DBA's are AWESOME!
+query optimation like code reviews.

A'SYNC Jobs

webscale boy band

Celery

  • don't block the web app
  • written in python & we use django
  • supervisord for celeryd

Rabbit MQ

  • message queue between web app & celery service
  • cluster per datacenter
  • puppet module to horizontally scale

Cache

rules everything around me

Memcache

  • we use the vanilla memached

memcache::data

  • ephemeral data (sessions/rss feeds/etc)
  • short lived and can be lost without impact

memcache::databases

  • django-cachemachine
  • object manager, looks in cache first for data

Local HTTP caching

  • We use Zeus
  • You can also use: Varnish, Squid
  • Global HTTP Caching: CDN

    • ~450 million Firefox users (6 wk updates)
    • vendors: Akamai/EdgeCast (65%/35%)
    • balance traffic with DynECT base on response

    Akamai::FF18 HPS

    Jan 10, 2013 -> Jan 13, 2013 inclusive.

    • Total hits: 5.5 billion
    • Peak HPS: 58,379.7 hits/sec

    Akamai::FF18 Bandwidth

    Jan 10, 2013 -> Jan 13, 2013 inclusive.

    • Total volume: 2.1PB
    • Peak traffic: 163.177 GBit/sec

    Scale Out

    or you fail out

    Config Management

    We chose Puppet, but there are other great options like: Chef & CFEngine

    Disposable Web Heads

    • nothing is shared
    • Seamicro Xeon
    • common files (uploads/css/js) in NetApp NFS
    • S3 to replace NFS for upload storage (amo/marketplace)

    AMD Seamicro

    • deployed for increase compute efficiency
    • saves up to 75% in space/power
    • enables 192 vs. 64 hosts per 45U rack

    The Future

    where we're going, we don't need roads

    DevOps culture

    • blameless postmortems
    • all invested in the same mission
    • continuous improvement (always try to make the process better)
    • hire the best f$*!ing people

    Self Service::Goal

    to become platform engineers!

    Self Service::Continuous Deployment

    • django-waffle
    • dark launching / feature flags
    • sumo, amo, input, mdn
    • if flag_is_active, checks, cookies, superuser, group, "dice roll"

    Self Service::Chief

    • 90% of site pushes to prod by end of 2013Q1

    Self Service::Jenkins

    • socorro - tarballs
    • stage autodeploys

    Self Service::Graphite

    everyone has access to the graphs, real time.

    Self Service::Logstash

    With Kibana, everyone has access to the logs.
    yup, real time.

    Self Service::Sentry

    everyone has access to exception tracking.
    you guessed it, real time!

    PaaS

    • we chose Stackato by ActiveState (built on CloudFoundry)
    • evaluated CloudFoundry, OpenShift & various hosted
    • chose most product focused

    Cloud

    • dynamically scale in cloud, base footprint in datacenter
    • PaaS -> add DEA instances for scaling extra capacity.

    keep on rockin'
    the free web