Jaime Gago Condensing Information Systems From the Vapor Of Data


Scaled Out AWS Graphite Federated Cluster

I've hit the limit of "my" AWS Graphite single server install (i2.xl) and I needed to come upScreen Shot 2015-05-07 at 10.04.48 PM with a scaled out design. I'm really looking forward to Jason Dixon Graphite's book, but can't wait for it so I went down the interwebs rabbit hole  (1234567). This is what I've come up with for my needs, maybe this helps somebody else out there, also hoping for feedback from the community.


This design should be able to handle a maximum of about 1 Million metrics, online for 6 months (1 minute resolution for 90 days, 10m for 180days), highly available and backed up twice for about $2k/month (AWS bill). The core are 5 i2.xl EC2 instances. Version wise I'm planning to use the master branches of carbonwhispercarbon-webapp (currently on a patched 0.9.12)


Architecture (see attached diagram)

  • HA is achieved via a mix of ELB + Carbon relays with a replica factor of 2 (each metrics is being sent by load balanced top relays to bottom relays using consistent hashing)
  • Metrics are backed up via Rsync to dedicated EBS volume
  • Core nodes are not easily added as adding one means re balancing/healing via custom scripts and is a PITA even with carbonate. This is a bypass product of consistent hashing.
  • Metrics R/W proxy nodes are easily added

Metrics Projections & Calculations

  • 1m:90d,10m:180d retention => 1866280Bytes = 1.87MB used storage per metric
  • I2.xl have 800GB SSDs, we need to save 10% for SSD over provisioning so 720GB available, provisioning 5 of them would total 3,600,000MB => 3,600,000 MB /1.87MB = 1,925,133 metrics capacity => total metrics capacity is 1,925,133 / 2 (replica factor of 2) : 962,566 so about 1 Million

AWS Costs (Reserved Instances are assumed)

  • m3.medium : $372/year, $31/month * 4 = $120/month
  • i2.xl: $3114/year, $259/month * 5 = $1295/month
  • EBS 800GB:$80/month * 5 = $400/month
  • RDS db.m3.xl= $439/year, 36/month * 2 = $72/month
  • ELBs: ~$100/month * 2

For kicks

962,566 million metrics with the proposed retention represent

(129600 + 25920) * 962,566 = 149,698,264,320 data points so about 150 Billions

source is running whisper-info.py on a whisper metric file with the proposed retention:

~$ whisper-info.py my_metric.wsp

maxRetention: 15552000 xFilesFactor: 0.5 aggregationMethod: average fileSize: 1866280

Archive 0 retention: 7776000 secondsPerPoint: 60 points: 129600 size: 1555200 offset: 40

Archive 1 retention: 15552000 secondsPerPoint: 600 points: 25920 size: 311040 offset: 1555240

Comments and backed WTFs greatly appreciated.



Comments (1) Trackbacks (0)
  1. I went with EBS volumes for the cores instead of I2 instances, easier to deal with (e.g. can scale iops and size on demand) and have builtin snapshots that can easily be automated which makes graphite data backup simpler.

Leave a comment

No trackbacks yet.