Evergreens

Mark Smith's Journal

Work related musings of a geek.

the state of Dreamwidth: load, capacity, etc

[staff profile] mark
Things are calm at the moment, so it seems a good time for me to ruminate on the current state of Dreamwidth's load/capacity/etc. Please let me know if anything is unclear, or if you have any concerns, or whatever -- I'll do my best to answer everything.

The summary -- Dreamwidth has definitely been hit with a lot of extra load in the past few weeks, but it's maybe not as much as you thought. We're over double what we were a month ago, but it's still only double -- we already had a good amount of load. Here's a good graph showing the bump:

Dreamwidth Bandwidth Usage

There are several main "systems" that make up Dreamwidth (or, really, most web sites). They are the frontend, the web servers, the cache, the databases, and the miscellaneous services.

Let's take it one by one... we'll start with the easy things. We're going to talk about the current state of stuff and the scaling of it. This is a term that loosely means "making it handle more traffic". (Where traffic is more users, more features, more whatever.)

Miscellaneous Services


These are pretty straightforward. Scaling these is, typically, a matter of just running more of them. We're nowhere near capacity on most of these, and any that are, we can just run more of the worker processes. I'm not too worried about these -- even if they do get overloaded, they won't stop the site itself from working. People can still read, post, and do stuff.

If these overload, it will affect emails going out, search, payments, and similar things that are considered non-critical services. (I.e., if search goes down for a day, it's frustrating and I will do my best to get it back online ASAP, but it's a lower priority than web servers or databases.)

Cache


We use memcached for most of our caching. Having this service online is crucial for the site to be working -- without our cache, the databases will overload and croak. No good.

Thankfully, though, this system is nearly free to scale. All we have to do is deploy another few instances and update the site config to use them. The downside is that adding more instances will cause the site to slow down for a little while because the entire cache has to be emptied and redistributed across the new, larger cache cluster.

We're getting close to the point where I want to deploy new cache instances, and I will be doing that when I get the new databases up and ready. I'll schedule it for a low traffic time so it should have minimal impact on the site.

I'm very comfortable with our status here and our ability to scale out for more capacity.

Web servers


These are the actual machines that handle processing the web pages, as you might expect. The nice thing about them, though, is that they are horizontally scalable. This means that adding more of them adds more capacity in a linear fashion. If we have ten web servers and they're overloaded, adding ten more doubles our capacity for this service.

We currently have six machines handling web requests and we can easily add more. It just takes about 48 hour notice to our hosting provider to get them to spin up and deploy a new machine. As soon as I notice us getting close to capacity on this tier, I submit a request and we get more up. Since the big bump of users two weeks ago, we've added two more web servers. If the load holds where it is now, we'll stay at this level -- but again, it's easy to add more.

I'm very comfortable with our status here, too.

Databases


We're currently running on MySQL databases. These machines are a lot more expensive than web servers -- more RAM, fancy disks, a RAID card with BBU, etc -- and they're a lot harder to scale than the webs. Harder, but not really impossible.

Physically, we have two machines. Logically, though, there are two types of databases -- the global database cluster and the user clusters. We have to talk about scaling the database in terms of its logical components, since they have different scaling requirements.

For the user clusters, these are effectively horizontally scalable, just like the web servers. We put online two more machines and we create a new user cluster, then we start moving people over to it. We can balance the load on the user databases by increasing or decreasing the number of users that "live" on that machine. You can see what user cluster you're on with our Where am I? tool.

The global cluster is harder to scale. There are some bits of data that have to live in one place because running it in several places makes code very, very hard to get right. Think about it like having two bosses -- if you have two bosses who do the same thing, you're never really sure who to listen to. Jim may tell you to work on project X, but Sally might say work on project Y. How do you decide what to do?

On the plus side, our global cluster is a lot, lot smaller than the user clusters. It only stores things like payments, user login information, and some other data that is pretty small and lightly used. It has a much higher capacity (how much load we can throw at it) before we have to consider scaling it.

Even then, scaling it can be done by adding more machines in as slaves -- i.e., exact copies of the master global database. This will buy us a decent amount of headroom before we have to consider doing something fancier like moving to SSDs instead of rotating disks. We can also add more cache machines to give us even more capacity.

We're hitting close to capacity on our existing databases, but we have two more machines on their way right now. They should be set up pretty soon (in the next day or so) and then we'll have more than double our current capacity. Also, we're still running on a MySQL version that is two years old -- there have been a lot of improvements to MySQL (particularly the Percona branch) since then, and I will be upgrading us soon.

All told, I'm pretty comfortable with our scaling here. Our existing systems are getting loaded but there's a very clear path from here to get us to more than 10x our current size. Once we start getting that big we'll have to do some more interesting work, but if we get to 10x our current size, we should have enough money that it will be no problem at all.

Frontend



Finally, the frontend -- our load balancer -- the machine that handles getting all of the user traffic from the Internet to our web servers. We're running a combination of software on this machine, primarily Pound and Perlbal. (Although soon I will be adding Varnish to help with userpic caching.)

Scaling the frontend is easy up to a certain point, after which it becomes really hard. Thankfully that "certain point" is fairly far off. Right now we're at about 25% capacity on this machine -- this is after the doubled load! -- and adding in a Varnish cache for userpics should help reduce that to about 15%.

When we start getting closer to that point I have a few ideas that will help with the load -- notably offloading the Perlbal instances to another machine -- and that will allow us to go up to the bandwidth limit of the machine. We're doing up to about 25Mbps right now and we can go close to 800Mbps before we start to hit capacity on that front.

In short, then, I believe we're in good shape on this front and have a clear path to scaling this out to more than 10x our current load.

Code/other concerns



Honestly, the part that is most likely to bite us is also one of the easiest to fix -- and that's our code. There are certainly inefficient things in our codebase and we will have to address them as they come up. This is also exactly the kind of thing that has led LJ to temporarily suspend ONTD and similar communities from time to time, because that's the most expedient way to get the service back to normal for everybody else while they isolate and fix the problem triggered by the heavy users.

Dreamwidth will have the same policy, too. If the site goes down and it turns out to be because of a particular community or heavy user, we'll take what action we need to bring the site back -- and then we'll work our tails off to get service restored to that particular user/group. I also promise that we will communicate with anybody affected by this and let you know what's up -- you won't sit and wonder what happened.

Open floor!



All that said... any questions? Fire away, I'll answer them to the best of my ability. (Although I will say that right now I'm going to step away from the computer and go make some bread. It's New Year's Eve and I'd like to spend some time with my partner, [personal profile] aposiopetic. I'll check back in though!)

And, if I haven't said it enough, thank you for using Dreamwidth. It's really gratifying to see people moving in and giving things a whirl. We've worked really hard on this site for the past few years -- this is our baby! -- and I'm so excited to share.
Page 2 of 2 << [1] [2] >>
01.01.2012 04:02 am (UTC)

(no subject)

kyrielle: A photo of kyrielle, in profile, turned slightly toward the viewer (Default)
Posted by [personal profile] kyrielle
Oh man, Varnish. I know very little about it and technically it may be appropriate, but it's most of the errors I've seen over at LJ and the errors they show are really confusing / unpleasant to a layperson or semi-layperson - especially as it's very easy (at least, I like to think it is because I did it REPEATEDLY until someone told me) to read it as 'Vanish' instead of 'Varnish'. Vanish is what the site did when I was seeing those, I thought someone was being a smart-aleck with messages!
02.01.2012 04:46 am (UTC)

(no subject)

kyrielle: A photo of kyrielle, in profile, turned slightly toward the viewer (Default)
Posted by [personal profile] kyrielle
Good to know! LJ was for a while serving up something like "503 varnish error - service unavailable" instead of pages. Which is a cruddy error to present users. I'd never heard of varnish and misread as vanish, which made the error message seem like a techie's bad in-joke. (Having had to fix a bug report at work where the reported error was "and this can't actually happen" ... Many thanks to the overconfident coworker who put that in! Of course, I also rejected a change on review that included the error message "and something useful here" but he'd meant to replace that and was embarrassed. So I'm maybe a bit ready to suspect technical staff burnout or silliness when I see what I perceive to be a meaningless or snide error....)
01.01.2012 04:15 am (UTC)

(no subject)

mdlbear: the positively imaginary half of a cubic mandelbrot set (Default)
Posted by [personal profile] mdlbear
Just want to say thank you -- both for the site, and the informative post. (And just because I've been programming professionally since forever, that doesn't mean I know a damned thing about databases. Some things are best left to the experts. :)
01.01.2012 04:30 pm (UTC)

(no subject)

risha: Illustration for "Naptime" by Martha Wilson (Default)
Posted by [personal profile] risha
Comments like this always make me consider taking the time to learn MySQL. I loved my database classes in college, thoroughly enjoy writing SQL, and make a habit of enjoying software that everyone else hates (I currently work on DOC1 full time).
01.01.2012 07:30 am (UTC)

(no subject)

sepdet: (Georgie)
Posted by [personal profile] sepdet
Happy New Year from my new iPad whose very first bookmark was Dreamwidth. I was a paid user on LJ for 9 years. I hope to be a paid user on DW for at least ten!
01.01.2012 10:03 am (UTC)

(no subject)

arise: (cagalli | snowboarding! fuck yeah)
Posted by [personal profile] arise
Just wanted to add my voice to the chorus of thank you very much for such a sensemake explanation, and have a great New Year :)
01.01.2012 10:06 am (UTC)

(no subject)

claidheamhmor: (Default)
Posted by [personal profile] claidheamhmor
I love the detail you give in your explanations; it gives me the comforting feeling that you know what you're talking about and that you've put planning into everything.
01.01.2012 11:00 am (UTC)

(no subject)

kotturinn: (Default)
Posted by [personal profile] kotturinn
Thankyou for an interesting and very well-written post. As someone whose job also involves a mix of systems and documentation I take off the hats to you. At some point I would like to offer as a volunteer for something but I'm afraid I can't say when - I'm afraid I need a combination of more paws and more spoons first.

One (two :-) ) question(s). Was MySQL part of the LJ inheritance? Do you intend to continue using it or might you consider migrating to a different DB at some point?
Edited 01.01.2012 11:01 am (UTC)
01.01.2012 11:01 am (UTC)

(no subject)

pensnest: bright-eyed baby me (Default)
Posted by [personal profile] pensnest
I'm just going to echo what [personal profile] claidheamhmor said, and add, best wishes for a happy, healthy and prosperous 2012.
01.01.2012 01:12 pm (UTC)

(no subject)

anatsuno: a black and wide photo of anatsuno, grinning (all about ana)
Posted by [personal profile] anatsuno
Happy new year, Mark! This post was a joy to read. Thank you for all you do. :D
01.01.2012 03:53 pm (UTC)

(no subject)

prk: (GeeksRock)
Posted by [personal profile] prk
Thanks for the summary, very interesting and good to see that most components are significantly scalable.

Out of curiosity, what's your plan if DW becomes exponentially more successful, and you find that not even the most powerful server has the capacity to serve as a global DB Master? Can you shard that data?

On the SLB side, any thoughts / potential plans around Global SLB & multiple data centre sites?

Prk
01.01.2012 06:01 pm (UTC)

(no subject)

vampslayer04: (Family Don't End with Blood)
Posted by [personal profile] vampslayer04
Anyone else having problems with ljlogin and Dreamwidth? It's not Dreamwidth's fault, but I just needed to vent. The Dreamwidth "Friends Page" is not labeled as such, and so when LJlogin tries to load it automatically, like I have it set to do everywhere else, it gives me a 404 error.

I use a lot of journals, so going without the login is kinda not gonna happen, and I don't like having to make it load my journal first, then click through to my reading page.

I know, I know, cry moar. Especially since this isn't really something that Dreamwidth needs to change.
01.01.2012 08:33 pm (UTC)

(no subject)

amphibologies: (Default)
Posted by [personal profile] amphibologies
I've never really used login to auto-load a page (I wasn't even aware it could, tbh) but I'm not having any problems with it. :|a Mind, my usual tendency is to login in via LJlogin and than just type the url into the appropriate bar.

Sorry I can't be more helpful, though. :(
01.01.2012 10:17 pm (UTC)

(no subject)

lassarina: (Far From Home)
Posted by [personal profile] lassarina
Thank you so much for breaking this down into things I can understand! I am so delighted that you and Denise are running this place, and explaining things so clearly, and I really enjoy this kind of behind-the-scenes post.
02.01.2012 06:45 pm (UTC)

(no subject)

aesopian: (pen and ink)
Posted by [personal profile] aesopian
Oh dear... touched in the stats place. Ah... questions and inquiries regarding the chart.

1) Do you have a visual record of the dreamwidth's data usage since it's creation? (... or just a statistical record.)
2) Do you have or know of any other spikes like this in dreamwidth's history? If so, do you remember the causes and/or effects of that spike?
3) Will you be able to provide charts like this for the next 3-4 months as more communities migrating from LJ begin to come active on DW?
4) Ah... there's probably not a way to get a usage chart like this from LJ... I know you're no longer with their company, but do you have a suggestion about who could be contacted for this information?

... Finally, if I do pursue this event as the subject matter for an academic research paper, would it be acceptable to contact you or Denise for an interview?

(Please note, this is a highly hypothetically proposition at the moment, but growing stronger with the more information I see.)
02.01.2012 11:48 pm (UTC)

(no subject)

Posted by [personal profile] six_echo
Um. Totally off topic, but I can't find where to ask. My inactive icons from LJ are all importing over to my new DW accounts. First, yay! and thanks, but second, I didn't think this was possible, is it a new feature?
03.01.2012 12:00 am (UTC)

(no subject)

denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)
Posted by [staff profile] denise
It's actually a change Mark made to work around a bug with how LJ was delivering data to us -- we could've spent hours and hours and hours debugging things, or we could've just said "eh", imported everything, and sorted it out later. We decided to not sweat it and be lazy instead. :)
03.01.2012 12:06 am (UTC)

(no subject)

Posted by [personal profile] six_echo Expand
03.01.2012 12:11 am (UTC)

(no subject)

Posted by [staff profile] denise Expand
03.01.2012 12:14 am (UTC)

(no subject)

Posted by [personal profile] six_echo Expand
03.01.2012 03:50 am (UTC)

(no subject)

Posted by [identity profile] acdragonmaster.livejournal.com
I'm a little curious, if this isn't too late to ask questions, but given that a large number of the people migrating or considering migrating from livejournal are RP communities, do you anticipate that this would cause a higher strain on the services than regular users? I'm wondering because of the fact that RP accounts generally have and make use of more icons than most regular users that I've seen, as well as tending to have much longer threads of comments in a much shorter period of time, which generally tends to lend itself to using quite a bit more bandwidth. This isn't so much a question of comm migration (though I have a question or two on that shortly XD) as the influx of user activity some of the larger communities would bring with; especially with the games where a single post can easily get 2-5k comments in the span of hours at times. It's a pretty specific concern of mine, since there's definitely a difference between 100 more "average" users who only post a dozen or two comments in a day, and 100 RPers who can sometimes rack up over a thousand comments apiece in a week. And, well, I've seen a site be accidentally taken down by its own users at time because too many of them were refreshing a page too much at once, so that's something I'm definitely wondering about.

I'm also curious, since I couldn't find the answer anywhere else I was linked or poked around, but am I correct in inferring that when communities are migrated, it can be done more or less piecemeal, if necessary? That is, I saw a remark about importing entries first and comments later, so I was wondering about how that worked; what if, for example, entries were imported without their comments, and someone commented here to one of those entries, then the comments were later imported? Would this cause any problems? Or similarly if posts on a community (or journal) were forward or back dated, would that sort of thing affect things if new posts were made here and the old ones migrated later? It's not something I've seen addressed/explained in detail, but which I'd imagine may make a difference to some people.

...and I could swear I had another question but I completely forgot it so it must not have been that important. XD

...also, on a minor point, for some reason the confirmation e-mail for using openID didn't show up until a good 20+ minutes later, although after I'd waited about fifteen minutes to no avail and had it resent it showed up immediately. :|a
04.01.2012 02:06 am (UTC)

(no subject)

Posted by [identity profile] acdragonmaster.livejournal.com Expand
04.01.2012 04:59 am (UTC)

(no subject)

Posted by [personal profile] syphilis Expand
04.01.2012 05:36 am (UTC)

(no subject)

Posted by [identity profile] acdragonmaster.livejournal.com Expand
05.01.2012 05:51 am (UTC)

Not an official answer

Posted by [personal profile] foxfirefey Expand
03.01.2012 06:08 am (UTC)

♥♥♥!

tobu_ishi: (Default)
Posted by [personal profile] tobu_ishi
Man, I love the feeling of receiving a clear, preemptive and helpful explanation from staff who treat me like a sensible human being. I was hesitant to switch over from LJ at first, but the more time I spend on Dreamwidth, the more I feel like I've come home. ♥
Page 2 of 2 << [1] [2] >>