This document is meant to be read by people with sysadmin experience. I'll go back at some point and clean it up, break it down into sections, etc. But for now I'm just trying to dump as much information as possible so that Matthew and Robby have some state on how things are.
- 1 Links
- 2 Nagios
- 3 Cacti
- 4 Traffic Flow
- 5 Perlbal
- 6 MogileFS
- 7 Gearman
- 8 Memcached
- 9 TheSchwartz
- 10 Workers
- 11 Incoming Mail
- 12 Outgoing Mail
- 13 Databases
- 14 Webservers
- 15 SSL
- 16 ddlockd
- Cacti: http://z.dreamwidth.org/cacti/
- Nagios: http://z.dreamwidth.org/nagios3/
- Healthy?: http://www.dreamwidth.org/admin/healthy.bml
The Nagios setup is running on dfw-admin01 in /etc/nagios3, most of the configuration files are in /etc/nagios3/conf.d as you can imagine. You can poke around if you want to change it, it's pretty straightforward.
If you do change things, you probably want to commit them to the operations repository.
make your changes... etc etc $ sync-back-nagios $ cd /root/dw-ops/nagios/conf.d $ hg status if everything looks good, then: $ commit -a mark -m "Some commit message."
Replace mark with matthew or alierak as appropriate.
Most of the graphs are more or less useful. I spend a lot of time looking at dfw-lb01 which shows all of the incoming site traffic. In particular: eth0 is always the "Internet" interface, on all slices. eth1 is the "Internal/Private" interface. And lo is lo.
The only time lo is really interesting is on the dfw-lb01/dfw-lb02 machines. Look at the SSL configuration to see why, but lo is the measure of how much SSL traffic we're doing.
This summarizes the flow of traffic. There are a lot more sections that talk far more in depth about various things, but here you go...
- Site external IP is on dfw-lb01 (or dfw-lb02), which runs Perlbal.
- User connects to Perlbal. If it's a static request, it serves it locally. If it's dynamic, it hands off to a webserver.
- Perlbal connects to dfw-webXX and proxies the request.
- Webservers connect to lots of things: databases, memcache, mogilefsd, gearmand, etc.
- Response is returned.
That's the basic flow of things and what connects to what. There's a separate flow that happens when the user requests a userpic (or any other MogileFS resource, but for now it's just userpics).
- User -> Perlbal, "GET /userpic/XXXX/YYY"
- Perlbal -> Webserver, "GET /userpic/XXXX/YYY"
- Webserver replies: X-REPROXY-URL: http://dfw-mog01/dev1/0/00/000/234.fid
- Perlbal -> dfw-mog01, "GET /dev1/..."
- Mogile storage node replies with image
- Perlbal munges headers from webserver original reply, plus body of image from mogile storeage node, returns that to the user.
SSL is different again:
- User -> Pound.
- Pound handles the SSL handshake and decryption/encryption.
- Pound connects to localhost:80 (Perlbal).
- Same process now as originally.
Perlbal is the main software load balancer. With the Dreamwidth configuration, it doesn't do terribly much except handle reproxying.
- Runs on: dfw-lb01, dfw-lb02
- Admin port: 60000
Nagios is setup to monitor HTTP and SSL on these machines, not necessarily the admin port though. (That could be useful.)
If Perlbal happens to crash or otherwise become unavailable, you can start/restart it.
run as root on dfw-admin01 $ bin/restart-perlbal
Kareila made a script for doing status checks on the perlbals. You can run it like this:
[dw @ dfw-admin01 - ~/current] -> bin/pbadm 1 Name "LJ::PERLBAL_SERVERS" used only once: possible typo at bin/pbadm line 37. Tue May 19 05:44:52 2009: [lb01 - 003, 0000] [lb02 - 000, 0000] Tue May 19 05:44:53 2009: [lb01 - 001, 0000] [lb02 - 000, 0000] Tue May 19 05:44:54 2009: [lb01 - 004, 0000] [lb02 - 000, 0000]
Ignore the warning. These lines should be color coded: green is okay, yellow is intriguing, red is problematic. But generally as long as the numbers look pretty low, it should be alright. (Unless it says DOWN of course...)
MogileFS stores userpics. Mark didn't get around to documenting this.
- Runs on: dfw-mog01, dfw-mog02, dfw-mog03
- mogilefsd (tracker) on port 7001
- mogstored (backend) on port 7500 (admin on 7501)
- lighttpd (GET access) on port 81
Mark put in a cron job to kill off one mogilefsd queryworker every 15 minutes, due to a memory leak (in Perl 5.10.0?) that caused these machines to swap.
Perlbal is running on these machines but I don't think it's supposed to be.
Very simple server that just handles jobs. If this goes down, it should be started back up.
- Runs on: dfw-jobs01, dfw-jobs02
- Port: 7003
There is no administrative port. I think there are some commands you can use to see how deep the queues are, but I don't know off the top of my head. We use gearman for only one thing right now (userpic resizes? directory searches?) so I can't imagine it falling behind.
This is manual, I have no tool to do it. SSH to the servers that run gearman and use the /etc/init.d/gearman-server script.
These generally stay up and never give any trouble. They store data, it's basically a LRU cache. We don't push them that hard right now -- you can find all of the basic information in Cacti, I setup a nice memcached graphing library with interesting stats.
- Runs on: dfw-memc01, dfw-memc02
- Port: 11211
Same as for Perlbal:
as root on dfw-admin01 $ bin/restart-memcache
KEEP IN MIND: Restarting memcache puts a heavy strain on the databases. While we can get away with it without any trouble right now (our databases are bored), at some point in the future restarting memcache will become synonymous with shooting the site in the knee and watching it hobble along.
If you want the nitty gritty you can telnet to one of the instances on the port above and type stats which will give you a nice dump.
There's not much to mention here. The actual work is done by workers, which is in the Workers section of this document. TheSchwartz database is maintained on the global database (see Databases section). Logical db name is dw_schwartz.
The workers do async tasks that we don't need to happen inline with someone doing something on the website. Okay, so I lied, some workers actually are synchronous (thinking of the Gearman things here).
- Runs on: dfw-jobs01, dfw-jobs02
There is no port or management for these, they're just tasks. Typically speaking, you can see if they're running by looking at ps on the machine.
as root on dfw-admin01 $ bin/restart-jobs
CAVEAT LECTOR: Restarting the workers can be hard on the content-importer workers, since they allocate 12 hours to process entry and comment imports. If you restart workers while an import is in progress, it will cause that user's import to effectively pause halfway for 12 hours until it gets retried later.
There is no current way around this. You just have to know when is a good time to restart workers. While I'm gone, if you need to restart them, just do it. If a user has a problem with a delayed import, support will be awesome and let them know that it might take a while.
If you want to check on the importers...
[root @ dfw-admin01 - ~] -> bin/importer-status dfw-mail01 4084 ? S 2:11 content-importer [bored] dfw-jobs01 5663 ? S 4:29 content-importer [bored] 25034 ? S 0:05 content-importer [bored] dfw-jobs02 28323 ? S 0:04 content-importer [bored] 28528 ? S 0:05 content-importer [bored]
Note they're all bored. That means you are safe to just restart the workers. On the other hand, if it says it's posting entries or comments, you might want to wait. (But if it's an emergency, just do it.)
The machine dfw-mail01 handles incoming mail. It's a postfix system, with the MySQL module so that it can handle mail aliases/forwarding for users.
Sorry this is lacking in detail. If you're familiar with postfix you can dig around /etc/postfix for some more information. Specifically the /etc/postfix/dw directory.
All outgoing mail comes from one of three places, depending on the source. All of the machines in the cluster are configured to use nullmailer to route all outgoing mail through dfw-mail01. So, cron mails and the like go out this way.
Mail generated by the site itself goes out via send-email jobs on dfw-jobs01 and dfw-jobs02. These are regular workers and you can see them sending by watching ps on the machines.
In theory, everything is configured as well as it can be. Reverse DNS is setup for the IPs, forward MX records, etc. If you have any suggestions on how to improve our "presence" so we don't get blocked, I'd appreciate it.
There are eight "real" databases and one "off" database for doing email aliases. I'm going to ignore the mail slave database, it's using a configuration so that it only replicates the dw_global.email_aliases table.
The main databases are broken down into four sets of two. But the sets are further broken down into types:
Global Database Cluster
The global cluster is dfw-db01 and dfw-db02. It's in a master/slave configuration. Right now, dfw-db01 is master, but this is subject to change if we have a failure of db01. The best bet is to use dbcheck (see below).
User Database Clusters
User clusters are master/master.
FIXME: add documentation on how to switch the active master... it's in etc/config-private.pl, I think Matthew knows how to do that. (We don't have a tool for it. Also, if you mark a cluster readonly, the site basically goes down, because readonly doesn't work on DW very well? Code problems?)
There's a nifty tool that you can use to check on the databases at a glance:
[dw @ dfw-admin01 - ~/current] -> bin/dbcheck.pl 1 dfw-db01 repl: - < 3> conn: 0/ 162 UTC 5.0.67 (master) 2 dfw-db02 1 repl: 0 < 3> conn: 0/ 1 UTC 5.0.67 (slave) 3 dfw-db-a01 4 repl: 0 < 3> conn: 1/ 2 UTC 5.0.67 (cluster4a) 4 dfw-db-a02 3 repl: 0 < 3> conn: 0/ 44 UTC 5.0.67 (cluster4b) 5 dfw-db-b01 6 repl: 0 < 3> conn: 0/ 46 UTC 5.0.67 (cluster5a) 6 dfw-db-b02 5 repl: 0 < 3> conn: 1/ 2 UTC 5.0.67 (cluster5b) 7 dfw-db-c01 8 repl: 0 < 3> conn: 0/ 41 UTC 5.0.67 (cluster6a) 8 dfw-db-c02 7 repl: 0 < 3> conn: 1/ 2 UTC 5.0.67 (cluster6b)
This has a lot of information, I'm sure you can sort most of it out. The "repl: 0" column shows how many bytes (NOT seconds, heh) behind in replication this database is. The "< 3>" next to it is how many binlogs are on that server currently.
Serves web requests. Pretty straightforward.
- Runs on: dfw-webXX
- Port: 80 :)
You're probably used to this by now. There is a handy tool to do the restarts, but this one lets you give it a "delay". If you have an emergency, and need to restart everything, you can just run the command with a 0 argument.
run as root on dfw-admin01 $ bin/restart-webs 5
That restarts the webservers with a 5 second delay. If you don't specify a delay, 5 seconds is used presently.
We use Pound to do SSL unwrapping. It's configured precisely where you expect it is.
- Runs on: dfw-lb01, dfw-lb02
- Port: 443
Nagios checks it on each of our servers to make sure that the full SSL flow is working on both load balancers.
(Why not have Apache handle SSL?)
[14:09] <xb95> Security is a big reason. [14:09] <xb95> I don't want my SSL certificates installed on the system that is doing code management. [14:09] <xb95> Or running the code. [14:11] <xb95> Load is another. SSL is traditionally heavy when done in software, so LiveJournal (and many other sites) use hardware load balancers to do SSL termination. [14:11] <xb95> Big-IP, NetScaler, etc [14:11] <xb95> But if you terminate the SSL on Apache, then you have no real choice except to let Apache handle the request, whatever it is. [14:12] <xb95> Ooh, and ANOTHER reason is that it would break X-REPROXY-FILE if you did the SSL termination on Apache, and if you break that, you again are stuck with Apache doing things it's really not designed for -- i.e. serving large static files. [14:13] <xb95> So in short: there are many very good reasons to not use Apache's SSL support. Sure, it'd be easier for developers, but meh. Pound takes about five minutes to set up.
[14:24] <xb95> https://gist.github.com/1305026 [14:24] <xb95> skakri: that is the pound.cfg for dreamwidth production + the snippet for Perlbal [14:24] <xb95> internet -> Pound (listens on port 443) -> Perlbal -> Apache [14:25] <skakri> xb95, thank you! [14:25] <xb95> the only other component is that you need two config options in your site config [14:26] <xb95> $SSL_HEADER = "X-DW-SSL"; [14:26] <xb95> $USE_SSL = 1; [14:26] <xb95> config-private.pl: $SSLDOCS = "$HOME/ssldocs"; [14:26] <xb95> config-private.pl: $SSLROOT ||= "https://$DOMAIN_WEB"; [14:26] <xb95> config.pl: # SSL prefix defaults [14:26] <xb95> config.pl: $SSLIMGPREFIX ||= "$SSLROOT/img"; [14:26] <xb95> config.pl: $SSLSTATPREFIX ||= "$SSLROOT/stc"; [14:26] <xb95> config.pl: $SSLJSPREFIX ||= "$SSLROOT/js"; [14:26] <xb95> config.pl: $SSLWSTATPREFIX ||= "$SSLROOT/stc"; [14:26] <xb95> config-private.pl: $CONCAT_RES_SSL = 0; # we have this too, not sure if it defaults to off or on [14:27] <skakri> xb95, same in config.pl? [14:27] <skakri> (the last line) [14:27] <xb95> actually the last one, and the first two, are in config-private.pl
Manual, for now. SSH to the machine in question and use /etc/init.d/pound to do the work.
NOTE: there is currently a problem with the configuration? Whenever the server reboots, pound won't start, because /var/run/pound does not exist. Make this directory and pound will start.
This is a very damn simple locking daemon.
- Runs on: dfw-jobs01, dfw-jobs02
- Port: 7002
Given that it's a locking system, it won't actually restart. The following command will start them up if they're down, that's it:
run as root on dfw-admin01 $ bin/restart-ddlockd
You can telnet to the port and issue the command status to see what's going on. It's a little terse.