Difference between revisions of "Production Notes"

From Dreamwidth Notes
Jump to: navigation, search
(Gearman)
 
(23 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
[[Category: Production]]
 
[[Category: Production]]
  
This document is meant to be read by people with sysadmin experience. I'll go back at some point and clean it up, break it down into sections, etc.  But for now I'm just trying to dump as much information as possible so that Matthew and Robby have some state on how things are.
+
This document is for the Dreamwidth staff. Most of this won't work for you, but if you're curious, feel free to look. We host on AWS.
  
 
== Links ==
 
== Links ==
  
* Cacti: http://z.dreamwidth.org/cacti/
+
* Dashboard: https://app.datadoghq.com/dashboard/v8n-hrk-jgt/dreamwidth-health
* Nagios: http://z.dreamwidth.org/nagios3/
+
 
* Healthy?: http://www.dreamwidth.org/admin/healthy.bml
 
* Healthy?: http://www.dreamwidth.org/admin/healthy.bml
  
== Nagios ==
+
== Traffic Flow ==
  
The Nagios setup is running on dfw-admin01 in /etc/nagios3, most of the configuration files are in /etc/nagios3/conf.d as you can imagine.  You can poke around if you want to change it, it's pretty straightforward.
+
Traffic looks like:
  
If you do change things, you probably want to commit them to the operations repository.
+
1. We use Cloudflare for some things (www, userpics, attachments, some high traffic domains).
 +
2. AWS Cloudfront. Use this to configure caching and such.
 +
3. AWS WAF. Use this to configure mitigations against bad actors or things.
 +
4. AWS Application Load Balancer. Use this to route traffic to different internal endpoints.
 +
5. Destination instance(s).
  
    make your changes... etc etc
+
It's hard to say what #5 is since it varies depending on the request. In general though, it hits one of our EC2 instances which handles the request.
   
+
    $ sync-back-nagios
+
    $ cd /root/dw-ops/nagios/conf.d
+
    $ hg status
+
   
+
    if everything looks good, then:
+
   
+
    $ commit -a mark -m "Some commit message."
+
  
Replace mark with matthew or alierak as appropriate.
+
== Gearman ==
  
== Cacti ==
+
Very simple server that just handles jobs.  If this goes down, it should be started back up.
  
Most of the graphs are more or less useful.  I spend a lot of time looking at dfw-lb01 which shows all of the incoming site traffic.  In particular: eth0 is always the "Internet" interface, on all slices.  eth1 is the "Internal/Private" interface.  And lo is lo.
+
* Runs on: dfw-jobs01, dfw-jobs02
 +
* Port: 7003
  
The only time lo is really interesting is on the dfw-lb01/dfw-lb02 machinesLook at the SSL configuration to see why, but lo is the measure of how much SSL traffic we're doing.
+
There is no administrative portI think there are some commands you can use to see how deep the queues are, but I don't know off the top of my head.  We use gearman for only one thing right now (userpic resizes? directory searches?) so I can't imagine it falling behind.
  
== Traffic Flow ==
+
=== Start/Restart ===
  
This summarizes the flow of trafficThere are a lot more sections that talk far more in depth about various things, but here you go...
+
This is manual, I have no tool to do itSSH to the servers that run gearman and use the <tt>/etc/init.d/gearman-server</tt> script.
  
* Site external IP is on dfw-lb01 (or dfw-lb02), which runs Perlbal.
+
=== Dreamhacks ===
* User connects to Perlbal.  If it's a static request, it serves it locally.  If it's dynamic, it hands off to a webserver.
+
See [[Setting up Gearman]]
* Perlbal connects to dfw-webXX and proxies the request.
+
* Webservers connect to lots of things: databases, memcache, mogilefsd, gearmand, etc.
+
* Response is returned.
+
  
That's the basic flow of things and what connects to what.  There's a separate flow that happens when the user requests a userpic (or any other MogileFS resource, but for now it's just userpics).
+
== Memcached ==
  
* User -> Perlbal, "GET /userpic/XXXX/YYY"
+
These generally stay up and never give any trouble.  They store data, it's basically a LRU cache. We don't push them that hard right now -- you can find all of the basic information in Cacti, I setup a nice memcached graphing library with interesting stats.
* Perlbal -> Webserver, "GET /userpic/XXXX/YYY"
+
* Webserver replies: X-REPROXY-URL: http://dfw-mog01/dev1/0/00/000/234.fid
+
* Perlbal -> dfw-mog01, "GET /dev1/..."
+
* Mogile storage node replies with image
+
* Perlbal munges headers from webserver original reply, plus body of image from mogile storeage node, returns that to the user.
+
  
SSL is different again:
+
* Runs on: dfw-memc01, dfw-memc02
 +
* Port: 11211
  
* User -> Pound.
+
=== Start/Restart ===
* Pound handles the SSL handshake and decryption/encryption.
+
* Pound connects to localhost:80 (Perlbal).
+
* Same process now as originally.
+
  
== Perlbal ==
+
Same as for Perlbal:
  
Perlbal is the main software load balancer.  With the Dreamwidth configuration, it doesn't do terribly much except handle reproxying.
+
    as root on dfw-admin01
 +
    $ bin/restart-memcache
  
* Runs on: dfw-lb01, dfw-lb02
+
KEEP IN MIND:  Restarting memcache puts a heavy strain on the databases.  While we can get away with it without any trouble right now (our databases are bored), at some point in the future restarting memcache will become synonymous with shooting the site in the knee and watching it hobble along.
* Admin port: 60000
+
  
Nagios is setup to monitor HTTP and SSL on these machines, not necessarily the admin port though.  (That could be useful.)
+
=== Admin Stats ===
  
=== Start/Restart ===
+
If you want the nitty gritty you can telnet to one of the instances on the port above and type <tt>stats</tt> which will give you a nice dump.
  
If Perlbal happens to crash or otherwise become unavailable, you can start/restart it.
+
== TheSchwartz ==
  
    run as root on dfw-admin01
+
There's not much to mention here.  The actual work is done by workers, which is in the Workers section of this document.  TheSchwartz database is maintained on the global database (see Databases section).  Logical db name is <tt>dw_schwartz</tt>.
    $ bin/restart-perlbal
+
  
=== Health Checking ===
+
== Workers ==
  
Kareila made a script for doing status checks on the perlbalsYou can run it like this:
+
The workers do async tasks that we don't need to happen inline with someone doing something on the websiteOkay, so I lied, some workers actually are synchronous (thinking of the Gearman things here).
  
    [dw @ dfw-admin01 - ~/current] -> bin/pbadm 1
+
* Runs on: dfw-jobs01, dfw-jobs02
    Name "LJ::PERLBAL_SERVERS" used only once: possible typo at bin/pbadm line 37.
+
    Tue May 19 05:44:52 2009: [lb01 - 003, 0000] [lb02 - 000, 0000]
+
    Tue May 19 05:44:53 2009: [lb01 - 001, 0000] [lb02 - 000, 0000]
+
    Tue May 19 05:44:54 2009: [lb01 - 004, 0000] [lb02 - 000, 0000]
+
  
Ignore the warning.  These lines should be color coded: green is okay, yellow is intriguing, red is problematicBut generally as long as the numbers look pretty low, it should be alright.  (Unless it says DOWN of course...)
+
There is no port or management for these, they're just tasksTypically speaking, you can see if they're running by looking at ps on the machine.
  
== MogileFS ==
+
=== Start/Restart ===
  
== Gearman ==
+
Second verse...
  
Very simple server that just handles jobs.  If this goes down, it should be started back up.
+
    as root on dfw-admin01
 +
    $ bin/restart-jobs
  
* Runs on: dfw-jobs01, dfw-jobs02
+
CAVEAT LECTOR: Restarting the workers can be hard on the content-importer workers, since they allocate 12 hours to process entry and comment imports.  If you restart workers while an import is in progress, it will cause that user's import to effectively pause halfway for 12 hours until it gets retried later.
* Port: 7003
+
  
There is no administrative portI think there are some commands you can use to see how deep the queues are, but I don't know off the top of my headWe use gearman for only one thing right now (userpic resizes?) so I can't imagine it falling behind.
+
There is no current way around thisYou just have to know when is a good time to restart workersWhile I'm gone, if you need to restart them, just do it.  If a user has a problem with a delayed import, support will be awesome and let them know that it might take a while.
  
=== Start/Restart ===
+
If you want to check on the importers...
  
This is manual, I have no tool to do it.  SSH to the servers that run gearman and use the <tt>/etc/init.d/gearman-server</tt> script.
+
    [root @ dfw-admin01 - ~] -> bin/importer-status
 +
    dfw-mail01
 +
    4084 ?        S      2:11 content-importer [bored]
 +
    dfw-jobs01
 +
    5663 ?        S      4:29 content-importer [bored]
 +
    25034 ?        S      0:05 content-importer [bored]
 +
    dfw-jobs02
 +
    28323 ?        S      0:04 content-importer [bored]
 +
    28528 ?        S      0:05 content-importer [bored]
  
== Memcached ==
+
Note they're all bored.  That means you are safe to just restart the workers.  On the other hand, if it says it's posting entries or comments, you might want to wait.  (But if it's an emergency, just do it.)
  
== TheSchwartz ==
+
== Incoming Mail ==
  
== Workers ==
+
The machine dfw-mail01 handles incoming mail.  It's a postfix system, with the MySQL module so that it can handle mail aliases/forwarding for users.
  
== Incoming Mail ==
+
Sorry this is lacking in detail.  If you're familiar with postfix you can dig around /etc/postfix for some more information.  Specifically the /etc/postfix/dw directory.
  
 
== Outgoing Mail ==
 
== Outgoing Mail ==
 +
 +
Amazon SES is our outgoing mail provider. You can use the AWS dashboards if needed.
  
 
== Databases ==
 
== Databases ==
 +
 +
We use Amazon RDS and operate on the Aurora database technology. This is basically MySQL but it automates things like leader elections/follower promotions, backups, etc.
  
 
== Webservers ==
 
== Webservers ==
  
== SSL ==
+
Serves web requests.  Pretty straightforward.
 +
 
 +
* Runs on: va-web{01,02,03,04,05,06}
 +
 
 +
=== Start/Restart ===
 +
 
 +
You're probably used to this by now.  There is a handy tool to do the restarts, but this one lets you give it a "delay".  If you have an emergency, and need to restart everything, you can just run the command with a 0 argument.
 +
 
 +
    run as root on va-admin01
 +
    $ bin/restart-webs 5
 +
 
 +
That restarts the webservers with a 5 second delay.  If you don't specify a delay, 5 seconds is used presently.
 +
 
 +
== ddlockd ==
 +
 
 +
This is a very damn simple locking daemon.
 +
 
 +
* Runs on: dfw-jobs01, dfw-jobs02
 +
* Port: 7002
 +
 
 +
=== Start/Restart ===
 +
 
 +
Given that it's a locking system, it won't actually restart.  The following command will start them up if they're down, that's it:
 +
 
 +
    run as root on dfw-admin01
 +
    $ bin/restart-ddlockd
 +
 
 +
=== Status ===
 +
 
 +
You can telnet to the port and issue the command <tt>status</tt> to see what's going on.  It's a little terse.

Latest revision as of 03:33, 7 November 2021


This document is for the Dreamwidth staff. Most of this won't work for you, but if you're curious, feel free to look. We host on AWS.

Links

Traffic Flow

Traffic looks like:

1. We use Cloudflare for some things (www, userpics, attachments, some high traffic domains). 2. AWS Cloudfront. Use this to configure caching and such. 3. AWS WAF. Use this to configure mitigations against bad actors or things. 4. AWS Application Load Balancer. Use this to route traffic to different internal endpoints. 5. Destination instance(s).

It's hard to say what #5 is since it varies depending on the request. In general though, it hits one of our EC2 instances which handles the request.

Gearman

Very simple server that just handles jobs. If this goes down, it should be started back up.

  • Runs on: dfw-jobs01, dfw-jobs02
  • Port: 7003

There is no administrative port. I think there are some commands you can use to see how deep the queues are, but I don't know off the top of my head. We use gearman for only one thing right now (userpic resizes? directory searches?) so I can't imagine it falling behind.

Start/Restart

This is manual, I have no tool to do it. SSH to the servers that run gearman and use the /etc/init.d/gearman-server script.

Dreamhacks

See Setting up Gearman

Memcached

These generally stay up and never give any trouble. They store data, it's basically a LRU cache. We don't push them that hard right now -- you can find all of the basic information in Cacti, I setup a nice memcached graphing library with interesting stats.

  • Runs on: dfw-memc01, dfw-memc02
  • Port: 11211

Start/Restart

Same as for Perlbal:

   as root on dfw-admin01
   $ bin/restart-memcache

KEEP IN MIND: Restarting memcache puts a heavy strain on the databases. While we can get away with it without any trouble right now (our databases are bored), at some point in the future restarting memcache will become synonymous with shooting the site in the knee and watching it hobble along.

Admin Stats

If you want the nitty gritty you can telnet to one of the instances on the port above and type stats which will give you a nice dump.

TheSchwartz

There's not much to mention here. The actual work is done by workers, which is in the Workers section of this document. TheSchwartz database is maintained on the global database (see Databases section). Logical db name is dw_schwartz.

Workers

The workers do async tasks that we don't need to happen inline with someone doing something on the website. Okay, so I lied, some workers actually are synchronous (thinking of the Gearman things here).

  • Runs on: dfw-jobs01, dfw-jobs02

There is no port or management for these, they're just tasks. Typically speaking, you can see if they're running by looking at ps on the machine.

Start/Restart

Second verse...

   as root on dfw-admin01
   $ bin/restart-jobs

CAVEAT LECTOR: Restarting the workers can be hard on the content-importer workers, since they allocate 12 hours to process entry and comment imports. If you restart workers while an import is in progress, it will cause that user's import to effectively pause halfway for 12 hours until it gets retried later.

There is no current way around this. You just have to know when is a good time to restart workers. While I'm gone, if you need to restart them, just do it. If a user has a problem with a delayed import, support will be awesome and let them know that it might take a while.

If you want to check on the importers...

   [root @ dfw-admin01 - ~] -> bin/importer-status
   dfw-mail01
    4084 ?        S      2:11 content-importer [bored]
   dfw-jobs01
    5663 ?        S      4:29 content-importer [bored]
   25034 ?        S      0:05 content-importer [bored]
   dfw-jobs02
   28323 ?        S      0:04 content-importer [bored]
   28528 ?        S      0:05 content-importer [bored]

Note they're all bored. That means you are safe to just restart the workers. On the other hand, if it says it's posting entries or comments, you might want to wait. (But if it's an emergency, just do it.)

Incoming Mail

The machine dfw-mail01 handles incoming mail. It's a postfix system, with the MySQL module so that it can handle mail aliases/forwarding for users.

Sorry this is lacking in detail. If you're familiar with postfix you can dig around /etc/postfix for some more information. Specifically the /etc/postfix/dw directory.

Outgoing Mail

Amazon SES is our outgoing mail provider. You can use the AWS dashboards if needed.

Databases

We use Amazon RDS and operate on the Aurora database technology. This is basically MySQL but it automates things like leader elections/follower promotions, backups, etc.

Webservers

Serves web requests. Pretty straightforward.

  • Runs on: va-web{01,02,03,04,05,06}

Start/Restart

You're probably used to this by now. There is a handy tool to do the restarts, but this one lets you give it a "delay". If you have an emergency, and need to restart everything, you can just run the command with a 0 argument.

   run as root on va-admin01
   $ bin/restart-webs 5

That restarts the webservers with a 5 second delay. If you don't specify a delay, 5 seconds is used presently.

ddlockd

This is a very damn simple locking daemon.

  • Runs on: dfw-jobs01, dfw-jobs02
  • Port: 7002

Start/Restart

Given that it's a locking system, it won't actually restart. The following command will start them up if they're down, that's it:

   run as root on dfw-admin01
   $ bin/restart-ddlockd

Status

You can telnet to the port and issue the command status to see what's going on. It's a little terse.