A friend's site that runs Drupal 4.6 is becoming more popular by the month. With increased popularity, comes increased traffic, and the need for more resources.
Problem: Slow Site
After moving to a dedicated server, things were better for a while, then in late March to early April, users were reporting page load times that are too slow. I mean really slow, as in 25 seconds per page per request.
Hardware is not the issue here. This is a Debian Linux server with a dual Pentium 4 2.8GHz with 2GB or RAM.
Investigation: CPU Bound
After logging in and doing some investigation, I saw the following output from vmstat:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
10 0 126820 380848 127940 938080 0 0 110 239 386 498 97 3 0 0
8 0 126820 357348 127976 938544 0 0 25 133 340 412 97 3 0 0
11 0 126820 362136 128004 938680 0 0 7 91 283 280 98 2 0 0
5 0 126820 389988 128032 939192 0 0 32 137 342 392 98 2 0 0
6 0 126820 417436 128060 939504 0 0 15 112 293 391 93 2 5 0
2 0 126820 400424 128088 939816 0 0 9 133 327 444 98 2 0 0
6 0 126820 365852 128132 940684 0 0 71 111 324 410 97 3 0 0
0 0 126820 382328 128160 941568 0 0 70 121 345 368 89 2 8 0
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
5 0 126820 362820 129520 967064 0 0 73 149 395 324 97 3 0 0
5 0 126820 348348 129572 967500 0 0 35 93 331 268 86 3 11 0
1 0 126820 378004 129596 968004 0 0 36 128 326 299 83 3 14 0
1 0 126820 408256 129632 968580 0 0 46 94 343 281 78 2 21 0
3 0 126820 418516 129660 968840 0 0 16 107 392 339 66 3 30 0
7 0 126820 348000 129696 969572 0 0 56 106 369 418 95 3 1 0
5 0 126820 355796 129720 969840 0 0 13 109 307 259 93 3 5 0
3 0 126820 387436 129748 970480 0 0 51 126 335 333 76 3 21 0
8 0 126820 364508 129772 970908 0 0 33 96 335 281 96 3 1 0
As you can see, there is a large number of processes that are in the run queue, waiting for the CPU to be available. Meanwhile, the CPU idle time is close to non-existent.
This is confirmed by the output from this ps command:
ps -el| grep '^. R'
R 999 26312 10623 2 78 0 - 6489 - ? 00:00:08 libhttpd.ep
5 R 999 26349 10623 4 80 0 - 6761 - ? 00:00:12 libhttpd.ep
5 R 999 26432 10623 11 80 0 - 6559 - ? 00:00:20 libhttpd.ep
5 R 999 26487 10623 2 78 0 - 6800 - ? 00:00:03 libhttpd.ep
5 R 999 26553 10623 2 78 0 - 6419 - ? 00:00:01 libhttpd.ep
5 R 999 26559 10623 10 80 0 - 6191 - ? 00:00:06 libhttpd.ep
5 R 999 26573 10623 8 79 0 - 6443 - ? 00:00:04 libhttpd.ep
5 R 999 26579 10623 11 79 0 - 6481 - ? 00:00:06 libhttpd.ep
5 R 999 26583 10623 10 80 0 - 6493 - ? 00:00:05 libhttpd.ep
After I tailed the Apache log, I found that the amount of hits to node/feed and other */feed is amazing (several a second). This is due to the fact that the site's audience are mainly Mac users, and they use all kinds of widgets to subscribe to RSS feeds. These widgets poll the site every so often, and when you have a lot of users doing this, the load on the site increases dramatically.
Step 1: Disabling Feeds
So, the first thing I did was modify modules/node.module to force node_feed() to return early, thus saving the load from the feeds. This improved things a bit, but not by a whole lot. Here is the vmstat output.
Tue Apr 18 20:31:08 PDT 2006
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 101232 708352 147608 663464 0 0 0 96 217 155 8 0 92 0
1 0 101232 708148 147612 663548 0 0 6 61 212 172 20 1 80 0
1 0 101232 718048 147612 663632 0 0 2 70 207 185 28 0 71 0
1 0 101232 710224 147612 663732 0 0 0 97 224 207 69 2 30 0
1 0 101232 689008 147612 663836 0 0 0 88 229 294 56 1 43 0
0 0 101232 646712 147620 663920 0 0 0 88 269 278 46 2 52 0
0 0 101232 680540 147624 664016 0 0 0 91 215 240 52 1 47 0
0 0 101232 701752 147624 664096 0 0 0 94 225 223 48 1 51 0
3 0 101232 701444 147624 664144 0 0 0 111 220 204 36 1 64 0
Step 2: Zeroing In On the Real Issue
Next, we noticed that another site using Drupal 4.6 on the same server is running relatively fast, so it had to be something specific to the slow site (either code or database), and not hardware, operating system, or system software.
Then, we noticed that a local copy of the same Drupal installation with the same database on a single user machine is also very slow. So, the problem is not load related (number of users accessing the site simultaneously).
When using the devel.module on a local install using a Pentium III 550 MHz, it took the front page 25 to 30 seconds to load on a local LAN. The inner pages took 13 to 19 seconds. This is page generation time as reported by devel.module.
Step 3: Lots of Aliases
Upon checking the database, I found that there were 8479 aliases in the url_alias table.
The problem is that the url_alias table, which contains the search engine friendly URLs for all nodes and feeds (e.g. the_article_title instead of node/3), was getting loaded in memory and indexed as an associative array for each request.
This is fine when one has only a few hundred nodes (my sites have 600+ and 700+ with no problems). However, in this case, the 8400+ aliases in the table are too much processing per request. For every page load, those were a) read from the database, and b) indexed as an array in memory, which caused the severe slow down for each request.
There is an enhancement that was done for 4.7 to not load all the aliases, but get them from the database on-demand. This was backported to 4.6 in this issue, and specifically in this patch by Matt Westgate.
After applying this patch, which applies cleanly to 4.6.6 with an offset, the site was flying fast again. On the test Pentium III, the front page took 2.2 to 2.8 seconds, and inner pages took 1.2 to 1.5 seconds.
I applied the patch to the live site, and asked the site owner to run update.php.
Here is the vmstat output after the patch on the live server (dual Pentium 4).
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 214300 133892 124580 1309908 1 3 2 2 4 1 41 1 58 0
2 0 214300 129780 124592 1309976 0 0 2 82 230 387 17 1 82 0
0 0 214300 120592 124596 1310036 0 0 1 77 215 420 20 1 79 0
0 0 214300 117864 124596 1310104 0 0 0 103 242 486 7 1 92 0
0 0 214300 117932 124604 1310148 0 0 0 64 202 328 16 1 82 0
0 0 214300 107216 124608 1310184 0 0 0 102 223 272 12 1 87 0
0 0 214300 109792 124608 1310188 0 0 0 98 208 77 0 0 99 0
0 0 214300 109776 124612 1310200 0 0 0 39 202 94 1 0 99 0
0 0 214300 117316 124616 1310204 0 0 0 37 204 58 0 0 99 0
1 0 214300 123028 124616 1310212 0 0 0 35 198 125 13 0 87 0
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 214300 119920 124692 1311136 1 3 2 2 4 1 41 1 58 0
0 0 214300 119852 124692 1311168 0 0 0 61 258 237 4 0 96 0
0 0 214300 145376 124692 1311188 0 0 0 38 292 169 3 0 96 0
0 0 214300 132276 124692 1311228 0 0 0 94 256 239 34 1 66 0
1 0 214300 119664 124700 1311268 0 0 0 76 229 289 5 1 94 0
1 0 214300 110760 124708 1311344 0 0 0 208 339 451 11 1 88 0
0 0 214300 103772 124712 1311404 0 0 0 67 246 447 7 1 92 0
1 0 214300 97004 124716 1311448 0 0 0 59 250 318 12 1 88 0
1 0 214300 106728 124720 1311508 0 0 0 78 274 374 8 1 91 0
0 0 214300 121628 124720 1311556 0 0 0 91 248 264 12 0 87 0
And the ps command shows no processes at all in the run queue:
ps -el | grep "^. R"
0 R 59556 21340 3669 0 76 0 - 745 - pts/0 00:00:00 ps
Voila! Success at last!
Conclusion
If we ever roll Drupal 4.6.7, then this patch should be part of that release. It seems that 4.6 will live on for sometime, even after the imminent release of 4.7, perhaps in a few weeks.
Comments
adrian (not verified)
Definitely should be rolled in.
Mon, 2006/04/24 - 05:31We patched our system at Bryght with this before we even released the first 4.6 version.
The only way we could concievably enable pathauto for our clients.
Ramdak (not verified)
Rolling the patch into a 4.6 release
Tue, 2006/04/25 - 10:12Well, despite several requests, Killes is on record saying he will not roll it in for 4.6.
Is this one of those things that requires a greater level of consensus to happen? I am not trying to stir the pot or any such thing, but it is really unfortunate this was deemed not necessary for 4.6.6.
greggles (not verified)
Disabling feeds
Mon, 2006/04/24 - 09:42Was disabling feeds was that just a test and you have re-enabled them or did you change the function in a way that makes it faster while keeping the functionality? If you just disable feeds I imagine that some of the site's users would be upset.
Great investigation and great article!
Khalid
Thanks
Mon, 2006/04/24 - 09:56Thanks for the typo. I fixed the article, and removed your comment about it so no one would get confused.
As for the feeds, I did not make it faster. I just disabled it. If I could make it faster, I would have contributed a patch. I think we should think of flat file caching for feeds or something of that sort. It is too much work done several times a second on some sites.
The site owner was desparate at that point, and we announced on the front page that RSS feeds will be disabled for a while while we trouble shoot the problem.
mike3k (not verified)
Redirecting feeds
Mon, 2006/04/24 - 11:16I use FeedBurner for my feeds. Over 75% of my traffic comes from RSS feeds, since NetNewsWire includes my feed as one of the defaults. I ended up creating a private feed URL which calls node_feed() and which only FeedBurner subscribes to, while node/feed is redirected to the feedburner feed.
At first I had node.module do the redirect itself, but I found I could reduce the load even further by using a redirect permanent in .htaccess.