Those of you who follow me on Twitter no doubt noticed that yesterday was no fun at all when it came to this site. Mostly due to some unexpected actions by my web host, the venerable Pair.
Red Alert #1
The first sign of trouble was a biggie. At 1:42PM local time I discovered that this site was completely down. As in, every page would produce an HTTP 500 "Internal Server Error". This is the kind of thing that can make a small ISV just about wet himself. I depend on the site to sell my software, and I depend on software sales to keep my house and buy food and stuff like that.
Now I've just done a completely new site design, with a whole mess of PHP code, and I had been fixing a couple of things in the morning. Naturally, I assumed the problem was my fault, and so I started desperately trying to figure out what I had broken.
This is where I first ran into trouble with Pair. If you're getting internal server errors, the first step is to look at the server error log and see what it's teling you. On Pair you can't get raw logs on a shared server, but their Account Control Center is supposed to show you errors related to your sites. But there was nothing there. As far as the ACC was concerned my site was fine, even though the site kept spewing 500s.
Left with no clues to go on, all I knew was "something's broke!". I checked my local copy of the site (the one that runs on my Mac) versus the live files. I made sure to undo every change I had made since I last knew things were working. I investigated whether the database had been hacked. Nothing.
Asking friends in #macsb for help, one pointed out that Pair had announced a PHP update yesterday, about 90 minutes before I noticed the site was broken:
php5.cgi will be upgraded today to PHP 5.2.2. This upgrade is necessary to patch a security vulnerability that was recently announced.
This will only affect you if you are currently using scripts with .php5 extensions or using php5.cgi through an '.htaccess' file for PHP-CGIWrap.
That's me right there. I run PHP as a CGI rather than an Apache module because it means I don't have to make my site files writable to everyone on the same shared server. Unix gave me file permissions, and I'm going to use them. The above was followed by instructions on how to upgrade to the latest PHP CGI.
It didn't help, though. I got on the phone to Pair. Eventually they managed to get me a working copy of PHP. The site came back up.
Normally I trust Pair to be reliable and not to break things unexpectedly. In this case I think they blew it in a major way. Their PHP upgrade immediately broke my web site, and this was done with no more than 90 minutes notice (the time from when they made the upgrade until I noticed something amiss). The announcement of the change was made on an internal Pair usenet server and on an RSS feed. You can't reasonably expect all of your customers to be constantly watching those. With this sudden, drastic change you need to be calling people's cell phones to alert them. And you can't make a change that's going to break people's web sites without telling them it's going to break the site. The announcement made it sounds like the upgrade was something I should do, not something I must immediately do to prevent my site from going offline. Geez, what if I'd been out of town? The site could have been down for weeks. I'm just lucky I was online to catch this before it went on too long.
Red Alert #2
It turned out that not all was well.
For Chimey and MondoMouse I use Aquatic Prime to generate license files. Aquatic Prime requires some server support if you want these be generated automatically. For me the normal process is:
- Someone orders the software through my eSellerate-powered web store.
- eSellerate sends my server an HTTP POST which contains information on the sale.
- If the POST data looks good, Aquatic Prime generates a license file and emails it to the customer.
And of course, I'm using a PHP implementation of Aquatic Prime.
I soon discovered that although the site was up, my Aquatic Prime code was broken. Gaah! This meant that if someone ordered my software, they wouldn't get a registration code! Oh shit!
This led to much near-panicked research involving having eSellerate send some "preview" mode test sales to the server and new debug code in my copy of Aquatic Prime. I discovered that while eSellerate was correctly sending the POST data, said data was not actually making it into my PHP code. No sale data, no license file, no email.
The stock Aquatic Prime code for working with eSellerate (contributed to the project by yours truly) grabs the POST data out of PHP's global $HTTP_RAW_POST_DATA variable. Esellerate sends XML instead of key/value pairs, so that's a convenient way to get at the XML. But now $HTTP_RAW_POST_DATA was always an empty string.
I tried running a phpinfo() on the new PHP, and discovered that the always_populate_raw_post_data setting was "Off". That certainly explained what I was seeing, because without that you don't get $HTTP_RAW_POST_DATA. But why would it suddenly be off? I guessed that Pair had changed the PHP configuration file without warning, a charge they denied (although it did take several emails back and forth before I could even get someone from support to understand what I was asking about).
Having no useful clues from Pair, or from PHP's release notes or changelog, I proceeded to find a work-around to get the license code going again. My fix is as follows:
if ($_SERVER["REQUEST_METHOD"] == "POST") {
$HTTP_RAW_POST_DATA = file_get_contents("php://input");
}
This bypasses the always_populate_raw_post_data setting to get at the POST data using a PHP URL wrapper. I'm not sure this is the best solution but it does the job.
That got me running again, but left me with one glaring question: WHY had this happened? If Pair didn't change their PHP config, and if the PHP changelog made no mention of this, why was my script suddenly broken after running correctly for so long?
Fortunately nobody tried to buy my software while the license system was down. But you know you're having a really bad day when you start a sentence with "Fortunately nobody tried to buy my software...", regardless of how you finish the sentence.
Apparently this is a known bug in PHP 5.2.2. And it's not just me who's affected. XML-RPC is broken by this for many people, such as WordPress's implementation (they've arrived at more or less the same solution as me). Drupal's implementation of XML-RPC was already using the php://input approach and is therefore unaffected.
Technically I guess I have to lay this problem on the PHP developers. They seem to have a problem with $HTTP_RAW_POST_DATA, because the changelog indicates that this exact bug was fixed previously in version 5.0.2 and version 5.1. Now it's back for a third round.
But a big part of the reason I'm with Pair in the first place is that I trust them not to stick me with buggy code. They don't always have the latest versions of everything, and that's just fine with me if it means I'm trading currency of releases for stability. I don't need the latest features of everything, I need my site to run reliably. Of course you're unlikely to get a release of something like PHP without there being some known bugs. At the same time, XML-RPC is an enormously popular system, and I would have expected a PHP release that broke it to also have failed Pair's vetting process.
What to do about all this? I've hosted with Pair since 2002. This is the first time I've had any serious trouble with my site since some time in 2004, and even that turned out not to be Pair's fault. And of course they still beat the crap out of places like Dreamhost for reliability. At the same time this is a very disappointing and worrying failure on their part. I'm not looking at switching just yet but I'm a lot less confident in my current setup than I was a week ago.






Tue, 05/15/2007 - 14:52
For a client of mine who, like yourself, runs mission critical stuff at Pair (ie: we lose money when its down), we use a dedicated box and build our own PHP CGI so we are in charge of all the upgrades. I think you can do that on a shared box too (not 100% sure).
Anyways, it would be nice if Pair did some site sniffing and if they discover a 500 error or other down-ness of your site after they do an upgrade notify you. I wouldn't demand such a service but it could be a nice add-on for them to offer. I use similar polling services for other clients. Montastic is a nice free option.
http://www.montastic.com
And on that same note I can't blame them for patching a known security hole with haste. They are one of the biggest shared hosting companies in the world, with a ton of client data to protect. I would much rather my site fail, I get notified, and fix the issue than they not patch and my client list is stolen.
The part B of your story is poor code review by the PHP team, hardly something that should get Pair blame but unfortunately none-the-less.
You wrote:
"The announcement of the change was made on an internal Pair usenet server and on an RSS feed. You can't reasonably expect all of your customers to be constantly watching those."
True. But customer who have mission critical software should be. It's the life of a indie developer + system admin and yeah it sucks. System Admining is probably my least favorite aspect of my gig as an indie. I dream of the day someone else is doing it for me.
Tue, 05/15/2007 - 15:49
Zorn: Maybe I didn't emphasize the suddenness of the change enough. When I referred to "constantly" watching those sources, I meant in the sense of reloading the feed every 30 seconds or so and being prepared to drop everything every time you do.
They made the PHP change without advance warning, and as soon as they did the site went down. Anyone affected by the problem would have their site offline for however long their RSS client's refresh cycle lasts, or more if they don't immediately read their feeds. Unfortunately there's no amount of preparedness on my part that would have prevented at least some downtime here. The way the change was made meant that it was only possible to react after the fact.
Tue, 05/15/2007 - 14:57
These things happen everywhere, yet maybe not as bad as yours... who knows.
I work for a large datacenter/isp and these issues happen from time to time. Without redundancy/clustering etc, it is hard to not be impacted by a single machine not being 100%, at 100% of the time.
Glad everything is back up. Stinks they took so long to fix it.
Thu, 05/17/2007 - 03:33
Tom: The new site design looks awesome! What an improvement.
Fortunately this bug didn't affect me (I have stuck with php4, partly because php5 has a reputation for bugginess). But it did affect some of my customers, and cost me support time dealing with customers for whom things had "suddenly broken" with MarsEdit.
I put more of the blame on PHP than on Pair. It sounds like the update was presented as a critical security fix. In many ways Pair is merely the facilitator between you and your trusted software packages. In this case I think you're saying, by using PHP, that you trust PHP's "critical security updates" to be safe.
I think the circumstances of this release from PHP made a LOT of people look culpable. WordPress looked broken. Pair looks responsible. Etc. Etc. Etc.
I'm curious to know how a company like Pair balances the "trust/verify" relationship when it comes to major software packages like this. I guess they should test everything extensively before deploying it, but again, it sounds like this release was marketed as a security update, and PHP in fact changed quite a bit more than that?
Thu, 05/17/2007 - 13:55
The latest PHP update (5.2.2) was described as having "major stability and security enhancements" over the previous release. However Pair doesn't always keep up with the latest version. By going to 5.2.2 they leapfrogged 7 or 8 interim PHP updates, and I don't know what would have been in them.
As I've gradually come to understand, the 500 errors weren't because of the new version of PHP per se, but rather because of the way Pair's CGIWrap system works. I don't know if this is the same as the SourceForge CGIWrap project or not, but basically by using CGIWrap your CGI code runs as your user ID instead of the web server's UID. Pair has a page describing how to use this with PHP4. It can also be done with PHP5 but you have to learn about it from strangers on the streets (i.e. Pair's internal newsgroups).
Pair explains to me that:
I don't understand CGIWrap enough to know why that would matter, but I guess the upshot is the same-- internal server errors until I got a new version of php5.cgi.
Fri, 05/25/2007 - 03:36
Hi there, I was setting up Wordpress for the first time this week and whilst getting MySQL and PHP up and running, I recall stumbling over forum post about this. Many have used the same solution as you have. However a few replies have commented that the HTTP_* variables have been deprecated.
I'd agree Pair should send out notices of what upgrades they are doing and do some checks about the impacts.
The issue for Wordpress is that it claims compatibility with a wide range of PHP versions. As developers we know what a nightmare it can be to have code that works correctly on all versions. And that's if we or our ISP's keep the various versions of MySQL, PHP and Wordpress up to date.
Post new comment