Bad Hardware Day

The site seems to be running nicely on the old server in the cellar. I’m sure browsing photos isn’t a terribly fast experience any more, but until I can find a reasonably priced hosting provider I can trust, this is the way it has to be. As detailed here, my last dedicated hosting provider turned out to be less than dedicated and not much of a host to his paying guest.

As I’ve mentioned before, the new hardware that Web Host Plus put me on after managed.com sold my hard drive to them was less than reliable. I suspected a few causes, one of which was bad RAM. This theory now seems to have been borne out by an experiment I did. Before I rsynced our photo gallery across the Atlantic, I rebooted the problem box in New Jersey and reduced its operational RAM from 2 Gb to 512 Mb. In that new configuration, I was able to spend many a joyous hour copying my precious data over the transatlantic pipe without the originating box going AWOL. QED, I’d say.

I’ll be cancelling service with managed.com, Web Host Plus or whoever’s running the show now as soon as my next invoicing cycle starts. Good riddance to bad rubbish, as they say.

In the meantime, any DNS oddness you may have been seeing should now be a thing of the past. All slaves are in sync and handing out correct data. Incorrect data in the caches of other DNS servers should now also have expired. Normal service has now been resumed.

Gallery Restored

Our photo gallery is back on-line.

I was able to bring it back much more quickly than I had first anticipated, because I used an old Gallery 1.x back-up to seed the albums in the Gallery 2.x installation. That copied hundreds of extra, unnecessary files, but they were easily removed afterwards by rsync.

I did this by getting a list of all the album directories on the remote server:

cd /var/www/g2data/albums
find -type d > /tmp/file_list

I then copied this over to the new server with the Gallery 1.x back-up:

for i in `cat /tmp/file_list`; do
  album=${i##*/}
  src=`find /var/www/html/albums -type d -name $album`
  [ -n "$src" ] && rsync -av $src/ /var/www/g2data/albums/$i
done

In the end, I needed to copy from New Jersey only the photos we had taken since mid-February, which is when I had done a full back-up in preparation for migrating from Gallery 1.x to 2.x.

Somehow, one of the tables in the MySQL database had got corrupted in the move:

060518 18:02:38 [ERROR] Got error 134 when reading table './gallery2/g2_ImageBlockCacheMap'

This was easily corrected:

mysql> repair table g2_ImageBlockCacheMap;
+-----------------------------+-------+---------+----------------------------------------+
| Table                           | Op     | Msg_type | Msg_text                                    |
+-----------------------------+-------+---------+----------------------------------------+
| gallery2.g2_ImageBlockCacheMap | repair | warning  | Number of rows changed from 45465 to 45460 |
| gallery2.g2_ImageBlockCacheMap | repair | status   | OK                                           |
+-----------------------------+-------+---------+----------------------------------------+

And, with that, the rescue and salvage operation to yank caliban.org from the incompetent clutches of the unholy alliance of Managed.com and Web Host Plus is 95% or more complete.

Once the residual DNS propagation issues evaporate, I’ll be able to fully exhale once again.

Pain And Suffering

Much of the last 24 hours has been spent seizing those rare moments during which my server — migrated through no desire of mine to Web Host Plus — is up and on the network, and using them to perform a migration of my own, namely to the server in my cellar.

I’m knackered, but a lot has been accomplished today. DNS and e-mail have now been fully migrated, including Web mail and mailing lists. The Web site, too — which you’re now reading — is also up and running on the new (well, actually quite old) server.

The main thing that’s not yet back up is our gallery of photos. That’s because it’s 19 Gb of data, which would be slow to copy from a reliable server on a fast network. Well, I have to copy it from a machine that keeps crashing and is not on a fast network. It could be a couple of weeks before I manage to get all of my data off it… if I’m lucky. I don’t want to even contemplate the notion of not being able to recover all of my data.

I thought I’d left this kind of sysadmin drudgery behind me when I stopped working. Indeed, I moved my domain to dedicated hosting to reduce the downtime and maintenance that I had repeatedly endured when I hosted it myself on a domestic DSL line. Little did I know that I would get to enjoy such advantages for barely a year before falling victim to the worst kind of professional incompetence: that with no sense of responsibility for one’s customers.

And so caliban.org is back on a domestic DSL line, albeit one that has proved itself more reliable than those I had in the US. The upstream bandwidth is also somewhat better.

Anyway, don’t bother looking for new photos — or old ones — in the near future. I’ll announce when — if? — they’re once again available.

Wankers

Yes: wankers. Wankers! WANKERS! WANKERS! __WANKERS!__

Who am I talking about? Managed.com, of course, the

company to which I give good money each month to host this site.

What happened?

Well, managed.com decided to move its network from California to New Jersey.

At least, that’s as much as they told us, the paying customers.

In preparation for this, they sent all of their customers an e-mail asking to

be supplied with the customer’s root password via plain text e-mail. For those

of you who aren’t in the field of computer system and network administration,

let me state that this is a violation of one of the most basic and universally

lauded principles of the profession: never, under any circumstances, send

passwords in the clear.

And yet, my hosting provider was asking me to do just this. In hindsight, that

should have been enough to spur me into action. I should have found another

hosting provider, right there and then, and moved my data prior to the

migration. But I decided to wait until after the migration to seek a better

provider. As always, laziness, compounded by a failure to recognise the

urgency of the situation, won out.

Anyway, managed.com were supposed to back-up their customers’ data, firstly

with a full back-up and then, shortly before the migration, with a further

incremental back-up. The migration was supposed to be barely noticeable, with

a guaranteed maximum of two hours of downtime.

I was sceptical, but kept my fingers crossed.

Can you believe that managed.com didn’t tell its customers in their

notification e-mail when this migration would actually take place? We were

left to guess. E-mails to them on the subject went unanswered, as did requests

for a secure channel through which to supply one’s root password.

When I noticed one day that my machine had been rebooted without my

permission, I incorrectly assumed the migration had already taken place. If

I’d known at that time that things would be moving to New Jersey, not just

around the corner in California, I could have run a traceroute and seen that

my machine had not actually gone anywhere. At that time, however, I thought

they were just moving locally. What else could I think? Managed.com had told

me virtually nothing in their e-mail.

caliban.org mysteriously went off the network on 9th May. It remained

inaccessible for almost three days. So much for the two hours of guaranteed

downtime.

All of my e-mails to managed.com went unanswered in this period. Only when I

threatened them with legal action (a trick I picked up in America), did they

finally respond by rebooting the machine and getting it back on-line.

Naïvely, I thought that would be an end to my problems. Yes, that was

very naïve of me.

You see, managed.com restored my service from a week old back-up. I’ve no idea

what happened to the promised incremental back-up. It was probably never made

and, even if it was, it would have had to be of the last week’s worth of data,

not just the day before the migration. I suspect it was never even made,

however.

The net effect? I found I was missing a week’s worth of e-mail, multiple DNS

changes had been lost, the last week’s worth of blog entries had effectively

never been written, and sundry other less serious issues now needed to be

fixed, such as recent software updates becoming undone.

More e-mails to managed.com went unanswered. Due to an oversight on my part,

my own off-site back-ups had not taken place in recent times, so I had no

private back-up from which I could recover my data. Typical.

I began work on the system to repair the damage my hosting provider had done

to it, but before I could achieve very much, the system went down again. The

system was off-line again for more than a day. Once again, e-mail threats were

required to get it back on-line.

So what’s going on?

Exploration of my system’s log messages shows that the new hardware on which

my data resides is not the same as the old. For one thing, the system has a

different Ethernet card. Now, either that card is flaky or the Linux driver

for it is, because the system regularly gives up the ghost and all but

crashes: TCP connections to open ports hang without response; processes can no

longer be forked; even syslogging stops.

Yet, even if the new hardware had presented no problems, it’s inconceivable

that a company would move a working Linux (or any other) system to new

hardware and just expect it to work. What if I had not had the driver for the

new network card compiled for my kernel? My machine would have had absolutely

no way of ever getting back onto the network after the migration. It’s sheer

luck that I can sometimes still log into my machine and that it’s not

completely dead to the world.

So, the networking on the new hardware is extremely unreliable. rsyncs

regularly fail with checksum errors. The more network traffic one pumps over

the interface, the more such errors occur. Eventually, the system becomes

unstable and eventually unreachable.

It’s also possible that the machine has bad RAM or ineffective cooling, either

at the CPU or the data centre level. Witness these messages, culled from my

log in a rare moment of accessibility.

May 15 06:39:58 ulysses CPU0: Temperature above threshold

May 15 06:39:58 ulysses CPU0: Running in modulated clock mode

The system is now on heavy-duty medication: cold reboots, at first twice

daily, but that proved inadequate, so cron now reboots the machine every hour.

That’s the only way to avoid the machine locking up completely, which then

puts me at the mercy of managed.com to reboot it. That’s something that now

seems to take more than 24 hours to accomplish.

Clearly, this appalling state of affairs can’t be allowed to continue, so I’m

already on the look-out for alternative hosting providers.

A year ago, when I selected this company to host my services, people seemed

happy with it. I, too, was happy with the service until earlier this year. In

the last couple of months, however, things have been going downhill, which is

never a good portent for the future. Nevertheless, I was not prepared for what

has now befallen me. These people are lacking even the most basic system

administration skills.

So, what happened? Well, a little research shows that managed.com is not

really performing a migration. The hard drives and the data have moved to the

other side of the country, yes, but not because managed.com is doing it. No,

managed.com has been sold, you see? My data now turns out to be at the mercy

of Web Host Plus, so the current disaster is

actually largely due to their mismanagement and incompetence.

In fact, it turns out that a great many people are in a [similar or even worse

state](http://www.webhostingtalk.com/showthread.php?t=508358), thanks to this

bunch of clowns.

Sixty-three

pages of utter misery and appalling professional disregard of one’s customers

come to light.

Anyway, to say that I am in the market for a new hosting provider is an

understatement. If you have any recommendations, I’d be glad to hear them.

Ideally, they should not be located in the US, due to that country’s Draconian

legal stance with regard to privacy.

Thanks to Google, I was able to rescue the missing

blog entries from the Google

cache. I had to add back

the article comments by hand, which caused the loss of the original time of

entry, but at least the text of the article itself has been recovered.

The week of missing e-mail, on the other hand, is simply gone. Calls to

Web Host Plus to make available the missing incremental back-up simply fall on

deaf ears.

I’m utterly appalled to experience first-hand how this company has lost my

data and now ignores my complaints. I’m left bewildered as to the precise

ratio of incompetence to deliberate professional disregard, but I am 100% sure

that I have to get my data away from this bunch of wankers as soon as I

possibly can.

Until that time, expect the server to be up and down like a yo-yo.

Miff TV

I went back to troubleshooting my expensive box of dysfunctional hardware today, but I actually couldn’t test any further. The next thing to check was the power supply, but I didn’t have a spare ATX unit to pop in.

Sadly, I had to bite the bullet and take the thing into a trusted PC shop. There, the bloke tried a new power supply, but it made no difference. That leaves pretty much just the CPU and motherboard to test. I bet it ends up being the sodding motherboard. That’ll give me another thing to send back to the place I purchased it. What a hassle. Buying hardware on-line is fine when everything works, but when it doesn’t…

Anyway, they probably won’t even start to look at it until early next week, so I’ll forget about it for the next few days.