no longer 100%

Until yesterday we had a perfect uptime history. 1,375 days online, and no interruptions. Today that changed: Starting from 9:16am PDT was not reachable for 19 minutes.

Absolutely our fault. Of all possible scenarios it is actually our preferred one since we can fix it. We did and this problem will never occur again. Still sucks to having lost our outage virginity. All outages are avoidable, and so was this one. And -as usual- it was the lack of imagination that caused us to not see this coming.

What happened?

An upload with 550Mb/s triggered an automated protection system that took a network interface offline. It should have only impacted the one address using that amount of traffic. But it was doing it’s job wrong and shot in both directions: Rendering us unreachable.

This system is meant to guard INTERDUBS against malicious brute force attacks. Not a bad idea, if implemented right. 550Mb/s is of course still very very far away from the limit of our network capabilities. It was the volume and specific traffic pattern that caused the emergency shut off. We did neither envision nor test this specific condition. No two ways around this: Our fault. Embarrassing. Please accept our apologies (and money, if you like - see below for details).

We re-configured the system and are confident todays outage will never happen again.

Since we were unaware of the bug in the system configuration it took us a little while to identify the cause. Since the system is responsible for security we also had to spend a couple minutes testing the changes that then became the fix. 19 Minutes is a long time for an outage. Looking at what needed to get done to bring us back online we feel that we did OK. Not great, but OK. Of course there is also room for improvement, and we started to implement those changes today.

INTERDUBS overall uptime dropped now from 100% to 99.99904%.
For the month we are down to 99.95%. Well below the 99.999% we promise. None of our clients has to pay for INTERDUBS this month. If they choose so. A simple email is enough: we will discount the whole month.

Since counting nines of uptime is not really what most people want to think about, we decided that effective immediately we change our policy: we now guarantee 100% uptime. No longer ‘only’ 99.999%. If a client feels that INTERDUBS wasn’t there for them when they needed it, then they don’t pay. Simple as that.

This page documents all outages and service disruptions we ever experienced

Leave a Reply

You must be logged in to post a comment.