Housekeeping: Not Enough Fiber In Our Diet

Some of you might have noticed tbat the site was down today. Apparently it is possible to cut the connection to Riverside Green’s highly available and highly expensive colocation facility with one backhoe.

It took five hours to restore the clipped fiber. Needless to say, I’m not satisfied. When I was at Honda they walked people out the door for five MINUTES down. There may be changes ahead for how we host the site. Apologies for the inconvenience.

25 Replies to “Housekeeping: Not Enough Fiber In Our Diet”

  1. James

    Unrelated to the excavator instigated islandwide blackout in Puerto Rico? I’ve had a server farm go down because someone plugged a kettle in and overloaded the circuit (same circuit!) Another one where a cleaner unplugged a rack to plug in his vacuum; the big plug was in the way.

    Reply
  2. roamer

    Honestly, I’m not surprised. Most datacenters don’t have redundancy in their teleco demarcs; Only a Tier 4 DC might have multiple exit points, and most won’t. Redundancy at that level is stupid expensive, and only very large (read, .gov) organizations are willing to pay for it. Most companies substitute by using multiple DC’s. Once an outside circuit has been cut – known in the industry as ‘backhoe fade’ – it usually takes a least a couple hours for a circuit to be fixed, based on accessibility and the level of damage done. Realize that the datacenter has zero control over the circuit once it’s out of their building. One circuit I know of was down for almost eight hours recently; turns out that it was routed over an abandoned railroad bridge somewhere near St. Louis, and the bridge caught fire. (I wish I were kidding.) The repair teams arrived quickly, but they couldn’t get to work until the local fire department cleared the scene. Redundancy is the only real way to ensure uptime; add a server in another DC (near the west coast, so you can geo-load balance) with an load balancer set up to route traffic away from a server if it becomes unavailable.

    Reply
      • roamer

        You take the current top level IP address for the site and assign it to the external side of the load balancer. Requests come in via that IP and are redirected by it (they, really, you want a backup balancer just in case) to one server or the other based on whichever factors you choose. If you lose connectivity to a server it simply pushes all traffic to the available one.
        Virtually all companies with an online presence above a certain size work this way; every time you log into WoW*, or the online game of your choice, it directs traffic from a central IP to the datacenter nearest you that has a colo for that company. This smooths out the load and helps ensure low ping times and good responsivness by the game. This also means that maintenance is just a matter of pulling the whole DC from the active group in the load balancer and working on it while no traffic is moving to it.

        *the highest processing density I’ve ever seen was a WoW colo cage. Custom extra-tall rack, with every space occupied with the latest blade server, with a switch on top, a RAID array, and a bottom mount PDU supplying 440v power – which increased efficiency and reduced the power bill. Very cool, paid for in cash by Blizzard’s legions of monthly dues-payers.

        Reply
        • Disinterested-Observer

          Be that as it may, you are still dependent on the load-balancer site not failing in some way. I don’t know how high availability sites get around that.

          Reply
          • roamer

            Failover, like I said. You have a secondary LB which will be failed over to if the primary goes down. Beyond that, you can set up a backup setup in another location with IP redirection (repoints if the top level IP stops responding) , but that involves work beyond the level I’ve been around.

  3. Shortest Circuit

    *sigh* I feel your pain, I came to find out the same way that our two redundant IXP uplinks were routed parallel to each other. Backhoe, darkness. At least I’ve learnt new curse words from the engineers patching the fiber back up. Some tools they’ve had…

    Reply
  4. -Nate

    ? How difficult is it to splice in the optic cable repair once they’ve located the damage ? .

    It sounds like some here might know this .

    TIA,

    -Nate

    Reply
    • roamer

      Splicing the fiber optic is not difficult. But remember, the backhoe (or fire, or rifle bullet, or what-have-you) had to go through the other lines, the inner protective wrap, the outer rubberized waterproofing, the protective steel armor jacket (if the cable’s big enough) and whatever else was in the way. So you have to sort out the mess, peel everything apart, figure out what’s broken, and then start fixing. That can take a while.

      Reply
  5. MrGreenMan

    Even the best colo will fail you. I had a rack at a place with the world’s greatest backup diesel generator, all sorts of stuff, certified EEs on the site…And they still toasted my servers when their diesel generator had a hole in the intake line so thrashed on and off for about 30 minutes before going down completely.

    It was nearly as economically damaging as the day Adria Richards decided to knife every SendGrid customer and we lost two big accounts because the messages weren’t getting through due to that DDoS.

    Reply
  6. Salubrious

    I was working with a Fortune 100 client who had custom applications that my company developed and supported. After some server outages, we were exploring adding redundancy in the hosting. One of the IT managers at the client asked us why we would need redundancy when we had their SLA that guaranteed up-time. We almost started laughing.

    Reply
    • dejal

      Didn’t Jack do a story about management that got a certain level and faked it every day?

      The Peter Principal on steroids.

      Reply
  7. Dirty Dingus McGee

    Was the backhoe hovering over the line yesterday? I was unable to connect yesterday afternoon and haven’t had a chance to try until just now. It could well be my local provider as I truly believe their infrastructure is one step removed from 2 soup cans and a string.

    Reply
    • Tom KlockauTom Klockau

      Yes, it was yesterday. Worked ok at lunch, then later that afternoon, nothing. I checked my phone too, and no dice there either.

      Reply
  8. DirtRoads

    I wondered what was going on. I thought it was my work computer didn’t like Jack’s website any more.

    Reply
  9. mas

    At least Backhoes make honest, if stupid, mistakes.Hunters shooting the cables off the poles in forest….
    Yes, it did happen. Took the team 2-3 days of ATVing to get to the location and fix.

    Reply
  10. Ben Johnson

    High availability for poor people:

    crontab: @hourly /root/sync

    /root/sync: (do it twice to have a good chance that the DB is somewhat cohesive)
    /usr/local/bin/rsync -avzPp –exclude=/dev –exclude=/sys / root@someotherhost:/
    /usr/local/bin/rsync -avzPp –exclude=/dev –exclude=/sys / root@someotherhost:/

    (Adjust paths and exclusions if you’re using Linux. Use FreeBSD you Philistine)

    There’a 99% chance that mysql will spin up fine – even though you’re not supposed to be doing that. And you should be using PostgeSQL if you care)

    Do some DNS magic through Route 53 and you’re done

    That being said, someone has thrown some good hardware and done some good tuning at Riverside Green – it’s darn responsive, especially when you consider that it’s WordPress. Kudos!

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.