[Olpc-sysadmin] mock.l.o outage (0400 EDT - 1930 EDT) & questions.

Michael Stone michael at laptop.org
Sat Sep 13 19:47:03 EDT 2008


1. mock.laptop.org was reported down to me at 1026 EDT by cscott on
    IRC. cscott also reported that it had been up as late as 0400 EDT.

2. I began investigating the situation at about 1900 EDT. 

    a) First, I determined that mock still responded to pings. 

    b) Then, I determined that mock was not responsive to SSH attempts.

    c) Next, I determined that mock was connected to port 2 of a KVM
       switch in the 1cc server closet. I asked the switch to display
       port 1 (which correctly showed pilgrim.laptop.org), then port 2.
       The switch reported 'no signal'. Mock was not responsive to
       keyboard input through the kvm switch including Ctrl-Alt-Delete.

    d) At this point, I power-cycled mock and watched it come up normally
       on port 2 of the KVM switch.

    e) I then verified that mock was accessible over SSH and HTTP, which
       it was.

    f) I then briefly inspected mock's log files looking for hints about
       the cause of failure. Unfortunately, I learned nothing.

         - Security Note: My ssh-agent's connection was briefly fowarded
           to mock post-reboot while I inspected the machine before I
           thought to disable such forwarding. 

3. Questions:

    a) Why did mock go down?

    b) Given that we don't know why mock went down, should we leave it up
       pending further diagnosis?

    c) What further diagnosis should be conducted?

Regards,

Michael


More information about the Olpc-sysadmin mailing list