[TYPO3-typo3org] Post Mortem gerrit outage from 2014-01-13 late evening through 2014-01-14

Peter Niederlag peter.niederlag at typo3.org
Wed Jan 14 18:10:17 CET 2015


Dear TYPO3 Contributors,

As reported earlier we had a severe crash on our server due to a power
outages in the data center on Monday. As it seems we still see some bad
effects.

Late on Tuesday 2014-01-13 we noticed fatal errors in one of gerrit's
central database tables. Due to this reason we had to shut down gerrit
for disaster recovery.

At first we spent an estimate of ~10 hours in order to fix the problem.
Around noon on 2014-01-14 we decided to just restore one specific table
from a 24h old backup. At this time we were pretty sure to find a way to
recreate the missing data later on in a progammatic way. After we
restored the table gerrit was put back into production around 3 PM GMT+1.

Lieuwe Hummel, one of our community members, noticed our problem on
twitter and sent us a bash script he once used to fix things in a
similar situation. We adapted the script to our setting and use case and
have been able to restore all patch requests that had been submitted on
2013-01-13 within another four hours of work.

Thx Lieuwe!

Lessons learned? InnoDB can be very tricky in case of severe failures.

Great thanks also to Steffen who spent half the night trying to bring
back the data with percona disaster toolset.

Peter Niederlag

P.S.:
For those interested CHECK TABLE reported 'InnoDB: The B-tree of index
"PRIMARY" is corrupted'.
-- 
Peter Niederlag
http://www.niekom.de * TYPO3 & EDV Dienstleistungen *



More information about the TYPO3-team-typo3org mailing list