On September 28, I gave a talk at the Surge conference in Baltimore, titled “Monitoring and Debugging Big Clusters Running Real-Time NoSQL Apps”. Here’s the abstract:
Running a big NoSQL cluster used for data-mining Map/Reduce jobs in production is challenging, but fairly common these days. Keeping up a hundred+ node NoSQL-based cluster serving sub-second responses to fickle web-users can be terrifying, for both operations and engineering staff. We’ve been operating such a system at blekko for two years now, and we’ve learned the hard way about the kinds of monitoring, debugging and debugability features needed to survive without pulling out too much of our hair. Come and learn from our mistakes!
If this sort of thing interests you, please take a look at the video and slides!