I spent the majority of my week in Berlin attending Monitorama EU 2013. It was an enjoyable, well-organized event and the talks were generally of a good quality. All the presentations were recorded, you can watch the recordings if you’re interested.
The conference was centered on one common goal: to improve monitoring. A nice community is forming and the people who are involved are generally polite and respectful and share a common focus on co-operation and open sourcing with open platforms. Everyone agrees on the need to move the industry forward regardless of differences between devs, ops, dev-ops and marketeers.
As I reflected upon the conference, it occurred to me how lost we are as an industry. One visitor told me that he had come to the event expecting to be told how easy monitoring really is and that it was him who was doing it all wrong. However, he soon learned that everyone is struggling with this stuff. It was a true observation. No one dared to evangelize an approach. Talk after talk focused on the struggles and the challenges in the industry and while some solutions were shared, many of the more significant questions remained unanswered.
Luckily, it is not quite that hopeless. Monitoring has evolved and come a long way. As a community we have solved a large number of monitoring problems. Many of the problems associated with the collection, storage and processing of data have been resolved over the past few years. The biggest challenge that is facing the community at this stage concerns identifying what we should actually do with this data.
During the conference issues related to alerts and surfacing challenges were raised again and again and it is apparent that many of the problems in these areas remain unsolved as the result of the limited skill set of the software engineers involved in the industry. Our domains are unpredictable with complex models that will not, and cannot, fit the mathematical models of the predictable, steady rhythmic metrics of the factory lines. The result is unacceptable numbers of false positives and true positives missed or lost, and even ignored, in all the noise of the false positives. We need better models… but with that comes many challenges.
At the end of the conference I had formed the opinion that alerts were good intentions leading to hell. Monitoring is not just about the CPU usage of servers. As monitoring grows, I hope that the event next year will include more talks from data scientists, usability experts and business people who are able to share their experiences beyond the CPU gauge and disk space alert alone.
I am looking forward to next year and hope we can have some Monitoring<3.