I'm pleased to announce that I've finished the first phase of implementing a system to monitor the status of deployed IT services and systems. The system is based heavily on some initial work done by Shaun Beavan at HQ. To determine the health of our services, the system is currently monitoring the online status of over 50 devices and measuring over 300 different system and service metrics. I plan on adding additional devices and metrics to the system in the near future.
As part of the first phase of the project, I've put together a public status dashboard on our IT website that provides a high-level overview of the state of our online services.
The dashboard is updated every 5 minutes with the latest data from the monitoring system.
The second phase of the project is to implement the automatic creation of helpdesk tickets in the event that a device goes offline or a metric enters a critical state. Ideally, the automated creation of tickets on service disruption will enable IT to respond before anyone has noticed the issue. This second phase should be completed by the end of the month after we've fully tested the system for potential false positive notifications.