Videos provided by OpenStack Summit via OpenStack Foundation YouTube Channel
Operating a large OpenStack installation can be a daunting task, but over time you'll find that the key metrics, problems, and warning signs (that you're probably spending your own personal time and energy watching for) aren't at all difficult to train a machine to do--especially if you want such a machine to become more accurate and more reliable over time in the flags it raises.
During the Hong Kong summit, we examined how a flow of logs, telemetry, and other performance data can be piped through various systems, and some possible open-source projects that can help.
In this session, we will explore in much more technical detail a setup in which logs and metrics become centralized, combined along the same timeline and presented intuitively for simpler visual inspection, as well as how to employ machine learning techniques to train your log collectors or other monitoring machines to look for patterns in the data. Having such tools in your environment will help prevent a barrage of alerts with no credence, help identify early warning signs and predict potential problems before they cause a degradation in performance, or--even worse--a service outage that interrupts your TV dinner.
But not only can these patterns be used for simple alerting as is often the traditional case; they are better used to take proactive, predefined actions, watch for recoveries, and send alerts only as needed.
Previous experience with machine learning is not necessary, but you should be slightly familiar with the inherent challenges in monitoring, alerting, and analysis persistent event flows from distributed systems.