Video recording and production done by DevOpsDays.
There’s been much discussion at [my company] about how valuable post mortems are in understanding the circumstances that led to an outage or event. How to conduct and memorialize a post mortem, however, remained unsystematic.
My talk will be about why myself and another engineer built an internal post mortem tool called Morgue and the effect that the tool has had on our organization. Morgue formalized and systematized the way [my company] as a whole runs post mortems by focusing both the leader and the attendees of the post mortem on the most important aspects of resolving and understanding the event in a consistent way. In addition, the tool has facilitated relations between Ops and Engineers by increasing the awareness of Ops’ involvement in an outage and also by making all of the post mortems easily available to anyone in the organization. Lastly, all of our developers have access to the Morgue repository and have continued to develop features for the tool as improvements for conducting a post mortem have been suggested.
A high level talk about what is weak about the way we currently run infrastructure, how to think in systems, how to deal with scale and complexity using real scientific methods (instead of guesswork). And finally, why "APIs will be our undoing" and why devops is crucial to the success of infrastructure in the future.
Trading firms are not about high fives, flashy suits and Maseratis. Behind the scenes the technology that powers these firms walks a delicate line, balancing regulatory risk and the need for rapid technology response when market conditions change. DevOps is a natural fit for these organizations, getting it right however is an entirely different story.
This talk will describe an initiative I led to implement a DevOps capability for a large online broker, focusing on the challenges we faced and the (painful) lessons we learnt.
I’ll discuss the conditions that made it possible, talk through the cultural challenges, outline the technology issues, and focus on what worked and what didn’t.
The Business: Why DevOps was the answer.
The Project: Funding a DevOps initiative using "risk”.
Team Culture: How we assembled an amazing team from different silo’s in IT and unknowingly created a nightmare.
Company Culture: Communication challenges between silo'd teams.
The Processes: Working with and around existing processes.
The Technology: Working with technical debt in “legacy" mission critical environments.
Progress: Early warning signs we missed and how we dealt with timeframes.
Lessons: The good, the bad and the ugly.
The importance and impact of Open Source Software cannot be accurately captured, as it has pretty much proliferated most companies and organizations worldwide.
This talk will focus on how to contribute to someone else's project, as well as some ideas on how to set up your own project and accept other people's contributions with as little friction as possible.
Covering the social aspects of interaction with unknown people, as well as knowing when to pass the torch, this talk will focus on some of the good and bad things about the OSS culture, and how we may improve them.
In this talk we will show how our team is learning about and using continuous improvement Toyota-style to help us reduce nagging technical debt. We will share our successes and failures as well as give you a taste of what focused continuous improvement and the scientific method can help a DevOps team het done.
We’ll show how our team was consumed by never ending emergencies and interruptions. We also had and drive-by shooting project management and plenty of release troubles galore. Learn how we are freeing up half of our team’s time (yes 50% !!) to tackle this critical work and still getting our important tasks and projects done.
We’ show you the basics of our continuous improvement process. It’s light and simple so we can walk you through our daily kata. We will explain the details of and around our technical debt projects. We will illustrate our initial obstacles and the some of the experiments we used to slay them! We will paint a vivid picture what it felt like to use science everyday instead of jumping to conclusions. We will share one key difference between a Cowboy and a Scientist and why we chose to become Scientists. We will discuss improving the performance of a constraint ala "The Phoenix Project" and "The Goal" in real life.
Monitoring all the things is good, but can lead to information overload. In a few years I think we'll have great tools for doing things like AI and big-data mining on metrics, and ideas like Etsy's Oculus and Skyline will be more broadly used, but realistically it takes a while for improvements to hit mainstream tools and get adopted. Assuming that most of us are currently using something not much different from Nagios + Graphite, what can we do Right Now, with very little effort, to improve the signal-to-noise ratio?
I'd argue that changing which metrics we pay the most attention to is a great place to start. That's because some metrics are more important than others. We have systems for a reason: to do work for us. Metrics that relate directly to that work are the most meaningful. A close second is metrics that indicate the availability of resources, such as free disk space -- in my research full disks have been the single biggest cause of system downtime.
Other metrics may seem to be just as revealing, but when you look at how they're actually generated in systems, you'll usually find that they're effects of the work, and therefore they're secondary signals, which don't strengthen the primary signals much if at all. So by focusing in on a handful of metrics, I believe we can significantly reduce the monitoring surface area and improve time to insight. Of course, I think we should keep all the other metrics too -- we just shouldn't drive ourselves nutty trying to keep an active eye on them all the time.
In this talk I'll explain what the workload- and resource-related metrics are and how I suggest keeping an eye on them.