Growing up, I wore a lot of hand-me-downs. I thought I looked great – all three of my older siblings were cool. It didn’t matter how many times I tripped on my brother’s torn jnco jeans or how long my sister, six years older than me, laughed when I got my head stuck in the armhole of her WWF t-shirts. I wanted what they had because they were so influential in my life – even if I was never comfortable and nothing ever fit.
What happens then when we apply DevOps strategies innovated at companies like Amazon and Starbucks to our mid-size applications? Do they really fit?
In this talk, we’ll discuss how to approach DevOps at an average company – how much should we automate? If we can’t afford to stress test everything, how do we choose which pieces to test? Should our dev’s work on development tools to expedite their process in the future or spend time on the features we need to ship? Do we need to trace everything? When do we turn to DevOps tools like Chef and Puppet? How can we avoid getting our heads stuck in the armhole of a shirt that doesn’t fit? We’ll investigate how to answer these questions, and how to make the most out of others’ success while we learn how to be happy being average.
This is a story about a cryptojack security incident involving one of CHT customers’ AWS development accounts. I will discuss not only the incident and response, but also some of the ways to prevent this type of event from occurring, how to detect it, and forensics data for which to look.
It’s a personal story/lesson from the field and I would like to share it with people in the industry to help them avoid this pitfall.
About a year ago I was a part of an incident where one of our customers reported a security event involving one of their numerous AWS accounts used for development purposes. The AWS account in question already generated about $40k worth of EC2 compute charges when they discovered this breach. There were about 100 or so top-end, CPU-heavy Windows machines spread out across the world running at 100% CPU for over 2 weeks, all apparently mining bitcoin.
No CloudTrail audit configuration was enabled for this account so there was no audit data available to identify what happened and who did what.
Our Cloud governance platform captures the state of AWS accounts and configurations, and historical data for some of these settings. Our team spent a few days reconstructing the state of things and how events transpired by looking at our backup data on a timeline. It was clear from our data that one of their admin employee account AWS credentials got compromised/leaked and were used to spin up all these resources. It also appeared that this attack was entirely automated and only needed sufficient AWS credentials as an input. An attacker also covered up their tracks and tried to “frame” another innocent user!
Luckily, the data we had cleared this user’s name and the customer received full AWS credits for the breach to cover the loss, but this was an important lesson for us and for our customer. This was an entirely preventable incident.
We saw a similar pattern/attack on our own infrastructure. The attempt failed due to some simple security measures we’ve taken. I’ll talk about some of the ways to prevent this event from occurring, how to detect it, and the forensics data to look for: things like enforcing MFA, using external Idp, removing the need for the AWS key and secret key by leveraging roles and instance profiles to grant permissions, etc. I will also talk about the importance of having the CloudTrail audit feature enabled by default in AWS.
Along with the rise of Data Science as the “Most Glamorous Job Of The 21st Century” came the realization that roughly 80% of a data scientist’s time was being spent collecting, cleaning, and storing the information they needed to do their job. The emerging role of the Data Engineer was created to offload that work onto a separate team and reduce duplication of effort and inefficient use of time. This separation between the management and reliability of an organization’s data and the exploration and interpretation of that information has led to the same tensions that exist between developers and technical operations which we have been working to ease for the past decade.
As practitioners and teams concerned with the engineering and science of data have gained experience and maturity they have also begun re-learning the same lessons that the DevOps transformation has been imparting to the teams concerned with creation and delivery of software. Along the way, data engineers have begun building more processes and tools around automation, testing, monitoring, and alerting of the systems that they are responsible for.
There is a lot to be learned in both directions as data becomes increasingly critical in any successful software system and more complex systems are required to manage all of the moving parts. I’m here to discuss areas that data engineers and operations teams overlap at the technical and social level, the types of tools that can be adopted to improve effectiveness in both directions, and how we can extend the impact of DevOps transformations to more units in the business organization.
By the time you leave you will have a better appreciation of what data engineers do, that there are lots of lessons for us to teach them, and that there are lots of lessons for them to teach us. This talk will also help data engineers to identify their blind spots and how they can address them.
Instead of talking “Free as in beer”, let’s talk “Craft as in beer.” Not too long ago the entire beer market was set to be two companies, Anheuser-Busch and Miller-Coors, but plucky little startup brewers have not only survived, but thrived. On a similar time-scale, most software was being developed in large development houses behind closed doors. Open Source was largely thought of as an unsustainable mistake. Now, small open source projects pop up every day and even successful companies are based entirely on Open Source Software. How is this possible?
The overlap here: Building Strong Communities.
What is the value of community?
How can we still practice sharing if we can’t share our code?
Where can we start building this community?
As infrastructure increases in complexity and monitoring increases in granularity, engineering teams can be notified about each and every hiccup in each and every server, container, or process. In this talk, I’ll be discussing how we can stay in tune with our systems without tuning out.
The ability to monitor infrastructure has been exploding with new tools on the market and new integrations, so the tools can speak to one another, leading to even more tools, and to a hypothetically very loud monitoring environment with various members of the engineering team finding themselves muting channels, individual alerts, or even alert sources so they can focus long enough to complete other tasks. There has to be a better way - a way to configure comprehensive alerts that send out notifications with the appropriate level of urgency to the appropriate persons at the appropriate time. And in fact there is: during this talk I’ll be walking through different alert patterns and discussing: what we need to know, who needs to know it, as well as how soon and how often do they need to know.
This talk discusses surveillance as a business model of the Internet, the value and the meaning of privacy, and how the idea of privacy has changed in the 21st century.
Is site reliability engineering and serverless a match made in heaven or a disaster waiting to happen? In this talk we will discuss just what is this serverless thing anyway, why it maters and what it means for building reliable systems. We will also explore each of the SRE principals such as embracing risk, service level objectives, monitoring and others and map them to their serverless counterparts to identify challenges and best practices. Finally we will make a few predictions about our serverless future and what it means for software engineering and operations as a whole.
As an organization with 25 production environments and over 6,000 containerized applications we have a lot of secrets to manage. Enough to run 12 geographically distributed HA Vault deployments across multiple cloud providers. Even though the Open Source community provides us with some really awesome tooling, managing that many Vault environments can be tricky.
This talk is about how we manage all those Vault servers and some of the unexpected bumps we’ve hit along the road to stability. We’ll cover standard outage events, bugs in Vault itself, dealing with unintended consequences of various authentication paradigms, and even migrating from one backend data store to another without needing to re-key.
During and after Hurricane Harvey, people from all over the world mobilized to help, including many technology professionals. This talk will discuss challenges and lessons learned from the chaotic first week of participating and organizing Houston’s tech response. We had to create flexible structures that allowed us to rapidly respond to changing needs, unpredictable volunteer capacity, and knowledge bottlenecks.
We created ways of rapidly disseminating information, even when our point people were stuck in meetings all day. We made on-the-fly decisions about who to add to teams and who to give write access to and needed to trust each other to make good decisions. Teams had to work together with people they had never met, and we needed ways to help people quickly on-board and take ownership of problems. By telling our story, we hope to share our story and the lessons that are applicable to leadership and structures beyond disaster situations.
Many presentations on Microservices offer a high-level view; rarely does one hear what it’s like to work in such an environment. Individual services are somewhat trivial to develop, but now you suddenly have countless others to track. You’ll become obsessed over how they communicate. You’ll have to start referring to the whole thing as “the Platform”. You will have to take on some considerable DevOps work and start learning about deployment pipelines, metrics, and logging.
Don’t panic. In this presentation we’ll discuss what we learned over the past four years by highlighting our mistakes. We’ll examine what a development lifecycle might look like for adding a new service, developing a feature, or fixing bugs. We’ll see how team communication is more important than one might realize. Most importantly, we’ll show how - while an individual service is simple - the infrastructure demands are now much more complicated: your organization will need to introduce and become increasingly dependent on various technologies, procedures, and tools - ranging from the ELK stack to Grafana to Kubernetes. Lastly, you’ll come away with the understanding that your resident SREs will become the most valued members of your team.
If you ask ten people to define DevOps, you’ll likely get a dozen different answers. Somehow, it’s 2018 and we still can’t agree on what DevOps is, only what it looks like. Many companies want all the benefits from DevOps without making any changes to their organization. The truth is that successful DevOps implementations require hard work over long periods of time.
DevOps at our company is a survival mechanism. We need to be lean and innovative or we’ll simply not exist as a business. When I say DevOps, I’m not talking about using Chef or deploying to AWS. I’m talking about optimizing early for the behaviors we wanted as an ops organization working with software developers. We focused on these efforts to get everyone working towards the same shared goals. We work to lower the risk of change through both the tools we created AND the culture we grew.
This is a story in 3 Acts. Each Act leads into the next one; the results compound the impact to the team. How We Engineer for Rapid Change Optimizing for Visibility Building Accountable Engineers.
In this session, you will learn how we turns ideas into reality, quickly and safely. We’ll go over how we design our telemetry system to support useful, actionable metrics and the steps we take to level up our engineers, giving them the ownership and accountability to own the applications they build. We’ll share what produced good results, what generated more trouble than it was worth, and what concrete ideas you can take back to improve how work gets done within your organization.
Many times during DevOps transformation, creating a technology architecture comes first. Then the uphill battle of closing skill and process gaps during adoption begins, which forces people to adjust to the technology. What if this process were reversed? This talk describes a real-world experience about how our Platform team created a distributed and non-optimized toolset architecture that allowed for teams with varying maturity to rapidly adopt and transform without that uphill battle.
When you talk to anyone that has been through a failed attempt at implementing DevOps, you will usually get a resounding “it’s a culture/people” aspect that hindered adoption. Since one of the tenets of DevOps is to constantly change, why should that not also include evolving the architecture?
The outline of the talk will include: the background of the company in which we architected this toolset; size, industry, VM count & geographic distribution; application architecture & stack (high level), business-as-usual software development process (high level); and the technical and cultural team challenges. Also: Platform team background; the approach we took in architecting; overall architecture; Custom UI, Jenkins, Artifactory, SVN; the Platform team ‘Day 2’ operations optimizations for managing the toolset, including script promotion process, developer ‘onboarding kit’, Jenkins server provisioning/monitoring/maintenance, and developer onboarding guidelines. Examples of different team experiences during onboarding will be provided, along with process and organizational changes we implemented to help facilitate adoption. Finally, we’ll review our results after 2 years (hint: we were successful), lessons learned, and key success criteria
The audience will be able to identify with some, if not all, of the company characteristics and challenges discussed, and will be able to hear small, actionable steps they can implement in their own organization.
This talk was born, quite surprisingly, out of listening to Carrie Fisher talk about her struggle with addiction. What really resonated with me was when she talked about how she coped with her struggles by talking about them with others.
“It creates community when you talk about private things and you can find other people that have the same things. Otherwise, I don’t know - I felt very lonely with some of the issues that I had or history that I had. And when I shared about it, I found that others had it, too.” – Carrie Fisher on Fresh Air with Terry Gross, 28 November 2016
At that moment, I decided I had to talk about my struggle with Impostor Syndrome. As I reflect on my life, I’ve identified Impostor Syndrome as a central theme. Over the years I’ve been terrified to raise my hand in class, speak up in meetings, file bug reports, and — especially — contribute to open source projects. I was constantly afraid that I’d be seen as a tragically uninformed dolt not worthy of the position I’ve been put in. So, most often, I kept quiet.
Until recently, I had no idea what this feeling was and where it came from. I didn’t have a word for it until reading an article where Luke Kanies described these feelings and provided me with a name for it: Impostor Syndrome. Once I had a word for what I was feeling, I was finally able to understand it. I was able to identify ways to cope with it and learn about it and how other people handled it. This is my story of Impostor Syndrome and how I’ve coped with it and come to understand it as one of my biggest strengths; hopefully, it’ll help others do the same.
In 2018, most of us understand what burnout is and why it’s an occupational hazard for site reliability engineers. But while we’re aware of how toil can sneak up on us, or how losing sleep to pages can destroy our productivity, we aren’t always as cognizant that we’re all starting from different baselines. Finding space in SRE can be challenging for people who struggle to read documentation quickly, or to speak up during meetings, or even have a face to face conversation at all. Even worse, well-meaning attempts at accommodation can stifle personal growth or become career-limiting.
But those of us with unusual deficits have gotten where we are by leveraging our strengths. The trauma we live with can also teach us coping skills, seeing the world a different way can lead to unique insights, and being anxious can lead to triple-checking configurations that everyone else only double-checks. The time is right to draw on lessons from the “mad pride” movement and learn to embrace difference on your team by providing tangible support without othering or being paternalistic, because one of the most ethical ways your organization can retain the best talent is by rejecting sanity as a requirement.
Unfortunately, rolling out changes for humans is not as easy as merging a pull request. How many times have you seen a new project management tool get rolled out, sometimes with much fanfare and polish, and just not get adopted? Have you ever seen your company announce a new business unit which led to a minor revolt? How scary is the word “reorganization”? If you have ideas on how your group’s processes could be better, want to launch a new tool that will work better as your company grows, or have to adjust the way you do things to meet new regulations, learning the basics of change management will help you to get your plans going, launch them effectively, and ensure they stick around.
I’ve helped implement several tools, projects, and process changes over the years. In this talk, I’ll walk you through the basics of organizational change management with specific examples about:
Why it’s so hard (Newton’s first law of motion; nobody likes surprises)
Points to consider while implementing (Am I sure I’ve identified all of the stakeholders? What do I do if I can’t satisfy all of their wants?)
Tactics to increase adoption (If you haven’t created evangelists, you’re probably creating enemies)
Keeping the change alive once launched (Can I keep showing improvements? Do I have materials that help newcomers?)
Pitfalls to avoid (Do you really have executive buy-in?)
It sounds simple to say that we will build one feature at a time, give it an API interface and allow it to connect with other features and microservices. The implementation is anything but simple. This talk explores how you can start migrating your existing features and services to a more modular, testable, and resilient system.
Since containers are not state-aware, how do you make changes to their presentation without needing to rebuild them entirely? With feature flags, your container can be stable and your presentation dynamic. How can you test a distributed architecture on your laptop? How can you simulate partial outages? This talk is going to touch on some of the best practices that you can use to bring new life to your brownfields.
More than ever, blog posts, workshops, tutorial videos, and books are presenting quick and easy ways to get started with new tools and technologies. While these types of media provide an excellent jumping off point for engineers to learn and experiment with new subjects, they also set unrealistic expectations about the complexity of the topics involved. Unfortunately, it is becoming more common for non-technical stakeholders to cite these works as evidence that engineering work should be completed faster and deadlines should be shorter. As a result, technical work that delves deeper into solving the hard problems of production operations, stability and scaling is being devalued and misunderstood.
In this talk, we’ll explore the patterns we’ve developed in our communications as a tech community that have led to the conflict between “This tech makes things fast and easy!” and “These problems are hard; we should invest time in solving them correctly.” We’ll dive into some real-world examples of the frustrating impact that current perception has on engineering teams. And finally, we’ll formulate some concrete takeaways to improve the way we present, share and frame technical topics across diverse technical and non-technical audiences.
Modern cloud applications today are built as distributed microservices. These microservices talk to each other over L7 protocols: HTTP, gRPC, Redis, Kafka, and more. In this world, L7 proxies have assumed a crucial role in managing and observing L7 protocols. In this talk, I’ll discuss the evolution of service architectures, the role L7 proxies play in this world, and how there is now a battle raging between Envoy Proxy, HAProxy, and NGINX. I’ll wrap by talking about why we chose Envoy Proxy as the anchor of our Ambassador API Gateway and show how that has enabled a number of new capabilities.
As we continue to work on improving our technology processes, there is much we can learn from the discipline of product management. I believe that by applying the techniques and approach of treating your infrastructure and service offerings as products, we can provide a more delightful experience and continuously improve.
Using principles and concepts from people like Marty Kagan and other experts in the product space, I will demonstrate applicable examples of applying the concepts of feedback, planning, and iterative improvement to practices including software delivery process, service desk response, and infrastructure as code.
You have some challenges after attrition, team re-orgs (agile or top-down), and de-prioritized cross-training has left engineers frantically searching email archives for “fixed content deployment” or staring blankly as git blame lists only their most talented former colleagues. If only there were a structured way to have ensured these services lived on.
You have to deliver a new service that talks to vendors around the world, and you know we’re using AWS now, but you’re not sure what a secure network looks like, and you’ve only got a week to get it live, so you leave it open for now and hope you’ll get to go back around and fix it when you’ve got time. If only there were a structured way to peer with an engineer who could have provided that information in context when it was cheaper and less risky.
There is a structured way to have a centralized team of sustaining engineers tasked with preventing these and similar problems. Think Brownfields-focused SREs.
Containers are becoming the standard practice to build, manage, and deploy software especially as cloud adoption continues to accelerate. To help with this shift there have been a variety of services and tools created such as Kubernetes, Docker swarm, Nomad, in addition to cloud services such as Amazon’s Elastic Container service (ECS) and Google’s Container Engine (GKE). While these services and tools help with the orchestration and deployment of containers the underlying infrastructure management is often still left up the user and can add significant overhead. This talk will go over the pros and cons of the various options you have to manage this infrastructure as well as go through a few live examples. Hopefully you will leave this talk with a better understanding of the complexities involved with provisioning, scaling, and maintaining infrastructure for large scale container deployments and possible solutions. Some of the topics covered will include:
Self managed kubernetes on the cloud
Amazon’s ECS and Fargate
Software expertise is a vital component of a software project success. Software products are created by people – their skills and talents determine the result to a large extent. However, people can’t really be attached to a project forever; they change work places and companies, taking their knowledge with them. It’s a well-known fact that subject domain experts are important and dangerous at the same time. In this presentation I will share my experience of dealing with highly qualified programmers and transferring their knowledge to project artifacts. I will present a concept of experts-free software development, which is practiced in our distributed teams.
Containers and Kubernetes are becoming the de-facto standard for software distribution, management, and operations. But deploying and managing Kubernetes requires significant in-house time and expertise.
For those considering embarking on a containerization and Kubernetes strategy, I will outline important considerations such as the benefits of a centralized Kubernetes operations layer based on open-source components giving IT teams more control and understanding. I’ll also dive into the unique challenges to consider in any Kubernetes deployment, such as the need for centralized monitoring, identity and access management, backup and disaster recovery, infrastructure management, security, and reliable cluster self-coordination and self-healing.
Attendees will leave with an understanding of how to ensure successful container and Kubernetes “Day 2” operations.
Users care as much about how your service performs as they do what job it helps them accomplish. This is the modern performance imperative. So why do we still see so many 503 errors and slow apps? The answer is empathy, a lack thereof. Performance is the most tangible element of “non-functional” quality criteria we regularly ignore until it’s too late. The cloud doesn’t save us magically either. “Building quality in” starts with a mindset guided by technical experience, team alignment, and empathy for your customers.
The automated delivery process is the substrate we all work in now, and the viscosity of our app/configuration/deployment code often dictates our rate of flow. Reducing these friction points with fast and complete feedback loops help us improve the pace of delivery. In this presentation, we’ll uncover:
how to include and improve performance criteria into stories and features
key performance indicators for cloud deployments and high-availability services
progressive performance testing strategies to radiate early feedback in our pipelines
advanced scenarios of dynamically provisioning test environments
Hopefully, you’ll walk away with some cool new techniques and questions to ask your own team, but the goal is to root our work in empathy in order to reset our vision of how to deliver high-performance systems.