Special Delivery Five, Those Interesting Stories About Dev Ops Organizations and Culture

Special Delivery Five, Those Interesting Stories About DevOps Organizations and Culture #

Hello, I’m Shi Xuefeng, and it’s time for another special broadcast. As we approach the end of this column, I want to talk to you once again about DevOps organization and culture.

DevOps culture seems like a contradictory combination: on the one hand, culture is something that seems to be understood rather than transmitted in words; on the other hand, the importance of culture to DevOps practices is undeniable.

At various industry conferences, topics about culture are always scarce. The reason is simple: culture is generally difficult to explain, and even if it can be explained, it doesn’t change much. Because changing culture is not as simple as introducing a tool, it often requires a change in mindset.

Speaking of DevOps culture, I recall an event last year when a few friends and I organized a “DevOps Practice Guide Book-Pulling” activity. This activity involved summarizing and refining the core knowledge of the book through several weeks of online sharing.

During the sharing process, there was one thing that left a deep impression on me. It started with a passage in the 14th chapter of the original book:

The team has nothing to hide from the customers, just as they have nothing to hide from themselves. Instead of treating problems that affect the online system as a secret, it is better to make them as transparent as possible and proactively announce internal issues to external users.

The IT manager of a large company happened to be responsible for sharing this chapter. He stated that he chose to include this passage out of respect for the original text, but in the context of China, this is not practical. Even he, a staunch practitioner of DevOps, finds it difficult to achieve this level. Because if all the internal issues of the company are openly disclosed to customers, it’s likely they would be packing up and going home the next day.

It’s also because companies generally do not publicly announce failures at the first opportunity that these incidents are mostly revealed through platforms like “Cloud Headlines.”

However, it seems that everyone has a poor memory. Often, these things are quickly forgotten, and apart from hearing rumors about who is taking the blame or being implicated, there is nothing particularly noteworthy.

This is understandable, after all, family issues should not be aired in public. It’s fine to vent within the company, but if everything is publicly announced, wouldn’t the company’s image be ruined? Moreover, it could affect the confidence users have in the company. Just think, if it’s always you who has the most problems, then who would dare to use your service?

We all know the key words of DevOps culture: collaboration, sharing and bearing together, blameless culture, learning from mistakes, etc. These principles are understood by everyone. However, when we actually encounter problems and need to balance the interests of different departments, whether we can still use these cultural norms as guidelines for behavior is another matter.

In plain terms, if you want to see if a team has a DevOps culture, it’s more important to look at what they do rather than what they say. So today, I’ll share a few stories with you to see how other companies have dealt with the same problems, and to reflect on why this is a better approach.

The Story of GitLab’s Database Deletion #

Let’s go back to January 31, 2017, when GitLab, one of the largest code hosting and collaboration platforms in the world, experienced an 18-hour outage due to a surprising incident: an IT engineer accidentally deleted the data from the production database.

At the time, GitLab was facing a spider attack, resulting in synchronization delays between the primary and backup databases that exceeded the write-ahead log (WAL) limit. As a result, data synchronization could not be completed. The typical solution to this problem was to remove all data records from the backup databases and trigger a new full synchronization. Unfortunately, due to a series of database configuration issues, such as the number of concurrent connections, the data backup process kept failing.

By then, it was already 11:30 PM in standard international time. Considering the time difference, it was already 1:30 AM on the next day for the engineer located in the Netherlands. The on-duty engineer suspected the failed synchronization might be caused by the residual data from previous attempts, so he decided to manually clear the backup server data once again.

However, perhaps due to negligence, he didn’t realize that he was operating on the production database. Just a few seconds later, when he came to his senses and tried to cancel the operation, it was too late. The end result was the loss of over 300 GB of live data, directly leading to the service entering recovery mode.

In theory, though such incidents are hard to accept, they are not uncommon. What’s even more serious is that when GitLab attempted to recover the data, they discovered that their supposedly “meticulously designed” backup mechanisms couldn’t salvage the deleted data.

Even more astonishingly, they only realized at that moment that due to a mismatch in tool versions after an upgrade, the scheduled database backups had been continuously failing. They had assumed that emails would alert them to this problem, but coincidentally, the alerts for the automated tasks did not work either.

At this point, they had two choices: either hide the truth and provide a harmlessly vague explanation to the outside world or fully disclose the problem, even down to every detail. How would you choose to handle it?

GitLab chose the latter. They immediately took the system offline and recorded all the details and analysis of the incident in a publicly accessible Google document. Moreover, they live-streamed the entire recovery process on YouTube, the world’s largest video sharing platform.

Considering that some users might not watch YouTube, they also synchronized updates on the problem status via Twitter, effectively turning a mishap into a hot topic. At the time, there were over 5,000 viewers concurrently watching the live stream, even briefly reaching the second place on the trending charts.

In addition, a few days later, the CEO personally provided a 4,000-word postmortem document, including the background of the incident, a timeline, core cause analysis, explanations of each backup mechanism, and nearly 20 follow-up improvement measures. This transparent approach gained the trust and recognition of their users. It can be said that in this aspect, they truly achieved transparency, openness, and honesty to the extreme.

Postmortem document: https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/

As for the unfortunate engineer’s fate, you might have heard about it. He was forced to watch several tens of minutes of the “Rainbow Cat” animation. To be honest, the animation is a bit boring. However, if such an incident happened to us, we would probably be fired outright. I know you’re probably curious about what this “Rainbow Cat” animation is all about; I also got bored and found it to watch. It is shown below:

Afterwards, GitLab became even more open. Now you can check the live status of their services at any time, including past incident analysis. An account named “GitLab Status” on Twitter provides real-time updates on current problems. The goal is to proactively expose issues before any user notices them. To date, they have published nearly 6,000 problem updates. At the same time, you can also view detailed monitoring views and monitoring data of the GitLab service, including GitLab’s operations manual and backup scripts. All of these are open to the public. You can use them directly if you want; if you think something is unreliable, you can also submit changes directly to them. I extracted some screenshots and links for your reference.

  1. GitLab Status Twitter: https://twitter.com/gitlabstatus

  1. GitLab Status Website: https://status.gitlab.com/

  1. GitLab Internal Monitoring Dashboard: https://dashboards.gitlab.com/

GitLab is not going crazy. In fact, openness has become the standard practice for mainstream companies. For example, on GitHub, you can also see similar information.

With that, the story can come to an end. The attitude towards incidents reflects the company’s culture to a large extent .

Firstly, it’s about learning from mistakes .

GitLab’s incident report not only describes the issue itself but also aims to share their experience, especially during the recovery process, to accumulate knowledge from mistakes, improve existing processes and tools, and thus completely avoid similar problems.

Everyone and every company makes mistakes. The attitude and importance given to errors determine the level of growth . So, if I were to interview at a company and the interviewer asks if I have any questions, what I care about the most is the company’s attitude towards mistakes and their specific actions.

Another aspect is building trust and providing timely feedback through openness and transparency . This applies not only to external users but also to departments and organizations involved in internal collaboration. Because only with full transparency can trust be earned and many things can be discussed; otherwise, building a culture of collaboration and shared responsibility becomes mere talk.

When starting to build a DevOps culture, you need to first understand whether the information needed by upstream and downstream can be autonomously and easily accessed at any time. If not, this is a good potential area for improvement.

The Story of the Sweater with Three Sleeves on Etsy #

Etsy is an American handmade crafts e-commerce platform. Since its IPO in 2015, its market value has reached nearly 8 billion dollars. Apart from its rapidly growing market value, it is most widely recognized for its DevOps capabilities, with many of its case studies featured in the book “The DevOps Handbook.”

So, why is this relatively unknown company able to achieve such success? In fact, through a small incident, we can see the reason.

What you may not know is that the most frequently visited single page of an online e-commerce company is not the homepage or the pages of specific products, but the website’s 502 error page, or what we commonly refer to as the “502 page.” Some companies even take advantage of this traffic by optimizing the user experience on the 502 page, using it as a platform for promoting products.

When Etsy’s website is unavailable, you will see an image of a girl knitting a sweater, and this sweater strangely has three sleeves.

In fact, the “sweater with three sleeves” symbolizes Etsy’s attitude towards mistakes. We all know that a sweater should only have two sleeves, that’s common sense. If someone actually knits a third sleeve, our first reaction is that it’s ridiculous; it’s just a personal mistake, and we rarely think about why they would do something against common sense, what the underlying reasons are.

But that’s not the case for Etsy. At the annual year-end summary meeting, the company presents various awards, and one of the prizes is the “sweater with three sleeves,” awarded to the individual responsible for the biggest problem introduced that year.

This is because, in their view, making mistakes is not a big deal. Mistakes are not the personal problem of an individual, but rather problems with the company’s systems and processes. It is precisely because of these mistakes that the company has room for improvement and growth. In a way, it is a contribution.

Of course, in addition to creating buzz, this behavior also expresses the company’s preference for culture, namely to establish a culture of psychological safety, rapid change, timely feedback, and encouragement of innovation, in order to inspire the morale and combat effectiveness of the entire team.

Coincidentally, the 2019 DevOps State of the Union Report also specifically pointed out that a culture of psychological safety contributes to increased team productivity. More importantly, the report puts it as an important capability and includes it in the DevOps capability model.

Because only when employees feel psychologically safe can they focus their attention on problem-solving and completing tasks quickly, instead of spending a lot of time attacking each other and engaging in departmental politics. It is when seeking cooperation across departments that they consider how to maximize the organization’s value, rather than thinking “someone is coming to disrupt my cheese, how can I create higher barriers to protect my own interests.” This is really critical for DevOps, a development model that emphasizes collaboration.

The Story of Netflix’s Hiring of Adults #

Silicon Valley in the United States is home to most of the world’s elite IT companies. Among the elites, the top tier is known as FAANG, an acronym for Facebook, Apple, Amazon, Netflix, and Google. These five companies essentially set the trend for technology in Silicon Valley. While most people are familiar with the other four companies, they know very little about Netflix. So, what makes this company one of the elite?

Imagine this: at Netflix, every engineer not only receives a top-tier salary, but also has the freedom to decide when and for how long they want to take vacations. Expense reimbursements do not require approval; employees submit their expenses and receive reimbursement accordingly. Furthermore, even if someone joins the company and leaves after just one day, they are provided with enough compensation to sustain themselves for a year or more.

At this point, you may be thinking that the boss of this company must be insane.

Well, this person named Reed Hastings is not insane at all. Everything I mentioned earlier is documented in the “Netflix Culture Deck,” a book written by the author of what is considered to be the most important document in Silicon Valley, which explains the culture at Netflix.

Netflix believes that instead of establishing various processes to constrain employees, it is better to eliminate unnecessary processes and give employees the space to freely express their value. They view every employee as an adult and expect them to take responsibility for their actions and contribute to the company’s development. In doing so, they can make the best choices and exert the greatest effort.

It is this open atmosphere that has led Netflix to open-source 171 projects and plugins to date. Among them, pioneers like Chaos Monkey, the originator of chaos engineering, the circuit breaker tool Hystrix, the service registration tool Eureka, and the deployment tool Spinnaker are all well-known open-source tools in the DevOps field.

The spirit of prioritizing open-source sharing is becoming one of the driving forces for more and more companies to place importance on open-source. Allowing truly exceptional people to do valuable work, rather than having them be influenced by complex processes, internal politics, and meaningless work, is how they can maximize their value. The same applies to DevOps.

Summary #

With that said, the three stories have been told. Let’s summarize several well-known aspects in the DevOps culture:

  1. Establish a blameless culture and learn from mistakes.
  2. Foster trust and collaboration through openness and transparency.
  3. Create a psychologically safe environment that encourages innovation.
  4. Embrace the spirit of open source and prioritize sharing.

Changing the corporate culture is not a task that can be accomplished by one person or one statement. The support and guidance of management are crucial. However, we should not expect every company to become a Silicon Valley giant like FAANG. So, let’s start with ourselves and within our scope of influence. Don’t feel that culture is unrelated to you; that is the most important thing.

Lastly, after finishing this lecture, I hope you can reevaluate whether a positive error backtracking mechanism has been established within your team. Are you encouraging internal sharing and innovation? Are you prioritizing openness and collaboration with both upstream and downstream partners? Are you actively reducing redundant construction through personal practice?

Reflection Questions #

What impressed you the most about today’s content? What DevOps culture-related stories can you share?

Feel free to write your thoughts and answers in the comments section. Let’s discuss and learn together. If you found this article helpful, you’re welcome to share it with your friends.