39 the Way to Deal With Legacy Systems

39 What to do when facing legacy systems #

Hello, I’m Zheng Ye.

In the previous talk, combining with the scenario of “joining a new company,” I talked about how to apply the knowledge we learned in specific situations. In this talk, let’s choose another typical working scenario to apply what we have learned. This scenario is about facing legacy systems.

In “34 | How did your code become a mess?,” I talked about how code will decay over time, whether intentional or unintentional. Even in the best-case scenario, where the code is well designed and carefully maintained, as technology keeps advancing, the system will need to be gradually upgraded and replaced.

For example, we used to think that telecommunications is a unique field, completely independent of IT technology, and all we needed to do was to master CT (Communication Technology) to be worry-free. But with the continuous development of IT technology, the telecommunications field has also started to break down barriers, embrace IT technology, and propose the concept of ICT (Information and Communications Technology).

Therefore, system upgrades and transformations are inevitable, no matter what. The problem is, if you are not willing to maintain the code you wrote three months ago, then when facing complex legacy systems, what should you do?

Many people’s immediate reaction is to rewrite the system. Rewriting without thinking is like buying a lottery ticket. It requires good luck to write it well, but most people are not that lucky, and we can’t always rely on winning the lottery to change our lives. So, is there a slightly more reliable path?

Note: This translation includes the markdown formatting.

Distinguishing Phenomenon from Root Causes #

When facing a massive legacy system, we can once again return to the thinking framework to find a direction.

  • Where are we?
  • Where are we going?
  • How can we get there?

The first question: When facing a legacy system, what is our current situation?

In the earlier part of this column, we mostly discussed how to answer questions about goals and implementation paths. We paid less attention to the “current situation.” This is because in most cases, the current situation is quite obvious. But this time it’s different. You might say, what’s so different? It’s just a legacy system with bad code, so fix it quickly. But please wait!

Are the legacy system and bad code actually the problem? In fact, they are not, they are just manifestations, not root causes.

Before making any changes, we need to analyze and find the root causes of the problem. For example, if implementing a requirement that appears to take two days actually takes two weeks or longer, the root cause might be excessive code coupling, where changes affect too many places. Another example is when performance optimization encounters a bottleneck and no matter how much you delay, the performance doesn’t improve. The root cause might be flawed architectural design, and so on.

Therefore, it is best to gather the team together and answer the first question together: What is the current situation like? Do you remember the retro I mentioned in “25 | Problems Repeatedly Occur During Development, What Should You Do?”? This is a good method to let the team collectively confirm what the current situation is like and find the root causes.

Why is it necessary to do this analysis first instead of just rewriting everything? Because if you don’t do a root cause analysis, you will have a hard time determining where the problem lies, and more importantly, you won’t be able to judge whether rewriting can truly solve the problem.

If it is an architectural issue, simply adjusting the model won’t solve the problem. Similarly, if the model is not clear, optimizing the architecture will also be a waste of time. Therefore, we must find the root cause of the problem to prevent ourselves from going down the same old path again.

Determining the Plan #

Assuming you and your team have analyzed the root causes of the problems in the legacy system and successfully answered the first question. Next, let’s answer the second question: What is the goal? For a legacy system, this question is actually the easiest to answer: rewrite certain code.

You may ask, why rewrite instead of refactor? Based on my understanding of most enterprises, if refactoring can solve the problem, they either don’t consider it a problem or have already fixed it, so our goal is most likely to rewrite certain code.

However, before continuing the discussion, I strongly suggest trying to refactor your code first, making small adjustments to the existing code as much as possible, and not embarking on a large-scale transformation, because refactoring has the lowest cost.

Our real focus is on the third question: How to do it? We need to break down the goal.

To rewrite a module, you need to consider how to ensure that the code we rewrite is functionally equivalent to the original code. The only reliable answer to this question is testing. Run the same tests on both systems, and if the results are the same, we consider them to have the same functionality.

No matter what your previous views on testing were, at this point, you will greatly wish that you already have a large number of tests. If not, it is best to add tests to this module first. Because only when you build a test safety net, can subsequent modifications be considered on a solid path.

When it comes to legacy code and testing, I recommend a classic book: “Working Effectively with Legacy Code” by Michael Feathers. From its English title, you can easily see that it is a book about legacy code. If you plan to deal with legacy code, I also recommend reading this book.

In 2007, I wrote a book review for this book, “This is a book about how to write tests,” I rated it. It will teach you how to write tests for real code.

This book left a deep impression on me with its definition of legacy systems: legacy code is code without tests. This definition is truly enlightening. According to this criterion, many teams’ code is legacy code, in other words, they are harming themselves by writing code.

With the test safety net in place, the next question is how to replace the legacy system. The answer is to break it down into small pieces and replace them step by step. As you can see, the idea of task decomposition is at work again.

In “Why do some people think you can build a Taobao for only 50,000 yuan?” I mentioned that Taobao’s process of transforming its system into a Java system is to divide the business into several small modules, upgrading only one module at a time. The old module is only maintained and no new features are added, while new features are developed in the new module. The old and new modules share the same database. When a new feature goes live, the corresponding feature in the old module is turned off, and when all features are replaced, the old module is taken offline.

This principle is universally applicable, the only difference lies in the size of the modules. If your “small module” is a system, then deploy both the old and new systems and control the traffic at the front entrance to gradually redirect it from the old system to the new system; if the “small module” is only at the code level, then you need a piece of distribution code that redirects the flow to different code based on parameters, and then gradually reduce the calls to the old code according to the progress of development until it no longer depends on the old code.

Here’s another small suggestion: according to the modular approach, put the new code in a new module and write the new code according to new standards, such as achieving 100% test coverage. Then, make the entry points dependent on this new module.

Finally, with tests and a replacement plan in place, there is still a critical question: how should the new code be written?

To answer this question, we must go back to the beginning. Why are we making this adjustment? Because this system is already overloaded, can our new modifications definitely solve this problem? The answer is uncertain.

Many programmers would think that the code left behind by others is a mess, but if they have a chance to rewrite the code, how can they guarantee not to mess things up? This is a question that many people haven’t thought about carefully.

Even if you don’t think about this question, if you rewrite this piece of code today, tomorrow you will complain that the person who wrote this code didn’t write it properly. It’s just that the person being complained about is yourself.

In order to slow down the rate of code decay, you must spend more effort on software design. On the one hand, establish a good domain model; on the other hand, seek the latest industry understanding of system construction.

I have mentioned the value of the domain model many times before in this column. Many industries have developed their own best practices in domain modeling, such as the e-commerce field, which you can refer to. This can save a lot of exploration costs.

Let’s delve into the latter point, “seeking the latest industry understanding.” In short, we need to know at what level the industry has now reached in terms of development.

For example, if you are building a high-traffic system today, you need to use a caching system, a CDN, instead of directly sending all traffic to the database. The premise for doing so is that the cost of memory has greatly decreased, making caching systems a standard configuration. Thanks to REST, the industry’s understanding of HTTP has advanced by leaps and bounds, creating great room for improvement for CDNs.

Today’s caching systems are no longer simple big maps, and some well-implemented caching systems can support many different data structures and even complex queries. To some extent, they have become a performance-optimized “database”.

With this understanding, when making technology choices, you can choose the appropriate technology based on the characteristics of your own system, rather than solving today’s problems with yesterday’s technology, resulting in code that is outdated as soon as it is written.

The example mentioned earlier involves technology selection. Another aspect of “the latest understanding” is the industry’s understanding of best practices.

In fact, in this column, I have talked a lot about various “best practices,” such as writing tests, having continuous integration, and automation, etc. These things may seem simple, but if you don’t do them, the result is that the team is very likely to fall back into the quagmire and continue to struggle.

If you choose to rewrite code, at the very least, the new code should be done according to “best practices” to minimize the rate of code decay.

In short, when transforming a legacy system, a key point is to not return to the old path.

Conclusion #

We have applied various knowledge we have learned earlier to “transforming legacy systems”. As long as the product is still under development, system transformation is inevitable. The prerequisite for transforming a legacy system is to understand the current situation and know why the system needs to be transformed. Is it because of architectural issues or chaotic domain models? Only by knowing the root cause can we carry out targeted transformations.

When transforming legacy systems, I have a few suggestions for you:

  • Build a test protection net to ensure the consistency of new and old modules.
  • Divide it into small pieces and replace them gradually.
  • Build a solid domain model.
  • Look for the latest understanding of system construction in the industry.

If there is only one thing you can remember from today’s content, please remember: Take small steps to transform legacy systems and do not go back to the old path.

Finally, I would like to ask you to share your experience in transforming legacy systems. Feel free to share your practices in the comments section.

Thank you for reading. If you find this article helpful, feel free to share it with your friends.