42 Detailed Precautions for Technical Research and Development

42 Detailed Precautions for Technical Research and Development #

Hello, I’m Jingxiao.

Technical research and development have always been one of the core components of major companies. The quality of these efforts directly affects the quality of the product as well as the user experience. It is therefore important to establish a standardized and robust development system. Today, I will discuss some considerations for technical research and development.

Choosing the Right Programming Language #

For example, when developing a system, we first need to choose the appropriate programming language for each part of the system based on specific requirements. Generally, we prefer using C++ for the infrastructure layer, while using languages such as Python, Java, or PHP for the pure server-side. Taking a search engine as an example, I have drawn a simplified architecture diagram below:

You can see that the general workflow is as follows: the user inputs a query on the client-side, sends a request to the server-side, the server-side then sends a request to the NLP service, analyzes the request, and waits for various signals. After obtaining the signals, the server-side sends requests to the backend. The backend performs feature extraction and uses ML models to retrieve and rank the results. Finally, the results are returned to both the server-side and the client-side.

We will be using C++ for both the NLP service and the backend. This is because this part of the process is the most complex and time-consuming, involving feature extraction and model serving, and has very high requirements for latency. Only languages like C/C++ can meet these demands.

As for the server-side or middle tier, we will be using languages like Python, Java, or PHP. This part does not involve particularly complex processing or have high demands for latency. It mainly focuses on business logic. Additionally, using these high-level languages makes it easier for programmers to get started and debug.

Effective Use of Caching #

Caching is vital in practical engineering and, without it, most of the products we use today would likely crash. Caching saves us a significant amount of CPU capacity and latency.

Let’s take the example of a search engine system. We set up caching on the client-side, server-side, and backend. On the client-side, we typically cache the user’s search history. For example, when you click on the search box, the suggested keywords that automatically appear can be cached results. This does not require sending a request to the server-side and can be directly downloaded from the client-side.

Similarly, we also set up caching on the server-side to store some search results. This way, if the same user sends the same request multiple times, there is no need to request data from the backend again. The results can be fetched directly from the cache, making it fast and efficient.

Likewise, caching is also needed on the backend. For example, in model serving, we usually have a cache that lasts several hours. Otherwise, providing real-time online services would impose a heavy workload on the CPU and result in high latency.

In summary, the lack of caching can cause a lot of problems.

  • The server load quickly spikes, increasing the chance of crashing.
  • End-to-end latency quickly increases, increasing the probability of request timeouts.

However, is more caching always better? Obviously not.

First of all, caching is usually expensive, so there is always a limit on how much we can use.

Secondly, caching is not omnipotent. Excessive caching can also harm the user experience of a product. For example, the retrieval and sorting of search results ideally should be done in real-time model serving because this approach provides more accurate and up-to-date personalized recommendations for users. The reason for having a cache for a few hours is mainly for performance considerations. However, if the cache is extended from a few hours to a few days, it is obviously inappropriate and will undoubtedly have a significantly negative impact on the user experience.

Therefore, the duration and amount of caching often need to be considered in relation to user engagement and performance. This requires making decisions based on specific analysis and A/B testing.

Robust Logging System #

A robust logging system is particularly crucial. Large-scale systems in companies are often composed of thousands or even millions of small systems. If a failure occurs, such as a sudden service outage in Google or Facebook, we need to quickly identify the cause and make repairs. What do we rely on? We rely on a robust logging system that allows us to easily break down the error and trace it layer by layer until we find the root cause.

Generally speaking, there are two types of logging modes we need in an online environment.

The first is real-time logging. Considering the pressure on servers, downsampling is usually performed, such as logging only 1% of the actual traffic. The advantage of this is that we can track various metrics in a timely manner and trigger alerts immediately if something goes wrong.

For example, at 12:00 noon, an engineer pushed some code into the product that would cause the server to crash. The real-time logging system detects the anomaly and sends out an alert. At this point, the relevant personnel will investigate. If they discover that the push time of this code coincides with the alert trigger time, they can quickly restore it (revert) and minimize the negative impact.

At the same time, real-time logging also benefits us in conducting various online experiments. For example, ML teams often need to tune parameters for A/B testing. Our usual approach is to check the real-time logging table every few hours and make moderate parameter adjustments based on various metrics.

The second type is daily full logging, which is updated once a day. This helps us gather some information for analysis, such as creating a dashboard for tracking daily metrics and progress. In addition, the full logging table is often used as training data for ML teams.

Profiling is essential #

Regarding profiling, we have mentioned before that it is a very important feature in actual development. It allows developers to understand the efficiency of each part of the system in detail and make improvements accordingly.

In an online environment, we usually add profiling code in many necessary places. This way, we can know the latency of this piece of code, which part of it has particularly severe latency, and so on, and then take targeted measures.

If there is no profiling, it is easy for developers to add features indiscriminately without optimization. As a result, over time, the system becomes more redundant and the latency increases. Therefore, a mature system will definitely have profiling code to help developers monitor changes in various internal indicators at any time.

test, test, test #

This point has been emphasized in previous articles: testing is crucial and should never be overlooked. Whether it’s unit testing, integration testing, or any other form, it is an effective way to ensure code quality and reduce the probability of bugs.

In well-regulated companies or teams, developers are not able to pass code review if they add or change a feature without writing tests. Therefore, tests must be written, especially when the system becomes more complex. Many engineers develop various new features on top of it, making it difficult to ensure that each part is unaffected. Testing is a good solution to this issue.

Apart from the tests written during regular development, it is also advisable to add an additional layer of testing before pushing the code to production. Taking the example of a search engine system again, it is known that companies like Google or Facebook have dedicated services that simulate different users sending requests and check if the responses meet the required standards. If an error occurs, it prevents the code from being pushed, alerting developers that there may be issues in their code that need to be rechecked.

Conclusion #

Regarding the precautions for technical research and development, I mainly emphasize the following points. In fact, there are many details in daily development work that are worth paying special attention to. And for error-prone areas, using a systematic process is an efficient solution. So, in your daily work, are there any areas worth sharing that you pay particular attention to, or any questions you want to discuss? Feel free to write down your thoughts in the comments section.