04 Use Regular Expressions With Caution

04 Use regular expressions with caution #

Hello, I’m Liu Chao.

In the previous lecture, when I was talking about optimizing the String object, I mentioned the Split() method that uses regular expressions, which may cause backtracking issues. Today, let’s delve deeper into this and understand what’s going on.

Before we begin, let’s take a look at an example that can help you better understand the content.

In a small-scale project development, I encountered a problem. To promote a new product, we developed a mini-program. According to the estimated number of visits, this activity is expected to have over 300,000 participating users, with a peak TPS (transactions per second) of around 3,000.

This result came from the micro-benchmark performance test I performed on the interface. I usually use the ab tool (which can be quickly installed by running yum -y install httpd-tools) to test the interface’s HTTP requests on another machine.

I can simulate peak requests on the production environment by setting the number of requests with -n and the number of concurrent users with -c. Then I evaluate the interface’s performance based on three indicators: TPS, RT (response time per second), and the distribution of requests per second. The following diagram shows this (with my server address hidden):

img

When I was performing the performance test, I found that the TPS of a submission interface was not increasing. In theory, this business logic is very simple, so the possibility of performance bottleneck is not high.

I quickly used the process of elimination to find the problem. First, I commented out all the business logic code in the method, leaving an empty method, and then checked the performance. This approach can effectively distinguish whether it is a framework performance issue or a business code performance issue.

I quickly identified it as a business code issue and immediately started examining the code one by one to find the cause. After adding the database insertion operation code, the TPS slightly decreased, but the reason was still not found. In the end, only the Split() method operation remained. Indeed, after adding the Split() method, the TPS significantly decreased.

But why does a Split() method affect the TPS? Let’s now delve into the related content of regular expressions, and we’ll find the answer when we complete the learning.

What is a regular expression? #

A regular expression is a concept in computer science that has been implemented in many programming languages. Regular expressions use specific metacharacters to retrieve, match, and replace strings that adhere to certain rules.

The metacharacters used to construct the syntax of regular expressions include ordinary characters, standard characters, quantifiers (limiting characters), and position characters (boundary characters). For more details, please refer to the following diagram:

img

Regular Expression Engine #

A regular expression is a formula written with regular symbols. The program performs syntax analysis on this formula and builds a syntax tree. Then, based on this analysis tree, it generates an execution program (which we call a state machine or automaton) combined with the regular expression engine, used for character matching.

The regular expression engine is a set of core algorithms used to build the state machine.

Currently, there are two ways to implement a regular expression engine: DFA (Deterministic Finite Automaton) and NFA (Non-deterministic Finite Automaton).

Comparatively, constructing a DFA is more costly than constructing an NFA, but the execution efficiency of a DFA is higher than that of an NFA.

Assuming the length of a string is n, if a DFA is used as the regular expression engine, the time complexity of matching is O(n); if an NFA is used, due to the existence of multiple branches and backtracking during the matching process, assuming the number of states in the NFA is s, the time complexity of this matching algorithm is O(ns).

The advantage of an NFA is that it supports more features. For example, capturing groups, lookaround, possessive quantifiers, and other advanced features are all based on matching subexpressions independently. Therefore, regular expression libraries used in programming languages are all based on NFA implementation.

So how does an NFA engine perform matching? Let me illustrate with the following characters and expression.

text = “aabcab” regex = “bc”

The NFA engine reads each character of the regular expression, uses it to match the target string, and if a match is successful, it moves on to the next character of the regular expression; otherwise, it continues to match the next character of the target string. Let’s break down the process:

First, compare the first matching symbol of the regular expression with the first character of the string. “b” does not match “a”; move on to the next character of the string, which is also “a”, which does not match either; continue to the next character, which is “b” and it matches.

[image]

Then, similarly, compare the second matching symbol of the regular expression with the fourth character of the string. “c” matches “c”; continue to read the next character of the regular expression, but there are no more characters to match, so the matching ends.

[image]

This is the matching process of the NFA engine. Although in practical applications, regular expressions encountered are more complex than this, the matching method is the same.

Backtracking in NFA Engine #

Complex regular expressions implemented with NFA engines often cause backtracking issues during the matching process. Extensive backtracking can occupy the CPU for a long time, resulting in system performance overhead. Let me provide an example:

text = “abbc” regex = “ab{1,3}c”

This example aims to match a string that starts with “a”, ends with “c”, and has 1-3 occurrences of the character “b” in between. The process of parsing with an NFA engine is as follows:

First, compare the first matching symbol “a” of the regular expression with the first character “a” of the string, and it matches.

[image]

Then, compare the second matching symbol “b{1,3}” of the regular expression with the second character “b” of the string, and it matches. However, since “b{1,3}” represents 1-3 occurrences of the character “b” and the NFA engine is greedy, it does not move on to the next matching symbol of the regular expression. Instead, it continues to match the third character “b” of the string, and it still matches.

[image]

Next, continue to compare “b{1,3}” with the fourth character “c” of the string, and find that it does not match. At this point, backtracking occurs, and the fourth character “c” that was already read is discarded, and the pointer returns to the position of the third character “b”.

[image]

How to avoid backtracking issues? #

Since backtracking can cause performance overhead for the system, how can we deal with it? If you have carefully read the above case, you will find that the greedy nature of the NFA automaton is the fuse, which is closely related to the matching pattern of regular expressions. Let’s learn about it together.

1. Greedy mode

As the name suggests, in quantity matching, if +, ?, *, or {min, max} quantifiers are used alone, the regular expression will match as much content as possible.

For example, in the above example:

text = “abbc” regex = “ab{1,3}c”

In greedy mode, the NFA automaton reads the largest matching range, that is, it matches 3 b characters. If the match fails once, it triggers a backtracking. If the match result is “abbbc”, it will succeed.

text = “abbbc” regex = “ab{1,3}c”

2. Reluctant mode

In this mode, the regular expression tries to match as few characters as possible. If it matches successfully, it will continue to match the remaining string.

For example, by adding a “?” after the character in the previous example, you can enable the reluctant mode.

text = “abc” regex = “ab{1,3}?c”

The match result is “abc”. In this mode, the NFA automaton first chooses the smallest matching range, which is to match 1 b character, thus avoiding the backtracking issue.

3. Possessive mode

Similar to the greedy mode, the possessive mode also maximizes matching more content; however, in the possessive mode, once the match fails, the matching ends without backtracking.

In the previous example, adding a “+” after the character can enable the possessive mode.

text = “abbc” regex = “ab{1,3}+bc”

The result is not a match, the matching ends, and there is no backtracking issue. By now, you should be very clear that the method to avoid backtracking is to use the reluctant mode and the possessive mode.

Regarding the initial doubt of “why does a split() method affect TPS,” you should now also understand, right?

I used the split() method to extract the domain name and check if the request parameters comply with the rules. Split() generated a lot of backtracking when matching groups, so I added a character required for matching and “+” after the regular expression to solve this problem.

\\?(([A-Za-z0-9-~_=%]++\\&{0,1})+)

Optimization of Regular Expressions #

The performance issues caused by regular expressions have sounded an alarm for me, and I would like to share some insights with you here. Any minor issue can potentially lead to performance problems, reflecting our insufficient understanding of this technology. Therefore, I encourage you to study performance optimization, master the methodology, and learn to see the essence through the appearance. Below, I will summarize several methods for optimizing regular expressions.

1. Use lazy quantifiers instead of greedy quantifiers #

Greedy quantifiers can cause backtracking issues, which can be avoided by using lazy quantifiers. I have explained this in detail earlier, so I won’t go into it again here.

2. Reduce branching choices #

Regular expressions with branching choices like “(X|Y|Z)” can reduce performance. We should try to minimize their usage during development. If necessary, we can optimize them in the following ways:

Firstly, we need to consider the order of choices and place the more frequently used choices at the beginning so that they can be matched more quickly.

Secondly, we can try to extract common patterns. For example, replace “(abcd|abef)” with “ab(cd|ef)”. The latter has a faster matching speed because the NFA automaton will try to match “ab” first, and if not found, it won’t attempt any other options.

Finally, for simple branching choices, we can use three separate indexes instead of “(X|Y|Z)”. If you test it, you will find that using three separate indexes is more efficient than using “(X|Y|Z)”.

3. Reduce capturing groups #

Before discussing this method, let me briefly introduce what capturing groups and non-capturing groups are.

A capturing group refers to saving the matched content of a sub-expression in a numbered or explicitly named array in the regular expression for easy reference later. Generally, each pair of parentheses () represents a capturing group, and capturing groups can be nested.

A non-capturing group, on the other hand, participates in the match but does not create a separate capture group. Its expression is usually composed of (?:exp).

In a regular expression, each capturing group has a number, with 0 representing the entire matched content. Let’s look at the following example:

public static void main( String[] args )
{
    String text = "<input high=\"20\" weight=\"70\">test</input>";
    String reg="(<input.*?>)(.*?)(</input>)";
    Pattern p = Pattern.compile(reg);
    Matcher m = p.matcher(text);
    while(m.find()) {
        System.out.println(m.group(0));// the entire matched content
        System.out.println(m.group(1));//(<input.*?>)
        System.out.println(m.group(2));//(.*?)
        System.out.println(m.group(3));//(</input>)
    }
}

Output:

<input high=\"20\" weight=\"70\">test</input>
<input high=\"20\" weight=\"70\">
test
</input>

If you don’t need to retrieve the text within a specific group, use a non-capturing group instead. For example, replace “(X)” with “(?:X)”. Let’s see another example:

public static void main( String[] args )
{
    String text = "<input high=\"20\" weight=\"70\">test</input>";
    String reg="(?:<input.*?>)(.*?)(?:</input>)";
    Pattern p = Pattern.compile(reg);
    Matcher m = p.matcher(text);
    while(m.find()) {
        System.out.println(m.group(0));// the entire matched content
        System.out.println(m.group(1));//(.*?)
    }
}

Output:

<input high=\"20\" weight=\"70\">test</input>
test

In summary, reducing unnecessary capturing groups can improve the performance of regular expressions.

Summary #

Although regular expressions are small, they have powerful matching capabilities. We often use them, for example, to validate mobile phone numbers or email addresses on registration pages.

However, many times, we neglect their usage rules because they are small, and some special cases are not covered in the test cases, resulting in situations where problems occur after going live.

Based on my past experience, if using regular expressions can make your code concise and convenient, then you can use them as long as you have done proper performance testing. If not, try to avoid using regular expressions to avoid causing more performance issues.

Thought question #

In addition to the Split() method using regular expressions, there are actually other methods in Java that also use regular expressions to implement certain functions, which may easily lead us into traps. Now, please think about which other utility methods in the JDK use regular expressions?