Login With Github

Hidden Traps in Regular Expressions

A few days ago, a monitoring system for an online project reported an exception suddenly. After checking the usage of the related resources, we found that the CPU utilization rate was nearly 100%. Then we exported the stack information for the problem with the thread Dump tool that comes with Java.

We can see that all the stacks point to a method called validateUrl, which gets more than 100 error messages on the stack. By troubleshooting the code, we know that the main function of the method is to verify whether the URL is legal.

So how a regex can lead to a high CPU utilization. In order to reproduce the problem, we extract the key code and make a simple unit test.

public static void main(String[] args) {
    String badRegex = "^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)+([A-Za-z0-9-~\\\\/])+$";
    String bugUrl = "http://www.fapiao.com/dddp-web/pdf/download?request=6e7JGxxxxx4ILd-kExxxxxxxqJ4-CHLmqVnenXC692m74H38sdfdsazxcUmfcOH2fAfY1Vw__%5EDadIfJgiEf";
    if (bugUrl.matches(badRegex)) {
        System.out.println("match!!");
    } else {
        System.out.println("no match!!");
    }
}

When we run the example above, through the resource monitor we can see that a process called java has a CPU utilization that has soared to 91.4%.

Now we almost can make a judge that the regex is the reason that leads to a high CPU utilization!

So, let's focus on the regex:

^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)+([A-Za-z0-9-~\\/])+$

The regex looks fine and can be divided into three parts:

It matches the http and https protocols in the first part, matches the www. character in the second part, and matches other characters in the third part. I stared at the regex for a long time and didn't find any big problem.

In fact, the key reason for the high CPU usage here is that the engine implementation used by Java regex is the NFA, which performs backtracking when character matching is performed. Once the backtracking occurs, the time it takes will become very long. It may be a few minutes or even a few hours. The amount of time depends on the number and complexity of backtracking.

By the way, maybe some still hasn't been very clear about what backtracking is. It doesn't matter, let's start with the principles of regex.

Regex Engine

Regex is a set of convenient match symbols. To achieve such a complex and powerful matching syntax, we must have a set of algorithms, and the implementation of the algorithm is called the regex engine. Simply put, there are two ways to implement a regex engine: the DFA (Deterministic Final Automata) and the NFA (Non deterministic Finite Automaton).

These two Automatons are different, and we're not going to go deep into their principles. Simply put, the time complexity of the DFA is linear, which is more stable but has limited functionality. The time complexity of NFA is relatively unstable, so sometimes it's very good and sometimes it's not, depending on the regex you write. But the advantage of NFA is that its functionality is even more powerful, so languages ​​such as Java, .NET, Perl, Python, Ruby, and PHP use NFA to implement their regex.

How does the NFA match? We use the following characters and expressions as examples.

text="Today is a nice day."
regex="day"

Remember that NFA match is based on regex. That is, NFA will read one character of the regex and match it with the target string. If the match succeeds, it will turn to the next character of the regex, otherwise it will continue to compare with the next character of the target string.

Let's go through the examples above step by step.

  • First, take the first match character of the regex: d. Then compare it with the first character of the string, which is T. It doesn't match, so turn to the next one. The second character is o, and it doesn't match either. So move on to the next one, which is d now. It matches. Then read the second character of the regular: a.
  • The second match character of the regex: a. And it will be compared with the fourth character of the string a. It matches again. Then go on to read the third character of the regex y.
  • The third match character of the regex is y. Let's continue to match it with the fifth character of the string, and it matches. Then try to read the next character of the regex and find that there is none, so the match ends.

The above is the matching process of the NFA, and the actual matching process is much more complicated. However, the principle of matching is the same.

Backtracking of NFA

Now that you've learned how NFA performs string matching, let's talk about the focus of the article: Backtracking. In order to explain the backtracking better, we'll use the following example.

text="abbc"
regex="ab{1,3}c"

This is a relatively simple example. The regex starts with a and ends with c, and between them there is a string of 1-3 b characters. The match process of NFA is like this:

  • First, take the first match character of the regex, which is a, and compare it with the first character of the string a. It matches, so move to the second character of the regex.
  • Take the second match character of the regex, which is b{1,3}, and compare it with the second character of the string b. It matches again. But since b{1,3} represents 1-3 b strings and the greedy nature of the NFA (that is, to match as much as possible), it won't read the next character of the regex at this time but still compare b{1,3} with the third character of the string, which is b too.  And it matches too. Then it will continue using b{1,3} to be compared with the fourth character of the string c, and find that it doesn't match. Backtracking occurs at this point.
  • How does backtracking work? After the backtracking, the fourth character (that is c) of the string which has been read will be spit out and the pointer will return to the third character of the string. After that, it will read the next character c of the regex, and compare it with the next character c of the current pointer, and it matches. Then read next, but it's over.

Let's go back and have a look at that regex which is used to validate a URL:

^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)+([A-Za-z0-9-~\\/])+$

The URL where the problem occurred is:

http://www.fapiao.com/dzfp-web/pdf/download?request=6e7JGm38jfjghVrv4ILd-kEn64HcUX4qL4a4qJ4-CHLmqVnenXC692m74H5oxkjgdsYazxcUmfcOH2fAfY1Vw__%5EDadIfJgiEf

We divide the regex into three parts:

  • Part 1: The verification protocol.^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://).
  • Part 2: Verify the domain. (([A-Za-z0-9-~]+).)+.
  • Part 3: Verify the parameters. ([A-Za-z0-9-~\\/])+$.

It can be found that there is no problem with the part of the regex verification protocol http://, but when verifying www.fapiao.com, it uses the way of xxxx. to verify. So the matching process is like this:

  • Match to www.
  • Matches to fapiao.
  • Match to com/dzfp-web/pdf/download?request=6e7JGm38jf....., you will see that because of the greedy nature, the program will always try to read the subsequent strings to match, and finally it find that there is no dot, so it starts characters backtracking one by one.

This is the first problem within the regex.

Another problem is in the third part of the regex. It can be found that the URL with problem has underscore (_) and percent sign (%), but the regex corresponding to the third part doesn't have. So only after matching for a long string of characters, it can be found that it doesn't match and then go backtracking.

This is the second problem within this regex.

Solution

You've learned that backtracking is the cause of the problem. So the solution of the problem is reducing backtracking. In fact, you will find that if you add the underscore and the percent sign into the third part, the program will become normal.

public static void main(String[] args) {
    String badRegex = "^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)+([A-Za-z0-9-~_%\\\\/])+$";
    String bugUrl = "http://www.fapiao.com/dddp-web/pdf/download?request=6e7JGxxxxx4ILd-kExxxxxxxqJ4-CHLmqVnenXC692m74H38sdfdsazxcUmfcOH2fAfY1Vw__%5EDadIfJgiEf";
    if (bugUrl.matches(badRegex)) {
        System.out.println("match!!");
    } else {
        System.out.println("no match!!");
    }
}

Run the above program and it will print out the match!!.

What if there are other URLs that contain messy characters in the future? Change it again? Certainly it's not realistic!

In fact, there are three modes in regex: Greedy mode, Reluctant mode, and Possessive mode.

If you add a sign ? in the regex, the Greedy mode will become Reluctant mode, that is, it will match as little as possible. However, backtracking will still occur in the Reluctant mode. For example:

text="abbc"
regex="ab{1,3}?c"

The first character of the regex: a, matches the first character of the string a. And the second operator of the regex, which is b{1,3}?, matches the second character b of the string. Because of the principle of minimum matching, the third operator c of the regex doesn't match the third character b of the string. So it goes backtracking and compares the second operator of the regex b{1,3}? with the third character b of the string, and now the match is successful. Then the third character of the regex c matches the fourth character c of the string. Ends.

If you add a sign + instead, the original Greedy mode will become Exclusive mode, that is, it will match as much as possible, but won't go backtracking.

Therefore, if you want to solve the problem completely, it must be guaranteed functionality while ensuring no backtracking. I add a plus sign to the second part of the regex that verifies the URL above:

^([hH][tT]{2}[pP]:\/\/|[hH][tT]{2}[pP][sS]:\/\/)
(([A-Za-z0-9-~]+).)++    --->>> (added + here)
([A-Za-z0-9-~_%\\\/])+$

Now there is no problem running the program.

Finally, I recommend a website that can check if there is a problem with the regex you write and the corresponding string match.

Online regex tester and debugger: PHP, PCRE, Python, Golang and JavaScript

For example, the URL that has problem in this article will be prompted after using the site check: catastrophic backgracking.

When you click on "regex debugger" in the bottom left corner, it will tell you how many steps have been checked, and will list all the steps and indicate where the backtracking occurred.

The regex in this article automatically stops after a 110,000-step attempt. It shows that the regex does have problems and needs to be improved.

But when I test it with the modified regex as below:

^([hH][tT]{2}[pP]:\/\/|[hH][tT]{2}[pP][sS]:\/\/)(([A-Za-z0-9-~]+).)++([A-Za-z0-9-~\\\/])+$

It's prompted that it takes only 58 steps to complete the check.

The difference of one character causes the huge performance gap.

Something to Say

It's amazing how a small regex can make the CPU die. It also gives us a wake-up call when encountering regex, it should be paid attention to greedy mode and backtracking problems.

5 Comments

temp

You can also eliminate the catestrophic backtracing by simplifying the regexp:

  1. Initial regexp: 84 steps, ~293ms
    ^([hH][tT]{2}[pP]:\/\/|[hH][tT]{2}[pP][sS]:\/\/)(([A-Za-z0-9-~]+).)++([A-Za-z0-9-~%_\\\/])+$
  2. Use case insensitive match: 84 steps, ~297ms
    ^(ht{2}p:\/\/|ht{2}ps:\/\/)(([a-z0-9-~]+).)++([a-z0-9-~%_\\\/])+$
  3. Eliminate inital OR clause case insensitive match: 85 steps, ~293ms
    ^(ht{2}ps?:\/\/)(([a-z0-9-~]+).)++([a-z0-9-~%_\\\/])+$
  4. Eliminate inital OR clause: 85 steps, ~285ms
    ^(ht{2}ps?:\/\/)(([a-z0-9-~]+).)++([a-z0-9-~%_\\\/])+$
  5. Eliminate "++" clause (no more catastrophic backtracing): 85 steps, ~261ms
    ^(ht{2}ps?:\/\/)(([a-z0-9-~]+).)+([a-z0-9-~%_\\\/])+$

In the end, we have a regexp that has more steps, but that is

  • Easier to read
  • Easier to maintain due to less complexity (no OR, no need to handle casing)
  • More portable (some regexp implementations do not support "++")
  • ~13% faster.

Maybe don't roll your own validation. Sounds like a soup.

Is that "." really supposed to be written like that (to mean any character)? Or should it be "\."?

Very insightful. I stop when i have a reserved working. Nie ilI' be conscious of tuning. Thanks for waking me up!

Simon is right, the problem here is that your period should be escaped. The catastophic backtracking here is caused because you have a "+" pattern inside another "+" pattern with no delimiter. The regex engine is trying every combination of groups to try to get past the _ and % characters. You can see this on really simple regexes, like ^(\w+)+$ - it only takes 15 or so \w characters followed by a non-\w character to see that fail hard.

Once the period is escaped, you'll find that it doesn't match your url anyway, because the question mark and equals sign are unaccounted for. Mridang is right - use a library.