overview

Origin

In my first company, I was mainly responsible for content-related business, and most of my work involved cleaning and organizing the content. Of course, part of the cleaning process inevitably involved filtering based on keywords, you know. The requirements were as follows:

The backend staff would add N keywords, and then clean all the content on the main website. Any content containing these keywords would be set as invalid.

Approach

When I first received this requirement, my initial solution was relatively brute force: create a HashSet for the keywords, then iterate through the entire database, and for each article, iterate through the Set to check if it contains any keywords…Okay, I admit that this is not a good method. As the number of keywords and data increases, the time consumed by this process grows exponentially!

So I turned to Baidu and discovered an algorithm: DFA.

Introduction to DFA

DFA stands for Deterministic Finite Automaton, which is a deterministic finite automaton. Since my algorithm skills are not good, those who are interested can refer to this blog post: Analysis of DFA-based Sensitive Word Query Algorithm

In essence, the principle is not difficult to explain. It involves constructing a tree using all the keywords, and then traversing the tree using the text. If traversal reaches a leaf node, it indicates that the article contains this keyword.

Ignoring the time it takes to construct the keyword tree, each text search only requires O(n) complexity.

Hutool has organized and improved the DFA algorithm and some online implementations, eventually forming the current Hutool-dfa module.