TokenizerUtil

Introduction

At present, there are various Chinese tokenizer libraries applied in search engines and natural language processing, with different usage methods. Although there are plugins adapted for Lucene and Elasticsearch, we still need to invest time in learning when we want to switch between multiple libraries.

Hutool has unified the interfaces of common Chinese tokenizer libraries, defining a set of specifications to isolate the differences between each library, allowing for a single piece of code to be easily swapped out.

The engines that Hutool currently encapsulates include:

Note: This tool and module are supported starting from Hutool-4.4.0.

Principle

Similar to the idea of a Java log facade, Hutool abstracts the rendering of tokenizer engines into three concepts:

  • TokenizerEngine: The tokenizer engine is used to encapsulate the tokenizer library object.
  • Result: The tokenizer result interface definition is used to abstract the results of text tokenizer, implementing the Iterator and Iterable interfaces for traversing segmented words.
  • Word: Represents a word in the segmented text, which can obtain information such as the word text, starting position, and ending position.

By implementing these three interfaces, users can ignore the differences between tokenizer libraries and achieve multi-text tokenizer.

Hutool will also automatically select which library to use for tokenizer based on the user’s introduced tokenizer library jar through the TokenizerFactory.

Usage

Parsing text and segmenting words

// Automatically select the engine based on the user's introduced tokenizer library jar
TokenizerEngine engine = TokenizerUtil.createEngine();

// Parse the text
String text = "这两个方法的区别在于返回值";
Result result = engine.parse(text);
// Output: 这 两个 方法 的 区别 在于 返回 值
String resultStr = CollUtil.join((Iterator<Word>)result, " ");

When you introduce Ansj, it will automatically route to Ansi’s library for tokenizer, and introducing HanLP will route to HanLP. And so on. In other words, after using Hutool, regardless of which tokenizer library you use, the code remains the same.

Custom template engine

Here’s an example using HanLP:

TokenizerEngine engine = new HanLPEngine(); 
// Parse the text 
String text = "这两个方法的区别在于返回值"; 
Result result = engine.parse(text); 
// Output: 这 两个 方法 的 区别 在于 返回 值 
String resultStr = CollUtil.join((Iterator<Word>)result, " ");