HtmlUtil

Origin

Hutool provides this utility class to handle some HTML page-related tasks in response to Http content returned in Http requests.

For example, when we use a crawler to crawl HTML pages, we need to process the HTML content of the returned pages, such as removing specified tags (e.g., advertising bars), removing JS, removing styles, etc. These operations can be completed using HtmlUtil.

Methods

HtmlUtil.escape

Escapes special HTML characters, including:

  1. ' replaced with '
  2. " replaced with "
  3. & replaced with &
  4. < replaced with &lt;
  5. > replaced with &gt;
String html = "<html><body>123'123'</body></html>";
// Result: &lt;html&gt;&lt;body&gt;123&#039;123&#039;&lt;/body&gt;&lt;/html&gt;
String escape = HtmlUtil.escape(html);

HtmlUtil.unescape

Restores escaped HTML special characters.

String escape = "&lt;html&gt;&lt;body&gt;123&#039;123&#039;&lt;/body&gt;&lt;/html&gt;";
// Result: <html><body>123'123'</body></html>
String unescape = HtmlUtil.unescape(escape);

HtmlUtil.removeHtmlTag

Removes specified HTML tags and the content surrounded by the tags.

String str = "pre<img src=\"xxx/dfdsfds/test.jpg\">";
// Result: pre
String result = HtmlUtil.removeHtmlTag(str, "img");

HtmlUtil.cleanHtmlTag

Clears all HTML tags but retains the content within the tags.

String str = "pre<div class=\"test_div\">\r\n\t\tdfdsfdsfdsf\r\n</div><div class=\"test_div\">BBBB</div>";
// Result: pre\r\n\t\tdfdsfdsfdsf\r\nBBBB
String result = HtmlUtil.cleanHtmlTag(str);

HtmlUtil.unwrapHtmlTag

Removes specified HTML tags, excluding the content.

String str = "pre<div class=\"test_div\">abc</div>";
// Result: preabc
String result = HtmlUtil.unwrapHtmlTag(str, "div");

HtmlUtil.removeHtmlAttr

Removes specified attributes from HTML tags. If multiple tags have the same attribute, they will all be removed.

String html = "<div class=\"test_div\"></div><span class=\"test_div\"></span>";
// Result: <div></div><span></span>
String result = HtmlUtil.removeHtmlAttr(html, "class");

HtmlUtil.removeAllHtmlAttr

Removes all attributes of specified tags.

String html = "<div class=\"test_div\" width=\"120\"></div>";
// Result: <div></div>
String result = HtmlUtil.removeAllHtmlAttr(html, "div");

HtmlUtil.filter filtering html text to avoid xss attack

String html = "<alert></alert>";
// result:""
String filter = HtmlUtil.filter(html);