HtmlUtil
Origin
Hutool provides this utility class to handle some HTML page-related tasks in response to Http content returned in Http requests.
For example, when we use a crawler to crawl HTML pages, we need to process the HTML content of the returned pages, such as removing specified tags (e.g., advertising bars), removing JS, removing styles, etc. These operations can be completed using HtmlUtil.
Methods
HtmlUtil.escape
Escapes special HTML characters, including:
'
replaced with'
"
replaced with"
&
replaced with&
<
replaced with<
>
replaced with>
String html = "<html><body>123'123'</body></html>";
// Result: <html><body>123'123'</body></html>
String escape = HtmlUtil.escape(html);
HtmlUtil.unescape
Restores escaped HTML special characters.
String escape = "<html><body>123'123'</body></html>";
// Result: <html><body>123'123'</body></html>
String unescape = HtmlUtil.unescape(escape);
HtmlUtil.removeHtmlTag
Removes specified HTML tags and the content surrounded by the tags.
String str = "pre<img src=\"xxx/dfdsfds/test.jpg\">";
// Result: pre
String result = HtmlUtil.removeHtmlTag(str, "img");
HtmlUtil.cleanHtmlTag
Clears all HTML tags but retains the content within the tags.
String str = "pre<div class=\"test_div\">\r\n\t\tdfdsfdsfdsf\r\n</div><div class=\"test_div\">BBBB</div>";
// Result: pre\r\n\t\tdfdsfdsfdsf\r\nBBBB
String result = HtmlUtil.cleanHtmlTag(str);
HtmlUtil.unwrapHtmlTag
Removes specified HTML tags, excluding the content.
String str = "pre<div class=\"test_div\">abc</div>";
// Result: preabc
String result = HtmlUtil.unwrapHtmlTag(str, "div");
HtmlUtil.removeHtmlAttr
Removes specified attributes from HTML tags. If multiple tags have the same attribute, they will all be removed.
String html = "<div class=\"test_div\"></div><span class=\"test_div\"></span>";
// Result: <div></div><span></span>
String result = HtmlUtil.removeHtmlAttr(html, "class");
HtmlUtil.removeAllHtmlAttr
Removes all attributes of specified tags.
String html = "<div class=\"test_div\" width=\"120\"></div>";
// Result: <div></div>
String result = HtmlUtil.removeAllHtmlAttr(html, "div");
HtmlUtil.filter
filtering html text to avoid xss attack
String html = "<alert></alert>";
// result:""
String filter = HtmlUtil.filter(html);