Comparison of HTML parsers

HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:

HTML traversal: offer an interface for programmers to easily access and modify the "HTML string code". Canonical example: DOM parsers.
HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.

Parser	License	Implementation language(s)	Latest date*	HTML parsing[1]	HTML5-compliant parsing	Clean HTML**	Update HTML***
HTML Tidy	W3C license	ANSI C	2017-03-01[2]	Yes[3]	Yes	Yes[3]	Yes
HtmlUnit	Apache License 2.0	Java	2019-08-24[4]	Yes	?	No	No
libxml2 HTMLparser	MIT License	C	2017-11-02[5]	Yes	No	?	?
Parser	License	Implementation language(s)	Latest date*	HTML Parsing	HTML5-compliant Parsing	Clean HTML**	Update HTML***

* Latest release (of significant changes) date.

** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.

*** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").

References

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[1] 12.2 Parsing HTML documents — HTML Standard Archived 2013-01-16 at the Wayback Machine

[2] HTML Tidy release 5.4.0

[what_is_tidy-3] What is Tidy?

[4] HtmlUnit Release 2.36.0

[5] xml2 Releases