Comparison of HTML parsers
HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:
- HTML traversal: offer an interface for programmers to easily access and modify of the "HTML string code". Canonical example: DOM parsers.
- HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.
Parser | License | Implementation language(s) | Latest date* | HTML parsing[1] | HTML5-compliant parsing | Clean HTML** | Update HTML*** |
---|---|---|---|---|---|---|---|
Lambda Soup | BSD-2-Clause | OCaml | 2016-12-10[2] | Yes | Yes | ? | ? |
html.parser | Python S. F. L. | Python | 2016-06-27[3] | Yes | ? | No | No |
Html Agility Pack | Microsoft Public License | C# | 2016-07-14[4] | Yes | ? | No | ? |
Beautiful Soup | Python S. F. L. | Python | 2016-08-02[5] | Yes | Partial[6] | Yes | Yes |
Gumbo | Apache License 2.0 | C | 2015-05-01 | Yes | Yes | ? | ? |
html5ever | Apache License 2.0 | Rust | 2016-02-23 | Yes | Yes | ? | ? |
html5lib | MIT License | Python (and PHP, six years ago) | 2016-07-15[7] | Yes | Yes | Yes | No |
HTML::Parser | Perl license | Perl | 2013-03-28 | Yes | No[8] | ? | ? |
WebGear | GPL3 | Perl | 2017-03-10 | Yes | Yes | ? | ? |
htmlPurifier | GNU Lesser GPL | PHP | 2009-03-25[9] | No | No | Yes | Yes |
HTML Tidy | W3C license | ANSI C | 2017-03-01[10] | Yes[11] | Yes | Yes[11] | Yes |
HtmlUnit | Apache License 2.0 | Java | 2016-05-27[12] | Yes | ? | No | No |
HtmlCleaner | BSD License[13] | Java | 2015-08-24 | No | No | Yes | ? |
Hubbub | MIT License | C | 2016-02-16 | Yes | Yes[14] | ? | ? |
Jaunt API | Jaunt Beta License | Java | 2013-08-01 | Yes | ? | Yes | No |
Jericho HTML Parser | Eclipse Public License | Java | 2015-10-24[15] | Yes | ? | ? | ? |
jsdom | MIT license | JavaScript | 2018-08-19 | Yes | Yes | ? | ? |
jsoup | MIT license | Java | 2018-04-15[16] | Yes | Yes[17] | Yes | Yes |
JTidy | JTidy License | Java | 2012-10-09[18] | No | ? | Yes | ? |
libxml2 HTMLparser | MIT License | C | 2012-09-11[19] | Yes | No | ? | ? |
NekoHTML | Apache License 2.0 | Java | 2014-06-02[20] | Yes | ? | ? | ? |
TagSoup | Apache License 2.0 | Java | 2011-07-07 | No | ? | ? | ? |
Validator.nu HTML Parser | MIT License | Java | 2012-06-05 | Yes | Yes | ? | ? |
PHP Simple HTML DOM Parser | MIT License | PHP | 2014-08-28 | Yes | ? | No | No |
The PHP DOMDocument-class | PHP License | PHP | 2014-10-04 | Yes | ? | No | No |
Nokogiri | MIT License | Ruby | 2016-10-03[21] | Yes | ? | No | No |
AVHTML | AGPL | C++ | 2015-08-27[22] | Yes | ? | No | Yes |
BrilliantHTML5Parser | Apache License 2.0 | Swift 3 | 2016-11-10 | Yes | ? | No | No |
MyHTML | LGPL | C | 2018-09-06 | Yes | Yes | No | No |
Aspose.HTML | Proprietary | C# | 2018-06-06 | Yes | Yes | ? | ? |
Lexbor | Apache License 2.0 | C | - | Yes | Yes | No | No |
Parser | License | Implementation language(s) | Latest date* | HTML Parsing | HTML5-compliant Parsing | Clean HTML** | Update HTML*** |
- * Latest release (of significant changes) date.
- ** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.
- *** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").
References
- ↑ 12.2 Parsing HTML documents — HTML Standard Archived 2013-01-16 at the Wayback Machine.
- ↑ Lambda Soup 0.6.1
- ↑ Python 3.5.2
- ↑ Nuget Html AgilityPack
- ↑ Beautiful Soup 4.5.1
- ↑ via html5lib
- ↑ Releases · html5lib/html5lib-python
- ↑ Bug #53300 for HTML-Parser: HTML 5
- ↑ HTML Tidy for Windows
- ↑ HTML Tidy release 5.4.0
- 1 2 What is Tidy?
- ↑ HtmlUnit Release 2.22 Changes
- ↑ HtmlCleaner is distributed under BSD License
- ↑ according to project's home page
- ↑ Jericho HTML Parser - Browse /jericho-html/3.4 at SourceForge.net
- ↑ jsoup release 1.11.3
- ↑ https://jsoup.org/ Per project homepage
- ↑ JTidy - Browse /JTidy at SourceForge.net
- ↑ libxml2 Releases
- ↑ NekoHTML | Change History
- ↑ Nokogiri release 1.6.8.1
- ↑ Latest commit 8c0d99f on 27 Aug 2015
This article is issued from
Wikipedia.
The text is licensed under Creative Commons - Attribution - Sharealike.
Additional terms may apply for the media files.