Comparison of HTML parsers: Difference between revisions
Appearance
Content deleted Content added
m Update release date and link of HtmlUnit |
No edit summary |
||
Line 298: | Line 298: | ||
| {{dunno}} |
| {{dunno}} |
||
|- |
|- |
||
| [https://github.com/lexborisov/lexbor Lexbor] |
| [https://github.com/lexborisov/lexbor Lexbor (HTML module)] |
||
| [[Apache License 2.0]] |
| [[Apache License 2.0]] |
||
| [[C (programming language)|C]] |
| [[C (programming language)|C]] |
||
| - |
| 2019-11-18 |
||
| {{Yes}} |
| {{Yes}} |
||
| {{Yes}} |
| {{Yes}} |
Revision as of 10:21, 21 November 2019
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)
|
HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:
- HTML traversal: offer an interface for programmers to easily access and modify the "HTML string code". Canonical example: DOM parsers.
- HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.
Parser | License | Implementation language(s) | Latest date* | HTML parsing[1] | HTML5-compliant parsing | Clean HTML** | Update HTML*** |
---|---|---|---|---|---|---|---|
Lambda Soup | BSD-2-Clause | OCaml | 2016-12-10[2] | Yes | Yes | ? | ? |
html.parser | Python S. F. L. | Python | 2016-06-27[3] | Yes | ? | No | No |
Html Agility Pack | MIT License | C# | 2019-07-07[4] | Yes | ? | No | ? |
HTML Monkey | Microsoft Public License | C# | 2018-12-14 | Yes | ? | ? | ? |
Beautiful Soup | Python S. F. L. | Python | 2019-01-07[5] | Yes | Partial[6] | Yes | Yes |
Gumbo | Apache License 2.0 | C | 2015-05-01 | Yes | Yes | ? | ? |
html5ever | Apache License 2.0 | Rust | 2016-02-23 | Yes | Yes | ? | ? |
html5lib | MIT License | Python (and PHP, six years ago) | 2016-07-15[7] | Yes | Yes | Yes | No |
HTML::Parser | Perl license | Perl | 2013-03-28 | Yes | No[8] | ? | ? |
WebGear | GPL3 | Perl | 2017-03-10 | Yes | Yes | ? | ? |
htmlPurifier | GNU Lesser GPL | PHP | 2019-07-14[9] | No | No | Yes | Yes |
HTML Tidy | W3C license | ANSI C | 2017-03-01[10] | Yes[11] | Yes | Yes[11] | Yes |
HtmlUnit | Apache License 2.0 | Java | 2019-08-24[12] | Yes | ? | No | No |
HtmlCleaner | BSD License[13] | Java | 2015-08-24 | No | No | Yes | ? |
Hubbub | MIT License | C | 2016-02-16 | Yes | Yes[14] | ? | ? |
Jaunt API | Jaunt Beta License | Java | 2013-08-01 | Yes | ? | Yes | No |
Jericho HTML Parser | Eclipse Public License | Java | 2015-10-24[15] | Yes | ? | ? | ? |
jsdom | MIT license | JavaScript | 2018-08-19 | Yes | Yes | ? | ? |
jsoup | MIT license | Java | 2019-05-12[16] | Yes | Yes[17] | Yes | Yes |
JTidy | JTidy License | Java | 2012-10-09[18] | No | ? | Yes | ? |
libxml2 HTMLparser | MIT License | C | 2017-11-02[19] | Yes | No | ? | ? |
NekoHTML | Apache License 2.0 | Java | 2014-06-02[20] | Yes | ? | ? | ? |
TagSoup | Apache License 2.0 | Java | 2011-07-07 | No | ? | ? | ? |
Validator.nu HTML Parser | MIT License | Java | 2012-06-05 | Yes | Yes | ? | ? |
PHP Simple HTML DOM Parser | MIT License | PHP | 2014-08-28 | Yes | ? | No | No |
The PHP DOMDocument-class | PHP License | PHP | 2014-10-04 | Yes | ? | No | No |
Nokogiri | MIT License | Ruby | 2016-10-03[21] | Yes | ? | No | No |
AVHTML | AGPL | C++ | 2015-08-27[22] | Yes | ? | No | Yes |
BrilliantHTML5Parser | Apache License 2.0 | Swift 3 | 2016-11-10 | Yes | ? | No | No |
MyHTML | LGPL | C | 2018-09-06 | Yes | Yes | No | No |
Aspose.HTML | Proprietary | C# | 2018-06-06 | Yes | Yes | ? | ? |
Lexbor (HTML module) | Apache License 2.0 | C | 2019-11-18 | Yes | Yes | No | No |
tooska | LGPL | C++ | 2019-06-29 | Yes | ? | ? | ? |
Parser | License | Implementation language(s) | Latest date* | HTML Parsing | HTML5-compliant Parsing | Clean HTML** | Update HTML*** |
- * Latest release (of significant changes) date.
- ** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.
- *** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").
References
- ^ 12.2 Parsing HTML documents — HTML Standard Archived 2013-01-16 at the Wayback Machine
- ^ Lambda Soup 0.6.1
- ^ Python 3.5.2
- ^ Nuget Html AgilityPack
- ^ Beautiful Soup 4.7.1
- ^ via html5lib
- ^ Releases · html5lib/html5lib-python
- ^ Bug #53300 for HTML-Parser: HTML 5
- ^ HTML Purifier
- ^ HTML Tidy release 5.4.0
- ^ a b What is Tidy?
- ^ HtmlUnit Release 2.36.0
- ^ HtmlCleaner is distributed under BSD License
- ^ according to project's home page
- ^ Jericho HTML Parser - Browse /jericho-html/3.4 at SourceForge.net
- ^ jsoup release 1.12.1
- ^ https://jsoup.org/ Per project homepage
- ^ JTidy - Browse /JTidy at SourceForge.net
- ^ libxml2 Releases
- ^ NekoHTML | Change History
- ^ Nokogiri release 1.6.8.1
- ^ Latest commit 8c0d99f on 27 Aug 2015