Jump to content

Comparison of HTML parsers: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Wastl23 (talk | contribs)
m Update release date and link of HtmlUnit
No edit summary
Line 298: Line 298:
| {{dunno}}
| {{dunno}}
|-
|-
| [https://github.com/lexborisov/lexbor Lexbor]
| [https://github.com/lexborisov/lexbor Lexbor (HTML module)]
| [[Apache License 2.0]]
| [[Apache License 2.0]]
| [[C (programming language)|C]]
| [[C (programming language)|C]]
| -
| 2019-11-18
| {{Yes}}
| {{Yes}}
| {{Yes}}
| {{Yes}}

Revision as of 10:21, 21 November 2019

HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:

  • HTML traversal: offer an interface for programmers to easily access and modify the "HTML string code". Canonical example: DOM parsers.
  • HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.
Parser License Implementation language(s) Latest date* HTML parsing[1] HTML5-compliant parsing Clean HTML** Update HTML***
Lambda Soup BSD-2-Clause OCaml 2016-12-10[2] Yes Yes ? ?
html.parser Python S. F. L. Python 2016-06-27[3] Yes ? No No
Html Agility Pack MIT License C# 2019-07-07[4] Yes ? No ?
HTML Monkey Microsoft Public License C# 2018-12-14 Yes ? ? ?
Beautiful Soup Python S. F. L. Python 2019-01-07[5] Yes Partial[6] Yes Yes
Gumbo Apache License 2.0 C 2015-05-01 Yes Yes ? ?
html5ever Apache License 2.0 Rust 2016-02-23 Yes Yes ? ?
html5lib MIT License Python (and PHP, six years ago) 2016-07-15[7] Yes Yes Yes No
HTML::Parser Perl license Perl 2013-03-28 Yes No[8] ? ?
WebGear GPL3 Perl 2017-03-10 Yes Yes ? ?
htmlPurifier GNU Lesser GPL PHP 2019-07-14[9] No No Yes Yes
HTML Tidy W3C license ANSI C 2017-03-01[10] Yes[11] Yes Yes[11] Yes
HtmlUnit Apache License 2.0 Java 2019-08-24[12] Yes ? No No
HtmlCleaner BSD License[13] Java 2015-08-24 No No Yes ?
Hubbub MIT License C 2016-02-16 Yes Yes[14] ? ?
Jaunt API Jaunt Beta License Java 2013-08-01 Yes ? Yes No
Jericho HTML Parser Eclipse Public License Java 2015-10-24[15] Yes ? ? ?
jsdom MIT license JavaScript 2018-08-19 Yes Yes ? ?
jsoup MIT license Java 2019-05-12[16] Yes Yes[17] Yes Yes
JTidy JTidy License Java 2012-10-09[18] No ? Yes ?
libxml2 HTMLparser MIT License C 2017-11-02[19] Yes No ? ?
NekoHTML Apache License 2.0 Java 2014-06-02[20] Yes ? ? ?
TagSoup Apache License 2.0 Java 2011-07-07 No ? ? ?
Validator.nu HTML Parser MIT License Java 2012-06-05 Yes Yes ? ?
PHP Simple HTML DOM Parser MIT License PHP 2014-08-28 Yes ? No No
The PHP DOMDocument-class PHP License PHP 2014-10-04 Yes ? No No
Nokogiri MIT License Ruby 2016-10-03[21] Yes ? No No
AVHTML AGPL C++ 2015-08-27[22] Yes ? No Yes
BrilliantHTML5Parser Apache License 2.0 Swift 3 2016-11-10 Yes ? No No
MyHTML LGPL C 2018-09-06 Yes Yes No No
Aspose.HTML Proprietary C# 2018-06-06 Yes Yes ? ?
Lexbor (HTML module) Apache License 2.0 C 2019-11-18 Yes Yes No No
tooska LGPL C++ 2019-06-29 Yes ? ? ?
Parser License Implementation language(s) Latest date* HTML Parsing HTML5-compliant Parsing Clean HTML** Update HTML***
* Latest release (of significant changes) date.
** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.
*** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").

References