Parsing HTML with PHP 8.4

A look at the new HTML5 parser, CSS selector support, and new DOM classes

Dec 09, 2024

PHP 8.4, released last month, brings three major improvements to HTML parsing, DOM traversal and manipulation:

A new HTML5 parser that accurately processes modern web content
Powerful CSS selector support for element retrieval
New DOM classes that better align with the DOM spec

For developers working with web scraping, content extraction, or HTML transformation, these are significant improvements in functionality and performance.

These features haven't received as much attention as they deserve in the PHP 8.4 release coverage. And there’s still very little documentation on the PHP website. Having recently begun updating the PHP port of Mozilla’s Readability to use these new features, I wanted to share more information.

Technical Foundation

At the core of these improvements is Lexbor, a C-based HTML parser created by Alexander Borisov. It provides fast, standards-compliant HTML parsing and CSS selector support. It’s now included in PHP 8.4's official DOM extension, which comes enabled by default — no extra configuration needed.

The new DOM classes follow the DOM spec more closely. If you're familiar with DOM traversal and manipulation in JavaScript, you'll find many familiar methods and properties now available in PHP, including querySelector and querySelectorAll.

The Old Way: Parsing with libxml

PHP has previously relied on libxml for parsing both XML and HTML. Unfortunately libxml struggles with modern HTML, and many pages get mangled by the parser. Let's look at a simple example that demonstrates the problem.

Here's a valid HTML5 document containing two paragraphs and a script tag:1

<!DOCTYPE html>
<title>Valid HTML5 Document</title>
<p>Paragraph 1</p>
<script>console.log("</html>Console log text");</script>
<p>Paragraph 2</p>

When trying to parse this document and count its paragraphs, PHP finds three elements, not two (try it):

$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHtml($html);
$paragraphs = $dom->getElementsByTagName('p');
echo "{$paragraphs->length} paragraphs found.";
// Output: 3 paragraphs found.

Why does it find three paragraphs instead of two? The presence of </html> in the script element trips up the libxml parser. Instead of treating it as text within the script, libxml interprets it as a closing HTML tag. When we serialize the resulting DOM back to HTML, we can see how the document was mangled:

<html>
  <body>
   <p>Paragraph 1</p>
   <script>console.log("</script>
  </body>
</html>
<html>
  <p>Console log text");</p>
  <p>Paragraph 2</p>
</html>

To work around these limitations, many developers have turned to alternative parsers. HTML5-PHP is popular, but it’s written in PHP rather than C, making it noticeably slower than libxml. It’s also unclear how much effort has been put in to keep up with the HTML living standard (more on that below).

The New Way: Parsing with Lexbor

PHP 8.4 solves these parsing challenges with its new HTML5 parser. Let's parse the same HTML with the new parser (try it):

$newDom = DOM\HTMLDocument::createFromString($html);
$paragraphs = $newDom->getElementsByTagName('p');
echo "{$paragraphs->length} paragraphs found.";
// Output: 2 paragraphs found.

The parser now correctly identifies two paragraphs. You can try running both the old and the new parser here.

According to Niels Dossche, who is responsible for these new additions, performance is comparable to libxml parsing, if not a little faster.

Lexbor vs. HTML5-PHP

For current HTML5-PHP users, switching to the new DOM API and parser offers some advantages.

Performance

Lexbor, written in C, should perform much better than HTML5-PHP. In my tests Lexbor was 3.6 times faster when processing HTML pages containing blog posts and news articles. According to Niels, the speed advantage should become even more pronounced when processing larger HTML documents.2

Standards compliance

The HTML specification is a living standard that continuously evolves, and parsers can vary in their implementation of current standards.

HTML5-PHP was started in 2013, and its README still references a 2012 version of the W3C HTML5 standard. Lexbor was started in 2018, based on the newer WHATWG standard, which is now the sole publisher of the HTML standard. So Lexbor is likely closer to the current standard than HTML5-PHP.

It’s also worth noting that HTML5-PHP currently relies on PHP's old DOM classes which don't support the improved features of PHP's new DOM API covered in the rest of this article.

Working with the New DOM Classes

For backward compatibility, PHP 8.4 introduces new DOM classes alongside the existing ones. This means you can continue using DOMDocument if needed, or even use both old and new classes in the same codebase.3

Here's how to get started:

$dom = DOM\HTMLDocument::createFromString($html);

The new classes follow a simpler naming convention under the DOM namespace:

Top-Level HTML Elements as DOM Properties

You can now access the main parts of a HTML document through these convenient DOM\Document properties:

head (read only)
“The first head element that is a child of the html element. These need to be in the HTML namespace. If no element matches, this evaluates to null.”
body
“The first child of the html element that is either a body tag or a frameset tag. These need to be in the HTML namespace. If no element matches, this evaluates to null.”
title
“The title of the document as set by the title element for HTML or the SVG title element for SVG. If there is no title, this evaluates to the empty string.”

Example:

$dom = DOM\HTMLDocument::createFromString('<p>My document</p>');
echo $dom->saveHTML($dom->body);
// Output: <body><p>My document</p></body>
$dom->title = 'My title';
echo $dom->saveHTML($dom->head);
// Output: <head><title>My title</title></head>

Working with innerHTML

PHP 8.4 also introduces innerHTML, a property that provides an easier way to work with an element's content. Instead of manipulating DOM nodes directly, you work with HTML strings (try it):4

$dom = DOM\HTMLDocument::createFromString('<body><h1>Test</h1></body>');
echo $dom->body->innerHTML;
// Output: <h1>Test</h1>
$dom->body->innerHTML = '<p>Something new</p>';
echo $dom->saveHTML();
// Output: <html><head></head><body><p>Something new</p></body></html>

Note that there is no outerHTML support yet.

Modern CSS Selector Support

One of the most powerful additions in PHP 8.4 is comprehensive support for modern CSS selectors. You can now use querySelector and querySelectorAll to find elements using the same selectors you're familiar with from frontend development:

querySelector($selectors)
“Returns the first descendant element that matches the CSS selectors”
querySelectorAll($selectors)
“Returns a NodeList containing all descendant elements that match the CSS selectors”

Here’s the previous code for getting paragraphs, but with querySelectorAll replacing getElementsByTagName:

$newDom = DOM\HTMLDocument::createFromString($html);
$paragraphs = $newDom->querySelectorAll('p');
echo "{$paragraphs->length} paragraphs found.";

This produces the same result as before, not very remarkable. But the new selector support enables much more sophisticated queries. Let's explore some practical examples:

Find Multiple Element Types

Get all paragraph and heading elements — returned in document order:

$elements = $dom->querySelectorAll('p, h1, h2, h3, h4, h5, h6');

Avoid repetition with :is and :where

Get paragraphs and main headings that are direct children of the article:

$elements = $dom->querySelectorAll('article > :is(p, h1, h2)');

You can also narrow your search to specific elements:

$elements = $dom->querySelector('article')->querySelectorAll('p, h1, h2');

Note that this is not technically equivalent to the earlier code, because we’re not limiting results to direct children only. To do that we’d need to use the :scope selector, which Lexbor doesn't yet support:

$dom->querySelector('article')->querySelectorAll(':scope > :is(p, h1, h2)');
// Throws: DOMException: Invalid selector (Selectors. Not supported: scope)

The good news is that a fix for this issue, contributed by Niels, is currently under review in Lexbor.

Find empty or non-empty elements with :empty and :not

Get all empty p elements:

$elements = $dom->querySelectorAll('p:empty');

Get all non-empty p elements:

$elements = $dom->querySelectorAll('p:not(:empty)');

Match parent or previous sibling elements with :has

Get all paragraphs in article that have at least one link inside them:

$elements = $dom->querySelectorAll('article p:has(a)');

Get h1 headings that are followed immediately by a h2 heading:

$elements = $dom->querySelectorAll('h1:has(+ h2)');

Attribute selectors

Get all external links — URLs starting with “http” and not containing “example.com”, case insensitive:

$elements = $dom->querySelectorAll(
    'a[href ^= "http" i]:not([href *= "example.com" i])'
);

For more examples of available selectors, you can refer to MDN's documentation on CSS selectors and combinators, and PHP’s selectors tests folder.

Update

This article was updated on 11th December 2024 based on feedback from Niels Dossche.

To be continued…

In part 2 I’ll be looking at:

XPath selectors
Namespaces
Renaming elements
Serialisation — turning the DOM tree back into HTML
And the small differences between the old and new DOM APIs

Credit

Huge thanks to Niels Dossche, both for introducing these fantastic new changes to PHP, and also for providing valuable feedback on this article.

And also a huge thanks to Alexander Borisov, who is the creator of Lexbor. Lexbor is not only responsible for HTML parsing in this PHP release, but also its CSS selector support.

Keyvan's blog

Discussion about this post