Interview with Niels Dossche

PHP, HTML, the DOM, and living standards

Dec 25, 2024

Niels Dossche, a PhD student at Ghent University in Belgium, is responsible for the major DOM improvements introduced in PHP 8.4. These bring HTML5 support, CSS selectors, and modern DOM features to PHP.

I spoke with Niels to learn more about how these changes came about. The following interview has been edited for clarity and length.

How did you get involved with PHP?

I have worked with PHP before, but not in a professional way. I think about ten years ago I played a bit with PHP, made some small websites just for myself, as a hobby. And then I did nothing with it. I started studying at university, like seven years ago. When I graduated, I got the opportunity to work as a researcher at the university. Indirectly, through my research, I became involved in PHP. How that happened is that I do research in static analysis, I don't know if you're familiar with it?

Vaguely. It's kind of low-level, difficult stuff.

Right. So, like analyzing the code without executing it to find bugs upfront. I do research on that, and I apply it to C and C++ code. At one point, I was thinking, well, now I need to actually choose some open source projects to test this on so I can report the results in a publication. What are some security-critical, complex C code bases? Well, PHP is one of those. Maybe I should get myself familiar with how PHP works internally, then I can run my analysis tools on it. So I cloned the PHP repository, started playing with it. I noticed some issues, small ones, but issues that seemed easy to fix, and I just started sending pull requests to the PHP repository.

These issues, did you discover them through your own curiosity or were they found by your tool?

Both. Initially they were just through my own curiosity, just things that I stumbled upon while I was trying to understand how everything works. Eventually, when the tool was mature enough to actually run on PHP, I also got reports from my tool of some bugs, and I started fixing those as well. I gradually became more and more involved in PHP in that way.

I ended up working with the DOM stuff because I just browsed the bug tracker at one point and noticed that there were a lot of crash bugs in the DOM extension. I didn't really know much about the DOM at that point, but I just started to get familiar with it and started to try to fix these issues and learn more about the DOM spec and how the W3C and WHATWG write the specs, and how it all evolved.

So it started with crash bugs. Then I became more aware of the HTML parsing issues, and also the semantic issues with the API not entirely doing what it's supposed to do. Part of that was because some things were not implemented correctly. But some were correct, based on standards from back in the day when we had the W3C managing the APIs. But some of these APIs changed when the WHATWG took them over and changed how they should work.

Version Numbers and Living Standards

It hadn't really occurred to me that by doing away with version numbers in standards,1 it becomes difficult to compare HTML5 parsers.

People say HTML5 instead of HTML living standard because nobody really knows what living standard means, and HTML5 is the term people stuck with. If you look at the specification document, it just says last updated on this date. And that's kind of the version number, but you have all these kinds of different HTML5 parsers also in userland PHP that all adhere to a different point in time of the specification.

It also complicates how we need to handle this in PHP because let's say that they relax some parsing rules about some particular elements — I know that there's talk about relaxing some parsing rules regarding form elements — so let's say that gets implemented. If we implement these changes in PHP, then we break backwards compatibility because people may rely on the old behavior.

But doesn’t it also affect you if you're a front-end developer?

Yeah, it affects everyone actually.

I like your decision to have separate DOM classes in PHP 8.4. There must be so much code relying on the old DOM.

Yeah, this is actually something that I ran into. At some point I tried to fix these semantic bugs — not crash bugs but just incorrect behavior. I tried to fix these and people started complaining that, well, now it broke my code.

So I reverted the changes and then other people started complaining, well, now we don't get the fix.

I remember the DOM API used to have inElementWhitespace or isElementWhitespace or some kind of confusing name.2 I’d made use of that and then I noticed it's not available in the new DOM API, they removed it in the living standard.

Yeah.

PHP aside — because in PHP you know there's a transition to a new DOM API — but how would Mozilla deal with that? Would they also remove it from their DOM API or would they try and maintain some backward compatibility?

I don't know the answer for Mozilla but I know the answer for Chrome. If you visit a webpage in Chrome, it also tracks some statistics about which APIs are used.

I think this information is actually public. The Chrome developers actually have information about which DOM API is used how many times. And then as far as I know, an API is marked as deprecated if its usage drops below a certain percentage, like 0.0 whatever, then they can consider it for removal. And they can't consider it for removal if its usage is too high.

Serenity OS

So your interest in the DOM started from looking at bug reports?

Yeah, I saw these bugs, I saw that no one was fixing them and I was like, well, I think I can do it. I also follow this YouTube channel — I don't know if you know the person but his name is Andreas Kling.

I've heard of the name.

He made the Serenity OS project and out of that came a web browser, it's called Ladybird. It's fully open source, not based on any pre-existing engine. And he often records videos about how he approaches implementing HTML and DOM stuff. And that also sparked my curiosity and that's also kind of-

He's doing that from scratch?

He's implementing everything from scratch, although he's not alone anymore. There are many people that contribute to the Ladybird project.

By seeing that project I also got interested in doing DOM stuff and seeing all these DOM bugs I was like, well, this is actually a perfect opportunity to explore some of that HTML and DOM stuff myself.

PHP 8.4 DOM Update

Was there already a plan to update the DOM classes in some way by others in the PHP team?

Nobody planned this. The only reason it actually happened is because I had, in July 2023, all these improvements already to fix DOM bugs. It was at that point pretty much crash-free. Much more stable than it ever had been. And then someone complained on Mastodon about the lack of modern DOM features, and he gave a list. I thought, well, I can probably implement those. But then one of the bullet points is the lack of HTML5 support. At that point I wasn't even sure that was possible to add because of how old the DOM code was.

So I just started experimenting with it. And then I showed some other people within the PHP community and got some feedback from them. Once the feedback was positive and I knew what the way forward was, I volunteered to implement HTML5 support. And if that was successful, I was like, okay, I'll try to spend some time making the API compliant. So I basically just volunteered to do it myself. Of course I got a lot of feedback from people in the community, but in terms of the programming effort, it was all on me, in my spare time.

How did you come across Lexbor?3

Kind of by coincidence. I was looking for suitable parsers to implement in PHP, which is actually quite problematic to find, because most of these compliant parsers are in browsers, or are in big projects, and they can't easily be decoupled from them. For example, we had one person saying, why don't you just use the parser from Firefox? And that's not possible, because it's so tightly coupled to the Firefox code base, I can't take it out.

There were some candidates, like Gumbo, which is a parser made by Google, but unfortunately, it has not been maintained since 2018 or something like that.

There was also the parser of Servo. Servo is a research project by Mozilla to implement a browser engine in Rust. The HTML parser could have also been used in theory, but it lacked some of the encoding support that's required by the HTML spec. Also, implementing the Rust library into the large C code base that PHP is, was not going to be easy at all.

I searched further and came across Lexbor purely by coincidence, and it was a very well-tested library. I also saw that it was used in some Python libraries and some D and Crystal libraries. So I knew it was mature enough to give it a shot. I prototyped the initial version in like a week and a half or so, and I thought, okay, this is probably the way to go.

HTML Parsing

In one of the Reddit posts about the PHP DOM update, there was a thread about how HTML is a bit of a mess.

Ah, yes.

There was a debate about why HTML parsers have to be very forgiving, making the parser spec very complicated, when nothing else is as forgiving.

Take any random webpage, put it into an HTML validator, and see if it validates. You'll see that 95% of the web doesn't validate.

Why does it have to be forgiving? Because almost all webpages would be broken otherwise. That's what complicates HTML parsing. For everything that can go wrong, the parsing specification says what should happen. And it's sometimes very complicated because it tries to automatically fix the mistake of the developer, and change the document in such a way that it's probably what the developer intended. There are a lot of these complicated algorithms to make that happen, and it's also a constant source of parser differentials. I don't know if you're familiar with parser differentials?

No.

So I will give an example from a security point of view. Let's say that you're a content management system, and a user writes some comments into a blog post. You want to sanitize the HTML that the user provides. If you don't, you run the risk of XSS.4

How do you sanitize the user comment? Well, there are some allowed tags, some tags that are not allowed, and you want to filter those out. If the parser at the server side handles some edge cases differently than on the client side, it's possible that some forbidden tags might slip through. The server may think, okay, this HTML is sane, nothing can go wrong, but if you give it to the client, it may actually do something unexpected.

So those are parser differentials.

And in general, every time there's a difference in how a client parses something, and how a server parses something, there's bound to be problems.

There's a recent paper, published in IEEE Security and Privacy, that talks about this exact problem.5

Big thanks to Niels for sharing his story and insights, and for dedicating so much time to improving PHP's DOM implementation for the rest of us.

I'll be continuing my look at the new PHP DOM API soon. If you missed the first post about this, you can read it here: Parsing HTML with PHP 8.4.

The W3C used to publish specs with version numbers. The WHATWG advocated for a faster moving living standard, which eventually won over.

isElementContentWhitespace() and isWhitespaceInElementContent()

Lexbor is the HTML5 parser used in PHP 8.4’s new DOM API

Cross-site scripting attacks.

Parse Me, Baby, One More Time: Bypassing HTML Sanitizer via Parsing Differentials [PDF]

Keyvan's blog

Discussion about this post