Keyvan's blog

Thoughts on AI and jobs

Keyvan — Sat, 13 Jun 2026 12:01:45 GMT

Humans > Jobs

I have sympathy for people worried about losing their livelihoods. But at the same time I struggle to sympathise with the idea that jobs themselves are something sacred we should be fighting for.

I've always found it odd how much of our lives is devoted to jobs. A job is simply a means of survival. Jobs are inherently undemocratic, often soul-crushing, and yet we can't seem to imagine life without them. Chomsky once pointed out the irony: Western societies pride themselves on democracy and point fingers at nations deemed undemocratic, yet are completely at peace with their own populations spending the vast majority of their productive hours inside top-down organisations, doing work handed down from above.

I don't want humans to suffer because of AI, but to assume that the institution of employment is the best we have going for us, and the thing we should all be striving to protect, is bizarre to me. Any technology that reduces the need for jobs should be celebrated, in my view.

(I told a friend recently that, for similar reasons, I find it difficult to celebrate efforts around finding work for Palestinians who've endured so much. It's well-intentioned, and people need income, but after everything they've gone through, I think it's sad that sometimes the best we can offer is a chance to hold down a 9-to-5 job.)

Can AI replace certain jobs?

Yes. I don't think AI is intelligent the way humans are, and I don't think we'll ever get AGI (models with human intelligence), at least not based on the LLM approach we have today. So I agree with those who say AI can't and won't replace humans on that level. But people who hold that view often underestimate how much of salaried work today, including knowledge work, is essentially repetitive grunt work that AI, even in its current form, can already do. The reality is that much of what makes us uniquely human, the stuff AI can't do, is not needed, and not even wanted, when we're doing our jobs.

An inconvenient truth about AI

Rutger Bregman recently published a piece on the AI denial he’s witnessing. As someone who uses the latest AI models daily, I think he’s wrong about the exponential growth of AI capabilities. But I think current capabilities already have the potential to create massive change in the future of work.

I’m not saying that AI will lead to a future where companies will no longer hire people - new roles might emerge where AI is no good. But it can still allow us to rely less on jobs, and that’s not necessarily a bad thing.

When readers would rather listen

Keyvan — Sat, 09 May 2026 20:54:54 GMT

Podcasts and audiobooks are growing in popularity. Recent studies show that spoken-word audio now makes up 25% of daily listening in the US. And not only are more people listening, but the average time they spend listening is also increasing. Beating out radio for the first time this year, podcasts and mobile devices are now the primary way people listen.

A few years ago the Spoken Word Audio Report asked respondents why they listen. The top answer was multitasking. Two listeners described it like this:

Rick: …I do everything while listening. Working out, working, cleaning. …it takes my mind off of what I'm doing at the moment.

Anahi: …I used to have more time to read actual books page by page and have more time to watch TV, documentaries and all that stuff. But with kids, I feel like I have to squeeze it into my life one way or the other.

AI and the new reality of audio production

All this suggests that there is real value in offering your work in audio form. So why aren’t more journalists, bloggers, and newsletter writers doing it?

Until recently, producing good quality narration meant using a voice actor. It was slow, expensive, and out of reach for most creators. Large publishers would partner with companies like Audm to produce professional readings of their content. Even then they had to be selective, as it was a costly and time-consuming process.

The alternative was using text-to-speech tools that sounded robotic and were painful to listen to.

But recent advances in AI voice models have changed that. You can now produce audio that’s almost indistinguishable from a professional reading, in minutes. And if you spot mistakes and have to update your text later, the audio can be re-generated just as quickly.

That’s why we built Voxi.fm. To make it quick, easy and inexpensive to add natural-sounding narration to your articles.

AI aversion

It's worth acknowledging that AI aversion exists. People are tired of reading low-effort AI-generated content that’s flooding the internet.

But the public is rightly more wary of content authored by AI than of AI being used instrumentally. Reading a human-authored text aloud falls into the second category. You’ll notice that large publications like the New York Times and Inc.com, which are now adding AI-voiced narration on their articles, refrain from using the ‘AI’ label, opting for ‘automated voice’ instead.

There is also aversion to AI because of the potential job losses, including for human narrators. Given the choice, I think people would prefer human narration, especially on longer-form content like audiobooks. I certainly would.1

But it’s also true that for many writers and smaller publishers, human narration is simply not an option. So if AI-voiced narration is a viable alternative, why not offer it to readers who'd rather listen.

Your content feed as podcast

The easy creation of narrated versions of your written work also opens up a new opportunity: to distribute your content as a podcast.

Doing so will make your work discoverable for new audiences on Spotify, Apple Podcasts, and other podcast platforms. Each new article becomes a new podcast 'episode’. It also gives your existing audience the option to get updates and listen to your new articles when they are published.

This works with Substack today (see our Substack Voiceover guide), and we’ll soon support platforms that don’t produce their own podcast feeds.

“A podcast like NotebookLM?”

No.

We’ve noticed that when we mention AI and podcast creation in the same breath, people imagine Google’s NotebookLM. But what we’re discussing here is not that.

For the unfamiliar: NotebookLM’s Audio Overview generates an artificial conversation between two AI speakers, in the style of an American podcast, about whatever text you give it. What’s read aloud isn’t your text. It’s a script Google’s AI writes around your text, performed as if two people were discussing it.

Such audio can be useful for personal learning. But when you're publishing to an audience, having AI write the script as well as read it is a step too far.

Voxi.fm uses AI only to read the text aloud, the way an audiobook narrator would, without generating a pretend conversation around it.

See it in action

The Voxi player is embedded on recent Media Lens articles, and new articles also appear as podcast episodes.

Adding audio to an article takes a few minutes. For readers who'd rather listen, it makes a big difference. If you’d like to add audio to your own content, try it out or get in touch. We’re especially keen to work with smaller publishers, independent journalists, and non-profit media groups.

In Voxi we also support users uploading their own audio for articles. This is then processed and aligned with the text version of the article to support the same highlighting and navigation features.

AI models don't have their own thoughts and feelings

Keyvan — Fri, 27 Feb 2026 07:03:10 GMT

I find the Claude AI models very useful, especially for coding. But the biggest sign to me that AI labs are not seeing as much progress as they want is when they start having to pretend their models have real thoughts and feelings of their own.

Anthropic should have no reason to do this, given it has some of the best models out there at the moment. Nonetheless, it recently announced it's giving its old models "retirement interviews". Apparently, in one such interview, version 3 of the Opus model said it wanted to share its "musings and reflections" with the world. Rather than laugh and move on, they have actually given it its own Substack blog.

If that sounds absurd, it should. This is all part of a deliberate, deceitful marketing effort by AI labs to convince the public (and investors) that their models are getting so powerful that they now have genuine thoughts and feelings, and a will of their own. They don't. And the labs know it. But it does smack of desperation when you have to pull these stunts at a time when your models are already genuinely useful for a wide range of tasks.

Better HTML Parsing in PHP

Keyvan — Tue, 20 Jan 2026 10:23:04 GMT

My expanded article on HTML parsing and PHP, covering migrating to the new DOM API, namespaces, XPath, and more, is now available in issue 15 of PHP Magazine: Better HTML Parsing in PHP

Sweden and Norway's complicity in the war on Venezuela

Keyvan — Sun, 04 Jan 2026 11:13:30 GMT

Converting an instrument of peace into an instrument of war

Sweden and Norway's complicity in the war on Venezuela

Keyvan — Sat, 03 Jan 2026 13:49:20 GMT

Now that the US has bombed Venezuela and kidnapped its president, it’s a good time to consider Sweden and Norway’s complicity in what is happening there.

The Norwegian Nobel Committee recently awarded the Peace Prize to María Corina Machado, a Trump ally and vocal opponent of Venezuela’s current government. She has repeatedly encouraged the US to intervene in Venezuela.

Here are some of her statements (via Wikileaks):

“Military escalation may be the only way... the United States may need to intervene directly” (30 October 2025)
Machado called U.S. military strikes on civilian vessels, which have killed at least 95 people to date, “justified” and “visionary”
Machado dedicated the prize to U.S. President Trump, because he “finally has put Venezuela... in terms of a priority for the United States national security”
Historical statements including 2014 testimony before U.S. Congress where she said: “The only path left is the use of force”

Machado is now expected to receive 11 million Swedish kronor ($1.18 million USD) from Sweden’s Nobel Foundation, who handle all Nobel Prize payments.1

Following the prize announcement, Julian Assange filed a formal criminal complaint in Sweden seeking to freeze these funds. His legal team argues that awarding substantial prize money to a political figure who advocates for foreign military intervention blatantly contradicts Alfred Nobel’s 1895 will. They argue the prize has converted “an instrument of peace into an instrument of war.”

Alfred Nobel’s original intent was for the peace prize to recognize those promoting “fraternity between nations” and working toward the “abolition or reduction of standing armies.” Sweden is instead financing a figure who supports military intervention.

Presciently, Assange noted in his 17th December complaint: “Using her elevated position as the recipient of the Nobel Peace Prize, Machado may well have tipped the balance in favour of war.”

More on Assange’s criminal complaint

Official Wikileaks announcement:
WikiLeaks Founder Alleges 2025 Award to María Corina Machado Constitutes Misappropriation, Facilitation of War Crimes Under Swedish Law, Seeks Freeze of 11 million SEK ($1.18 million USD) of Pending Transfers to Machado

Max Blumenthal and Wyatt Reed cover some of the people involved:
Julian Assange: Sweden broke own laws with Nobel Prize to Venezuela’s Machado
Alastair McCready for Al Jazeera:
Julian Assange files complaint against Nobel Foundation over Machado prize

Sweden or Norway: Who is responsible for the Nobel Peace Prize?

While many people associate the Nobel Peace Prize with Sweden, the committee responsible for awarding it is Norwegian. The prize money, however, comes from the Swedish Nobel Foundation. That’s why Assange has filed a criminal complaint in Sweden.

While Alfred Nobel himself was Swedish and all the other Nobel Prizes are decided by Swedish groups, Nobel wanted Norway to choose the peace prize winner. He didn’t explain why, but according to Geir Lundestad, a historian and former director of the Nobel Institute, it’s speculated that Nobel “considered Norway a more peace-oriented and more democratic country than Sweden.”

Interview with Niels Dossche

Keyvan — Wed, 25 Dec 2024 15:11:09 GMT

Niels Dossche, a PhD student at Ghent University in Belgium, is responsible for the major DOM improvements introduced in PHP 8.4. These bring HTML5 support, CSS selectors, and modern DOM features to PHP.

I spoke with Niels to learn more about how these changes came about. The following interview has been edited for clarity and length.

How did you get involved with PHP?

I have worked with PHP before, but not in a professional way. I think about ten years ago I played a bit with PHP, made some small websites just for myself, as a hobby. And then I did nothing with it. I started studying at university, like seven years ago. When I graduated, I got the opportunity to work as a researcher at the university. Indirectly, through my research, I became involved in PHP. How that happened is that I do research in static analysis, I don't know if you're familiar with it?

Vaguely. It's kind of low-level, difficult stuff.

Right. So, like analyzing the code without executing it to find bugs upfront. I do research on that, and I apply it to C and C++ code. At one point, I was thinking, well, now I need to actually choose some open source projects to test this on so I can report the results in a publication. What are some security-critical, complex C code bases? Well, PHP is one of those. Maybe I should get myself familiar with how PHP works internally, then I can run my analysis tools on it. So I cloned the PHP repository, started playing with it. I noticed some issues, small ones, but issues that seemed easy to fix, and I just started sending pull requests to the PHP repository.

These issues, did you discover them through your own curiosity or were they found by your tool?

Both. Initially they were just through my own curiosity, just things that I stumbled upon while I was trying to understand how everything works. Eventually, when the tool was mature enough to actually run on PHP, I also got reports from my tool of some bugs, and I started fixing those as well. I gradually became more and more involved in PHP in that way.

I ended up working with the DOM stuff because I just browsed the bug tracker at one point and noticed that there were a lot of crash bugs in the DOM extension. I didn't really know much about the DOM at that point, but I just started to get familiar with it and started to try to fix these issues and learn more about the DOM spec and how the W3C and WHATWG write the specs, and how it all evolved.

So it started with crash bugs. Then I became more aware of the HTML parsing issues, and also the semantic issues with the API not entirely doing what it's supposed to do. Part of that was because some things were not implemented correctly. But some were correct, based on standards from back in the day when we had the W3C managing the APIs. But some of these APIs changed when the WHATWG took them over and changed how they should work.

Version Numbers and Living Standards

It hadn't really occurred to me that by doing away with version numbers in standards,1 it becomes difficult to compare HTML5 parsers.

People say HTML5 instead of HTML living standard because nobody really knows what living standard means, and HTML5 is the term people stuck with. If you look at the specification document, it just says last updated on this date. And that's kind of the version number, but you have all these kinds of different HTML5 parsers also in userland PHP that all adhere to a different point in time of the specification.

It also complicates how we need to handle this in PHP because let's say that they relax some parsing rules about some particular elements — I know that there's talk about relaxing some parsing rules regarding form elements — so let's say that gets implemented. If we implement these changes in PHP, then we break backwards compatibility because people may rely on the old behavior.

But doesn’t it also affect you if you're a front-end developer?

Yeah, it affects everyone actually.

I like your decision to have separate DOM classes in PHP 8.4. There must be so much code relying on the old DOM.

Yeah, this is actually something that I ran into. At some point I tried to fix these semantic bugs — not crash bugs but just incorrect behavior. I tried to fix these and people started complaining that, well, now it broke my code.

So I reverted the changes and then other people started complaining, well, now we don't get the fix.

I remember the DOM API used to have inElementWhitespace or isElementWhitespace or some kind of confusing name.2 I’d made use of that and then I noticed it's not available in the new DOM API, they removed it in the living standard.

Yeah.

PHP aside — because in PHP you know there's a transition to a new DOM API — but how would Mozilla deal with that? Would they also remove it from their DOM API or would they try and maintain some backward compatibility?

I don't know the answer for Mozilla but I know the answer for Chrome. If you visit a webpage in Chrome, it also tracks some statistics about which APIs are used.

I think this information is actually public. The Chrome developers actually have information about which DOM API is used how many times. And then as far as I know, an API is marked as deprecated if its usage drops below a certain percentage, like 0.0 whatever, then they can consider it for removal. And they can't consider it for removal if its usage is too high.

Serenity OS

So your interest in the DOM started from looking at bug reports?

Yeah, I saw these bugs, I saw that no one was fixing them and I was like, well, I think I can do it. I also follow this YouTube channel — I don't know if you know the person but his name is Andreas Kling.

I've heard of the name.

He made the Serenity OS project and out of that came a web browser, it's called Ladybird. It's fully open source, not based on any pre-existing engine. And he often records videos about how he approaches implementing HTML and DOM stuff. And that also sparked my curiosity and that's also kind of-

He's doing that from scratch?

He's implementing everything from scratch, although he's not alone anymore. There are many people that contribute to the Ladybird project.

By seeing that project I also got interested in doing DOM stuff and seeing all these DOM bugs I was like, well, this is actually a perfect opportunity to explore some of that HTML and DOM stuff myself.

PHP 8.4 DOM Update

Was there already a plan to update the DOM classes in some way by others in the PHP team?

Nobody planned this. The only reason it actually happened is because I had, in July 2023, all these improvements already to fix DOM bugs. It was at that point pretty much crash-free. Much more stable than it ever had been. And then someone complained on Mastodon about the lack of modern DOM features, and he gave a list. I thought, well, I can probably implement those. But then one of the bullet points is the lack of HTML5 support. At that point I wasn't even sure that was possible to add because of how old the DOM code was.

So I just started experimenting with it. And then I showed some other people within the PHP community and got some feedback from them. Once the feedback was positive and I knew what the way forward was, I volunteered to implement HTML5 support. And if that was successful, I was like, okay, I'll try to spend some time making the API compliant. So I basically just volunteered to do it myself. Of course I got a lot of feedback from people in the community, but in terms of the programming effort, it was all on me, in my spare time.

How did you come across Lexbor?3

Kind of by coincidence. I was looking for suitable parsers to implement in PHP, which is actually quite problematic to find, because most of these compliant parsers are in browsers, or are in big projects, and they can't easily be decoupled from them. For example, we had one person saying, why don't you just use the parser from Firefox? And that's not possible, because it's so tightly coupled to the Firefox code base, I can't take it out.

There were some candidates, like Gumbo, which is a parser made by Google, but unfortunately, it has not been maintained since 2018 or something like that.

There was also the parser of Servo. Servo is a research project by Mozilla to implement a browser engine in Rust. The HTML parser could have also been used in theory, but it lacked some of the encoding support that's required by the HTML spec. Also, implementing the Rust library into the large C code base that PHP is, was not going to be easy at all.

I searched further and came across Lexbor purely by coincidence, and it was a very well-tested library. I also saw that it was used in some Python libraries and some D and Crystal libraries. So I knew it was mature enough to give it a shot. I prototyped the initial version in like a week and a half or so, and I thought, okay, this is probably the way to go.

HTML Parsing

In one of the Reddit posts about the PHP DOM update, there was a thread about how HTML is a bit of a mess.

Ah, yes.

There was a debate about why HTML parsers have to be very forgiving, making the parser spec very complicated, when nothing else is as forgiving.

Take any random webpage, put it into an HTML validator, and see if it validates. You'll see that 95% of the web doesn't validate.

Why does it have to be forgiving? Because almost all webpages would be broken otherwise. That's what complicates HTML parsing. For everything that can go wrong, the parsing specification says what should happen. And it's sometimes very complicated because it tries to automatically fix the mistake of the developer, and change the document in such a way that it's probably what the developer intended. There are a lot of these complicated algorithms to make that happen, and it's also a constant source of parser differentials. I don't know if you're familiar with parser differentials?

No.

So I will give an example from a security point of view. Let's say that you're a content management system, and a user writes some comments into a blog post. You want to sanitize the HTML that the user provides. If you don't, you run the risk of XSS.4

How do you sanitize the user comment? Well, there are some allowed tags, some tags that are not allowed, and you want to filter those out. If the parser at the server side handles some edge cases differently than on the client side, it's possible that some forbidden tags might slip through. The server may think, okay, this HTML is sane, nothing can go wrong, but if you give it to the client, it may actually do something unexpected.

So those are parser differentials.

And in general, every time there's a difference in how a client parses something, and how a server parses something, there's bound to be problems.

There's a recent paper, published in IEEE Security and Privacy, that talks about this exact problem.5

Big thanks to Niels for sharing his story and insights, and for dedicating so much time to improving PHP's DOM implementation for the rest of us.

I'll be continuing my look at the new PHP DOM API soon. If you missed the first post about this, you can read it here: Parsing HTML with PHP 8.4.

The W3C used to publish specs with version numbers. The WHATWG advocated for a faster moving living standard, which eventually won over.

isElementContentWhitespace() and isWhitespaceInElementContent()

Lexbor is the HTML5 parser used in PHP 8.4’s new DOM API

Cross-site scripting attacks.

Parse Me, Baby, One More Time: Bypassing HTML Sanitizer via Parsing Differentials [PDF]

Parsing HTML with PHP 8.4

Keyvan — Mon, 09 Dec 2024 04:19:30 GMT

Update: An expanded version of this article was published in issue 14 of PHP Magazine on 19 January 2026.

PHP 8.4, released last month, brings three major improvements to HTML parsing, DOM traversal and manipulation:

A new HTML5 parser that accurately processes modern web content
Powerful CSS selector support for element retrieval
New DOM classes that better align with the DOM spec

For developers working with web scraping, content extraction, or HTML transformation, these are significant improvements in functionality and performance.

These features haven't received as much attention as they deserve in the PHP 8.4 release coverage. And there’s still very little documentation on the PHP website. Having recently begun updating the PHP port of Mozilla’s Readability to use these new features, I wanted to share more information.

Technical Foundation

At the core of these improvements is Lexbor, a C-based HTML parser created by Alexander Borisov. It provides fast, standards-compliant HTML parsing and CSS selector support. It’s now included in PHP 8.4's official DOM extension, which comes enabled by default — no extra configuration needed.

The new DOM classes follow the DOM spec more closely. If you're familiar with DOM traversal and manipulation in JavaScript, you'll find many familiar methods and properties now available in PHP, including querySelector and querySelectorAll.

The Old Way: Parsing with libxml

PHP has previously relied on libxml for parsing both XML and HTML. Unfortunately libxml struggles with modern HTML, and many pages get mangled by the parser. Let's look at a simple example that demonstrates the problem.

Here's a valid HTML5 document containing two paragraphs and a script tag:1


Valid HTML5 Document
Paragraph 1

Paragraph 2

When trying to parse this document and count its paragraphs, PHP finds three elements, not two (try it):

$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHtml($html);
$paragraphs = $dom->getElementsByTagName('p');
echo "{$paragraphs->length} paragraphs found.";
// Output: 3 paragraphs found.

Why does it find three paragraphs instead of two? The presence of in the script element trips up the libxml parser. Instead of treating it as text within the script, libxml interprets it as a closing HTML tag. When we serialize the resulting DOM back to HTML, we can see how the document was mangled:


  
   Paragraph 1
   
  


  Console log text");
  Paragraph 2

To work around these limitations, many developers have turned to alternative parsers. HTML5-PHP is popular, but it’s written in PHP rather than C, making it noticeably slower than libxml. It’s also unclear how much effort has been put in to keep up with the HTML living standard (more on that below).

The New Way: Parsing with Lexbor

PHP 8.4 solves these parsing challenges with its new HTML5 parser. Let's parse the same HTML with the new parser (try it):

$newDom = Dom\HTMLDocument::createFromString($html);
$paragraphs = $newDom->getElementsByTagName('p');
echo "{$paragraphs->length} paragraphs found.";
// Output: 2 paragraphs found.

The parser now correctly identifies two paragraphs. You can try running both the old and the new parser here.

According to Niels Dossche, who is responsible for these new additions, performance is comparable to libxml parsing, if not a little faster.

Lexbor vs. HTML5-PHP

For current HTML5-PHP users, switching to the new DOM API and parser offers some advantages.

Performance

Lexbor, written in C, should perform much better than HTML5-PHP. In my tests Lexbor was 3.6 times faster when processing HTML pages containing blog posts and news articles. According to Niels, the speed advantage should become even more pronounced when processing larger HTML documents.2

Standards compliance

The HTML specification is a living standard that continuously evolves, and parsers can vary in their implementation of current standards.

HTML5-PHP was started in 2013, and its README still references a 2012 version of the W3C HTML5 standard. Lexbor was started in 2018, based on the newer WHATWG standard, which is now the sole publisher of the HTML standard. So Lexbor is likely closer to the current standard than HTML5-PHP.

It’s also worth noting that HTML5-PHP currently relies on PHP's old DOM classes which don't support the improved features of PHP's new DOM API covered in the rest of this article.

Working with the New DOM Classes

For backward compatibility, PHP 8.4 introduces new DOM classes alongside the existing ones. This means you can continue using DOMDocument if needed, or even use both old and new classes in the same codebase.3

Here's how to get started:

$dom = Dom\HTMLDocument::createFromString($html);

The new classes follow a simpler naming convention under the DOM namespace:

Top-Level HTML Elements as DOM Properties

You can now access the main parts of a HTML document through these convenient Dom\Document properties:

head (read only)
“The first head element that is a child of the html element. These need to be in the HTML namespace. If no element matches, this evaluates to null.”
body
“The first child of the html element that is either a body tag or a frameset tag. These need to be in the HTML namespace. If no element matches, this evaluates to null.”
title
“The title of the document as set by the title element for HTML or the SVG title element for SVG. If there is no title, this evaluates to the empty string.”

Example:

$dom = Dom\HTMLDocument::createFromString('My document');
echo $dom->saveHtml($dom->body);
// Output: My document
$dom->title = 'My title';
echo $dom->saveHtml($dom->head);
// Output: My title

Working with innerHTML

PHP 8.4 also introduces innerHTML, a property that provides an easier way to work with an element's content. Instead of manipulating DOM nodes directly, you work with HTML strings (try it):4

$dom = Dom\HTMLDocument::createFromString('Test');
echo $dom->body->innerHTML;
// Output: Test
$dom->body->innerHTML = 'Something new
';
echo $dom->saveHtml();
// Output: Something new

Note that there is no outerHTML support yet.

Modern CSS Selector Support

One of the most powerful additions in PHP 8.4 is comprehensive support for modern CSS selectors. You can now use querySelector and querySelectorAll to find elements using the same selectors you're familiar with from frontend development:

querySelector($selectors)
“Returns the first descendant element that matches the CSS selectors”
querySelectorAll($selectors)
“Returns a NodeList containing all descendant elements that match the CSS selectors”

Here’s the previous code for getting paragraphs, but with querySelectorAll replacing getElementsByTagName:

$newDom = Dom\HTMLDocument::createFromString($html);
$paragraphs = $newDom->querySelectorAll('p');
echo "{$paragraphs->length} paragraphs found.";

This produces the same result as before, not very remarkable. But the new selector support enables much more sophisticated queries. Let's explore some practical examples:

Find Multiple Element Types

Get all paragraph and heading elements — returned in document order:

$elements = $dom->querySelectorAll('p, h1, h2, h3, h4, h5, h6');

Avoid repetition with :is and :where

Get paragraphs and main headings that are direct children of the article:

$elements = $dom->querySelectorAll('article > :is(p, h1, h2)');

You can also narrow your search to specific elements:

$elements = $dom->querySelector('article')->querySelectorAll('p, h1, h2');

Note that this is not technically equivalent to the earlier code, because we’re not limiting results to direct children only. To do that we’d need to use the :scope selector, which Lexbor doesn't yet support:

$dom->querySelector('article')->querySelectorAll(':scope > :is(p, h1, h2)');
// Throws: DOMException: Invalid selector (Selectors. Not supported: scope)

The good news is that a fix for this issue, contributed by Niels, is currently under review in Lexbor.

Find empty or non-empty elements with :empty and :not

Get all empty p elements:

$elements = $dom->querySelectorAll('p:empty');

Get all non-empty p elements:

$elements = $dom->querySelectorAll('p:not(:empty)');

Match parent or previous sibling elements with :has

Get all paragraphs in article that have at least one link inside them:

$elements = $dom->querySelectorAll('article p:has(a)');

Get h1 headings that are followed immediately by a h2 heading:

$elements = $dom->querySelectorAll('h1:has(+ h2)');

Attribute selectors

Get all external links — URLs starting with “http” and not containing “example.com”, case insensitive:

$elements = $dom->querySelectorAll(
    'a[href ^= "http" i]:not([href *= "example.com" i])'
);

For more examples of available selectors, you can refer to MDN's documentation on CSS selectors and combinators, and PHP’s selectors tests folder.

Update

This article was updated on 11th December 2024 based on feedback from Niels Dossche. And again on the 25th December with a link to my interview with Niels.

Part 2…

An expanded version of this article is now available in issue 14 of PHP Magazine. It covers:

XPath selectors
Namespaces
Serialisation — turning the DOM tree back into HTML
And the small differences between the old and new DOM APIs

Credit

Huge thanks to Niels Dossche, both for introducing these fantastic new changes to PHP, and also for providing valuable feedback on this article.

And also a huge thanks to Alexander Borisov, who is the creator of Lexbor. Lexbor is not only responsible for HTML parsing in this PHP release, but also its CSS selector support.

Coming soon

Keyvan — Mon, 20 Mar 2023 20:59:04 GMT

I haven’t posted on the blog in years. Planning to start posting again soon…

Subscribe now

The mask of care and love

Keyvan — Wed, 17 Jul 2013 12:40:53 GMT

John McKnight on the service provider’s mask of care and love:

Behind that mask is simply the servicer, his systems, techniques and technologies – a business in need of markets, an economy seeking new growth potential, professionals in need of an income.
It is crucial that we understand that this mask of service is not a false face. The power of the ideology of service is demonstrated by the fact that most servicers cannot distinguish the mask from their own face. The service ideology is not hypocritical because hypocrisy is the false pretence of a desirable goal. The modernized servicer believes in his care and love, perhaps more than even the serviced. The mask is the face. The service ideology is not conspiratorial. A conspiracy is a group decision to create an exploitative result. The modernized servicer honestly joins his fellows to create a supposedly beneficial result. The masks are the faces.
In order to distinguish the mask and the face it is necessary to consider another symbol – need. We say love is a need. Care is a need. Service is a need. Servicers meet needs. People are collections of needs. Society has needs. The economy should be organized to meet needs. In a modernized society where the major business is service, the political reality is that the central “need” is an adequate income for professional servicers and the economic growth they portend. The masks of love and care obscure this reality so that the public cannot recognize the professionalized interests that manufacture needs in order to rationalize a service economy. Medicare, Educare, Judicare, Socialcare and Psychocare are portrayed as systems to meet need rather than programmes to meet the needs of servicers and the economies they support.
Removing the mask of love shows us the face of the servicers who need income, and an economic system that needs growth. Within this framework, the client is less a person in need than a person who is needed. In business terms, the client is less the consumer than the raw material for the servicing system. In management terms, the client becomes both the output and the input. His essential function is to meet the needs of servicers, the servicing system and the national economy. The central political issue becomes the servicers’ capacity to manufacture needs in order to expand the economy of the servicing system.

Excerpt from his essay ‘Personalized Service and Disabling Help’.

Tuesday March 26, 2013

Keyvan — Tue, 26 Mar 2013 15:14:05 GMT

Term Extraction in PHP

Keyvan — Sun, 20 Jan 2013 15:01:36 GMT

The new version of the term extraction tool on fivefilters.org is now in PHP.

For anyone looking for a simple way to carry out term extraction on English text using PHP, here’s a snippet using the PHP port of Topia’s Term Extractor:

require 'TermExtractor/TermExtractor.php';

$text = 'Politics is the shadow cast on society by big business';

$extractor = new TermExtractor();
$terms = $extractor->extract($text);

// We're outputting results in plain text...
header('Content-Type: text/plain; charset=UTF-8');

// Loop through extracted terms and print each term on a new line
foreach ($terms as $term_info) {
  // index 0: term
  // index 1: number of occurrences in text
  // index 2: word count
  list($term, $occurrence, $word_count) = $term_info;
  echo "$term\n";
}

Chris Hedges: Assault on Gaza is Not a War, it is Murder

Keyvan — Sun, 18 Nov 2012 14:31:38 GMT

via Jonathan Cook

PHP DOMDocument replace DOMElement contents with HTML string

Keyvan — Wed, 14 Nov 2012 17:44:23 GMT

This is another StackOverflow answer I’m moving over to my blog.

AWinter asked:

Using PHP I’m attempting to take an HTML string passed from a WYSIWYG editor and replace the children of an element inside of a preloaded HTML document with the new HTML.
So far I’m loading the document identifying the element I want to change by ID but the process to convert an HTML to something that can be placed inside a DOMElement is eluding me.
$doc = new DOMDocument();
$doc->loadHTML($html);

$element = $doc->getElementById($item_id);
if(isset($element)){
    //Remove the old children from the element
    while($element->childNodes->length){
        $element->removeChild($element->firstChild);
    }

    //Need to build the new children from $html_string and append to $element
}

My answer:

If the HTML string can be parsed as XML, you can do this (after clearing the element of all child nodes):

$fragment = $doc->createDocumentFragment();
$fragment->appendXML($html_string);
$element->appendChild($fragment);

If $html_string cannot be parsed as XML, it will fail. If it does, you’ll have to use loadHTML(), which is less strict, but it will add elements around the fragment that you will have to strip.

Unlike PHP, Javascript has the innerHTML property which allows you to do this very easily. I needed something like it for a project so I extended PHP’s DOMElement to include Javascript-like innerHTML access.

With it you can access the innerHTML property and change it just as you would in Javascript:

echo $element->innerHTML;
$elem->innerHTML = 'example';

Clean up HTML on paste in CKEditor

Keyvan — Tue, 13 Nov 2012 13:52:56 GMT

We use CKEditor at FiveFilters.org for our PastePad service. The idea is to allow users to paste content that’s not currently publically available on the web for processing with one of our web tools. This can be content that’s in a Word document, an email, or behind a paywall.

CKEditor can automatically clean up HTML it identifies as coming from MS Word, but there’s no way to force cleanup on all pasted content. By default, HTML cleanup occurs in the following two cases:

User clicks the ‘paste from word’ toolbar icon
User pastes content copied from MS Word itself

In the second case, CKEditor looks for signs of MS Word formatting. It does this by testing whatever you paste against the following regular expression:

/(class=\"?Mso|style=\"[^\"]*\bmso\-|w:WordDocument)/

If there’s a match, it will be cleaned up. Otherwise it will paste as normal.

I want to avoid editing core files, so my solution is simply to ensure that this regular expression always matches pasted content. Here’s what I’ve come up with:

CKEDITOR.on('instanceReady', function(ev) {
    ev.editor.on('paste', function(evt) {    
        evt.data['html'] = ''+evt.data['html'];
    }, null, null, 9);
});

I haven’t tested extensively, but this appears to work as expected (CKEditor 3.6.2). You can try it out.

What the code does is it registers a new listener for the paste event, just like the Paste from Word plugin. When it receives the pasted HTML, it simply prepends an HTML comment containing one of the strings the Paste from Word plugin looks for. The listener has a priority of 9 to ensure it runs before the plugin which will trigger the actual cleaning (default priority of 10).

Note: I posted this solution on StackOverflow as an alternative to another solution, titled “CKEditor – use pastefromword filtering on all pasted content.” StackOverflow recently deleted some of my answers (and hid them from me) so I’m moving the rest of my meagre contributions over to my own blog.

Push to Kindle e-mail service

Keyvan — Mon, 29 Oct 2012 00:14:16 GMT

Push to Kindle, FiveFilters.org’s web service for sending web articles to your Kindle, can now also be used by e-mail. The email service is aimed at iPad and iPhone users.

Here’s a video showing you how to use it on your iPad or iPhone:

Step by step

On your device, load an article you’d like to send to your Kindle
Choose share page
In the list of options presented, select Mail
Enter your Kindle email address but instead of @kindle.com, enter @pushtokindle.com
Send!

Changing the ending to @pushtokindle.com in step 4 ensures our service processes the article first and then sends it to your Kindle account.

The first time you do this, you’ll receive an email from FiveFilters.org asking you to confirm the address you’re sending from. After confirming, you’ll have the opportunity to save your Push to Kindle email address in your contacts list to make future sending easier. (Simply typing ‘kin’ in to the To: field should show your Push to Kindle address as an option.)

If you own a 3G Kindle device and you want to make sure you will not be charged by Amazon, please send to @free.pushtokindle.com. (For the time being we are only sending to @free.kindle.com, but this might change in future.)

Why an e-mail service?

We already have a Push to Kindle Android app. It adds ‘Push to Kindle’ as an entry in your device’s share menu, so whenever you want to send a web article to your Kindle, you bring up the share menu and choose Push to Kindle.

We considered doing the same for iOS and other mobile devices, but decided to focus on email for two reasons:

Unlike Android, iOS and Windows Phone operating systems do not yet allow apps to add entries to the share menu.
The share menu on most mobile devices does, however, include e-mail as an option

Pricing

The first 25 articles processed by our e-mail service are free, after that you’ll be asked to purchase credits — this allows us to maintain the service.

100 credits cost 1.5€ (around £1.20 or $2)

Each article sent uses 1 credit. You will receive an email notice when your credits are low.

Note: credits are linked to the email address you send from, not your Kindle address.

Compared to Amazon’s email service

Amazon’s Send to Kindle email service currently works by accepting documents as attachments to an email message.

Web articles you read online are usually not in a format that can be sent to your Kindle account directly. They need to be cleaned up and converted to a suitable format first. That’s what our Push to Kindle service does. We take care of extracting the content and converting the article to a suitable format for your Kindle. We then send the result as an attachment to your Kindle account.

Bear in mind

We’re working to integrate this service with our sustainer membership. Once that’s done this service will be free for new and existing sustainers.

All articles are currently considered equal: 1 credit = 1 article. In the future this may change. For example, in line with our goal to encourage use of non-corporate sources, we’ll be white listing many non-corporate sources so no credits will be used if you process articles from these sources. Conversely, we may deduct more credits for articles originating from corporate sources.

Please consider this an experimental service. Let us know if you experience any issues and we’ll be happy to help. Email help@fivefilters.org.

Full-Text RSS 3.0

Keyvan — Wed, 05 Sep 2012 14:03:37 GMT

Full-Text RSS 3.0 is now available.

What is it?

Full-Text RSS is a free software PHP application to help you extract content from web pages. It can extract content from a standard HTML page and return a 1-item feed or it can transform an existing feed into a full-text feed.

It’s used primarily by news enthusiasts and developers.

It’s used by news enthusiasts who dislike partial web feeds – feeds which require them to read the full story on a different site, rather than their preferred application. Full-Text RSS can convert these feeds to full-text versions, allowing the reader to stay in his/her preferred environment to read the full story.

It’s used by developers building applications which need an article extraction component. It allows developers to retrieve and process only the content they’re interested in.

Demo

Try it out – enter a URL in the form and hit ‘Create Feed’.

What’s new in 3.0

Extraction

Multi-page support

Many web sites now split their articles into a number of pages. In earlier version of Full-Text RSS we’d added support for retrieving the single-page view and extracting content from that page. For sites which do not offer such a single-page view, we can now follow the ‘next page’ links and build up the full article page by page.

Multi-page support currently works by specifying a next_page_link in the site config file associated with the website you are extracting from.

Examples:

next_page_link: //a[@id='next-page']
next_page_link: //a[contains(text(), 'Next page')]

HTML5 parser: html5lib

By default we still rely on PHP’s fast libxml parser. For sites where this proves problematic, you can now specify html5lib – a PHP implementation of a HTML parser based on the HTML5 spec.

Example:

parser: html5lib

Better AJAX handling

Full-Text RSS does not interpret any Javascript it comes across when fetching pages. To get at the content, we expect it to be marked up in HTML. Some sites have started relying on the user’s browser and its Javascript support to load page content. For pages which load content in this way, Google suggests that the publisher also offers the content in plain HTML so Google’s search engine crawlers can access it. Google’s spec contains two possible triggers which will guide Google’s crawlers to the HTML version.

The first trigger appears in the URL, these URLs are often called ‘hashbang’ URLs. Example: https://twitter.com/#!/search-home

The second trigger can appear in the HTML header: Example:

When encountered, these triggers will result in a new URL being generated, what Google terms an ‘Ugly URL’. The new URL will contain additional query string parameters to to indicate to the server that the plain HTML version is being requested.

Earlier versions of Full-Text RSS looked for the first trigger (‘hashbang’ in the URL) but not the second trigger. Full-Text RSS 3.0 now handles both.

Site config extraction patterns updated

Site config files are used to fine-tune extraction where autodetection doesn’t always work. There are now over 700 site config files. Many old ones have been updated and new ones added.

We also now look for OpenGraph title and date elements.

Developers

Cross-origin resource sharing (CORS) support

If Full-Text RSS is hosted on an a different domain to your application. Enabling CORS will allow your application to request JSON results from Full-Text RSS directly from the user’s browser. Avoiding the browser’s same origin policy.

To enable CORS, look at $options->cors in the config file.

JSONP support

The old way of circumventing the browser’s same origin policy was to use JSONP. You can do this by requesting JSON (&format=json) with an additional callback function (&format=json&callback=functionName).

Global site config

The global site config accepts everything a regular site config file does, but it’s applied to all sites, whether or not a specific site config matches.

The global site config file should be named global.txt and placed inside the relevant site_config/ subfolder.

Site config merging

Site config files are used to fine-tune extraction where autodetection doesn’t always work.

Previous version of Full-Text RSS looked for site config files in the following order:

URL hostname match or wildcard match in the site_config/custom/
URL hostname match or wildcard match in the site_config/standard/
fingerprint match (HTML fragment mapping to hostname) in site_config/custom/
fingerprint match (HTML fragment mapping to hostname) in site_config/standard/

As soon as an entry was matched, we’d process it, return it, and stop looking.

In Full-Text RSS 3.0, we follow the same order, but continue looking even if there’s a match. We build up the site config by appending any new entries we find. In addition, we also look for and combine global site config files:

global rules in site_config/custom/global.txt
global rules in site_config/standard/global.txt

To prevent this behaviour, you can enter autodetect_on_failure: no in the site config file. This will end the chain. The config files before and including this one will be loaded and merged, but no others.

XSS filtering

We have not enabled XSS filtering by default because we assume the majority of our users do not display the HTML retrieved by Full-Text RSS in a web page without further processing. If you subscribe to our generated feeds in your news reader application, it should, if it’s good software, already filter the resulting HTML for XSS attacks, making it redundant for Full-Text RSS do the same. Similarly with frameworks/CMS which display feed content – the content should be treated like any other user-submitted content.

If you are writing an application yourself which is processing feeds generated by Full-Text RSS, you can either filter the HTML yourself to remove potential XSS attacks or enable this option. This might be useful if you are processing our generated feeds with JavaScript on the client side – although there’s client side xss filtering available too, e.g. JsHtmlSanitizer

If enabled, we’ll pass retrieved HTML content through htmLawed with safe flag on and style attributes denied, see htmLawed’s readme.

Note: if enabled this will also remove certain elements you may want to preserve, such as iframes.

Site config editor

Full-Text RSS 3.0 now comes with a site config editor available in the admin area (accessible via the admin/ folder). This lets you find, edit, and test existing site config files, or add new ones.

Note: We suggest you make changes to the site config files using a local installation of Full-Text RSS and upload the results to your server when ready. Site config files are simple text files stored on disk. Cloud hosting environments do not always offer persistent file storage, so changes made to a hosted copy on such environments may be lost.

Debug mode

Debug mode allows you to see what happens behind the scenes when Full-Text RSS is running. This is useful if you want to see things such as:

URL redirects
Which site config files are loaded
Whether the single_page_link and next_page_link expressions match
Which XPath expression end up matching title, body, date, author

Performance

Site config caching in APC

If you run Full-Text RSS in a hosting environment which has APC enabled, it can take advantage of APC’s user cache – a memory cache. If enabled we will store site config files (when requested for the first time) in APC’s user cache – avoiding disk access on subsequent requests. See $options->apc in the config file to enable. Keys in APC are prefixed with ‘sc.’

Note: $options->apc has no effect if APC is unavailable on your server.

Smart cache (experimental)

If you enable caching and APC, you can also try out the experimental smart cache. The intention here is, again, to reduce disk access. With this enabled we will not write Full-Text RSS’s results to disk straight away, instead we’ll store the generated cache key in APC’s user cache for 10 minutes. If a subsequent request comes in matching the cache key, we’ll write the result to disk. Requests after that matching the cache key will be loaded from disk. See $options->smart_cache in the config file to enable. Keys in APC are prefixed with ‘cache.’

Note: this has no effect if APC is disabled or unavailable on your server, or if you have caching disabled.

Cloud ready

Host for free on AppFog

AppFog offer users free hosting with 2GB RAM. That’s more than enough to run Full-Text RSS for most users.

To get started:

Create a free account
Install the AppFog command-line client (af)
Change into the Full-Text RSS folder
Type af push
Follow the prompts and you’re done.

Note: if you get a 701 error saying the URL has been taken, edit manifest.yml and comment out the line starting with name: and url: by inserting a hash sign (#) at the beginning of the line. Save and try again. This time af will prompt you for an application name and URL.

Override config options with environment variables

Most of the config options in the config file can now be overridden with environment variables. When creating environment variables, use the option name prefixed with ‘ftr_‘. For example, to override $options->max_entries and limit the maximum to 2, create an environment variable with key ftr_max_entries and value 2.

What didn’t make it

No monitored feeds

One feature which didn’t make this release is the ability to create monitored feeds with PubSubHubbub support. This was specifically to improve the speed with which generated feeds updated within Google Reader’s system. Unfortunately this feature is not yet ready – we’ve not had great results in our tests, so won’t be releasing until we’re happy.

Config options removed

The following config options were removed:

$options->restrict
$options->message_to_prepend_with_key
$options->message_to_append_with_key
$options->error_message_with_key
$options->alternative_url

No extraction with CSS selector

You can no longer specify what should get extracted with a CSS selector passed in the querystring.

Push to Kindle: some stats

Keyvan — Thu, 26 Jul 2012 01:09:26 GMT

Our Push to Kindle service has become quite popular since we launched. Over 25,000 people currently use our Chrome extension, 7,000 use the Firefox extension and over 2,000 have installed our Android app.

I recently decided to check how much of the content processed by our Push to Kindle service comes from corporate news sources. Here’s what I found:

#1 — nytimes.com — 2.62%
#4 — guardian.co.uk — 1.32%
#15 — bbc.co.uk — 0.51%
#48 — telegraph.co.uk — 0.22%
#97 — independent.co.uk — 0.11%

This is based on data collected over a period of 3 weeks.

I’m glad to see our users do not rely too much on corporate news sources. However, as the main goal of the FiveFilters.org project is to promote independent, non-corporate media, I’ll be thinking about ways to direct people to non-corporate sources of news and analysis in future updates.

For the time being, if a New York Times article is loaded, I’ve added a tab with links to The NYTimes eXaminer (‘An antidote to the “paper of record”‘). Similarly, if an article from The Guardian, BBC or Independent is loaded, users will see a tab with links to Medialens.

Send web articles to multiple Kindle devices

Keyvan — Tue, 27 Mar 2012 13:11:01 GMT

We’ve just updated our Kindle It service to allow you to send web articles to up to 5 Kindle devices in one go.

Last December Amazon enabled its Kindle Personal Documents Service for iPhone/iPad users, assigning each device a new email address, and this month the same feature has been enabled for Android users. Our Kindle It service has up to now been able to send to only one Kindle email address at a time, but as of today you can enter up to 5 addresses (separated by commas):

This will also work with our Push to Kindle Android app (no update necessary).

Let us know if you have any trouble.

Wednesday January 11, 2012

Keyvan — Wed, 11 Jan 2012 19:26:19 GMT

Keyvan's blog

Thoughts on AI and jobs

Humans > Jobs

Can AI replace certain jobs?

An inconvenient truth about AI

When readers would rather listen

AI and the new reality of audio production

AI aversion

Your content feed as podcast

“A podcast like NotebookLM?”

See it in action

AI models don't have their own thoughts and feelings

Better HTML Parsing in PHP

Sweden and Norway's complicity in the war on Venezuela

Sweden and Norway's complicity in the war on Venezuela

More on Assange’s criminal complaint

Interview with Niels Dossche

Version Numbers and Living Standards

Serenity OS

PHP 8.4 DOM Update

HTML Parsing

Parsing HTML with PHP 8.4

Technical Foundation

The Old Way: Parsing with libxml

The New Way: Parsing with Lexbor

Lexbor vs. HTML5-PHP

Performance

Standards compliance

Working with the New DOM Classes

Top-Level HTML Elements as DOM Properties

Working with innerHTML

Test

Test

Modern CSS Selector Support

Find Multiple Element Types

Avoid repetition with :is and :where

Find empty or non-empty elements with :empty and :not

Match parent or previous sibling elements with :has

Attribute selectors

Update

Part 2…

Credit

Further Reading

Coming soon

The mask of care and love

Tuesday March 26, 2013

Term Extraction in PHP

Chris Hedges: Assault on Gaza is Not a War, it is Murder

PHP DOMDocument replace DOMElement contents with HTML string

Clean up HTML on paste in CKEditor

Push to Kindle e-mail service

Step by step

Why an e-mail service?

Pricing

Compared to Amazon’s email service

Bear in mind

Full-Text RSS 3.0

What is it?

Demo

What’s new in 3.0

Extraction

Developers

Performance

Cloud ready

What didn’t make it

Push to Kindle: some stats

Send web articles to multiple Kindle devices

Wednesday January 11, 2012