PHP Port of Arc90’s Readability
Update 2011-03-23: Readers may also be interested in how we use PHP Readability at FiveFilters.org: Content Extraction at FiveFilters.org
Last year I ported Arc90’s Readability to use in the Five Filters project. It’s been over a year now and Readability has improved a lot — thanks to Chris Dary and the rest of the team at Arc90.
As part of an update to the Full-Text RSS service I started porting a more recent version (1.6.2) to PHP and the code is now online:
PHP Readability
(view source code)
For anyone not familiar, Readability was created for use as a browser addon (a bookmarklet). With one click it transforms web pages for easy reading and strips away clutter. Apple recently incorporated it into Safari Reader.
It’s also very handy for content extraction, which is why I wanted to port it to PHP in the first place. Here’s an example of how to use the PHP port:
require_once 'Readability.php';
header('Content-Type: text/plain; charset=utf-8');
// get latest Medialens alert
// (change this URL to whatever you'd like to test)
$url = 'http://medialens.org/alerts/index.php';
$html = file_get_contents($url);
// PHP Readability works with UTF-8 encoded content.
// If $html is not UTF-8 encoded, use iconv() or
// mb_convert_encoding() to convert to UTF-8.
// If we've got Tidy, let's clean up input.
// This step is highly recommended - PHP's default HTML parser
// often does a terrible job and results in strange output.
if (function_exists('tidy_parse_string')) {
$tidy = tidy_parse_string($html, array(), 'UTF8');
$tidy->cleanRepair();
$html = $tidy->value;
}
// give it to Readability
$readability = new Readability($html, $url);
// print debug output?
// useful to compare against Arc90's original JS version -
// simply click the bookmarklet with FireBug's
// console window open
$readability->debug = false;
// convert links to footnotes?
$readability->convertLinksToFootnotes = true;
// process it
$result = $readability->init();
// does it look like we found what we wanted?
if ($result) {
echo "== Title ===============================\n";
echo $readability->getTitle()->textContent, "\n\n";
echo "== Body ===============================\n";
$content = $readability->getContent()->innerHTML;
// if we've got Tidy, let's clean it up for output
if (function_exists('tidy_parse_string')) {
$tidy = tidy_parse_string($content,
array('indent'=>true, 'show-body-only'=>true),
'UTF8');
$tidy->cleanRepair();
$content = $tidy->value;
}
echo $content;
} else {
echo 'Looks like we couldn\'t find the content.';
}
Differences between the PHP port and the original
Arc90’s Readability is designed to run in the browser. It works on the DOM tree (the parsed HTML) after the page’s CSS styles have been applied and Javascript code executed. This PHP port does not run inside a browser. We use PHP’s ability to parse HTML to build our DOM tree, but we cannot rely on CSS or Javascript support. As such, the results will not always match Arc90’s Readability. (For example, if a web page contains CSS style rules or Javascript code which hide certain HTML elements from display, Arc90’s Readability will dismiss those from consideration but our PHP port, unable to understand CSS or Javascript, will not know any better.)
Another significant difference is that the aim of Arc90’s Readability is to re-present the main content block of a given web page so users can read it more easily in their browsers. Correct identification, clean up, and separation of the content block is only a part of this process. This PHP port is only concerned with this part, it does not include code that relates to presentation in the browser — Arc90 already do that extremely well, and for PDF output there’s FiveFilters.org’s PDF Newspaper.
Finally, this class contains methods that might be useful for developers working on HTML document fragments. So without deviating too much from the original code (which I don’t want to do because it makes debugging and updating more difficult), I’ve tried to make it a little more developer friendly. You should be able to use the methods here on existing DOMElement objects without passing an entire HTML document to be parsed.