Content Extraction at FiveFilters.org

Mar 23, 2011

Full-Text RSS 2.7 from FiveFilters.org is now available. I thought I’d write about one area of improvement in this release: content extraction.

Automatic Extraction

Up to now we’ve relied mainly on PHP Readability to automatically identify and extract articles from web pages, and this is still how the majority of articles are extracted. It works extremely well for most pages, but there are still occasions when it fails – e.g. picks out the wrong HTML element, or doesn’t find anything at all. Improving PHP Readability will be one area of focus for future releases.

In 2.7 we still use PHP Readability, but we now recognise and prioritise hNews microformatting – if detected, we extract the first element marked entry-title and all elements marked entry-content. This is a standard that will hopefully be used more widely on the web. (For those who’ve asked, Twitter updates are now extracted properly because of hNews support.)

Site Patterns

Recognising that auto-detection does sometimes fail, in version 2.5 we introduced custom extraction patterns: a way for users to override auto-detection and tell Full-Text RSS (using CSS selectors) which element it should extract as the content block.

The biggest change in 2.7 is the introduction of site patterns. Site patterns sit in between custom extraction and auto detection. They allow fine grained control over extraction on a per-site basis. A site, identified by its domain name, can now have its own config file detailing extraction rules. Each time a URL is processed, we check to see if a corresponding site config exists, and if it does, we refer to it for instructions. Users can specify XPath expressions to match title and body elements and define rules to strip superfluous elements.

Rather than create our own configuration format for site patterns, we chose to adopt the same format used by Instapaper. Here’s what the entry for wikipedia.org looks like:

body: //div[@id = 'content']
strip_id_or_class: editsection
strip_id_or_class: toc
prune: no

Instapaper users will find these patterns by visiting instapaper.com/bodytext/ (login required).

One big advantage for us in using the same config format is that we can make use of all the existing rules listed on Instapaper. Marco, Instapaper’s creator, has opened up the database to allow for public contributions. So, included in Full-Text RSS 2.7 is over 100 site configuration files which will be applied automatically (look inside the site_config/standard/ directory). Most of these are borrowed from Instapaper, but we’ll soon be adding our own which we’ll be sharing with everyone.

Users can also create their own site config files and drop them in the site_config/custom/ directory. Each site config is simply a text file named after the site. For example, if I wanted a special rule for extracting content from this site, I would create a keyvan.net.txt file with the appropriate rules inside.

Extraction Process Overview

To summarise, Full-Text RSS 2.7 attempts to extract in the following order:

Custom Extraction Pattern
Site Patterns
hNews
PHP Readability

If at any stage we find we’ve got a successful title and body match, we do not proceed further. If, however, there is no match, we move down the list until there is (the only exception here is with custom extraction patterns – if the supplied CSS selector does not match, no further attempt is made).

Sound useful?

Full-Text RSS 2.7 is licensed under the AGPL and available to try or buy at fivefilters.org/content-only/.

Keyvan's blog

Discussion about this post