Full-Text RSS 3.0

Sep 05, 2012

What is it?

Full-Text RSS is a free software PHP application to help you extract content from web pages. It can extract content from a standard HTML page and return a 1-item feed or it can transform an existing feed into a full-text feed.

It’s used primarily by news enthusiasts and developers.

It’s used by news enthusiasts who dislike partial web feeds – feeds which require them to read the full story on a different site, rather than their preferred application. Full-Text RSS can convert these feeds to full-text versions, allowing the reader to stay in his/her preferred environment to read the full story.

It’s used by developers building applications which need an article extraction component. It allows developers to retrieve and process only the content they’re interested in.

Demo

Try it out – enter a URL in the form and hit ‘Create Feed’.

What’s new in 3.0

Extraction

Multi-page support

Many web sites now split their articles into a number of pages. In earlier version of Full-Text RSS we’d added support for retrieving the single-page view and extracting content from that page. For sites which do not offer such a single-page view, we can now follow the ‘next page’ links and build up the full article page by page.

Multi-page support currently works by specifying a next_page_link in the site config file associated with the website you are extracting from.

Examples:

next_page_link: //a[@id='next-page']
next_page_link: //a[contains(text(), 'Next page')]

HTML5 parser: html5lib

By default we still rely on PHP’s fast libxml parser. For sites where this proves problematic, you can now specify html5lib – a PHP implementation of a HTML parser based on the HTML5 spec.

Example:

parser: html5lib

Better AJAX handling

Full-Text RSS does not interpret any Javascript it comes across when fetching pages. To get at the content, we expect it to be marked up in HTML. Some sites have started relying on the user’s browser and its Javascript support to load page content. For pages which load content in this way, Google suggests that the publisher also offers the content in plain HTML so Google’s search engine crawlers can access it. Google’s spec contains two possible triggers which will guide Google’s crawlers to the HTML version.

The first trigger appears in the URL, these URLs are often called ‘hashbang’ URLs. Example: https://twitter.com/#!/search-home

The second trigger can appear in the HTML header: Example:

When encountered, these triggers will result in a new URL being generated, what Google terms an ‘Ugly URL’. The new URL will contain additional query string parameters to to indicate to the server that the plain HTML version is being requested.

Earlier versions of Full-Text RSS looked for the first trigger (‘hashbang’ in the URL) but not the second trigger. Full-Text RSS 3.0 now handles both.

Site config extraction patterns updated

Site config files are used to fine-tune extraction where autodetection doesn’t always work. There are now over 700 site config files. Many old ones have been updated and new ones added.

We also now look for OpenGraph title and date elements.

Developers

Cross-origin resource sharing (CORS) support

If Full-Text RSS is hosted on an a different domain to your application. Enabling CORS will allow your application to request JSON results from Full-Text RSS directly from the user’s browser. Avoiding the browser’s same origin policy.

To enable CORS, look at $options->cors in the config file.

JSONP support

The old way of circumventing the browser’s same origin policy was to use JSONP. You can do this by requesting JSON (&format=json) with an additional callback function (&format=json&callback=functionName).

Global site config

The global site config accepts everything a regular site config file does, but it’s applied to all sites, whether or not a specific site config matches.

The global site config file should be named global.txt and placed inside the relevant site_config/ subfolder.

Site config merging

Site config files are used to fine-tune extraction where autodetection doesn’t always work.

Previous version of Full-Text RSS looked for site config files in the following order:

URL hostname match or wildcard match in the site_config/custom/
URL hostname match or wildcard match in the site_config/standard/
fingerprint match (HTML fragment mapping to hostname) in site_config/custom/
fingerprint match (HTML fragment mapping to hostname) in site_config/standard/

As soon as an entry was matched, we’d process it, return it, and stop looking.

In Full-Text RSS 3.0, we follow the same order, but continue looking even if there’s a match. We build up the site config by appending any new entries we find. In addition, we also look for and combine global site config files:

global rules in site_config/custom/global.txt
global rules in site_config/standard/global.txt

To prevent this behaviour, you can enter autodetect_on_failure: no in the site config file. This will end the chain. The config files before and including this one will be loaded and merged, but no others.

XSS filtering

We have not enabled XSS filtering by default because we assume the majority of our users do not display the HTML retrieved by Full-Text RSS in a web page without further processing. If you subscribe to our generated feeds in your news reader application, it should, if it’s good software, already filter the resulting HTML for XSS attacks, making it redundant for Full-Text RSS do the same. Similarly with frameworks/CMS which display feed content – the content should be treated like any other user-submitted content.

If you are writing an application yourself which is processing feeds generated by Full-Text RSS, you can either filter the HTML yourself to remove potential XSS attacks or enable this option. This might be useful if you are processing our generated feeds with JavaScript on the client side – although there’s client side xss filtering available too, e.g. JsHtmlSanitizer

If enabled, we’ll pass retrieved HTML content through htmLawed with safe flag on and style attributes denied, see htmLawed’s readme.

Note: if enabled this will also remove certain elements you may want to preserve, such as iframes.

Site config editor

Full-Text RSS 3.0 now comes with a site config editor available in the admin area (accessible via the admin/ folder). This lets you find, edit, and test existing site config files, or add new ones.

Note: We suggest you make changes to the site config files using a local installation of Full-Text RSS and upload the results to your server when ready. Site config files are simple text files stored on disk. Cloud hosting environments do not always offer persistent file storage, so changes made to a hosted copy on such environments may be lost.

Debug mode

Debug mode allows you to see what happens behind the scenes when Full-Text RSS is running. This is useful if you want to see things such as:

URL redirects
Which site config files are loaded
Whether the single_page_link and next_page_link expressions match
Which XPath expression end up matching title, body, date, author

Performance

Site config caching in APC

If you run Full-Text RSS in a hosting environment which has APC enabled, it can take advantage of APC’s user cache – a memory cache. If enabled we will store site config files (when requested for the first time) in APC’s user cache – avoiding disk access on subsequent requests. See $options->apc in the config file to enable. Keys in APC are prefixed with ‘sc.’

Note: $options->apc has no effect if APC is unavailable on your server.

Smart cache (experimental)

If you enable caching and APC, you can also try out the experimental smart cache. The intention here is, again, to reduce disk access. With this enabled we will not write Full-Text RSS’s results to disk straight away, instead we’ll store the generated cache key in APC’s user cache for 10 minutes. If a subsequent request comes in matching the cache key, we’ll write the result to disk. Requests after that matching the cache key will be loaded from disk. See $options->smart_cache in the config file to enable. Keys in APC are prefixed with ‘cache.’

Note: this has no effect if APC is disabled or unavailable on your server, or if you have caching disabled.

Cloud ready

Host for free on AppFog

AppFog offer users free hosting with 2GB RAM. That’s more than enough to run Full-Text RSS for most users.

To get started:

Create a free account
Install the AppFog command-line client (af)
Change into the Full-Text RSS folder
Type af push
Follow the prompts and you’re done.

Note: if you get a 701 error saying the URL has been taken, edit manifest.yml and comment out the line starting with name: and url: by inserting a hash sign (#) at the beginning of the line. Save and try again. This time af will prompt you for an application name and URL.

Override config options with environment variables

Most of the config options in the config file can now be overridden with environment variables. When creating environment variables, use the option name prefixed with ‘ftr_‘. For example, to override $options->max_entries and limit the maximum to 2, create an environment variable with key ftr_max_entries and value 2.

What didn’t make it

No monitored feeds

One feature which didn’t make this release is the ability to create monitored feeds with PubSubHubbub support. This was specifically to improve the speed with which generated feeds updated within Google Reader’s system. Unfortunately this feature is not yet ready – we’ve not had great results in our tests, so won’t be releasing until we’re happy.

Config options removed

The following config options were removed:

$options->restrict
$options->message_to_prepend_with_key
$options->message_to_append_with_key
$options->error_message_with_key
$options->alternative_url

No extraction with CSS selector

You can no longer specify what should get extracted with a CSS selector passed in the querystring.

Keyvan's blog

Discussion about this post