Adding support for your own RSS feeds to web2lrf
Basic Example
To add support for your own RSS Feeds you will need to create a small python script. For example, suppose you want to convert the feed http://wonderfulwebsite.or/rss/feed1.xml to an ebook. Create the file wonderfulwebsite.py
from libprs500.ebooks.lrf.web.profiles import DefaultProfile
class WonderfulWebsite(DefaultProfile):
title = 'My Wonderful Webiste Feed'
max_recursions = 2
def get_feeds(self):
return [ ('Feed 1', 'http://wonderfulwebsite.or/rss/feed1.xml') ]
Now run web2lrf as
web2lrf --verbose --user-profile wonderfulwebsite.py
Customization
Most websites typically have a lot of cruft that makes a direct conversion into an ebook problematic. While the above example should work for a very simple feed and website, you will need to customize the process for real world feeds. Fortunately, the user profile framework is very flexible. See the detailed HOWTO on creating your own User Profiles at the bottom of the page.
Print version
The first thing you'd probably want to do is use the print version of the articles. That can be accomplished by adding a method to WonderfulWebsite as shown below
class WonderfulWebsite(DefaultProfile):
...
def print_version(self, url):
return url + '/print_version'
The method print_version will be called with the URL for every article and should return the modified URL that points to the print version of that article.
Preprocessing Article HTML
You can preprocess the downloaded HTML to remove, banners, ads, unwanted HTML and Graphics, change styles etc. This is done by adding an member to WonderfulWebsite as shown
import re
class WonderfulWebsite(DefaultProfile):
...
preprocess_regexps = [
(re.compile(r'<div class="banner">.*?</div>', re.IGNORECASE | re.DOTALL),
lambda match : ''),
]
This example removes banners (<div> elements of class "banner") from the HTML before converting.
More Information
The above examples are just the tip of the iceberg when it comes to the capabilities of web2lrf. The best way to learn how to write user profiles is to look at the built-in profiles that web2lrf already has. A more detailed HOWTO to guide you in doing that is available here.
- [source:trunk/src/libprs500/ebooks/lrf/web/profiles/newsweek.py newsweek]
- [source:trunk/src/libprs500/ebooks/lrf/web/profiles/bbc.py bbc]
- [source:trunk/src/libprs500/ebooks/lrf/web/profiles/nytimes.py nytimes]
Finally, you should look at the definition of [source:trunk/src/libprs500/ebooks/lrf/web/profiles/__init__.py DefaultProfile]
User provided profiles
- Faz.net (StDo)
- AJC.com (ivan)
- Wired.com (DaveC)
- Taipei Times (lorenzogoehr)
- Die Zeit (StDo)
- Spiegel Online (StDo)
- Wall Street Journal (JTravers)
- Barron's (JTravers)
- Portfolio.com (JTravers)
- Dayly Dilbert Comic (StDo)
- CNN.com (majorde)
- The HK Standard (Peter Tang)
- Jinghua (Peter Tang)
- Heise News (Stefan Hempe)
- Golem News (Stefan Hempe)
- Kathimerini.gr Greek News
- IN.GR Greek News
- Pathfinder.gr Greek News
Attachments
- ajc.py (3.5 KB) - added by kovidgoyal 14 months ago.
-
faznet.py
(1.5 KB) - added by stdo
14 months ago.
Version 0.10 of faznet.py
- wired.py (3.5 KB) - added by DaveC 14 months ago.
- Taipeitimes.py (1.1 KB) - added by kovidgoyal 14 months ago.
-
zeitde.py
(1.4 KB) - added by stdo
14 months ago.
Version 0.08 of zeitde.py - RSS feed www.zeit.de
-
spiegelde.py
(1.7 KB) - added by stdo
13 months ago.
Version 0.10 of spiegelde.py - RSS feed www.spiegel.de
- portfolio.py (2.1 KB) - added by jtravers 13 months ago.
- wsj.py (6.6 KB) - added by jtravers 13 months ago.
- barrons.py (3.4 KB) - added by jtravers 13 months ago.
-
dilbert.py
(1.2 KB) - added by stdo
13 months ago.
The Dayly Dilbert Comic
- cnn.py (3.3 KB) - added by majorde 13 months ago.
-
heise_newsticker.py
(1.9 KB) - added by shempe
12 months ago.
Version 0.02 of heise_newsticker.py - RSS feed www.heise.de
-
Golem.py
(1.8 KB) - added by shempe
12 months ago.
Version 0.01 of Golem.py - RSS feed www.golem.de
- hkstandard.py (1.8 KB) - added by kovidgoyal 12 months ago.
- jinghua.py (2.9 KB) - added by kovidgoyal 12 months ago.
-
kathimerini.py
(2.5 KB) - added by activea
12 months ago.
kathimerini.gr RSS
-
ingr.py
(2.8 KB) - added by activea
12 months ago.
in.gr RSS
-
pathfinder.py
(3.0 KB) - added by activea
12 months ago.
Pathfinder.gr Greek News
-
thenation.py
(4.8 KB) - added by secretsubscribe
12 months ago.
The Nation.
-
ag2.py
(1.2 KB) - added by Deputy-Dawg
12 months ago.
Agenzia Fides
-
chr_mon.py
(1.5 KB) - added by Deputy-Dawg
12 months ago.
Christian Science Monitor
-
jrpost.py
(1.8 KB) - added by Deputy-Dawg
12 months ago.
Jerusalem Post
-
reuters.py
(1.8 KB) - added by Deputy-Dawg
11 months ago.
Reuters News - fixed minor typo
-
ap.py
(2.2 KB) - added by Deputy-Dawg
11 months ago.
Associated Press
-
upi.py
(1.6 KB) - added by Deputy-Dawg
11 months ago.
United Press International (UPI)
-
wash_post.py
(1.8 KB) - added by Deputy-Dawg
11 months ago.
Washington Post
-
DiePresse.py
(1.8 KB) - added by woodman
11 months ago.
Austria
-
futurezone-orf.py
(2.0 KB) - added by woodman
11 months ago.
Austria
-
Standard-Wissenschaft.py
(1.8 KB) - added by woodman
11 months ago.
Austria
-
greader.py
(1.7 KB) - added by davec
10 months ago.
Google Reader, for new feeds2lrf frameword (the other davec...)
- cyberpresse.py (0.9 KB) - added by balok 6 weeks ago.
