API Documentation for recipes

Defines various abstract base classes that can be subclassed to create powerful news fetching recipes. The useful subclasses are:

BasicNewsRecipe

class calibre.web.feeds.news.BasicNewsRecipe

Abstract base class that contains a number of members and methods to customize the fetching of contents in your recipes. All recipes must inherit from this class or a subclass of it.

The members and methods are organized as follows:

Customizing e-book download

BasicNewsRecipe.title

The title to use for the ebook

Default: 'Unknown News Source'

BasicNewsRecipe.description

A couple of lines that describe the content this recipe downloads. This will be used primarily in a GUI that presents a list of recipes.

Default: ''

BasicNewsRecipe.__author__

The author of this recipe

Default: 'calibre'

BasicNewsRecipe.max_articles_per_feed

Maximum number of articles to download from each feed. This is primarily useful for feeds that don’t have article dates. For most feeds, you should use BasicNewsRecipe.oldest_article

Default: 100

BasicNewsRecipe.oldest_article

Oldest article to download from this news source. In days.

Default: 7.0

BasicNewsRecipe.recursions

Number of levels of links to follow on article webpages

Default: 0

BasicNewsRecipe.delay

Delay between consecutive downloads in seconds

Default: 0

BasicNewsRecipe.simultaneous_downloads

Number of simultaneous downloads. Set to 1 if the server is picky. Automatically reduced to 1 if BasicNewsRecipe.delay > 0

Default: 5

BasicNewsRecipe.timeout

Timeout for fetching files from server in seconds

Default: 120.0

BasicNewsRecipe.timefmt

The format string for the date shown on the first page. By default: Day_Name, Day_Number Month_Name Year

Default: ' [%a, %d %b %Y]'

BasicNewsRecipe.feeds

List of feeds to download Can be either [url1, url2, ...] or [('title1', url1), ('title2', url2),...]

Default: None

BasicNewsRecipe.no_stylesheets

Convenient flag to disable loading of stylesheets for websites that have overly complex stylesheets unsuitable for conversion to ebooks formats If True stylesheets are not downloaded and processed

Default: False

BasicNewsRecipe.encoding

Specify an override encoding for sites that have an incorrect charset specification. The most common being specifying latin1 and using cp1252. If None, try to detect the encoding.

Default: None

static BasicNewsRecipe.get_browser()

Return a browser instance used to fetch documents from the web. By default it returns a mechanize browser instance that supports cookies, ignores robots.txt, handles refreshes and has a mozilla firefox user agent.

If your recipe requires that you login first, override this method in your subclass. For example, the following code is used in the New York Times recipe to login for full access:

def get_browser(self):
    br = BasicNewsRecipe.get_browser()
    if self.username is not None and self.password is not None:
        br.open('http://www.nytimes.com/auth/login')
        br.select_form(name='login')
        br['USERID']   = self.username
        br['PASSWORD'] = self.password
        br.submit()
    return br
BasicNewsRecipe.get_cover_url()
Return a URL to the cover image for this issue or None. By default it returns the value of the member self.cover_url which is normally None. If you want your recipe to download a cover for the e-book override this method in your subclass, or set the member variable self.cover_url before this method is called.
BasicNewsRecipe.get_feeds()
Return a list of :term:RSS feeds to fetch for this profile. Each element of the list must be a 2-element tuple of the form (title, url). If title is None or an empty string, the title from the feed is used. This method is useful if your recipe needs to do some processing to figure out the list of feeds to download. If so, override in your subclass.
BasicNewsRecipe.parse_index()

This method should be implemented in recipes that parse a website instead of feeds to generate a list of articles. Typical uses are for news sources that have a “Print Edition” webpage that lists all the articles in the current print edition. If this function is implemented, it will be used in preference to BasicNewsRecipe.parse_feeds().

It must return a list. Each element of the list must be a 2-element tuple of the form ('feed title', list of articles).

Each list of articles must contain dictionaries of the form:

{
'title'       : article title,
'url'         : URL of print version,
'date'        : The publication date of the article as a string,
'description' : A summary of the article
'content'     : The full article (can be an empty string). This is used by FullContentProfile
}

For an example, see the recipe for downloading The Atlantic.

Customizing feed parsing

BasicNewsRecipe.summary_length

Max number of characters in the short description

Default: 500

BasicNewsRecipe.use_embedded_content

Normally we try to guess if a feed has full articles embedded in it based on the length of the embedded content. If None, then the default guessing is used. If True then the we always assume the feeds has embedded content and if False we always assume the feed does not have embedded content.

Default: None

BasicNewsRecipe.get_article_url(article)
Override in a subclass to customize extraction of the URL that points to the content for each article. Return the article URL. It is called with article, an object representing a parsed article from a feed. See feedsparser. By default it returns article.link.
static BasicNewsRecipe.print_version(url)

Take a url pointing to the webpage with article content and return the URL pointing to the print version of the article. By default does nothing. For example:

def print_version(self, url):
    return url + '?&pagewanted=print'
BasicNewsRecipe.parse_feeds()
Create a list of articles from the list of feeds returned by BasicNewsRecipe.get_feeds(). Return a list of Feed objects.

Pre/post processing of downloaded HTML

BasicNewsRecipe.extra_css

Specify any extra CSS that should be addded to downloaded HTML files It will be inserted into <style> tags, just before the closing </head> tag thereby overrinding all CSS except that which is declared using the style attribute on individual HTML tags. For example:

extra_css = '.heading { font: serif x-large }'

Default: None

BasicNewsRecipe.match_regexps

List of regular expressions that determines which links to follow If empty, it is ignored. For example:

match_regexps = [r'page=[0-9]+']

will match all URLs that have page=some number in them.

Only one of BasicNewsRecipe.match_regexps or BasicNewsRecipe.filter_regexps should be defined.

Default: []

BasicNewsRecipe.filter_regexps

List of regular expressions that determines which links to ignore If empty it is ignored. For example:

filter_regexps = [r'ads\.doubleclick\.net']

will remove all URLs that have ads.doubleclick.net in them.

Only one of BasicNewsRecipe.match_regexps or BasicNewsRecipe.filter_regexps should be defined.

Default: []

BasicNewsRecipe.remove_tags

List of tags to be removed. Specified tags are removed from downloaded HTML. A tag is specified as a dictionary of the form:

{
 name      : 'tag name',   #e.g. 'div'
 attrs     : a dictionary, #e.g. {class: 'advertisment'}
}

All keys are optional. For a full explanantion of the search criteria, see Beautiful Soup A common example:

remove_tags = [dict(name='div', attrs={'class':'advert'})]

This will remove all <div class=”advert”> tags and all their children from the downloaded HTML.

Default: []

BasicNewsRecipe.remove_tags_after

Remove all tags that occur after the specified tag. For the format for specifying a tag see BasicNewsRecipe.remove_tags. For example:

remove_tags_after = [dict(id='content')]

will remove all tags after the first element with id=”content”.

Default: None

BasicNewsRecipe.remove_tags_before

Remove all tags that occur before the specified tag. For the format for specifying a tag see BasicNewsRecipe.remove_tags. For example:

remove_tags_before = [dict(id='content')]

will remove all tags before the first element with id=”content”.

Default: None

BasicNewsRecipe.keep_only_tags

Keep only the specified tags and their children. For the format for specifying a tag see BasicNewsRecipe.remove_tags. If this list is not empty, then the <body> tag will be emptied and re-filled with the tags that match the entries in this list. For example:

keep_only_tags = [dict(id=['content', 'heading'])]

will keep only tags that have an id attribute of “content” or “heading”.

Default: []

BasicNewsRecipe.preprocess_regexps

List of regexp substitution rules to run on the downloaded HTML. Each element of the list should be a two element tuple. The first element of the tuple should be a compiled regular expression and the second a callable that takes a single match object and returns a string to replace the match. For example:

preprocess_regexps = [
   (re.compile(r'<!--Article ends here-->.*</body>', re.DOTALL|re.IGNORECASE),
    lambda match: '</body>'),
]

will remove everythong from <!–Article ends here–> to </body>.

Default: []

BasicNewsRecipe.template_css

The CSS that is used to styles the templates, i.e., the navigation bars and the Tables of Contents. Rather than overriding this variable, you should use :member:`extra_css` in your recipe to customize look and feel.

System Message: ERROR/3 (/home/kovid/work/calibre/src/calibre/manual/news_recipe.rst, line 113); backlink

Unknown interpreted text role “member”.

Default: u'\n            .article_date {\n                font-size: x-small; color: gray; font-family: monospace;\n            }\n            \n            .article_description {\n                font-size: small; font-family: sans; text-indent: 0pt;\n            }\n            \n            a.article {\n                font-weight: bold; font-size: large;\n            }\n            \n            a.feed {\n                font-weight: bold; font-size: large;\n            }\n            \n            .navbar {\n                font-family:monospace; font-size:8pt\n            }\n'

BasicNewsRecipe.preprocess_html(soup)

This method is called with the source of each downloaded HTML file, before it is parsed for links and images. It can be used to do arbitrarily powerful pre-processing on the HTML. It should return soup after processing it.

soup: A BeautifulSoup instance containing the downloaded HTML.

BasicNewsRecipe.postprocess_html(soup, first_fetch)

This method is called with the source of each downloaded HTML file, after it is parsed for links and images. It can be used to do arbitrarily powerful post-processing on the HTML. It should return soup after processing it.

Parameters:

System Message: WARNING/2 (/home/kovid/work/calibre/src/calibre/web/feeds/news.py:docstring of BasicNewsRecipe.postprocess_html, line 8)

Field list ends without a blank line; unexpected unindent.

instance containing the downloaded HTML. :param first_fetch: True if this is the first page of an article.

Convenience methods

BasicNewsRecipe.cleanup()
Called after all articles have been download. Use it to do any cleanup like logging out of subscription sites, etc.
BasicNewsRecipe.index_to_soup(url_or_raw)

Convenience method that takes an URL to the index page and returns a BeautifulSoup of it.

url_or_raw: Either a URL or the downloaded index page as a string

BasicNewsRecipe.sort_index_by(index, weights)

Convenience method to sort the titles in index according to weights. index is sorted in place. Returns index.

index: A list of titles.

weights: A dictionary that maps weights to titles. If any titles in index are not in weights, they are assumed to have a weight of 0.

static BasicNewsRecipe.tag_to_string(tag, use_alt=True)

Convenience method to take a BeautifulSoup Tag and extract the text from it recursively, including any CDATA sections and alt tag attributes. Return a possibly empty unicode string.

use_alt: If True try to use the alt attribute for tags that don’t have any textual content

tag: BeautifulSoup Tag

CustomIndexRecipe

class calibre.web.feeds.news.CustomIndexRecipe
This class is useful for getting content from websites that don’t follow the “multiple articles in several feeds” content model.
CustomIndexRecipe.custom_index()
Return the filesystem path to a custom HTML document that will serve as the index for this recipe. The index document will typically contain many <a href=”...”> tags that point to resources on the internet that should be downloaded.