Defines various abstract base classes that can be subclassed to create powerful news fetching recipes. The useful subclasses are:
Abstract base class that contains a number of members and methods to customize the fetching of contents in your recipes. All recipes must inherit from this class or a subclass of it.
The members and methods are organized as follows:
The title to use for the ebook
Default: 'Unknown News Source'
A couple of lines that describe the content this recipe downloads. This will be used primarily in a GUI that presents a list of recipes.
Default: ''
The author of this recipe
Default: 'calibre'
Maximum number of articles to download from each feed. This is primarily useful for feeds that don’t have article dates. For most feeds, you should use BasicNewsRecipe.oldest_article
Default: 100
Oldest article to download from this news source. In days.
Default: 7.0
Number of levels of links to follow on article webpages
Default: 0
Delay between consecutive downloads in seconds
Default: 0
Number of simultaneous downloads. Set to 1 if the server is picky. Automatically reduced to 1 if BasicNewsRecipe.delay > 0
Default: 5
Timeout for fetching files from server in seconds
Default: 120.0
The format string for the date shown on the first page. By default: Day_Name, Day_Number Month_Name Year
Default: ' [%a, %d %b %Y]'
List of feeds to download Can be either [url1, url2, ...] or [('title1', url1), ('title2', url2),...]
Default: None
Convenient flag to disable loading of stylesheets for websites that have overly complex stylesheets unsuitable for conversion to ebooks formats If True stylesheets are not downloaded and processed
Default: False
Specify an override encoding for sites that have an incorrect charset specification. The most common being specifying latin1 and using cp1252. If None, try to detect the encoding.
Default: None
Return a browser instance used to fetch documents from the web. By default it returns a mechanize browser instance that supports cookies, ignores robots.txt, handles refreshes and has a mozilla firefox user agent.
If your recipe requires that you login first, override this method in your subclass. For example, the following code is used in the New York Times recipe to login for full access:
def get_browser(self):
br = BasicNewsRecipe.get_browser()
if self.username is not None and self.password is not None:
br.open('http://www.nytimes.com/auth/login')
br.select_form(name='login')
br['USERID'] = self.username
br['PASSWORD'] = self.password
br.submit()
return br
This method should be implemented in recipes that parse a website instead of feeds to generate a list of articles. Typical uses are for news sources that have a “Print Edition” webpage that lists all the articles in the current print edition. If this function is implemented, it will be used in preference to BasicNewsRecipe.parse_feeds().
It must return a list. Each element of the list must be a 2-element tuple of the form ('feed title', list of articles).
Each list of articles must contain dictionaries of the form:
{
'title' : article title,
'url' : URL of print version,
'date' : The publication date of the article as a string,
'description' : A summary of the article
'content' : The full article (can be an empty string). This is used by FullContentProfile
}
For an example, see the recipe for downloading The Atlantic.
Max number of characters in the short description
Default: 500
Normally we try to guess if a feed has full articles embedded in it based on the length of the embedded content. If None, then the default guessing is used. If True then the we always assume the feeds has embedded content and if False we always assume the feed does not have embedded content.
Default: None
Take a url pointing to the webpage with article content and return the URL pointing to the print version of the article. By default does nothing. For example:
def print_version(self, url):
return url + '?&pagewanted=print'
Specify any extra CSS that should be addded to downloaded HTML files It will be inserted into <style> tags, just before the closing </head> tag thereby overrinding all CSS except that which is declared using the style attribute on individual HTML tags. For example:
extra_css = '.heading { font: serif x-large }'
Default: None
List of regular expressions that determines which links to follow If empty, it is ignored. For example:
match_regexps = [r'page=[0-9]+']
will match all URLs that have page=some number in them.
Only one of BasicNewsRecipe.match_regexps or BasicNewsRecipe.filter_regexps should be defined.
Default: []
List of regular expressions that determines which links to ignore If empty it is ignored. For example:
filter_regexps = [r'ads\.doubleclick\.net']
will remove all URLs that have ads.doubleclick.net in them.
Only one of BasicNewsRecipe.match_regexps or BasicNewsRecipe.filter_regexps should be defined.
Default: []
List of tags to be removed. Specified tags are removed from downloaded HTML. A tag is specified as a dictionary of the form:
{
name : 'tag name', #e.g. 'div'
attrs : a dictionary, #e.g. {class: 'advertisment'}
}
All keys are optional. For a full explanantion of the search criteria, see Beautiful Soup A common example:
remove_tags = [dict(name='div', attrs={'class':'advert'})]
This will remove all <div class=”advert”> tags and all their children from the downloaded HTML.
Default: []
Remove all tags that occur after the specified tag. For the format for specifying a tag see BasicNewsRecipe.remove_tags. For example:
remove_tags_after = [dict(id='content')]
will remove all tags after the first element with id=”content”.
Default: None
Remove all tags that occur before the specified tag. For the format for specifying a tag see BasicNewsRecipe.remove_tags. For example:
remove_tags_before = [dict(id='content')]
will remove all tags before the first element with id=”content”.
Default: None
Keep only the specified tags and their children. For the format for specifying a tag see BasicNewsRecipe.remove_tags. If this list is not empty, then the <body> tag will be emptied and re-filled with the tags that match the entries in this list. For example:
keep_only_tags = [dict(id=['content', 'heading'])]
will keep only tags that have an id attribute of “content” or “heading”.
Default: []
List of regexp substitution rules to run on the downloaded HTML. Each element of the list should be a two element tuple. The first element of the tuple should be a compiled regular expression and the second a callable that takes a single match object and returns a string to replace the match. For example:
preprocess_regexps = [
(re.compile(r'<!--Article ends here-->.*</body>', re.DOTALL|re.IGNORECASE),
lambda match: '</body>'),
]
will remove everythong from <!–Article ends here–> to </body>.
Default: []
The CSS that is used to styles the templates, i.e., the navigation bars and the Tables of Contents. Rather than overriding this variable, you should use :member:`extra_css` in your recipe to customize look and feel.
Default: u'\n .article_date {\n font-size: x-small; color: gray; font-family: monospace;\n }\n \n .article_description {\n font-size: small; font-family: sans; text-indent: 0pt;\n }\n \n a.article {\n font-weight: bold; font-size: large;\n }\n \n a.feed {\n font-weight: bold; font-size: large;\n }\n \n .navbar {\n font-family:monospace; font-size:8pt\n }\n'
This method is called with the source of each downloaded HTML file, before it is parsed for links and images. It can be used to do arbitrarily powerful pre-processing on the HTML. It should return soup after processing it.
soup: A BeautifulSoup instance containing the downloaded HTML.
This method is called with the source of each downloaded HTML file, after it is parsed for links and images. It can be used to do arbitrarily powerful post-processing on the HTML. It should return soup after processing it.
| Parameters: |
|
|---|
instance containing the downloaded HTML. :param first_fetch: True if this is the first page of an article.
Convenience method that takes an URL to the index page and returns a BeautifulSoup of it.
url_or_raw: Either a URL or the downloaded index page as a string
Convenience method to sort the titles in index according to weights. index is sorted in place. Returns index.
index: A list of titles.
weights: A dictionary that maps weights to titles. If any titles in index are not in weights, they are assumed to have a weight of 0.
Convenience method to take a BeautifulSoup Tag and extract the text from it recursively, including any CDATA sections and alt tag attributes. Return a possibly empty unicode string.
use_alt: If True try to use the alt attribute for tags that don’t have any textual content
tag: BeautifulSoup Tag