Detailed HOWTO on creating your own User Profiles (by DaveC)

The easiest way to create your own custom User Profile (we'll call ours MyProfile.py) is by using one of the built in profiles as a starting point and modify the code for the web site you want to render as a LRF. Keep in mind that most web pages have quite a few extra HTML features like buttons, links to other pages in the web site, multimedia links, etc that just don't render well when viewed on the Sony Reader. It is primarily these things that you are getting rid of to try to get down to the main text of the articles that you want to actually see on the reader.

A bit of understanding of how HTML pages are structured is helpful at this point. Very roughly, a HTML web page uses paired matching tags contained in <>'s to define it's structure and content:

(Note, this is a very simplified representation/abstraction of a web page)

<HTML>   <-sets the start and end points of the web page's data/contents

  <HEAD>

    <TITLE>The Web Page's Title</TITLE>

    <META>...various meta data defining the author, language, character set, encoding, etc </META>
    
    <LINK>...common resources like style sheets used for multiple pages on the web site</LINK>
    
    <STYLE>...more style sheet info</STYLE>
    
    <SCRIPT>...shared javascript code</SCRIPT>

  </HEAD>

  <BODY>     <-additional parameters may be contained in this tag, ie. class
                Most of what you want for your LRF conversion will be in the <BODY> section

    <div header>...</div>  <--various Web Page Elements like header graphics, ads, menu buttons, etc.

    <div id="article">
      <a href="...">A link to another URL</a>
      <img src="a banner ad">

      <div id="article body">
        <img src="...">
        The text you want lives here
        blah blah blah
        <p>                <--New Paragraph
        blah blah blah

        <script>some JavaScript can go here, ie. to open a window for comments.</script>

      </div>
    </div>

    <div id="footer">...</div>  <--additional stuff at the bottom of the web page, ads, links, etc.
    <table>ie. a bunch of links to other pages on this site</table>

  </BODY>
</HTML>

Most web pages are not hand coded but generated from an internal database of articles, ad banner locations, affiliate links, internal menus, etc. and assembled by the web server code using these bits and pieces (the stuff between the various <div> tags is plugged together like lego blocks) into one big web page. This makes our jobs alot easier as all we have to do is identify the appropriate tags that a particular web site uses within it's web pages that consistently bound the text portion we want for the LRF conversion. There is unfortunately no guaranteed consistent tags used across web pages as each web site is usually very different from another, but from document to document there may be patterns.

1) Fire up your web browser and point it to their RSS URL, open up a few articles and look to see if there is a print version (look for the printer icon) or a 1 page version (for multipage articles) as these usually have the least amount of extra banner/ad/graphics/menus to wade through and will have the whole article text in one page.

If there is a print version, look at the URL for it and compare it with the original URL, if you're lucky, you might see a pattern like this:

Original article URL: http://www.example.com/2007/11/07/article=1111

Print version URL: http://www.example.com/print/2007/11/07/article=1111

In which case adding this definition to your MyProfile.py

  def print_version(self, url):
    return url.replace('http://www.example.com/', 'http://www.example.com/print/')

This inserts the "print" into the beginning of the URL so web2lrf will always pull down the print version if it is available.

Or if the URL's look more like this: Original article URL: http://www.example.com/2007/11/07/article=1111

Print version URL: http://www.example.com/2007/11/07/article=1111/print

You'd ad this code to your MyProfile.py

  def print_version(self, url):
    return url + '/print'

This will add "/print" to all article URL's to get the print version.

You may need to do more complex substitutions if you're not lucky (Kovid, could you comment on how we might do this if this is possible?).


For the next part, you'll have to be familiar with the use of Regular Expressions in Python. There are many good online resources that can provide a tutorial/primer on this subject, here is a good place to start: http://docs.python.org/lib/re-syntax.html

I was able to use the existing example code and modify it to my needs by some simple substitution of the search strings without any fancy pattern matching. If you need to do some more complex/sophisticated pattern matching, read up on how to create Regular Expressions (REGEX) at the link above.


The next step involves modifying the HTML code to strip out the bits you don't want to bother trying to put into the LRF. You'll use variations of this code to accomplish this:

    preprocess_regexps = [
       (re.compile(r'<start>.*?<end>', re.IGNORECASE | re.DOTALL), lambda match : '<New Stuff>'),
    ]

The .*? is a wild card that matches all text as long as it is bounded by <start> and <end>

For example, the code above would match the text: <start>this is some text<end> would be replaced by: <New Stuff>

These text segments would also be matched and replaced as above: <start>blah blah blah<end> <start>this is some more text<end>

But this wouldn't match and would be left alone: <begin>keep this text<end>

If you wanted to complete remove the text, then leave the lambda field empty with , this would delete all instances of any text bounded by <start> and <end>:

    preprocess_regexps = [
       (re.compile(r'<start>.*?<end>', re.IGNORECASE | re.DOTALL), lambda match : ''),
    ]

You can also have multiple (re.compile(...)), lines to match multiple patterns.

    preprocess_regexps = [
       (re.compile(r'<start>.*?<end>', re.IGNORECASE | re.DOTALL), lambda match : '<New Stuff>'),
       (re.compile(r'<advertisement>.*?</advertisement>', re.IGNORECASE | re.DOTALL), lambda match : ''),
       (re.compile(r'<banner>.*?</banner>', re.IGNORECASE | re.DOTALL), lambda match : ''),
    ]

This swaps out the stuff between <start> and <end> tags with <New Stuff> and gets rid of stuff between the <advertisement> and <banner> tags.

This code will perform a substitution within the matching tags:

(r'<meta http-equiv="Content-Type" content="text/html; charset=(\S+)"', lambda match : match.group().replace(match.group(1), 'UTF-8')),

<meta http-equiv="Content-Type" content="text/html; charset=ASCII" /> Will become <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />


2) Look for the "View Source" or "Page Source" menu option in your web browser, usually found under the "View" menu option. You will see alot busy/complex HTML code but scroll through and find where the <BODY> tags start and end at </BODY>, then look between these tags and identify the start and end of the actual article text. Do this for several articles until you figure out some sort of pattern.

Let's say the article pages all seem to adhere to the our example HTML code above.

The following code in MyProfile.py would get rid of everything between the <HEAD> and </HEAD> which has a nice effect of getting rid of the lots of link, meta and javascript (and other external program code) that isn't going into the LRF anyways.

    preprocess_regexps = [
       (re.compile(r'<HEAD>.*?</HEAD>', re.IGNORECASE | re.DOTALL), lambda match : '<HEAD></HEAD>'),
    ]

However, someday web2lrf may use some of the meta information and really the main thing you want to get rid of is any CSS references (these can over ride the LRF font sizes used in the conversion making for some hard to read, overly large or small fonts). The easiest way to ignore CSS's is to add the following line:

  no_stylesheets = True

Similarly, if we wanted to get rid of all the stuff between the <BODY> tag and the Article text that we want, and the useless stuff at the end of the page, we could add the following lines:

    preprocess_regexps = [
       (re.compile(r'<BODY>.*?<div id="article body">', re.IGNORECASE | re.DOTALL), lambda match : '<BODY><div id="article body">'),
       (re.compile(r'<div id="footer">.*?</BODY>', re.IGNORECASE | re.DOTALL), lambda match : '</BODY>'),
    ]

If you wanted to remove the Javascript, you could add this line

       (re.compile(r'<script>.*?</script>', re.IGNORECASE | re.DOTALL), lambda match : ''),

So if our MyProfile.py looked like this:

  preprocess_regexps = [
    (re.compile(r'<HEAD>.*?</HEAD>', re.IGNORECASE | re.DOTALL), lambda match : '<HEAD></HEAD>'),
    (re.compile(r'<BODY>.*?<div id="article body">', re.IGNORECASE | re.DOTALL), lambda match : '<BODY><div id="article body">'),
    (re.compile(r'<div id="footer">.*?</BODY>', re.IGNORECASE | re.DOTALL), lambda match : '</BODY>'),
    (re.compile(r'<script>.*?</script>', re.IGNORECASE | re.DOTALL), lambda match : ''),
    ]

Our very simple Sample HTML code would look like this after being processed:

<HTML>   <-sets the start and end points of the web page's data/contents

  <HEAD></HEAD>

  <BODY><div id="article body">
        <img src="...">

        The text you want lives here
        blah blah blah
        <p>                <--New Paragraph
        blah blah blah

      </div>
    </div>

  </BODY>
</HTML>

Note that there may be an odd an extra unmatched </div> tag, which web2lrf doesn't seem to care much about.


That's pretty much how I made the wired.py User Profile: through some trial and error and looking at a bunch of different page's HTML source to find common recurring/consistent patterns that bounded the start and ends of the interesting bits of the main article text and trying out the patterns.