Tutorial 6: Generating an RSS/Atom Feed from a Product Search

Generating an RSS/Atom Feed from a Product Search

In this tutorial will go over configuring screen-scraper to generate an RSS or Atom feed based on extracted data. We will continue on using the "Shopping Site" scraping session we generated in Tutorial 2. In order to use the RSS/Atom functionality you need to be using the Enterprise Edition of screen-scraper.

If you haven't already gone through Tutorial 2, this tutorial will make more sense if you do so first.

If you decided not to go through Tutorial 2, or don't still have the scraping session and scripts you created in it, you can download and import them into screen-scraper by following these steps:

  1. Download the zip file located here and unzip it. You should now have an "interpreted_java" directory and a "vbscript" directory.
  2. If you're running Windows, and prefer to program in VBScript, import the "Shopping Site (Scraping Session).sss" scraping session located in the "vbscript" directory; otherwise, import the one located in the "interpreted_java" directory. Instructions on importing objects into screen-scraper can be found here.

Once you've got the scraping session imported into screen-scraper you're ready to roll. Click on the "Tutorial Details" link below to get going.

Tutorial 6: Page 2: Tutorial Details

Tutorial Details

Before going on, take a minute to read over the Generating RSS and Atom Feeds page in our documentation. That should give you a basic overview.

We're going to configure our "Shopping Site" scraping session so that it generates a feed of products based on a search parameter. That is, we'll give it a search keyword (e.g., "bug" or "dvd"), it will extract the product data, then create an XML feed out of the scraped data. For testing purposes we'll just access the XML feed from a web browser, though you could just as easily access it from an RSS/Atom reader.

Tutorial 6: Page 3: Setting Up the Scraping Session

Setting Up the Scraping Session

If you read over the Generating RSS and Atom Feeds page you can probably guess at how we'll need to modify the scraping session. Let's start by altering the name of the extractor pattern that grabs the product details. In screen-scraper click on the "Details page" scrapeable file for the "Shopping Site" scraping session, then click the "Extractor Patterns" tab. Change the name of the extractor pattern from "PRODUCTS" to "XML_FEED". This pattern will extract out the DataSet that will hold our entire feed. We'll now need to designate the fields for the individual items in the feed. Click on the "Sub-Extractor Patterns" tab for our feed. There are several fields we're extracting, but for the sake of simplicity we'll just deal with two of them. For the "TITLE" portion of our feed we're in luck because we already have a "TITLE" sub-extractor pattern. For the "DESCRIPTION" part of the feed item we're not currently extracting the full description from the product details page. Just for the sake of providing an example let's use the "MODEL" field instead. Change the name of the "MODEL" sub-extractor pattern to "DESCRIPTION" so that it looks like this:

>Model: ~@DESCRIPTION@~<

There are two more elements we need for our XML feed: "LINK" and "PUBLISHED_DATE". We're obviously not extracting either of these, so let's write a quick script to set them for us. Create a new script by clicking on the pencil and paper icon in the button bar. Give the script the name "Set URL and published date". Copy and paste this in for the text of the script:

// Set the "LINK" element to the URL of the current product details page.
dataRecord.put( "LINK", scrapeableFile.getCurrentURL() );

// Create a formatted date representing the current date.
dataRecord.put( "PUBLISHED_DATE", new Date() );

Once you've created the script associate it with the "XML_FEED" extractor pattern by clicking on the "Details page" scrapeable file, then on the "Extractor Patterns" tab. Click on the "Add Script" button, select "Set URL and published date" under the "Script Name" column, and "After each pattern application" under the "When to Run" column.

The script is fairly straightforward. We first set the "LINK" element to the URL of the product details page we're currently on. You'll notice that we're setting the value via the "put" method on the current DataRecord object. Because this script will get invoked for each pattern application the "dataRecord" object will be in scope. You'll likely remember from previous tutorials that the "dataRecord" object can be thought of as the current row on the spreadsheet of extracted data. Here we're simply adding a cell to the current row of the spreadsheet for the "LINK" element of the feed. The second element we set is the "PUBLISHED_DATE". For those unfamiliar with Java, passing it "new Date()" simply indicates that the feed item was published on the current date.

If you haven't done so previously, you'll also want to disable the "Shopping Site--initialize session" script. We'll be passing values in externally, and this script would otherwise overwrite those values. To disable the script, click on the "Shopping Site" scraping session in the tree on the left, then on the "Scripts" tab. Un-check the box in the table under the "Enabled?" column.

Take a minute now to save your work.

That's it for setting up the scraping session. We're now going to generate the feed.

Tutorial 6: Page 4: Generating the XML Feed

Generating the XML Feed

Let's run a quick test just to make sure the scraping session works. After that, we'll add a few more bells and whistles. Start up screen-scraper as a server. If you need help on that try this page. Once that's up, assuming you haven't altered the default "SOAP Server" port (which is also the web server port), and that you're running screen-scraper on your local machine, try entering this URL in to your browser:

http://localhost:8779/ss/xmlfeed?scraping_session=Shopping+Site&SEARCH=bug

If all goes well the browser should take a little bit to load, then you should see an XML document appear containing the extracted information. If you got an error message or the document didn't appear as you expected it to, check screen-scraper's log. Just as with scraping sessions run remotely, screen-scraper will create a log file in its "log" folder corresponding to each RSS/Atom scraping session.

Dealing with the URL directly can be a bit cryptic, what with the encoding and all. As such, let's make use of a little HTML file that will allow us to generate feeds using different search parameters and formats. You can access it here. Note that this HTML file assumes that you're running screen-scraper as a server on your local machine on port 8779. If any of that isn't the case you'll want to download the HTML file to your local machine, alter it with your settings, then open it back up in your browser.

Try experimenting with the form a bit. It gives you control over most all of the features that are available, including the format of the feed. Also take a close look at the URL. screen-scraper simply converts the GET parameters in the URL to session variables in the scraping session. If you'd like, you can even open the feed in your favorite RSS/Atom reader to ensure that the format is valid.

Tutorial 6: Page 5: Where to Go From Here

Where to Go From Here

The ability to generate RSS/Atom feeds directly from scraped data opens up quite a few interesting possibilities. Where you take things from this point is left to your imagination...