NavigationUser loginscreen-scraper.com welcomes...
Currently online
There are currently 0 users and 2 guests online.
|
Tutorial 6: Page 3: Setting Up the Scraping Session
If you read over the Generating RSS and Atom Feeds page you can probably guess at how we'll need to modify the scraping session. Let's start by altering the name of the extractor pattern that grabs the product details. In screen-scraper click on the "Details page" scrapeable file for the "Shopping Site" scraping session, then click the "Extractor Patterns" tab. Change the name of the extractor pattern from "PRODUCTS" to "XML_FEED". This pattern will extract out the DataSet that will hold our entire feed. We'll now need to designate the fields for the individual items in the feed. Click on the "Sub-Extractor Patterns" tab for our feed. There are several fields we're extracting, but for the sake of simplicity we'll just deal with two of them. For the "TITLE" portion of our feed we're in luck because we already have a "TITLE" sub-extractor pattern. For the "DESCRIPTION" part of the feed item we're not currently extracting the full description from the product details page. Just for the sake of providing an example let's use the "MODEL" field instead. Change the name of the "MODEL" sub-extractor pattern to "DESCRIPTION" so that it looks like this: >Model: ~@DESCRIPTION@~<There are two more elements we need for our XML feed: "LINK" and "PUBLISHED_DATE". We're obviously not extracting either of these, so let's write a quick script to set them for us. Create a new script by clicking on the pencil and paper icon in the button bar. Give the script the name "Set URL and published date". Copy and paste this in for the text of the script:
// Set the "LINK" element to the URL of the current product details page.Once you've created the script associate it with the "XML_FEED" extractor pattern by clicking on the "Details page" scrapeable file, then on the "Extractor Patterns" tab. Click on the "Add Script" button, select "Set URL and published date" under the "Script Name" column, and "After each pattern application" under the "When to Run" column. The script is fairly straightforward. We first set the "LINK" element to the URL of the product details page we're currently on. You'll notice that we're setting the value via the "put" method on the current DataRecord object. Because this script will get invoked for each pattern application the "dataRecord" object will be in scope. You'll likely remember from previous tutorials that the "dataRecord" object can be thought of as the current row on the spreadsheet of extracted data. Here we're simply adding a cell to the current row of the spreadsheet for the "LINK" element of the feed. The second element we set is the "PUBLISHED_DATE". For those unfamiliar with Java, passing it "new Date()" simply indicates that the feed item was published on the current date. If you haven't done so previously, you'll also want to disable the "Shopping Site--initialize session" script. We'll be passing values in externally, and this script would otherwise overwrite those values. To disable the script, click on the "Shopping Site" scraping session in the tree on the left, then on the "Scripts" tab. Un-check the box in the table under the "Enabled?" column. Take a minute now to save your work. That's it for setting up the scraping session. We're now going to generate the feed.
|
SearchNew Video!Tags Throughout this Site |
Recent comments
7 hours 56 min ago
8 hours 3 min ago
10 hours 9 min ago
1 day 5 hours ago
1 day 5 hours ago
1 day 6 hours ago
1 day 6 hours ago
1 day 6 hours ago
1 day 7 hours ago
3 days 3 hours ago