Tutorial 2: Scraping an E-commerce Site

Scraping an E-commerce Site

In this tutorial we'll be scraping search results from a basic e-commerce site. We'll also demonstrate logging in to a web site before scraping data. Data you'll be scraping from web sites is often in the form of "records", or data that might fit into a spreadsheet in rows and columns. It's also often necessary to log in to a web site before you can scrape the data you're interested in. Hopefully getting some practice with these situations in this tutorial will let you apply the experience to other similar situations. For example, you would likely apply the same approach we'll go over here to extracting data such as online directories, real estate listings, or product descriptions.

If you haven't already gone through tutorial 1 we'd recommend that you do so before continuing with this one. This tutorial, however, doesn't depend on scraping sessions or other objects you might have created in the previous tutorials. You may wish to download and import the completed scraping session that goes with this tutorial. The scraping session and complete output file are available below.

The site we'll be scraping information from is found here: http://www.screen-scraper.com/shop/. Feel free to click around and explore for a minute.

The scraping session you are about to create and the output file the scraping session will generate:

AttachmentSize
dvds.txt897 bytes
Shopping Site (Scraping Session).sss10.2 KB

Tutorial 2: Page 2: Screen-Scraping Overview Review

Screen-Scraping Overview Review

As you'll remember from the previous tutorials, extracting information from web sites using screen-scraper typically involves four main steps:

1. Use the proxy server to determine the exact files that need to be requested in order to get the information you're after.
2. Create a scraping session with scrapeable files that define the sequence of pages screen-scraper will request.
3. Generate extractor patterns to define the exact information you need screen-scraper to grab from each page.
4. Write small scripts or programming code to invoke screen-scraper and/or work with the data it extracts.

Tutorial 2: Page 3: Recording Search Results

Recording Search Results

As in the first tutorial, we'll be recording a browser session using the proxy server. Remember that a proxy session holds all of the HTTP requests and responses from your browser for the period of time you run it.

Create a new proxy session now either by clicking the "New Proxy Session" button (looks like a globe) or by selecting "New Proxy Session" from the "File" menu. When the proxy session appears type in "Shopping Site" in the "Name" field. In your web browser go to this URL: http://www.screen-scraper.com/shop/ (remember that you may want to use one browser with the proxy server and one to view the tutorials).

At this point start up the proxy server by clicking the "Start Proxy Server" button, then configure your web browser as you did in the first tutorial (if you need help try this page). In screen-scraper, ensure that the "Don't log binary files" checkbox is checked. Now click on the "Progress" tab so that you can see the pages appear as they get recorded.

We'll be doing a search in the shopping web site for the term "dvd" in the various products. Do this by typing "dvd" (without the quotes) into the search box located in the upper-right corner of the home page, then click the "Search" button. You'll see screen-scraper work for a bit, then, once it finishes, you should just see one row in the "HTTP Transactions" table. We'll want to traverse all of the search results, so, in your web browser, click the "Next >>" link. screen-scraper will work again for a bit while it records the next search results page. Later on we'll be scraping the details pages, so let's record one of those now. Click on the "Speed" link to view details on this DVD. These are the only pages we're interested in at this point, so go ahead and stop the proxy session by clicking the "Stop Proxy Server" button on the "General" tab. You'll also want to re-configure your web browser so that it's no longer using screen-scraper as a proxy server.

Tutorial 2: Page 4: Creating the Scraping Session

Creating the Scraping Session

Create a scraping session either by clicking the "New Scraping Session" button (looks like a gear) or by selecting "New Scraping Session" from the "File" menu. In the "Name" field enter "Shopping Site" (if you already downloaded and imported the scraping session at the first of this tutorial you'll want to name your scraping session something else--perhaps "My Shopping Site"). This is the scraping session that will hold all of the files we'll be extracting data from. Remember that a scraping session is simply a container for all of the files and other objects that will allow us to extract data from a given web site.

We'll now be adding scrapeable files to our scraping session. You'll remember from the first tutorial that a scrapeable file represents a web page you'd like screen-scraper to request.

Add the first scrapeable file to the scraping session by clicking the "Shopping Site" proxy session in the tree on the left (the first of the two "Shopping Site" nodes), then on the "Progress" tab. Find the row in the "HTTP Transactions" table with the following URL (probably the second in the table):

http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&keyword=dvd&sort=2a&page=2

This URL corresponds to the second page in the search results. We'll use this file because it should contain all of the parameters in the URL we need to request any of the search results pages (including the first). After clicking on this row in the table, information corresponding to the file will appear in the lower pane. Add the file to the "Shopping Site" scraping session by selecting it in the "Generate scrapeable file in" drop-down list, and clicking the "Go" button next to the "Generate scrapeable file in" drop-down list.

After the scrapeable file appears under the scraping session rename it to "Search results". Next, click on the "Parameters" tab. Remember that when we generate a scrapeable in this way screen-scraper pulls out the parameters from the URL and puts them under the "Parameters" tab for us. Because these are "GET" parameters (as opposed to "POST" parameters), when the scrapeable file is invoked by screen-scraper in a running scraping session, the parameters will get appended again to the URL. Let's take a closer look at each of the parameters that were embedded in the URL:

* main_page: advanced_search_result
* keyword: dvd
* sort: 2a
* page: 2

The only two that we're likely interested in are "keyword" and "page". We can guess that "keyword" refers to the text we typed into the search box initially. The "page" parameter refers to what page we're on in the search results. We can guess that if we were to replace the "2" in the "page" parameter of the URL it would bring up the first page in the search results. Try this by bringing up the following page in your web browser:

http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&keyword=dvd&sort=2a&page=1

Looks like our theory was correct. You should see the first page of search results. It's also important to note that the "keyword" and "page" parameters are those that will need to be dynamic. We'll get to that in a minute.

Tutorial 2: Page 5: Creating the Script to Initialize the Scraping Session

Creating the Script to Initialize the Scraping Session

We're now going to create a small script to initialize our scraping session. It's a common practice to run a script at the very beginning of a scraping session that can initialize variables and such. That's what we'll be doing here.

Generate the script either by clicking the "New Script" button (looks like a pencil and paper) or by selecting "New Script" from the "File" menu. In the "Name" field type "Shopping Site--initialize session". You'll remember from the first tutorial that screen-scraper scripts get invoked when certain events occur. We'll be invoking this script before the scraping session begins, as we did in the second tutorial.

If you prefer to code in Java (or JavaScript), select "Interpreted Java" from the "Language" drop-down, then copy and paste the following text into the "Script Text" box:

// Set the session variables.
session.setVariable( "SEARCH", "dvd" );
session.setVariable( "PAGE", "1" );


If you prefer to code in VBScript, select "VBScript" from the "Language" drop-down, then copy and paste the following text into the "Script Text" box:

' Set the session variables.
Call session.SetVariable( "SEARCH", "dvd" )
Call session.SetVariable( "PAGE", "1" )


We set two session variables on our current scraping session. The one item to note is the "PAGE" session variable. We start at 1 so that the first search results page will get requested first.

Before trying out this script let's modify the parameters for our scrapeable file so that they make use of the session variables. Click on the "Search results" scrapeable file, then on the "Parameters" tab. Change the value of the "keyword" parameter from "dvd" to "~#SEARCH#~" (without the quotes), and change the value of the "page" parameter from "2" to "~#PAGE#~" (again, omit the quotes).

The ~#SEARCH#~ and ~#PAGE#~ tokens will be replaced at runtime with the values of the corresponding session variables. As such, the first URL will be as follows:

http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&keyword=dvd&sort=2a&page=1

That is, screen-scraper will take all of our "GET" parameters, append them to the end of the URL, then replace any embedded session variables (surrounded by the ~# #~ markers) with their corresponding values.

Note that we could achieve the same effect by deleting all of the parameters from the "Parameters" tab, and replacing our URL with this:

http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&keyword=~#SEARCH#~&sort=2a&page=~#PAGE#~

Breaking out the parameters under the "Parameters" simply makes them easier to manage, which is why we take that approach.

We'll now need to associate our script with our scraping session so that it gets invoked before the scraping session begins. To do that, click on the scraping session in the tree on the left, then on the "Scripts" tab. Click the "Add Script" button to add a script. In the "Script Name" column select "Shopping Site--initialize session". The "When to Run" column should show "Before scraping session begins", and the "Enabled" checkbox should be checked. This will cause our script to get executed at the very beginning of the scraping session so that the two session variables can get set.

All right, we're ready to try it all out. This scraping session will generate a larger log than the one we worked on earlier, so it may be a good idea to increase the number of lines screen-scraper will display in its log. To do that, click on the scraping session in the tree on the left, then on the "Log" tab. In the text box labeled "Show only the following number of lines" enter the number 1000.

Run the scraping session by selecting it in the tree on the left, then click the "Run Scraping Session" button. View the progress of the scraping session by clicking on it in the tree on the left, then clicking on the "Log" tab. You'll notice that the URL of the requested file is the one given above. You can also verify that the correct URL was requested by clicking on the "Search results" scrapeable file, then on the "Last Response" tab, then on the "Render HTML" or "Display Response in Browser" buttons. The page should resemble the one you saw in your web browser.

Remember that it's a good idea to run scraping sessions often as you make changes, and watch the log and last responses to ensure that things are working as you expect them to. You'll also want to save your work frequently. Do that now by hitting the "Save" button (the one with the disk icon).

Tutorial 2: Page 6: Creating Extractor Patterns for Links

Creating Extractor Patterns for Links

This particular part of the tutorial is one that covers important principles that often seem confusing to people at first. If you've been speeding through the tutorial up to this point, it would probably be a good idea to slow down a bit and read more carefully.

We're now going to create a couple of extractor patterns to extract information for the "Next" link and the product details links. Remember that an extractor pattern is a block of text (usually HTML) that contains special tokens that will match pieces of data you're interested in extracting.

When creating extractor patterns we recommend that you always use the HTML from the "Last Response" tab in screen-scraper. By default, after screen-scraper requests a page it "tidies" the HTML found in it, which makes it differ from the HTML that you would get by viewing the source in your web browser (and also makes it more consistent, facilitating extraction). Click on the "Search results" scrapeable file in the tree on the left, then on the "Last Response" tab. The text box contains HTML because we just ran the scraping session. Copy all of the HTML and paste it into a text editor, such as Notepad or TextMate.

If you click either the "Render HTML" or "Display Response in Browser" button in screen-scraper you'll see a page basically resembling the search results page in your web browser. We're going to extract a portion of each of the product details links so that we can subsequently request each details page and extract information from them. The first details link corresponds to the "A Bug's Life" DVD. Find that in the text editor you just pasted the HTML into (specifically search for the text "A Bug's Life"). Here is the block of HTML representing this product:

<tr class="productListing-odd">
<td align="center" class="productListing-data">&nbsp;<a href="http://www.screen-scraper.com/shop/index.php?main_page=product_info&amp;products_id=8"><img src="images/dvd/a_bugs_life.gif" border="0" alt="A Bug's Life" title=" A Bug's Life " width="100" height="80" /></a>&nbsp;</td>
<td class="productListing-data">&nbsp;<a href="http://www.screen-scraper.com/shop/index.php?main_page=product_info&amp;products_id=8">A Bug's Life</a>&nbsp;</td>
<td align="right" class="productListing-data">&nbsp;$35.99&nbsp;</td>
<td align="center" class="productListing-data"><a href="http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&amp;keyword=dvd&amp;sort=2a&amp;page=1&amp;action=buy_now&amp;products_id=8"><img src="includes/templates/template_default/buttons/english/button_buy_now.gif" border="0" alt="Buy Now" title=" Buy Now " width="60" height="30" /></a>&nbsp;</td>
</tr>


This may seem like a bit of a mess, but if we look closely we can pick out the details link:

<td class="productListing-data">&nbsp;<a href="http://www.screen-scraper.com/shop/index.php?main_page=product_info&amp;products_id=8">A Bug's Life</a>&nbsp;</td>


Breaking it down a bit more we get the URL:

http://www.screen-scraper.com/shop/index.php?main_page=product_info&amp;products_id=8


By the way, you might notice that the typical & symbols in the URL have been replaced by &. Don't be alarmed, it's just part of the tidying process screen-scraper applies to the HTML. Again, if we examine the parameters in the URL we can guess that the important one is "products_id", which likely identifies the product whose details we're interested in. We'll guess that the "products_id" is the only one we'll need to extract. This will give us enough information to request a details page. At this point, click on the "Search results" scrapeable file in the tree on the left, then click on the "Extractor Patterns" tab. We'll create an extractor pattern to grab out the product IDs from each link. Here's the extractor pattern we'll use:

<td class="productListing-data">&nbsp;<a href="http://www.screen-scraper.com/shop/index.php?main_page=product_info&amp;products_id=~@PRODUCTID@~">~@PRODUCT_TITLE@~</a>&nbsp;</td>


Create the extractor pattern by clicking on the "Add Extractor Pattern" button, then copying and pasting the text above into the resulting box. Also, give the extractor pattern the name "Product details link". Remember that extractor pattern tokens (delineated by the ~@ @~ markers) indicate data points we're interested in extracting. In this case, we want to extract the ID of the product (embedded in the URL), and the title of the product.

Double-click the ~@PRODUCTID@~ token (or select the text between the ~@ @~ delimiters, right-click it and select "Edit token"), and, in the box that appears, click "Save in session variable" checkbox. Click on the "Regular Expression" tab, and select "Non-double quotes". You'll notice that when you do that the text [^"]* shows up in the text box just above the drop-down list. This is the regular expression that we'll be using. You could also edit it manually, but generally won't need to.

Let's slow down at this point and go over what we just did to the ~@PRODUCTID@~ extractor pattern token. You might remember from the second tutorial that by checking the "Save in session variable" box we're telling screen-scraper to preserve the value for us so that we can use it at a later point. We'll get to that in a bit. This time we also selected a regular expression for it to use. In most cases you'll want to designate a regular expression for extractor pattern tokens. If you're not very familiar with regular expressions, don't worry. In the vast majority of cases you can simply use the regular expressions found in that drop-down list. Let's go over what effect designating a regular expression has. By indicating the "Non-double quotes" regular expression we're saying that we want that token to match any character except a double-quote (i.e., the " character). You'll notice in our extractor pattern that a double-quote character just follows our ~@PRODUCTID@~ extractor pattern token. By using a regular expression we limit what the token will match so that we can ensure we get only what we want. You might think of it as putting a little fence around the token. We want it to match any characters underneath the ~@PRODUCTID@~ extractor pattern token, up to (but not including) the double-quote character.

A line from that last paragraph is worth repeating. In most cases you'll want to designate a regular expression for extractor pattern tokens. Using regular expressions also makes extractor patterns more resilient to changes in the web site. That is, if the web site makes minor changes to its HTML (e.g., altering a font style or color), often if you've been using regular expressions your extractor patterns will still match. Also, by using regular expressions we can often decrease the amount of HTML we need to use in our extractor patterns. That is, by using regular expressions we indicate more precisely what the data will look like that our tokens will match. By doing this, we can often reduce the amount of HTML we include at the beginning and end of our extractor patterns. In general, if you can reduce the amount of HTML in your extractor patterns, and increase the number of regular expressions you use in tokens, your extractor patterns will be more resilient to changes that get made in the HTML of the pages.

Now close the "Edit Token" box, which saves our settings.

Now let's alter the settings for the ~@PRODUCT_TITLE@~ token. We're not interested in saving the value for this token in a session variable, but we include it since it will differ for each section of HTML we want to match. Double-click the ~@PRODUCT_TITLE@~ extractor pattern token to bring up the "Edit token" dialogue box. Click on the "Regular expression" tab, then select "Non-HTML tags". Again, take a look at the characters on the left and right sides of our ~@PRODUCT_TITLE@~ extractor pattern token. By using this regular expression we tell it not to include any greater than (>) or less than (<) symbols. This way we create a boundary for the token so that we can ensure it matches only what we want it to.

Why even include an extractor pattern token for data we don't want to save? This is another important principle. By using extractor pattern tokens for data we don't necessarily want to save, we make the extractor pattern more resilient to changes in the HTML. By using these extra tokens we can "future proof" our extractor patterns against changes the site owners might make down the road. There are also often situations (such as the present one) where data points adjacent to data we want to extract will differ for each pattern match. Here we only want the product ID, but we also include the product title because of its proximity to the data we want to extract, and because its value will differ each time the extractor pattern matches.

If those last few paragraphs strike you as a little bit confusing, don't worry. As you get more experience using screen-scraper you'll see why they're important. For now just take our word for it that you'll generally want to use regular expressions with extractor pattern tokens, and that it's often a good idea to use extractor pattern tokens to match data points you don't necessarily want to save. As you get more experience it will become more apparent when to use extractor pattern tokens for data you don't want to save.

Let's give our new extractor pattern a try. Click the "Apply Pattern to Last Scraped Data" button. You should see a window come up that shows the extracted data.

Again, let's slow down a moment and review what this window contains. When an extractor pattern matches, it produces a DataSet. You can think of a DataSet like a spreadsheet--it contains rows columns and cells. Each row in a DataSet is called a DataRecord. Again, a DataRecord can be thought of as being analogous to a row in a spreadsheet. In this particular case our DataSet has three columns. Two of them should be familiar--they correspond to the PRODUCT_TITLE and PRODUCTID extractor pattern tokens. The "Sequence" column indicates the order in which each row was extracted. You'll notice that the sequence is zero-based, meaning the first DataRecord in the DataSet is referenced with an index of 0. You'll also notice that the DataSet has 10 records--one for each product found in the search results page. Later on when we start talking more about DataSets and DataRecords, just remember the spreadsheet analogy--a DataSet is like the entire spreadsheet, and a DataRecord is like a single row in the spreadsheet.

Another good habit to get into is applying your extractor patterns frequently to ensure they correctly match the text you want extracted. Go ahead and close the "DataSet" window now.

Now for our "Next" link. In the text editor where you pasted the full HTML from the web page, search for the text "Next". Around that area you'll find the HTML for the link:

&nbsp;&nbsp;<a href="http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&amp;keyword=dvd&amp;sort=2a&amp;page=2" title=" Next Page ">[Next&nbsp;&gt;&gt;]</a>&nbsp;</td>


Fortunately, we're already familiar with the URL, and we know that the only parameters we need to worry about are "keyword" and "page". Create a new extractor pattern, call it "Next link", and use the following to grab the values of those parameters out:

&nbsp;&nbsp;<a href="http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&amp;keyword=~@KEYWORD@~&amp;sort=2a&amp;page=~@PAGE@~" title=" Next Page ">[Next&nbsp;&gt;&gt;]</a>&nbsp;</td>


As with the previous extractor pattern, double-click the ~@PAGE@~ token, and, in the box that appears, click "Save in session variable" checkbox. Click on the "Regular Expression" tab, and select "Number" from the "Select" drop-down list.

Close the "Edit Token" box to save your settings. If you're interested, the "Number" regular expression \d* simply indicates that we only want the PAGE token to match numbers (\d signifies a digit, and the * signifies "zero or more").

Next, double-click the "KEYWORD" extractor pattern token to edit it. Click on the "Regular Expression" tab, then select "URL GET parameter" from the "Select" drop-down list. This indicates that the "KEYWORD" extractor pattern should match only characters that would be found in a "GET" parameter of a URL. We could have used the "Non-double quotes" regular expression as we did above, but used this one instead as it's a bit more specific still to what we do and don't want the token to match. You'll notice that we didn't check the box to save the "KEYWORD" extractor pattern token in a sesion variable. We already have that value in a session variable, so we don't bother getting it again.

Try out the extractor pattern by clicking the "Apply Pattern to Last Scraped Data". Excellent! We have two matches--one for each "Next" link on the page (the top and bottom of the page).

Now would be a good time to save your work. Do that by selecting "Save" from the "File" menu or by clicking the floppy disk icon.

OK, let's try out the whole thing once more. Click on the "Shopping Site" scraping session in the tree on the left, then on the "Log" tab. Click the "Clear Log" button--we're going to run it again and we don't want to get confused by the log text from the last run. As before, click on the "Run Scraping Session" button to get it going. You'll see quite a bit more text in the log this time. Take a minute to look through it to ensure you understand what's going on.

Tutorial 2: Page 7: Scraping Pages from Scripts

Scraping Pages from Scripts

For each details link we're going to scrape the corresponding details page. This is a common scenario in screen-scraping--given a search results page, you need to extract details for each product, which means following each of the product details links. For each details page you'll likely want to extract out pieces of information corresponding to the products.

Let's start by creating a scrapeable file for the details page. We could create it from the proxy session, but it's pretty simple, so let's just create it from scratch. Click on the "Shopping Site" scraping session, the "General" tab, then click the "Add Scrapeable File" button. Give the scrapeable file the name "Details page", and the following URL:

http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=~#PRODUCTID#~

You'll notice that this time we're leaving all of the parameters embedded in the URL. Sometimes with shorter URL's it's more convenient to take this approach rather than breaking them out under the "Parameters" tab. As before, when the scraping session runs, the ~#PRODUCTID#~ token will be replaced by the value of the "PRODUCTID" session variable. At this point, click the "This scrapeable file will be invoked manually from a script" checkbox. If we didn't do this, screen-scraper would invoke this scrapeable file in sequence (after the search results page), which we don't want. Instead, we're going to tell screen-scraper to invoke this scrapeable file from a script.

In screen-scraper, links are generally followed by invoking a script after an extractor pattern finds matches. Let's go over this in more detail. First, create a new script and call it "Scrape details page". If you're using Interpreted Java enter the following code:

session.scrapeFile( "Details page" );


If you're using VBScript enter the following:

Call session.ScrapeFile( "Details page" )


OK, this is where the logic may get a little tricky. For each product ID our "Product details link" extractor pattern extracts, we want to scrape the product details page using the PRODUCTID it extracts. Go to the "Product details link" extractor pattern by clicking the "Search results" scrapeable file, then the "Extractor Patterns" tab. Note the "Scripts" pane under the extractor pattern. Click the "Add Script" button. This will allow us to have a script execute as the pattern finds matches. Under the "Script Name" column, if it isn't already selected, select our "Scrape details page" script. Leave the "Sequence" as is, and, under the "When to Run" column, select "After each pattern application".

Let's walk through this a bit more slowly. After the search results page is requested the "Product details link" will be applied to the HTML in the page. Remember that this particular extractor pattern will match 10 times--once for each product details link. Each time it matches it will grab a different product ID and save the value of that product ID into the PRODUCTID session variable. The "Scrape details page" script will get invoked after each of these matches, and each time the PRODUCTID session variable will hold a different product ID. As such, when the "Details page" gets scraped the URL will get a different product. For example, the first time the extractor pattern matches the PRODUCTID session variable will hold "8", and the URL will be:

http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=8

The next time the product ID will be 34, yielding the URL:
http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=34

If it helps, think again about the spreadsheet analogy. You can imagine screen-scraper walking through each row in the spreadsheet. It encounters a row, saves any needed data in session variables (the product ID, in this case), then invokes the "Scrape details page" script. Because it just matched a specific product ID, and saved its value in a session variable, when the "Details page" scrapeable file gets invoked by the script, the current product ID in the PRODUCTID session variable will be used. Once it's finished invoking the "Details page" scrapeable file, it will go on to the next row (or DataRecord) in the spreadsheet (or DataSet). Again, it will save the next product ID in a session variable, then execute the "Scrape details page" script, which in turn invokes the "Details page" scrapeable file. Because we indicated that the script should be invoked "After pattern application", this will occur 10 times--once for each search result. If we had designated "After pattern is applied", the script would only have been executed once--after it traversed the spreadsheet and reached the very end.

Hopefully that's not too repetitive :) This is another area that people new to screen-scraper find confusing, so it's probably worth it to slow down a bit and ensure you understand what's going on.

Now would be a good time to try out the whole scraping session again. Do that like you did before by clearing out the log for the scraping session, then clicking the "Run Scraping Session" button. You'll see each details page getting requested one-by-one. Note especially each URL, which will have a different product ID at the end of each. If you'd prefer not to wait for the entire session to run you can click the "Stop Scraping Session" button. As before, it would be a good idea to go through the log carefully to ensure that you understand what it's doing.

At this point we still need to deal with the "Next" page link. We already have an extractor pattern to grab out the page number of the next page. Let's create a script to scrape the search results page again for each "Next" link. Generate a new script and call it "Scrape search results". If you're using Interpreted Java enter the following:

if( dataSet.getNumDataRecords() > 0 ){
     session.scrapeFile( "Search results" );
}


If you're using VBScript enter the following code (again, be sure to select "VBScript" from the "Language" drop-down box):

If dataSet.getNumDataRecords > 0 Then
    Call session.ScrapeFile( "Search results" )
End If


You'll notice that the script makes use of a "dataSet" variable. When the script is invoked screen-scraper will automatically create a variable corresponding to the current DataSet. This variable allows you to get access to all of the information that was extracted by the current extractor pattern. You can read more about objects available in scripts and their scope in our documentation, at the Using Scripts and API Documentation pages.

In this particular case, the script first checks the number of records in the current DataSet. That is, it looks at the number of DataRecords (or rows) in the DataSet (or spreadsheet). This effectively just checks to see if any "Next" link was found in the page. If so, it tells screen-scraper to scrape the "Search results" scrapeable file.

After creating the script return to the "Next link" extractor pattern, then click the "Add Script" button. Select the "Scrape search results" script. This time there's something slightly different we'll need to do under the "When to Run" column. First, click the "Apply Pattern to Last Scraped Data" button. You'll notice that the pattern matches twice. The problem is that we only want to follow one of the "Next" links (that is, we don't want to scrape the second page twice). This is easily dealt with by selecting "After pattern is applied" under the "When to run" column. In other words, the script will only get invoked once--after the extractor pattern has matched as many times as it can. Note, though, that because we're saving the value of the ~@PAGE@~ extractor pattern token in a session variable it will still hold the correct value when the page gets scraped. Because we indicate that the script is to be invoked "After pattern is applied", the "dataSet" variable will be in scope. See the Variable scope section in our documentation for more detail on which variables are in scope depending on when a given script is run.

OK, run the scraping session once more. Clear the scraping session log, then click the "Run Scraping Session" button again. If you let it run for a while you'll notice that it will request each details page for the products found on the first search results page, request the second search results page, then request each of the details pages for that page.

Tutorial 2: Page 8: Extracting Product Details

Extracting Product Details

At this point we're able to scrape the details pages for each of the products. We're now ready to extract the information we're really interested in: data about each DVD. To do this we're going to use sub-extractor patterns. Again, this is a point in the tutorial where you may want to slow down a bit. Sub-extractor patterns is another important concept that can be a bit confusing at first.

Sub-extractor patterns allow us to define a small region within a larger HTML page from which we'll extract individual snippets of information. This helps to eliminate most of the HTML text we're not interested in, allowing us to be more precise about the data we'd like to extract. It also makes our extractor patterns more resilient to future changes in the HTML page, as they allow us to reduce the amount of HTML we need to include.

If you let the scraping session run through to completion the last URL in the scraping session log will be the following:

http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=7

If that's not the exact one you have, don't worry; it won't make a difference for our extractor patterns. We'll need to examine the HTML for this page in order to generate the extractor patterns for it. Do this by clicking on the "Details page" scrapeable file in the tree on the left, then on the "Last Response" tab. You'll remember that screen-scraper records the HTML for the last time each page was requested. Bring up the URL above in your web browser. We'll be extracting the DVD title, price, model, shipping weight, and manufacturer.

It should be apparent in examining the page that most of the elements in it aren't of interest to us. For example, we don't care about the header, footer, or any of the boxes along the sides of the page. We'll first define a region that basically surrounds the elements we're interested in. Here is that full region:

<tr>
<td colspan="2" class="pageHeading" valign="top">
<h1>You've Got Mail</h1>
</td>
</tr>

<tr>
<td align="center" valign="top" class="smallText" rowspan="2">
<script language="javascript" type="text/javascript">
<!--
document.write('<a href="javascript:popupWindow(\'http://www.screen-scraper.com/shop/index.php?main_page=popup_image&amp;pID=7\')"><img src="images/dvd/youve_got_mail.gif" border="0" alt="You\'ve Got Mail" title=" You\'ve Got Mail " width="100" height="80" hspace="5" vspace="5" /><br />larger image<\/a>');
//-->
</script>



<noscript><a href="http://www.screen-scraper.com/shop/index.php?main_page=images/dvd/youve_got_mail.gif" target="_blank"><img src="images/dvd/youve_got_mail.gif" border="0" alt="You've Got Mail" title=" You've Got Mail " width="100" height="80" hspace="5" vspace="5" /><br />
larger image</a></noscript> </td>
<td class="main" align="center" valign="top">Model: DVD-YGEM</td>
</tr>

<tr>
<td class="main" align="center"></td>
</tr>

<tr>
<td align="center" class="pageHeading">$34.99</td>
<td class="main" align="center">Shipping Weight: 7.00 lbs.</td>
</tr>

<tr>
<td>&nbsp;</td>
<td class="main" align="center">10 Units in Stock</td>
</tr>

<tr>
<td class="main" align="center">Manufactured by: Warner</td>
<td align="center">
<table border="0" width="150px" cellspacing="2" cellpadding="2">
<tr>
<td align="center" class="cartBox">&nbsp;Quantity


That might seem like a large chunk of HTML, but it's actually a relatively small percentage of the entire page.

Before defining sub-extractor patterns we first define an extractor pattern with a special ~@DATARECORD@~ token in it. If you're familiar with computer programming in general, the ~@DATARECORD@~ token can be thought of as a "reserved word". That is, it's a token that has a special meaning in that it defines the sub-region of the HTML page containing the data elements we're interested in. You'll always use the ~@DATARECORD@~ token when using sub-extractor patterns.

Here's the extractor pattern we'll use:

<tr>
<td colspan="2" class="pageHeading~@DATARECORD@~Quantity


Notice that we simply replaced most of the middle portion of the large block of HTML with a ~@DATARECORD@~ token. If you look at the text before and after ~@DATARECORD@~ you can see that the same text is also found at the beginning and end of the large HTML block. The basic idea here is to include only as much HTML around the sub-region as necessary to uniquely identify it in the page. Any of the HTML covered by the ~@DATARECORD@~ token will be picked up by screen-scraper, and will define our sub-region that we'll be extracting the individual pieces of data from.

Create a new extractor pattern using the text given above (remember we're still using the "Details page" scrapeable file), then give it the name "PRODUCTS". Now click the "Apply Pattern to Last Scraped Data" button. In the window that appears, copy the text from the "DATARECORD" column and paste it into your text editor. The easiest way to select all of the text in that box is to triple-click it, use the keyboard to copy the text (Ctrl-C in Windows and Linux), then paste it into your text editor. The text should look like this:

" valign="top"><h1>You've Got Mail</h1></td></tr><tr><td align="center" valign="top" class="smallText" rowspan="2"><script language="javascript" type="text/javascript"><!--document.write( '<a href="javascript:popupWindow(\'http://www.screen-scraper.com/shop/index.php?main_page=popup_image &pID=7\')"><img src="images/dvd/youve_got_mail.gif" border="0" alt="You\'ve Got Mail" title=" You\'ve Got Mail " width="100" height="80" hspace="5" vspace="5" /><br />larger image<\/a>'); //--></script> <noscript><a href="http://www.screen-scraper.com/shop/index.php? main_page=images/dvd/youve_got_mail.gif" target="_blank"><img src="images/dvd/youve_got_mail.gif" border="0" alt="You've Got Mail" title=" You've Got Mail " width="100" height="80" hspace="5" vspace="5" /><br />larger image</a></noscript> </td><td class="main" align="center" valign="top"> Model: DVD-YGEM</td></tr><tr><td class="main" align="center"></td></tr><tr><td align="center" class="pageHeading">$34.99</td><td class="main" align="center">Shipping Weight: 7.00 lbs.</td> </tr><tr><td>&nbsp;</td><td class="main" align="center">10 Units in Stock</td></tr> <tr><td class="main" align="center">Manufactured by: Warner</td><td align="center"> <table border="0" width="150px" cellspacing="2" cellpadding="2"><tr><td align="center" class="cartBox">&nbsp;


This is the HTML we're after, but it's all in one large block. This occurs because screen-scraper strips out unnecessary white space when extracting information in order to make the extraction process more efficient. This can make sifting through the HTML a little more difficult, but the search feature in your text editor should make this relatively straightforward. You could also deal with the HTML found directly in the "Last Response" tab. You'd just have to be sure that you're only grabbing portions of the page that would be covered by the ~@DATARECORD@~ extractor pattern token.

First off, we're interested in the DVD title. In your text editor do a search for the first word in the title of the DVD whose page you're viewing (e.g., if you're viewing the HTML for the last DVD in the search results you'll search for "You've"). This should highlight the first word in the title. In order to extract this piece of information we'll use a small sub-extractor pattern:

<h1>~@TITLE@~</h1>


Once again, we include only as much HTML around the piece of data that we're interested in as is necessary. If we do this just right we'll still be able to extract information even if the web site itself makes minor changes. On our "PRODUCTS" extractor pattern, click the "Sub-Extractor Patterns" tab, then on the "Add Sub-Extractor Pattern" button. In the text box that appears paste the text for the sub-extractor pattern we've included above. Edit the ~@TITLE@~ extractor pattern token by double-clicking it, click the "Regular Expression" tab, then select "Non-HTML tags" from the drop-down list (as a side note, "Non-HTML tags" is probably the most common regular expression you'll use). Click on the "Apply Sub-Extractor Pattern to Last Scraped Data" to try it out. You should see a DataSet with a single row and columns for the DATARECORD and TITLE tokens.

Next, create the following sub-extractor patterns for the remaining data elements we want to extract (note that each line of text will be a separate sub-extractor pattern):

>$~@PRICE@~<

>Model: ~@MODEL@~<

>Shipping Weight: ~@SHIPPING_WEIGHT@~<

>Manufactured by: ~@MANUFACTURED_BY@~<


For each token in the sub-extractor patterns give it the "Non-HTML tags" regular expression, as you did for the ~@TITLE@~ token.

As sub-extractor patterns match data, they aggregate the pieces into a single data record. That is, when our PRODUCTS extractor pattern is applied along with its sub-extractor patterns, the following data record will be produced:


TITLE PRICE MODEL SHIPPING_WEIGHT MANUFACTURED_BY
You've Got Mail 34.99 DVD-YGEM 7.00 lbs. Warner


You can see this by clicking the "Apply Pattern to Last Scraped Data" button.

If you'd like, at this point try running the scraping session again by clearing the log and hitting the "Run Scraping Session" button. If you examine the log while the session runs you'll see that it extracts out details for each of the DVDs.

Tutorial 2: Page 9: Saving the Data

Saving the Data

Once screen-scraper extracts data there are a number of things that can be done with it. For example, you might be invoking screen-scraper from an ASP script, which, after telling screen-scraper to extract data, might display it to the user. In our case we'll simply write the data out to a text file. To do this, we'll once again write a script. Create a new script, call it "Write data to a file", and use either the following Interpreted Java:

FileWriter out = null;

try
{
session.log( "Writing data to a file." );

// Open up the file to be appended to.
out = new FileWriter( "dvds.txt", true );

// Write out the data to the file.
out.write( dataRecord.get( "TITLE" ) + "\t" );
out.write( dataRecord.get( "PRICE" ) + "\t" );
out.write( dataRecord.get( "MODEL" ) + "\t" );
out.write( dataRecord.get( "SHIPPING_WEIGHT" ) + "\t" );
out.write( dataRecord.get( "MANUFACTURED_BY" ) );
out.write( "\n" );

// Close up the file.
out.close();
}
catch( Exception e )
{
session.log( "An error occurred while writing the data to a file: " + e.getMessage() );
}

Or the following VBScript (remember to select "VBScript" from the "Language" drop-down box):

' Generate objects to write data to a file.
Set objFSO = CreateObject( "Scripting.FileSystemObject" )
' The "8" indicates that we want to append data to the file.
Set objDVDFile = objFSO.OpenTextFile( "dvds.txt", 8, True )

' Write out the data to the file.
objDVDFile.Write dataRecord.Get( "TITLE" ) + vbTab
objDVDFile.Write dataRecord.Get( "PRICE" ) + vbTab
objDVDFile.Write dataRecord.Get( "MODEL" ) + vbTab
objDVDFile.Write dataRecord.Get( "SHIPPING_WEIGHT" ) + vbTab
objDVDFile.Write dataRecord.Get( "MANUFACTURED_BY" ) + vbTab
objDVDFile.Write vbCrLf

' Close the file and clean up.
objDVDFile.Close
Set objFSO = Nothing

Our script simply takes the contents of the current data record (which for us will be the data record that constitutes a single DVD) and appends it to a "dvd.txt" text file.

If you're familiar with VBScript or Java, hopefully the scripts make sense. There is one important point worth noting, though. You'll notice that each script makes use of a "DataRecord" object (referenced as the "dataRecord" variable in the scripts). This object refers to the current DataRecord as the script is executed. Again, think of the spreadsheet. When the script gets invoked, a specific DataRecord (or row in the spreadsheet) will be current. This DataRecord automatically becomes a variable you can use in your script. The DataRecord object has a "get" method, which allows you to retrieve the value for a key it contains (i.e., you're referencing a specific cell in the spreadsheet). Again, you can read more about objects available in scripts and their scope in our documentation, at the Using Scripts and API Documentation pages.

Click on the "Details page" scrapeable file, then on the "Extractor Patterns" tab. Below the extractor pattern text click the "Add Script" button. In the "Script Name" column, select "Write data to a file" and in the "When to Run" column select "After each pattern application" (even though there will only be one match per page). For each DVD we'll execute the script that will write the information out to a file.

To clarify a bit further, because we're invoking the script "After each pattern application", the "dataRecord" variable will be in scope. In other words, for each row in the spreadsheet (which happens to be a single row in this case) screen-scraper will execute the "Write data to a file" script. Each time it gets invoked a DataRecord will be current (again, think of it walking through each row in the spreadsheet). As such, we have access to the current row in the spreadsheet by way of the "dataRecord" variable. Had we indicated that the script was to be invoked "After pattern is applied", the "dataRecord" would not be in scope. Again using the spreadsheet analogy, scripts that get invoked "After pattern is applied" would run after screen-scraper had walked through all of the rows in the spreadsheet, so no DataRecord would be in scope (i.e., it's at the end of the spreadsheet--after the very last row). See the Variable scope section in our documentation for more detail on which variables are in scope depending on when a given script is run.

Once again, run the scraping session. This time if you check the directory where screen-scraper is installed you'll notice a dvds.txt file that will grow as the DVD details pages get scraped.

Note that as an alternative to the above scripts you could do the following in Interpreted Java (professional and enterprise editions only):

dataSet.writeToFile( "dvds.txt" );

Or in VBScript:

Call dataSet.WriteToFile( "dvds.txt" )

We included the first example to demonstrate referencing data records in scripts.

If you would like more information on saving extracted data to a database please consult our FAQ on the topic here.

Tutorial 2: Page 10: Logging In

Logging In

Oftentimes it's necessary to log in to a web site before extracting the information you're interested in. This is generally quite a bit easier than it might seem. Typically this simply involves creating a scrapeable file to handle the login that will get invoked before any of the other pages. The shopping site we're scraping from doesn't require us to log in before performing searches, but for the sake of this tutorial we'll set it up as if it did.

Before we look at the page that handles the actual login, we need to have screen-scraper request the home page for the shopping site. This is necessary because it allows for a few initial cookies to be set before we attempt to log in. If you're familiar with web programming, we're requesting the home page so that the server can create a session for us (tracked by the cookies) prior to our attempting a login. By having screen-scraper request the home page, those cookies will get set, and screen-scraper will then automatically track them for us.

Create a scrapeable file for the home page by clicking on the "Shopping Site" scraping session (the one with a gear) in the tree on the left, then on the "Add Scrapeable File" button. Give the new scrapeable file the name "Home". Leave its sequence as "1", and give it the URL "http://www.screen-scraper.com/shop/".

Login HTTP requests are usually POST requests, which makes it trickier to tell what parameters are being passed to the server (i.e., the parameters won't appear in the URL). The proxy server can make viewing the parameters easier, so let's make use of it. Open your web browser to the shopping login page:

http://www.screen-scraper.com/shop/index.php?main_page=login

In screen-scraper click on the "Shopping Site" proxy session, then on the "Start Proxy Server" button (found on the "General" tab). Now click on the "Progress" tab. Go ahead and remove any HTTP transactions that are already there by clicking the "Clear All Transactions" button. Configure your web browser to use screen-scraper as a proxy server as you did earlier.

In your web browser, in the "E-Mail Address" field enter test@test.com and in the "Password" field enter testing, then click the "login" button. After screen-scraper works for a bit, return to the "General" tab and click the "Stop Proxy Server" button. Re-configure your web browser so that it no longer uses screen-scraper as a proxy server.

If you paid close attention to screen-scraper as it was working you may have noticed that two rows were added to the "HTTP Transactions" table (it's actually possible that three were added; if so just delete the last one by highlighting it and hitting the "Delete" key on your keyboard). Click on the second to last row in the table (the URL should begin with:

http://www.screen-scraper.com/shop/index.php?main_page=login

This is the actual login POST request. If you scroll down in the lower section and look in the "POST data" text box you'll see the email address and password we entered in earlier. You'll also notice that "x" and "y" parameters were passed in (these simply represent the coordinates where you clicked the "login" button). If you click on the "Response" tab, once again in the lower section, you'll notice that the "Status Line" field shows a response code of "302 Found". This is a redirect response, which indicates that the browser should be redirected to a different URL. When this response was issued by the server your browser faithfully followed to this other URL, creating the last row in the "HTTP Transaction" table.

At this point we'll want to copy the login POST request to our scraping session. We only need the second to last transaction in the table (the login request itself) and not the request representing the redirect, since screen-scraper will automatically follow redirects for us. Copy the HTTP transaction to your scraping session by clicking on the second to last row in the table (the one corresponding to the POST request), ensure that the "Shopping Site" scraping session is selected in the drop-down, then click the "Go" button. After the new scrapeable file is created under the scraping session rename it "Login". Also, set its sequence to 2. It should be requested right after the home page is requested. screen-scraper automatically tracks cookies, just like a web browser, so by requesting it near the beginning any subsequent pages that are protected by the login will be accessible.

Now click the "Parameters" tab in our "Login" scrapeable file. You'll notice that screen-scraper automatically extracted out the various POST parameters and added them to the scrapeable file. If you're familiar with URL encoding, you'll also notice that screen-scraper decoded the "email_address" parameter to "test@test.com". screen-scraper automatically URL encodes parameters found under the "Parameters" tab before passing them up to the server.

At this point feel free to run the scraping session again. Because our site doesn't require logging in before searching can take place it won't make much difference, but you'll at least be able to see the login page being requested in the log for the scraping session.

Tutorial 2: Page 11: Where to Go From Here

Where to Go From Here

Congratulations! At this point you should have the basics under your belt to scrape most web sites. From here you could continue on with one of the subsequent tutorials, if they seem relevant to your project. It may also be a good idea to look through a bit more of our documentation in order to get familiar with other details of screen-scraper. Either way, probably the best way to learn screen-scraper is to use it. Try it on one of your own projects!