Tutorial 7: Page 3: Altering the Scraping Session

Altering the Scraping Session

The changes we'll be making to our "Shopping Site" scraping session in order to add this new functionality are actually pretty minor. First, let's deal with the trickiest part (which really isn't all that tricky): creating the script that will read in the file containing our search terms, and run each search.

Create a new script by clicking the pencil and paper icon in the button bar. Give the script the name "Read search terms". Leave the "Language" drop-down list with the value "Interpreted Java". Paste in the following for the content of the script:

// Create a file object that will point to the file containing
// the search terms.
File inputFile = new File( "search_terms.txt" );

// These two objects are needed to read the file.
FileReader in = new FileReader( inputFile );
BufferedReader buffRead = new BufferedReader( in );

// Read the file in line-by-line.  Each line in the text file
// will contain a search term.
while( ( searchTerm = buffRead.readLine() )!=null)
{
  // Set a session variable corresponding to the search term.
  session.setVariable( "SEARCH", searchTerm );

// Remember we need to initialize the PAGE session variable, just
// in case we need to iterate through multiple pages of search results.
// We begin at page 1 for each search.
session.setVariable( "PAGE", "1" );

  // Get search results for this particular search term.
  session.scrapeFile( "Search results" );
}

// Close up the objects to indicate we're done reading the file.
in.close();
buffRead.close();

The script is pretty heavily commented, so it may be apparent what's going on, but let's walk through it a bit, just in case.

First off we create a few objects that are going to allow us to read in search terms from a file called "search_terms.txt". We read the search terms in line-by-line in a "while" loop. For each search term we're going to invoke the scrapeable file "Search results". You might remember that the "Search results" scrapeable file is the one that handles issuing the search to the e-commerce web site, and walks through all of the search results pages. It also has an extractor pattern that pulls the details links, following each one of those to the "Details page" scrapeable file.

That might sound a bit complicated, so let's put the rest of the pieces in place, run it, then walk through it again.

There are just a few more modifications we need to make. Please do the following:

  1. If you haven't done so previously, you'll also want to disable the "Shopping Site--initialize session" script. We'll be search terms from our external file, and this script would otherwise overwrite those values. To disable the script, click on the "Shopping Site" scraping session in the tree on the left, then on the "Scripts" tab. Un-check the box in the table under the "Enabled?" column.
  2. Click on the "Home" scrapeable file, then check the box labeled "This scrapeable file will be invoked manually from a script". Do the same for the "Login" scrapeable file. This essentially has the effect of disabling it. We don't want it to log in to the site every time we perform a search (and it's not necessary), so we just forgo this script for now.
  3. Click on the "Search results" scrapeable file, then check the box labeled "This scrapeable file will be invoked manually from a script". This time we check the box because we're going to want to run the search for each search term, rather than just letting it happen in sequence for a single term. That is, we're going to explicitly tell the scrapeable file when to run from our "Read search terms" script rather than let screen-scraper invoke it in sequence.
  4. Click on the "Details page" scrapeable file, then on the "Extractor Patterns" tab. For the "PRODUCTS" extractor pattern, in its "Scripts" section (below the box for the pattern text) ensure that the "Enabled?" box for the "Write data to a file" script is checked. We disabled it in a previous tutorial, but we'll need it now so that the data gets written out.
  5. Click on the "Shopping Site" scraping session, then on the "Scripts" tab. Click the "Add Script button. Under "Script Name" select "Read search terms". Under the "When to Run" column, leave it with the value "Before scraping session begins". We're going to invoke our script at the very beginning of the scraping session. The "Read search terms" can be thought of as a type of "controller" script. Rather than letting screen-scraper invoke scrapeable files in sequence our script will instead explicitly initiate searches by invoking the "Search results" scrapeable file.

That should do it. Click ahead to finalize setup and run the scraping session.