Tutorial 7: Scraping a Site Multiple Times Based on Search Terms
 |
Scraping a Site Multiple Times Based on Search Terms |
It's often the case in screen-scraping that you want to submit a form multiple times using different parameters each time. For example, you may be extracting locations from the "store locator" service on a site, and need to submit the form for a series of zip codes. In this tutorial we'll provide an example on how to go about that. We will continue on using the "Shopping Site" scraping session we generated in Tutorial 2.
If you haven't already gone through Tutorial 2, this tutorial will make more sense if you do so first.
If you decided not to go through Tutorial 2, or don't still have the scraping session and scripts you created in it, you can download and import them into screen-scraper by following these steps:
- Download the zip file located here and unzip it. You should now have an "interpreted_java" directory and a "vbscript" directory.
- If you're running Windows, and prefer to program in VBScript, import the "Shopping Site (Scraping Session).sss" scraping session located in the "vbscript" directory; otherwise, import the one located in the "interpreted_java" directory. Instructions on importing objects into screen-scraper can be found here.
Once you've got the scraping session imported into screen-scraper you're ready to roll. Click on the "Tutorial Details" link below to get going.
Tutorial 7: Page 2: Tutorial Details
 |
Tutorial Details |
Our "Shopping Site" example is pretty limited in that it can only handle one search term. What if we want to extract products for multiple search terms? For example, we may want to scrape various DVD titles that would fit with the other titles in our collection. We could search for the new DVD's using a series of keywords.
We're going to alter the existing "Shopping Site" scraping session so that it reads in a file containing search terms, and performs a search for each one. Just as before, as it performs a search it will follow the "details" links and extract out information for each product. Once the information is extracted it will write it out to a file.
Tutorial 7: Page 3: Altering the Scraping Session
 |
Altering the Scraping Session |
The changes we'll be making to our "Shopping Site" scraping session in order to add this new functionality are actually pretty minor. First, let's deal with the trickiest part (which really isn't all that tricky): creating the script that will read in the file containing our search terms, and run each search.
Create a new script by clicking the pencil and paper icon in the button bar. Give the script the name "Read search terms". Leave the "Language" drop-down list with the value "Interpreted Java". Paste in the following for the content of the script:
|
// Create a file object that will point to the file containing // the search terms. File inputFile = new File( "search_terms.txt" );
// These two objects are needed to read the file. FileReader in = new FileReader( inputFile ); BufferedReader buffRead = new BufferedReader( in );
// Read the file in line-by-line. Each line in the text file // will contain a search term. while( ( searchTerm = buffRead.readLine() )!=null) { // Set a session variable corresponding to the search term. session.setVariable( "SEARCH", searchTerm );
// Remember we need to initialize the PAGE session variable, just // in case we need to iterate through multiple pages of search results. // We begin at page 1 for each search. session.setVariable( "PAGE", "1" );
// Get search results for this particular search term. session.scrapeFile( "Search results" ); }
// Close up the objects to indicate we're done reading the file. in.close(); buffRead.close();
|
The script is pretty heavily commented, so it may be apparent what's going on, but let's walk through it a bit, just in case.
First off we create a few objects that are going to allow us to read in search terms from a file called "search_terms.txt". We read the search terms in line-by-line in a "while" loop. For each search term we're going to invoke the scrapeable file "Search results". You might remember that the "Search results" scrapeable file is the one that handles issuing the search to the e-commerce web site, and walks through all of the search results pages. It also has an extractor pattern that pulls the details links, following each one of those to the "Details page" scrapeable file.
That might sound a bit complicated, so let's put the rest of the pieces in place, run it, then walk through it again.
There are just a few more modifications we need to make. Please do the following:
- If you haven't done so previously, you'll also want to disable the "Shopping Site--initialize session" script. We'll be search terms from our external file, and this script would otherwise overwrite those values. To disable the script, click on the "Shopping Site" scraping session in the tree on the left, then on the "Scripts" tab. Un-check the box in the table under the "Enabled?" column.
- Click on the "Home" scrapeable file, then check the box labeled "This scrapeable file will be invoked manually from a script". Do the same for the "Login" scrapeable file. This essentially has the effect of disabling it. We don't want it to log in to the site every time we perform a search (and it's not necessary), so we just forgo this script for now.
- Click on the "Search results" scrapeable file, then check the box labeled "This scrapeable file will be invoked manually from a script". This time we check the box because we're going to want to run the search for each search term, rather than just letting it happen in sequence for a single term. That is, we're going to explicitly tell the scrapeable file when to run from our "Read search terms" script rather than let screen-scraper invoke it in sequence.
- Click on the "Details page" scrapeable file, then on the "Extractor Patterns" tab. For the "PRODUCTS" extractor pattern, in its "Scripts" section (below the box for the pattern text) ensure that the "Enabled?" box for the "Write data to a file" script is checked. We disabled it in a previous tutorial, but we'll need it now so that the data gets written out.
- Click on the "Shopping Site" scraping session, then on the "Scripts" tab. Click the "Add Script button. Under "Script Name" select "Read search terms". Under the "When to Run" column, leave it with the value "Before scraping session begins". We're going to invoke our script at the very beginning of the scraping session. The "Read search terms" can be thought of as a type of "controller" script. Rather than letting screen-scraper invoke scrapeable files in sequence our script will instead explicitly initiate searches by invoking the "Search results" scrapeable file.
That should do it. Click ahead to finalize setup and run the scraping session.
Tutorial 7: Page 4: Running the Scraping Session
 |
Running the Scraping Session |
The last item we need to take care of is creating the text file that will contain our search terms. Let's keep it simple. Fire up your favorite text editor and create a file called "search_terms.txt" inside of screen-scraper's installation folder (e.g., "C:\Program Files\screen-scraper professional edition\search_terms.txt"). Add the following three lines to the text file:
bug
speed
blade
Those search terms should yield at least a few DVD's we can add to our collection.
All right, now's the moment of truth. Run the new scraping session by clicking on it in screen-scraper and clicking the "Run Scraping Session" button. After that, click on the "Log" file to watch it do its thing. If all goes well, once it's done, you should have a "dvds.txt" file in screen-scraper's install folder containing scraped data for all of the search terms.
Take a look carefully through the log. If it all seems to make sense, you're done. If not, read on so that we can walk through it a bit more carefully.
The flow of events goes like this, once you hit the "Run Scraping Session" button:
- The scraping session starts up, and immediately invokes the "Read search terms" script.
- The "Read search terms" script creates a few objects, then reads in the first line of the "search_terms.txt" file: "bug".
- The "Read search terms" script sets the "SEARCH" session variable with the value "bug", then invokes the "Search results" scrapeable file. You'll remember from the earlier tutorial that the "SEARCH" session variable is used to perform each search. Check the "URL" field for the "Search results" scrapeable file for a reminder on where its used.
- The "Read search terms" script initializes the "PAGE" session variable to "1". It turns out that this is probably unnecessary in this particular case, but you'll want to remember it for future projects. For each search term we're performing a completely separate search, so we need to make sure we start on the first page.
- The "Read search terms" script invokes the "Search results" scrapeable file. This is essentially the same thing as clicking the "Search" button on the search form with the current search term ("bug", in this case).
- The "Search results" scrapeable file makes the HTTP request, then applies the "PRODUCT" extractor pattern to the HTML in order to get all of the "details" links.
- For each match by the "PRODUCT" extractor pattern the script "Scrape details page" gets invoked.
- At this point screen-scraper will loop zero or more times. It will scrape the "Details page" scrapeable file for each link found on the search results page.
- Each time the "Details page" scrapeable file is invoked it requests the page, extracts out the data we want, then invokes the "Write data to a file" script, which writes out the extracted data to the "dvds.txt" file.
- Once screen-scraper has finished performing the search for "bug" control flows back to our original "Read search terms" script, where it moves on to the next search term in the file: "speed". From there you can go back to step 3, where it begins the search process again.
Remember that the "Log" tab is key to understanding the flow of events in screen-scraper. If you're still a bit fuzzy on how things are working, try looking more carefully through the log to piece together how the site is being scraped.
Tutorial 7: Page 5: Where to Go From Here
 |
Where to Go From Here |
At this point feel free to experiment a bit. You may want to try adding a few more search terms to the "search_terms.txt" file.
Probably the best way to extend on what this tutorial covers would be to try your own project. If you're faced with the task of scraping a web site multiple times for various numbers or search keywords, chances are the scraping session you'll create won't differ too significantly from the one we've presented here.