Memory Conscious Next Page

If you're scraping a site with lots of "next page" links, you are well advised to use the following script, instead of the other two listed here.

Conceptually, the problem with calling a script at the end of a scrapeableFile, which calls the same scrapeableFile over and over again, is that you're stacking the scrapeableFiles on top of one another. They'll never leave memory until the last page has completed, at which point the stack quickly goes away. This style of scraping is called "recursive".

If you can't predict how many pages there will be, then this idea should scare you :) Instead, you should use an "iterative" approach. Instead of chaining the scrapeableFiles on the end of one another, you call one, let it finish and come back to the script that called it, and then the script calls another. A while/for loop is very fit for this.

Here's a quick illustration of a comparison, so that you can properly visualize the difference. Script code to follow.

// the non-loop "recursive" approach:
search results for category "A"
|- next results
     |- next results
         |- next results
             |- next results
search results for category "B"
|- next results
     |- next results
         |- next results
             |- next results
                 |- next results
                     |- next results

// Now here's the for-loop "iterative" approach, via a single control script:
search results for category "A"
next results
next results
next results
next results

search results for category "B"
next results
next results
next results
next results
next results
next results

Much more effective.

So here's how to do it. When you get to the point where you need to start iterating search results, call a script which will be a little controller for the iteration of pages. This will handle page numbers and offset values (in the event that page iteration isn't using page numbers).

First, your search results page should match some extractor pattern which hints that there is a next page. This helps remove what the page number actually is, and reduces next pages to a simple boolean true or false. The pattern should match some text that signifies a next page is present. In the example code below, I've named the variable "HAS_NEXT_PAGE". Be sure to save it to a session variable. If there is no next page, then this variable should not be set at all. That will be the flag for the script to stop trying to iterate pages.

// If using an offset, this number should be the first search results page's offset, be it 0 or 1.
int initialOffset = 0;

// ... and this number is the amount that the offset increases by each
// time you push the "next page" link on the search results.
int offsetStep = 20;

String fileToScrape = "Search Results ScrapeableFile Name";

/* Generally no need to edit below here */

hasNextPage = "true"; // dummy value to allow the first page to be scraped
for (int currentPage = 1; hasNextPage != null; currentPage++)
{
    // Clear this out, so the next page can find its own value for this variable.
    session.setVariable("HAS_NEXT_PAGE", null);
    session.setVariable("PAGE", currentPage);
    session.setVariable("OFFSET", (currentPage - 1) * offsetStep + initialOffset);
    session.scrapeFile(fileToScrape);
    hasNextPage = session.getVariable("HAS_NEXT_PAGE");
}

The script provides to you a "PAGE" session variable, and an "OFFSET" session variable. Feel free to use either one, whichever your situation calls for.

OFFSET will (given the default values in the script), be 0, 20, 40, 60, etc, etc.
PAGE will be 1, 2, 3, 4, 5, etc, etc.