Using Scraping Sessions

Using Scraping Sessions

Overview

A scraping session is simply a way to collect together files that you want scraped. Typically you'll create a scraping session for each site you want to scrape informaiton from.

You can create a new scraping session by clicking the New Scraping Session button (looks like a gear) or by selecting "File->New Scraping Session" from the menu.

General tab



The "General" tab allows you to manage basic actions and information related to the scraping session.

  • Run Scraping Session: Starts the scraping session. Once the scraping session begins running you can watch its progress under the "Log" tab.
  • Delete: Deletes the scraping session.

  • Add Scrapeable File: Adds a new scrapeable file to this scraping session. See the using scrapeable files page for more information.
  • Export: Allows you to export the scraping session to an XML file. This might be useful for backing up your work or transferring information to a different screen-scraper installation.
  • Name: Used to identify the scraping session. The name should be unique relative to other scraping sessions.
  • Notes: Useful for keeping notes specific to the scraping session.

Scripts tab



Using this tab scripts can be designated to run either before or after the scraping session runs. This can be useful for functions like initializing session variables and performing clean-up after the session is finished. The script to be run is designated under the "Script Name" column. The sequence the scripts should be invoked in is determined by the "Sequence" column. Indicate the event that should trigger the script using the "When to Run" column. If the checkbox in the "Enabled?" column is not checked the script will not get run.

Log tab



The "Log" tab displays messages as the scraping session is running. This is one of the most valuable tools in working with and debugging scraping sessions. As you're creating your scraping session you'll want to run it frequently and check the log to ensure that it's doing what you expect it to.

Advanced tab



This tab contains a number of settings that may be required when working with certain sites.

  • Max requests per file: (professional and enterprise editions only) In some cases web sites may not be completely reliable, which could necessitate making the request for a given page more than once. For example, a small site receiving a lot of traffic may not respond to the first two or three requests, but could on subsequent requests. The "Max requests per file" text box allows you to control the maximum number of attempts screen-scraper should make in requesting a given file. For example, if this value is set to 10, screen-scraper will try to request a given file up to 10 times before giving up on it.
  • Cookie policy: (professional and enterprise editions only) This drop-down list controls the way screen-scraper works with cookies. In most cases you won't need to modify this setting. There may be instances, however, where you find yourself unable to log in to a web site or advance through pages as you're expecting to. If you've checked other settings, such as POST and GET parameters, you may need to adjust the cookie policy. Some web sites issue cookies in uncommon ways, and adjusting this setting will allow screen-scraper to work correctly with them. In some cases you may also want to reject cookies completely.
  • HTTP client: (professional and enterprise editions only) In certain rare cases a site will only function when accessed with Internet Explorer. The "HTTP client" drop-down list allows you to indicate that screen-scraper should use the Internet Explorer browser to make its requests instead of its own internal HTTP client. This feature only works when screen-scraper is running on Microsoft Windows. Note also that it should only be used as a last resort as it will cause the scraping process to take longer and to consume more memory.
  • Use HTTP strict mode: (professional and enterprise editions only) This setting goes hand-in-hand with the "Cookie policy" drop-down list. If you're having trouble advancing through pages on a site you might try checking this box as well as adjusting the cookie policy.
  • External proxy settings: These text boxes are used in cases where you need to connect to the Internet via an external proxy server.

Anonymization tab



See the Anonymization page of the documentation for details on this pane.

Running Scraping Sessions Within Scraping Sessions (enterprise edition only)

It is also possible to run a scraping session within a scraping session that is already running via the RunnableScrapingSession class. Detailed documentation on methods available for the RunnableScrapingSession class are in our API documentation. Here's a specific example of how the RunnableScrapingSession might be used in a screen-scraper script:

// Generate a new RunnableScrapingSession object that will inherit
// from the current scraping session.  This object will be used
// to run the scraping session "My Scraping Session"
myRunnableScrapingSession = new com.screenscraper.scraper.RunnableScrapingSession( "My Session", session );

// Because we passed the "session" object to the RunnableScrapingSession
// it will have access to all of the session variables within the
// currently running session.  As such, there's no need to explicitly
// set any new session variables.  We simply tell it to scrape.
myRunnableScrapingSession.scrape();

// Once it's done scraping, because it inherited from our currently
// running scraping session, we have access to any session variables
// that were set when the RunnableScrapingSession ran in the context
// of our currently running scraping session.  For example, let's
// suppose that when the RunnableScrapingSession ran it set a new
// variable called "MY_VAR".  Because of the inheritance, we could
// do something like this to see th new value:
session.log( "MY_VAR: " + session.getVariable( "MY_VAR" ) );


From here:

On scripts: