The Scraping Engine

The Scraping Engine

An Overview of the Scraping Engine

An Overview of the Scraping Engine

Purpose

The screen-scraper application contains a scraping engine, which is intended to provide an intuitive and convenient way to set up specific web pages to have information scraped from them. Many of the details related to screen-scraping that are typically done in code can be handled through a graphical user interface (called the "workbench", in screen-scraper).

Basic concepts

There are several basic elements used by screen-scraper in extracting data from web sites. The first is a scraping session which consists of a series of files, called scrapeable files, that screen-scraper will request in a designated sequence. A common example might be a site that requires authentication before the data that is to be extracted can be accessed. The first file, or HTTP request, might be to a server-side script that handles a user's login. It might be necessary to follow a few links, which would involve creating more scrapeable files, until the page can be requested that contains the desired data. Any number of parameters can be associated with scrapeable files. This would be GET, POST, or authentication tokens such as Basic, Digest or NTLM that need to be sent when the file is requested. For each scrapeable file that's requested any number of extractor patterns can be applied to the text retrieved from the page in order to extract out the desired pieces. Throughout this process scripts can be invoked that might perform tasks such as insert extracted data into a database or invoke subsequent scrapeable files to be requested. As a scraping session is running screen-scraper will log the activity and record each request and response corresponding to each scrapeable file that gets requested.


From here:

More details on the scraping engine:

API Documentation

API Documentation

API Documentation

Using Scraping Sessions

Using Scraping Sessions

Overview

A scraping session is simply a way to collect together files that you want scraped. Typically you'll create a scraping session for each site you want to scrape informaiton from.

You can create a new scraping session by clicking the New Scraping Session button (looks like a gear) or by selecting "File->New Scraping Session" from the menu.

General tab



The "General" tab allows you to manage basic actions and information related to the scraping session.

  • Run Scraping Session: Starts the scraping session. Once the scraping session begins running you can watch its progress under the "Log" tab.
  • Delete: Deletes the scraping session.

  • Add Scrapeable File: Adds a new scrapeable file to this scraping session. See the using scrapeable files page for more information.
  • Export: Allows you to export the scraping session to an XML file. This might be useful for backing up your work or transferring information to a different screen-scraper installation.
  • Name: Used to identify the scraping session. The name should be unique relative to other scraping sessions.
  • Notes: Useful for keeping notes specific to the scraping session.

Scripts tab



Using this tab scripts can be designated to run either before or after the scraping session runs. This can be useful for functions like initializing session variables and performing clean-up after the session is finished. The script to be run is designated under the "Script Name" column. The sequence the scripts should be invoked in is determined by the "Sequence" column. Indicate the event that should trigger the script using the "When to Run" column. If the checkbox in the "Enabled?" column is not checked the script will not get run.

Log tab



The "Log" tab displays messages as the scraping session is running. This is one of the most valuable tools in working with and debugging scraping sessions. As you're creating your scraping session you'll want to run it frequently and check the log to ensure that it's doing what you expect it to.

Advanced tab



This tab contains a number of settings that may be required when working with certain sites.

  • Max requests per file: (professional and enterprise editions only) In some cases web sites may not be completely reliable, which could necessitate making the request for a given page more than once. For example, a small site receiving a lot of traffic may not respond to the first two or three requests, but could on subsequent requests. The "Max requests per file" text box allows you to control the maximum number of attempts screen-scraper should make in requesting a given file. For example, if this value is set to 10, screen-scraper will try to request a given file up to 10 times before giving up on it.
  • Cookie policy: (professional and enterprise editions only) This drop-down list controls the way screen-scraper works with cookies. In most cases you won't need to modify this setting. There may be instances, however, where you find yourself unable to log in to a web site or advance through pages as you're expecting to. If you've checked other settings, such as POST and GET parameters, you may need to adjust the cookie policy. Some web sites issue cookies in uncommon ways, and adjusting this setting will allow screen-scraper to work correctly with them. In some cases you may also want to reject cookies completely.
  • HTTP client: (professional and enterprise editions only) In certain rare cases a site will only function when accessed with Internet Explorer. The "HTTP client" drop-down list allows you to indicate that screen-scraper should use the Internet Explorer browser to make its requests instead of its own internal HTTP client. This feature only works when screen-scraper is running on Microsoft Windows. Note also that it should only be used as a last resort as it will cause the scraping process to take longer and to consume more memory.
  • Use HTTP strict mode: (professional and enterprise editions only) This setting goes hand-in-hand with the "Cookie policy" drop-down list. If you're having trouble advancing through pages on a site you might try checking this box as well as adjusting the cookie policy.
  • External proxy settings: These text boxes are used in cases where you need to connect to the Internet via an external proxy server.

Anonymization tab



See the Anonymization page of the documentation for details on this pane.

Running Scraping Sessions Within Scraping Sessions (enterprise edition only)

It is also possible to run a scraping session within a scraping session that is already running via the RunnableScrapingSession class. Detailed documentation on methods available for the RunnableScrapingSession class are in our API documentation. Here's a specific example of how the RunnableScrapingSession might be used in a screen-scraper script:

// Generate a new RunnableScrapingSession object that will inherit
// from the current scraping session.  This object will be used
// to run the scraping session "My Scraping Session"
myRunnableScrapingSession = new com.screenscraper.scraper.RunnableScrapingSession( "My Session", session );

// Because we passed the "session" object to the RunnableScrapingSession
// it will have access to all of the session variables within the
// currently running session.  As such, there's no need to explicitly
// set any new session variables.  We simply tell it to scrape.
myRunnableScrapingSession.scrape();

// Once it's done scraping, because it inherited from our currently
// running scraping session, we have access to any session variables
// that were set when the RunnableScrapingSession ran in the context
// of our currently running scraping session.  For example, let's
// suppose that when the RunnableScrapingSession ran it set a new
// variable called "MY_VAR".  Because of the inheritance, we could
// do something like this to see th new value:
session.log( "MY_VAR: " + session.getVariable( "MY_VAR" ) );


From here:

On scripts:

Using Scrapeable Files

Using Scrapeable Files

Overview

A scrapeable file is a URL-accessible file that you want to have retrieved as part of a scraping session. These files are the core of screen-scraping as they determine what information will be made available to extract data from.

Scrapeable files are created by clicking the "Add Scrapeable File" button from the "General" tab for a scraping session. You can delete a scrapeable file by right-clicking (or option-clicking in Mac OS X) it in the tree on the left side of the screen and selecting "Delete".

In addition to working with files on remote servers, screen-scraper can also handle files on local file systems. For example, the following is a valid path to designate in the URL field: C:\wwwroot\myweb\my_file.htm.

Properties tab



The "Properties" tab defines basic settings needed to request a file.

  • Delete: Deletes the scrapeable file.
  • Copy: Copies the scrapeable file. (enterprise edition only)
  • Name: Identifies the scrapeable file.

  • URL: The URL of the file to be scraped. This is likely something like http://www.mysite.com/, but can also contain embedded session variables, like this: http://www.mysite.com/cgi-bin/test.cgi?param1=~#TEST#~. In the latter case the text ~#TEST#~ would get replaced with the value of the corresponding session variable.
  • Sequence: Indicates the order in which this file should be requested.
  • This scrapeable file will be invoked manually from a script: Indicates that this scrapeable file will be invoked within a script, so it should not be scraped in sequence. If this box is checked the "Sequence" text box becomes grayed out.

Parameters tab



"Get" and "Post" Parameters

The "Parameters" tab indicates GET and POST parameters that should be sent when the file is requested. Note that GET parameters can also be embedded in the "URL" field under the "Properties" tab. Parameters are added using the "Add Parameter" button. They can be deleted by selecting them and either hitting the "Delete" key on the keyboard, or by right-clicking (option-clicking in Mac OS X) and selecting "Delete".

Upload a File

In the Enterprise Edition of screen-scraper you can also designate files to be uploaded. This is done by designating "FILE" as the parameter type. The "Key" column would containg the name of the parameter (as found in the corresponding HTML form), and the value would be the local path to the file you'd like to upload (e.g., C:\myfiles\this_file.txt).

Embed Variables

Embedded session variables can be used in the "Key" and "Value" fields for parameters. For example, if you have a "username" POST parameter you might embed a USERNAME session variable in the "Value" field with the token ~#USERNAME#~. This would cause the value of the "USERNAME" session variable to be substituted in at run time.

Extractor Patterns tab



This tab holds the various extractor patterns that will be applied to the HTML of this scrapeable file. See the using extractor patterns page for more information.

Scripts tab



Using this tab scripts can be designated to run either before or after the file is requested. This can be useful for functions like setting session variables and requesting multiple pages of search results. The script to be run is designated under the "Script Name" column. The sequence the scripts should be invoked in is determined by the "Sequence" column. Indicate the event that should trigger the script using the "When to Run" column. If the checkbox in the "Enabled?" column is not checked the script will not get run.

Last Request tab



This tab will display the raw HTTP request for the last time this file was retrieved. This tab can be useful for debugging in looking at POST and GET parameters that were sent to the server.

Last Response tab



This tab displays the raw HTTP and HTML from the last time this file was requested. The most common use for this tab is in generating and testing extractor patterns. You can generate an extractor pattern by highlighting a block of text or HTML, right-clicking (option-clicking on Mac OS X) and selecting "Generate extractor pattern from selected text".

The "Render HTML"/"View Source" button allows you to toggle between a rendered version of the page and the raw HTML source. In certain cases the HTML may contain embedded JavaScript and complex DHTML that screen-scraper has difficulty rendering. You can also use the "Display Response in Browser" button to display the web page in your default web browser.

Note that the contents shown under the "Last Request" tab might appear differently from the original HTML of the page. screen-scraper has the ability to "tidy" the HTML, which can facilitate data extraction. See using extractor patterns for more details on tidying HTML.

When viewed as text, the HTML for the last response can be searched using the "Find..." button.

Advanced tab (professional and enterprise editions only)



This tab contains a few advanced settings.

  • Username and Password: These two text fields are used with sites that make use of Basic, Digest, NTLM authentication. You can generally recognize when a web site requires this type of authentication because, after requesting the page, a small box will pop up requesting a username and password.
  • Tidy HTML after scraping?: When this box is checked screen-scraper will "tidy" the HTML after requesting the file. This cleans up the HTML, which facilitates extracting data from it. Note that a performance hit is incurred, however, when tidying is done. In cases where performance is critical this box should be un-checked.


From here:

More details on related stuff:

Using Extractor Patterns

Using Extractor Patterns

Overview

Extractor patterns allow you to pinpoint select snippets of data that you want extracted from a web page. They're often the most confusing part of screen-scraper, so you'll want to look over this page carefully. An extractor pattern is a block of text (usually HTML) that contains special tokens that will match pieces of data you're interested in extracting. These tokens are text labels surrounded by the delimiters ~@ and @~. The label between the delimiters should contain only alpha-numeric characters and underscores.

You can think of an extractor pattern like a stencil. A stencil is an image in cut-out form, often made of thin cardboard. As you place a stencil over a piece of paper, apply paint to it, then remove the stencil, the paint remains only where there were holes in the stencil. Analogously, you can think of placing an extractor pattern over the HTML of a web page where the tokens correspond to the holes where the paint would pass through. After an extractor pattern is applied it reveals the portions of the web page you'd like to extract.

Extractor pattern tokens designate regions where data elements are to be captured. For example, given the following HTML snippet:

<p>This is the <b>piece of text</b> I'm interested in.</p>

you would extract "piece of text" by creating an extractor pattern with a token positioned like so:

<p>This is the <b>~@EXTRACTED_TEXT@~</b> I'm interested in.</p>

The extracted text could then be accessed via the identifier "EXTRACTED_TEXT".

If you haven't done so already, we'd recommend going through our first tutorial to get a better feel for using extractor patterns.

Managing extractor patterns



Any number of extractor patterns can be associated with a given scrapeable file, and are managed by clicking on a scrapeable file, then on the "Extractor Patterns" tab. Add an extractor pattern by clicking on the "Add Extractor Pattern" button. Any number of extractor patterns can be applied to a given scrapeable file, and they will be applied to the file in a designated sequence. Any number of tokens can appear within an extractor pattern.

The recommended way to create a token is to simply select a region of text in an extractor pattern, right-click (or control-click in Mac OS X) the selected region and select "Generate extractor pattern token from selected text" in the pop-up menu (note that this can be done either when viewing the extractor pattern as HTML or as plain text). Creating an extractor pattern token in this manner will open a window that allows you to edit the attributes of the token. It's also often helpful to use an external text editor when creating extractor patterns where you can store snippets of HTML you're working with. You can then copy text into screen-scraper, as needed.

If an extractor pattern takes too long to match a block of text it will timeout. The timeout setting may be adjusted from the "Settings" window (click on the Options->Settings menu item) under the "General" tab. If you find that your extractor pattern is timing out you might try adjusting it by using more precise regular expressions. The tips at the bottom of this page might also help.

Note that when creating extractor patterns you should use the HTML that will be found under the "Last Response" tab associated with a scrapeable file. By default, screen-scraper will "tidy" the HTML once it's been scraped, meaning that it will format it in a consistent way that makes it easier to work with. If you use the HTML by viewing the source for a page in your web browser it will likely be different from the HTML that screen-scraper generates.

Main tab



The main tab allows you to edit the primary attributes of the pattern, and contains the following elements:

  • Delete Extractor Pattern: Deletes the current extractor pattern.

  • Apply Pattern to Last Scraped Data: It's often helpful to test out your extractor pattern to ensure that it's doing what you expect. To test your extractor pattern just click on the "Apply Pattern to Last Scraped Data" button. This will pop up a new window with the results of the match. Depending on how many times your pattern matches in the text of the last response (the HTML that appears under the "Last Response" tab), you should notice one or more data sets. This is the information that will be available from a script using the dataRecord and dataSet variables (see using scripts for details).
  • Copy Pattern: (enterprise edition only) Copies the extractor pattern so that it can be added somewhere else.
  • Identifier: This is a string that will be used to identify the piece of data that gets extracted as a result of this token. You should use only alphanumeric characters and underscores here.
  • Sequence: Determines the order in which the extractor patterns will be applied to the HTML.

  • Pattern text: Used to hold the text for the extractor pattern. This will also include the extractor pattern tokens that are analogous to the holes in the stencil.
  • Scripts: This table allows you to indicate scripts that should be run as the extractor pattern finds matches. Much like other programming languages, screen-scraper can invoke code based on specified events. In this case, you can invoke scripts before the pattern is applied, after each match it finds, or after all matches have been made. For example, if your pattern finds 10 matches, and you designate a script to be run "After each pattern application", that script will get invoked 10 separate times. See using scripts for more details.

Sub-extractor patterns tab



This tab allows you to add, edit, delete, and test sub-extractor patterns. See the "Using sub-extractor patterns" section below for more on this.

Advanced tab (professional and enterprise editions only)



The advanced tab provides extended control over extractor patterns, described below.

  • Automatically save the data set generated by this extractor pattern in a session variable: If this box is checked screen-scraper will place the DataSet object generated when this extractor pattern is applied into a session variable using the identifier as the key. For example, if your extractor pattern were named "PRODUCTS", and you checked this box, screen-scraper would apply the pattern and place the resulting DataSet into a session variable to be used later on.
  • Filter duplicate records: When this box and the "Cache the data set" box are checked screen-scraper will filter duplicates from extracted records. See the "Filtering duplicate records" section below for more details.
  • Cache the data set: In some cases you'll want to store extracted data in a session variable, but the data set will potentially grow to be very large. The "Cache the data set" checkbox will cause the extracted data to be written out to the file system as it's being extracted so that it doesn't consume RAM. When you attempt to access the data set from a script or external code it will be read from the disk into RAM temporarily so that it can be used. You'll also need to check this box if you want to filter duplicates.
  • This extractor pattern will be invoked manually from a script: If you check this box the extractor pattern will not be invoked automatically by screen-scraper. Instead, you'll invoke it in a script using the "extractData" and "extractOneValue" methods described on the using scripts page.

Extractor pattern tokens

Extractor pattern tokens can be edited by double-clicking them, or by selecting the label between the ~@ @~ delimiters, then right-clicking (control-clicking on Mac OS X), and selecting "Edit extractor pattern token". This will display a small dialog box with a tabbed pane. Each pane is described below.

Extractor pattern tokens "General" tab



  • Identifier: This is a string that will be used to identify the piece of data that gets extracted as a result of this token. You should use only alphanumeric characters and underscores here.
  • Save in session variable? Checking this box causes the value extracted by the token to be saved in a session variable using the token's identifier. See using session variables for more information.

Extractor pattern tokens "Regular Expression" tab



Here you can designate a regular expression that will be used to match the text covered by this token. You can either enter one in the text box, or select one from the drop-down list. The regular expressions that appear in the drop-down list can be edited by selecting "Edit regular expressions" from the "Options" menu.

Extractor pattern tokens "Mapping" tab (enterprise edition only)



The mapping tab allows you to alter extracted values. Often once you extract data from a web page you need to put it into a consistent format. For example, you may want products with very similar names to have identical names.

screen-scraper makes use of mapping sets when determining how to map a given extracted value. A mapping set may contain any number of mappings, which screen-scraper will analyze in sequence until it finds a match, or runs out of mappings. As such, you'll often want to put more specific mappings higher in sequence than more general mappings.

The various columns in a mapping are defined below:

  • From The value screen-scraper should match.
  • To Once a match is found, indicates the new value the extracted data will assume.

  • Type Determines the type of match that should be made in working with the value in the "From" field. The "Equals" option will match if an exact match is found, the "Contains" value will match if the value contains the text in the "From" field, and the "Regular Expression" type uses the "From" value as a regular expression to attempt to find a match.
  • Case Sensitive? Indicates whether or not the match should be case sensitive.
  • Sequence Determines the sequence in which the particular mapping should be analyzed.

You can create a new mapping set by simply typing a name into the "Set" box. Sets can be deleted via the "Delete Set" button, and an individual mapping can be added by clicking the "Add Mapping" button. Individual mappings can be deleted by selecting them, then right-clicking (control-clicking on Mac OS X) and selecting "Delete".

Consider the screen-shot of the "Mapping" tab above. If the extracted value were Widget 123 screen-scraper would first try to match using the Widget 1 mapping. Because this is an "Equals" match the mapping wouldn't occur, so screen-scraper would proceed to the second mapping. The second mapping would match because a "Contains" type was designated. That is, the text Widget 123 contains the text Widget. As such, the extracted data Widget 123 would become Product ABC, because that is the "To" value designated for the second mapping.

When using regular expressions in your mapping you can also make use of back references. Back references allow you to preserve values in the original text when mapped to the "To" value. For example, if you were mapping the value Widget 123 you could use the regular expression Widget (\d*). In the "To" column you could then enter the value Product \1, which, when mapped, would convert Widget 123 to Product 123. The value in parentheses in the "From" column gets inserted via the \1 marker found in the "To" column.

Extractor pattern tokens advanced tab (enterprise edition only)



  • Use to filter duplicates: Indicates that this token should be used when filtering duplicates. See the "Filtering duplicates" section below for more details.
  • Strip HTML: Check this box if you'd like screen-scraper to pull out any HTML tags from the extracted value.
  • Resolve relatively URL to absolute URL: If checked, this will resolve a relative URL (e.g., /myimage.gif) into an absolute URL (e.g., http://www.mysite.com/myimage.gif).
  • Convert HTML entities: This will cause any html entities to be converted into plain text (e.g., it will convert &amp; into &).

Filtering duplicates (professional and enterprise editions only)

When extracting records from web sites you'll often want to filter out duplicates. screen-scraper provides a method whereby this can be done automatically. To filter duplicates for data extracted by a given extractor pattern you'll wnat to go to the "Advanced" tab, then check the boxes labeled "Automatically save the data set generated by this extractor pattern in a session variable", "Filter duplicate records", and "Cache the data set". This will cause screen-scraper to generate a session variable with the same name as the extractor pattern identifier, and will save any records extracted by the pattern to the file system rather than saving them in memory.

Once you've set up the extractor pattern to cache and save the data set, you'll need to designate the fields that would identify a unique record. That is, when filtering duplicates screen-scraper will compare the values for designated columns in order to determine if a duplicate record already exists (more or less like a database compound key). You designate an extractor pattern token to be used in determining uniqueness by editing it, and checking the "Use to filter duplicates" box found under the "Advanced" tab.

Because screen-scraper filters duplicates as it's scraping you'll want to wait until the end to make use of the data. For example, if you want all of the data written to a .CSV file you would want to invoke the script that does that after the scraping session has ended. That way you can guarantee that all of the data has been extracted and filtered before you save it.

Using sub-extractor patterns

Sub-extractor patterns allow you to extract data in smaller pieces, providing significantly more flexibility in pinpointing the specific pieces you're after. Consider a search results page consisting of rows and columns of data. Using normal extractor patterns you would use a single pattern to extract the data from all columns for a single row. In many cases this works just fine; however, the process gets more complicated when each row differs significantly. For example, certain cell rows may be in different colors or their contents may be completely missing. With a normal extractor pattern it would be difficult to account for the variability in the cells. By using sub-extractor patterns you could create a normal extractor pattern to extract an entire row, then use individual sub-extractor patterns to pull out the individual cells.

Consider the following HTML table:

Name Phone Address
Juan Ferrero 111-222-3333 123 Elm St.
Joe Bloggs No contact information available
Sherry Lloyd 234-5678 (needs area code) 456 Maple Rd.

Here is the corresponding HTML source:

<table cellpadding="2" border="1">
<tr><th>Name</th><th>Phone</th><th>Address</th></tr>
<tr><td class="Name">Juan Ferrero</td><td class="Phone">111-222-3333</td><td class="Address">123 Elm St.</td></tr>
<tr><td class="Name" bgcolor="red">Joe Bloggs</td><td colspan="2">No contact information available</td></tr>
<tr><td class="Name">Sherry Lloyd</td><td class="Phone" bgcolor="yellow">234-5678 (needs area code)</td><td class="Address">456 Maple Rd.</td></tr>
</table>

It would be difficult (if not impossible) to write a single extractor pattern that would extract the information for each row because the contents of the cells differ so significantly. The different colored cells and the cell spanning two columns make the data too inconsistent to be extracted using a single pattern.

Consider this extractor pattern:

<tr><td~@DATARECORD@~/td></tr>

If applied to the HTML above the extractor pattern would produce the following three matches:

1.  class="Name">Juan Ferrero</td><td class="Phone">111-222-3333</td><td class="Address">123 Elm St.<
2.  class="Name" bgcolor="red">Joe Bloggs</td><td colspan="2">No contact information available<
3.  class="Name">Sherry Lloyd</td><td class="Phone" bgcolor="yellow">234-5678 (needs area code)</td><td class="Address">456 Maple Rd.<

Sub-extractor patterns would allow you to extract individual pieces of information from each row. For example, consider this sub-extractor pattern:

Name">~@NAME@~</td>

If applied to each of the individual extracted rows above the following three pieces of information would be extracted:

1.  Juan Ferrero
2. 
3.  Sherry Lloyd

Note that "Joe Bloggs" didn't get extracted because the cell his name in is red. Let's adjust the sub-extractor pattern slightly:

Name"~@nonhtml@~>~@NAME@~</td>

The ~@nonhtml@~ tag represents an extractor pattern token that uses the "Non-HTML tags" regular expression:  [^<>]*. Matching anything between where it is covering until it encounters either an opening or closing HTML bracket. In this particular case the effect is that all three names get extracted. To extract the phone number you'd use this sub-extractor pattern:

<td class="Phone"~@nonhtml@~>~@PHONE@~</td>

We have the case, however, of the cell in the second row that spans two columns, which would not get extracted by the sub-extractor pattern. We may still want this information, however, so we create the following sub-extractor pattern, just in case the cell exists:

<td colspan="2">~@PHONE@~<

If applied to our data we'd get the following results:

1. 
2. No contact information available
3.

Sub-extractor patterns aggregate everything that's extracted into a single data set. Using all of our extractor and sub-extractor patterns together we'd get the following data set:

Data record # Name Phone
Data record #1 Juan Ferrero 111-222-3333
Data record #2 Joe Bloggs No contact information available
Data record #3 Sherry Lloyd 234-5678 (needs area code)

There are a couple of important things to note about sub-extractor patterns:

  • Note that we had two sub-extractor patterns holding a token with the same name (PHONE). If a sub-extractor pattern doesn't match anything it simply has no effect, which allows another sub-extractor pattern to match something instead. The sub-extractor pattern that matches something will take precedence over those that don't.
  • The ~@DATARECORD@~ extractor pattern token is special in that it defines a block of data that you wish to apply sub-extractor patterns to.
  • When using sub-extractor patterns only the first match will be used. That is, even if a sub-extractor pattern could match multiple times, only the data corresponding to the first match will be extracted.

Tips on using extractor patterns

  • Test your patterns frequently. Extractor patterns take some practice. Especially when you're first trying them out you'll want to test them as you're working with them. It often helps to test it after every couple of tokens you insert.

  • Use regular expressions to make your extractor patterns more precise. One of the most common problems encountered occurs when an extractor pattern matches too much data, which usually includes a lot of HTML. There are a couple of ways to address this problem. One is to extend the pattern outward. That is, include HTML that falls before and after the block you're trying to match. The second approach, which is generally the easier of the two, is to include regular expressions. We've included a number of common regular expressions that you can select from the drop-down list.
  • Ensure that the pattern extracts the number of data sets you expect it to. Oftentimes your pattern might not be as flexible as you think it is. Test it out to make sure it matches as many times as you think it should.

  • Try tidying the HTML. This will ensure that white space is handled consistently and will often clean up extraneous characters. The setting that determines whether or not HTML gets tidied is adjusted under the "General" tab of the settings window (click on Options->Settings from the menu).


From here:

Related stuff:

Using Scripts

Using Scripts

Overview

screen-scraper has a built-in scripting engine to facilitate dynamically scraping sites and working with data once it's been extracted. Depending on your needs scripts can be helpful for such things as interacting with databases and dynamically determining which files get scraped when.

Invoking scripts in screen-scraper is similar to other programming languages in that they're tied to events. Just as you might designate a block of code to be run when a button is clicked in Visual Basic, in screen-scraper you might run a script after an HTML file has been downloaded or data has been extracted from a page.

Depending on your preferences, there are a number of languages that scripts can be written in. screen-scraper supports JavaScript, Interpreted Java, and Python on any platform, and JScript, VBScript, and Perl when running on Windows. Try the links at the bottom of this screen for information specific to each of the scripting languages.

If you haven't done so already, we'd highly recommend taking some time to go through our tutorials in order to get more familiar with how scripts are used.

Managing scripts

Scripts are added by clicking the "New Script" button (looks like a pencil and paper) or by selecting "File->New Script" from the menu bar. Delete a script either by selecting it and pressing the "Delete" key or by right-clicking it (or control-clicking on Mac OS X) and selecting "Delete".

Each script is given a unique name so that you can easily indicate when it should be invoked (e.g. before a scraping session begins or after each application of an extractor pattern). You can also select the language the script is written in. Scripts can be exported to an XML file so that they can be backed up or transferred to other instances of screen-scraper. See the Importing and exporting objects page for more information on this. Clicking on the "Show Script Instances" button will display any locations where this script is invoked in the format scraping session: scrapeable file: extractor pattern.

Finally, you're given a text box in which to write your script. The text editing features for authoring scripts in screen-scraper are currently fairly limited, so you may want to consider using an external editor, then copying and pasting text in to screen-scraper.

Using scripts

You designate a script to be executed by associating it with some event. For example, if you click on a scraping session in the tree, then on the "Scripts" tab, you'll notice that you can designate scripts to be invoked either before a scraping session begins or after it completes. Other events that can be used to invoke scripts relate to scrapeable files and extractor patterns. After associating a script with an object in this way it can be disassociated by selecting it in the table and pressing the "Delete" key or by right-clicking it (or control-clicking on Mac OS X) and selecting "Delete". You can also selectively enable and disable scripts using the "Enabled?" checkbox in the rightmost column.

Working with external Java libraries

Existing Java code can be referred to from within scripts. Simply copy any jar files you'd like to reference from within scripts into the "lib\ext" folder found in screen-scraper's directory. Note that you'll still need to use the "import" statement within your scripts to refer to specific classes, like this:

import com.foo.bar.*;

Please note--screen-scraper 4.0 was built on a Java 1.5 platform. You Java scripts must accept at least a version 1.5 JRE in order to compile and run properly.

Built-in objects

screen-scraper offers a few objects that you can work with in a script. Bear in mind that not all of these variables will be available in all scripts. See the Variable scope section (following this one) for more details. You can view details on all of the objects and their methods in our API Documentation.

Variable scope

Depending on when a script gets run different variables may be in scope. When associating a script with an object, such as a scraping session or scrapeable file, you're asked to specify when the script is to be run. The table that follows specifies what variables will be in scope depending on when a given script is run. Note that none of the variables will be in scope when a script is invoked directly, though it is common in these scripts to create RunnableScrapingSession objects.

When Script is Run session in scope scrapeableFile in scope dataSet in scope dataRecord in scope
Before scraping session begins X      
After scraping session ends X      
Before file is scraped X X    
After file is scraped X X    
Before pattern is applied X X    
After pattern is applied X X X  
After each pattern application X X X X

Debugging scripts

One of the best ways to fix errors is to simply watch the scraping session log (under the "Log" tab) and the "error.log" file (located in the "log" directory where screen-scraper was installed) for script errors. When a problem arises in executing a script screen-scraper will output a series of error-related statements to the logs. Often a good approach in debugging is to build your script bit by bit, running it frequently to ensure that it runs without errors as you add each piece.

When screen-scraper is running as a server it will automatically generate individual log files in the "log" directory for each running scraping session (this can be disabled in the settings window). An "error.log" file will also be generated in that same directory when internal screen-scraper errors occur.

The "Breakpoint" window can also be invaluable in debugging scripts. You can invoke it by inserting the line session.breakpoint() into your script. While the "Breakpoint" is displayed script execution will halt. There are two buttons along the top of the window. The "play" button will simply continue execution of your script. Clicking the "stop" button will cause screen-scraper to halt execution as soon as it can. The "Breakpoint" window also exposes any session variables, data sets, and data records that are in scope. These values can be altered in the "Breakpoint" window as well.


From here: