![]() |
The Scraping Engine |
![]() |
An Overview of the Scraping Engine |
Purpose
The screen-scraper application contains a scraping engine, which is intended to provide an intuitive and convenient way to set up specific web pages to have information scraped from them. Many of the details related to screen-scraping that are typically done in code can be handled through a graphical user interface (called the "workbench", in screen-scraper).
Basic concepts
There are several basic elements used by screen-scraper in extracting data from web sites. The first is a scraping session which consists of a series of files, called scrapeable files, that screen-scraper will request in a designated sequence. A common example might be a site that requires authentication before the data that is to be extracted can be accessed. The first file, or HTTP request, might be to a server-side script that handles a user's login. It might be necessary to follow a few links, which would involve creating more scrapeable files, until the page can be requested that contains the desired data. Any number of parameters can be associated with scrapeable files. This would be GET, POST, or authentication tokens such as Basic, Digest or NTLM that need to be sent when the file is requested. For each scrapeable file that's requested any number of extractor patterns can be applied to the text retrieved from the page in order to extract out the desired pieces. Throughout this process scripts can be invoked that might perform tasks such as insert extracted data into a database or invoke subsequent scrapeable files to be requested. As a scraping session is running screen-scraper will log the activity and record each request and response corresponding to each scrapeable file that gets requested.
From here:
More details on the scraping engine:
![]() |
Using Scraping Sessions |
Overview
A scraping session is simply a way to collect together files that you want scraped. Typically you'll create a scraping session for each site you want to scrape informaiton from.
You can create a new scraping session by clicking the New Scraping Session button (looks like a gear) or by selecting "File->New Scraping Session" from the menu.
General tab

The "General" tab allows you to manage basic actions and information related to the scraping session.
Delete: Deletes the scraping session.
Scripts tab

Using this tab scripts can be designated to run either before or after the scraping session runs. This can be useful for functions like initializing session variables and performing clean-up after the session is finished. The script to be run is designated under the "Script Name" column. The sequence the scripts should be invoked in is determined by the "Sequence" column. Indicate the event that should trigger the script using the "When to Run" column. If the checkbox in the "Enabled?" column is not checked the script will not get run.
Log tab

The "Log" tab displays messages as the scraping session is running. This is one of the most valuable tools in working with and debugging scraping sessions. As you're creating your scraping session you'll want to run it frequently and check the log to ensure that it's doing what you expect it to.
Advanced tab

This tab contains a number of settings that may be required when working with certain sites.
Anonymization tab

See the Anonymization page of the documentation for details on this pane.
Running Scraping Sessions Within Scraping Sessions (enterprise edition only)
It is also possible to run a scraping session within a scraping session that is already running via the RunnableScrapingSession class. Detailed documentation on methods available for the RunnableScrapingSession class are in our API documentation. Here's a specific example of how the RunnableScrapingSession might be used in a screen-scraper script:
// Generate a new RunnableScrapingSession object that will inherit
// from the current scraping session. This object will be used
// to run the scraping session "My Scraping Session"
myRunnableScrapingSession = new com.screenscraper.scraper.RunnableScrapingSession( "My Session", session );
// Because we passed the "session" object to the RunnableScrapingSession
// it will have access to all of the session variables within the
// currently running session. As such, there's no need to explicitly
// set any new session variables. We simply tell it to scrape.
myRunnableScrapingSession.scrape();
// Once it's done scraping, because it inherited from our currently
// running scraping session, we have access to any session variables
// that were set when the RunnableScrapingSession ran in the context
// of our currently running scraping session. For example, let's
// suppose that when the RunnableScrapingSession ran it set a new
// variable called "MY_VAR". Because of the inheritance, we could
// do something like this to see th new value:
session.log( "MY_VAR: " + session.getVariable( "MY_VAR" ) );From here:
On scripts:
![]() |
Using Scrapeable Files |
Overview
A scrapeable file is a URL-accessible file that you want to have retrieved as part of a scraping session. These files are the core of screen-scraping as they determine what information will be made available to extract data from.
Scrapeable files are created by clicking the "Add Scrapeable File" button from the "General" tab for a scraping session. You can delete a scrapeable file by right-clicking (or option-clicking in Mac OS X) it in the tree on the left side of the screen and selecting "Delete".
In addition to working with files on remote servers, screen-scraper can also handle files on local file systems. For example, the following is a valid path to designate in the URL field: C:\wwwroot\myweb\my_file.htm.
Properties tab

The "Properties" tab defines basic settings needed to request a file.
Name: Identifies the scrapeable file.
Parameters tab

"Get" and "Post" Parameters
The "Parameters" tab indicates GET and POST parameters that should be sent when the file is requested. Note that GET parameters can also be embedded in the "URL" field under the "Properties" tab. Parameters are added using the "Add Parameter" button. They can be deleted by selecting them and either hitting the "Delete" key on the keyboard, or by right-clicking (option-clicking in Mac OS X) and selecting "Delete".
Upload a File
In the Enterprise Edition of screen-scraper you can also designate files to be uploaded. This is done by designating "FILE" as the parameter type. The "Key" column would containg the name of the parameter (as found in the corresponding HTML form), and the value would be the local path to the file you'd like to upload (e.g., C:\myfiles\this_file.txt).
Embed Variables
Embedded session variables can be used in the "Key" and "Value" fields for parameters. For example, if you have a "username" POST parameter you might embed a USERNAME session variable in the "Value" field with the token ~#USERNAME#~. This would cause the value of the "USERNAME" session variable to be substituted in at run time.
Extractor Patterns tab

This tab holds the various extractor patterns that will be applied to the HTML of this scrapeable file. See the using extractor patterns page for more information.
Scripts tab

Using this tab scripts can be designated to run either before or after the file is requested. This can be useful for functions like setting session variables and requesting multiple pages of search results. The script to be run is designated under the "Script Name" column. The sequence the scripts should be invoked in is determined by the "Sequence" column. Indicate the event that should trigger the script using the "When to Run" column. If the checkbox in the "Enabled?" column is not checked the script will not get run.
Last Request tab

This tab will display the raw HTTP request for the last time this file was retrieved. This tab can be useful for debugging in looking at POST and GET parameters that were sent to the server.
Last Response tab

This tab displays the raw HTTP and HTML from the last time this file was requested. The most common use for this tab is in generating and testing extractor patterns. You can generate an extractor pattern by highlighting a block of text or HTML, right-clicking (option-clicking on Mac OS X) and selecting "Generate extractor pattern from selected text".
The "Render HTML"/"View Source" button allows you to toggle between a rendered version of the page and the raw HTML source. In certain cases the HTML may contain embedded JavaScript and complex DHTML that screen-scraper has difficulty rendering. You can also use the "Display Response in Browser" button to display the web page in your default web browser.
Note that the contents shown under the "Last Request" tab might appear differently from the original HTML of the page. screen-scraper has the ability to "tidy" the HTML, which can facilitate data extraction. See using extractor patterns for more details on tidying HTML.
When viewed as text, the HTML for the last response can be searched using the "Find..." button.
Advanced tab (professional and enterprise editions only)

This tab contains a few advanced settings.
From here:
More details on related stuff:
![]() |
Using Extractor Patterns |
Overview
Extractor patterns allow you to pinpoint select snippets of data that you want extracted from a web page. They're often the most confusing part of screen-scraper, so you'll want to look over this page carefully. An extractor pattern is a block of text (usually HTML) that contains special tokens that will match pieces of data you're interested in extracting. These tokens are text labels surrounded by the delimiters ~@ and @~. The label between the delimiters should contain only alpha-numeric characters and underscores.
You can think of an extractor pattern like a stencil. A stencil is an image in cut-out form, often made of thin cardboard. As you place a stencil over a piece of paper, apply paint to it, then remove the stencil, the paint remains only where there were holes in the stencil. Analogously, you can think of placing an extractor pattern over the HTML of a web page where the tokens correspond to the holes where the paint would pass through. After an extractor pattern is applied it reveals the portions of the web page you'd like to extract.
Extractor pattern tokens designate regions where data elements are to be captured. For example, given the following HTML snippet:
<p>This is the <b>piece of text</b> I'm interested in.</p>
you would extract "piece of text" by creating an extractor pattern with a token positioned like so:
<p>This is the <b>~@EXTRACTED_TEXT@~</b> I'm interested in.</p>
The extracted text could then be accessed via the identifier "EXTRACTED_TEXT".
If you haven't done so already, we'd recommend going through our first tutorial to get a better feel for using extractor patterns.
Managing extractor patterns

Any number of extractor patterns can be associated with a given scrapeable file, and are managed by clicking on a scrapeable file, then on the "Extractor Patterns" tab. Add an extractor pattern by clicking on the "Add Extractor Pattern" button. Any number of extractor patterns can be applied to a given scrapeable file, and they will be applied to the file in a designated sequence. Any number of tokens can appear within an extractor pattern.
The recommended way to create a token is to simply select a region of text in an extractor pattern, right-click (or control-click in Mac OS X) the selected region and select "Generate extractor pattern token from selected text" in the pop-up menu (note that this can be done either when viewing the extractor pattern as HTML or as plain text). Creating an extractor pattern token in this manner will open a window that allows you to edit the attributes of the token. It's also often helpful to use an external text editor when creating extractor patterns where you can store snippets of HTML you're working with. You can then copy text into screen-scraper, as needed.
If an extractor pattern takes too long to match a block of text it will timeout. The timeout setting may be adjusted from the "Settings" window (click on the Options->Settings menu item) under the "General" tab. If you find that your extractor pattern is timing out you might try adjusting it by using more precise regular expressions. The tips at the bottom of this page might also help.
Note that when creating extractor patterns you should use the HTML that will be found under the "Last Response" tab associated with a scrapeable file. By default, screen-scraper will "tidy" the HTML once it's been scraped, meaning that it will format it in a consistent way that makes it easier to work with. If you use the HTML by viewing the source for a page in your web browser it will likely be different from the HTML that screen-scraper generates.
Main tab

The main tab allows you to edit the primary attributes of the pattern, and contains the following elements:
Delete Extractor Pattern: Deletes the current extractor pattern.
dataRecord and dataSet variables (see using scripts for details).
Sequence: Determines the order in which the extractor patterns will be applied to the HTML.
Sub-extractor patterns tab

This tab allows you to add, edit, delete, and test sub-extractor patterns. See the "Using sub-extractor patterns" section below for more on this.
Advanced tab (professional and enterprise editions only)

The advanced tab provides extended control over extractor patterns, described below.
DataSet object generated when this extractor pattern is applied into a session variable using the identifier as the key. For example, if your extractor pattern were named "PRODUCTS", and you checked this box, screen-scraper would apply the pattern and place the resulting DataSet into a session variable to be used later on.
Extractor pattern tokens
Extractor pattern tokens can be edited by double-clicking them, or by selecting the label between the ~@ @~ delimiters, then right-clicking (control-clicking on Mac OS X), and selecting "Edit extractor pattern token". This will display a small dialog box with a tabbed pane. Each pane is described below.
Extractor pattern tokens "General" tab

Extractor pattern tokens "Regular Expression" tab

Here you can designate a regular expression that will be used to match the text covered by this token. You can either enter one in the text box, or select one from the drop-down list. The regular expressions that appear in the drop-down list can be edited by selecting "Edit regular expressions" from the "Options" menu.
Extractor pattern tokens "Mapping" tab (enterprise edition only)

The mapping tab allows you to alter extracted values. Often once you extract data from a web page you need to put it into a consistent format. For example, you may want products with very similar names to have identical names.
screen-scraper makes use of mapping sets when determining how to map a given extracted value. A mapping set may contain any number of mappings, which screen-scraper will analyze in sequence until it finds a match, or runs out of mappings. As such, you'll often want to put more specific mappings higher in sequence than more general mappings.
The various columns in a mapping are defined below:
To Once a match is found, indicates the new value the extracted data will assume.
You can create a new mapping set by simply typing a name into the "Set" box. Sets can be deleted via the "Delete Set" button, and an individual mapping can be added by clicking the "Add Mapping" button. Individual mappings can be deleted by selecting them, then right-clicking (control-clicking on Mac OS X) and selecting "Delete".
Consider the screen-shot of the "Mapping" tab above. If the extracted value were Widget 123 screen-scraper would first try to match using the Widget 1 mapping. Because this is an "Equals" match the mapping wouldn't occur, so screen-scraper would proceed to the second mapping. The second mapping would match because a "Contains" type was designated. That is, the text Widget 123 contains the text Widget. As such, the extracted data Widget 123 would become Product ABC, because that is the "To" value designated for the second mapping.
When using regular expressions in your mapping you can also make use of back references. Back references allow you to preserve values in the original text when mapped to the "To" value. For example, if you were mapping the value Widget 123 you could use the regular expression Widget (\d*). In the "To" column you could then enter the value Product \1, which, when mapped, would convert Widget 123 to Product 123. The value in parentheses in the "From" column gets inserted via the \1 marker found in the "To" column.
Extractor pattern tokens advanced tab (enterprise edition only)

Filtering duplicates (professional and enterprise editions only)
When extracting records from web sites you'll often want to filter out duplicates. screen-scraper provides a method whereby this can be done automatically. To filter duplicates for data extracted by a given extractor pattern you'll wnat to go to the "Advanced" tab, then check the boxes labeled "Automatically save the data set generated by this extractor pattern in a session variable", "Filter duplicate records", and "Cache the data set". This will cause screen-scraper to generate a session variable with the same name as the extractor pattern identifier, and will save any records extracted by the pattern to the file system rather than saving them in memory.
Once you've set up the extractor pattern to cache and save the data set, you'll need to designate the fields that would identify a unique record. That is, when filtering duplicates screen-scraper will compare the values for designated columns in order to determine if a duplicate record already exists (more or less like a database compound key). You designate an extractor pattern token to be used in determining uniqueness by editing it, and checking the "Use to filter duplicates" box found under the "Advanced" tab.
Because screen-scraper filters duplicates as it's scraping you'll want to wait until the end to make use of the data. For example, if you want all of the data written to a .CSV file you would want to invoke the script that does that after the scraping session has ended. That way you can guarantee that all of the data has been extracted and filtered before you save it.
Using sub-extractor patterns
Sub-extractor patterns allow you to extract data in smaller pieces, providing significantly more flexibility in pinpointing the specific pieces you're after. Consider a search results page consisting of rows and columns of data. Using normal extractor patterns you would use a single pattern to extract the data from all columns for a single row. In many cases this works just fine; however, the process gets more complicated when each row differs significantly. For example, certain cell rows may be in different colors or their contents may be completely missing. With a normal extractor pattern it would be difficult to account for the variability in the cells. By using sub-extractor patterns you could create a normal extractor pattern to extract an entire row, then use individual sub-extractor patterns to pull out the individual cells.
Consider the following HTML table:
| Name | Phone | Address |
|---|---|---|
| Juan Ferrero | 111-222-3333 | 123 Elm St. |
| Joe Bloggs | No contact information available | |
| Sherry Lloyd | 234-5678 (needs area code) | 456 Maple Rd. |
Here is the corresponding HTML source:
|
<table cellpadding="2" border="1"> |
It would be difficult (if not impossible) to write a single extractor pattern that would extract the information for each row because the contents of the cells differ so significantly. The different colored cells and the cell spanning two columns make the data too inconsistent to be extracted using a single pattern.
Consider this extractor pattern:
<tr><td~@DATARECORD@~/td></tr>
If applied to the HTML above the extractor pattern would produce the following three matches:
|
1. class="Name">Juan Ferrero</td><td class="Phone">111-222-3333</td><td class="Address">123 Elm St.< |
Sub-extractor patterns would allow you to extract individual pieces of information from each row. For example, consider this sub-extractor pattern:
Name">~@NAME@~</td>
If applied to each of the individual extracted rows above the following three pieces of information would be extracted:
1. Juan Ferrero
2.
3. Sherry Lloyd
Note that "Joe Bloggs" didn't get extracted because the cell his name in is red. Let's adjust the sub-extractor pattern slightly:
Name"~@nonhtml@~>~@NAME@~</td>
The ~@nonhtml@~ tag represents an extractor pattern token that uses the "Non-HTML tags" regular expression: [^<>]*. Matching anything between where it is covering until it encounters either an opening or closing HTML bracket. In this particular case the effect is that all three names get extracted. To extract the phone number you'd use this sub-extractor pattern:
<td class="Phone"~@nonhtml@~>~@PHONE@~</td>We have the case, however, of the cell in the second row that spans two columns, which would not get extracted by the sub-extractor pattern. We may still want this information, however, so we create the following sub-extractor pattern, just in case the cell exists:
<td colspan="2">~@PHONE@~<
If applied to our data we'd get the following results:
1.
2. No contact information available
3.
Sub-extractor patterns aggregate everything that's extracted into a single data set. Using all of our extractor and sub-extractor patterns together we'd get the following data set:
| Data record # | Name | Phone |
|---|---|---|
| Data record #1 | Juan Ferrero | 111-222-3333 |
| Data record #2 | Joe Bloggs | No contact information available |
| Data record #3 | Sherry Lloyd | 234-5678 (needs area code) |
There are a couple of important things to note about sub-extractor patterns:
Tips on using extractor patterns
Test your patterns frequently. Extractor patterns take some practice. Especially when you're first trying them out you'll want to test them as you're working with them. It often helps to test it after every couple of tokens you insert.
Ensure that the pattern extracts the number of data sets you expect it to. Oftentimes your pattern might not be as flexible as you think it is. Test it out to make sure it matches as many times as you think it should.
From here:
Related stuff:
![]() |
Using Scripts |
Overview
screen-scraper has a built-in scripting engine to facilitate dynamically scraping sites and working with data once it's been extracted. Depending on your needs scripts can be helpful for such things as interacting with databases and dynamically determining which files get scraped when.
Invoking scripts in screen-scraper is similar to other programming languages in that they're tied to events. Just as you might designate a block of code to be run when a button is clicked in Visual Basic, in screen-scraper you might run a script after an HTML file has been downloaded or data has been extracted from a page.
Depending on your preferences, there are a number of languages that scripts can be written in. screen-scraper supports JavaScript, Interpreted Java, and Python on any platform, and JScript, VBScript, and Perl when running on Windows. Try the links at the bottom of this screen for information specific to each of the scripting languages.
If you haven't done so already, we'd highly recommend taking some time to go through our tutorials in order to get more familiar with how scripts are used.
Managing scripts
Scripts are added by clicking the "New Script" button (looks like a pencil and paper) or by selecting "File->New Script" from the menu bar. Delete a script either by selecting it and pressing the "Delete" key or by right-clicking it (or control-clicking on Mac OS X) and selecting "Delete".
Each script is given a unique name so that you can easily indicate when it should be invoked (e.g. before a scraping session begins or after each application of an extractor pattern). You can also select the language the script is written in. Scripts can be exported to an XML file so that they can be backed up or transferred to other instances of screen-scraper. See the Importing and exporting objects page for more information on this. Clicking on the "Show Script Instances" button will display any locations where this script is invoked in the format scraping session: scrapeable file: extractor pattern.
Finally, you're given a text box in which to write your script. The text editing features for authoring scripts in screen-scraper are currently fairly limited, so you may want to consider using an external editor, then copying and pasting text in to screen-scraper.
Using scripts
You designate a script to be executed by associating it with some event. For example, if you click on a scraping session in the tree, then on the "Scripts" tab, you'll notice that you can designate scripts to be invoked either before a scraping session begins or after it completes. Other events that can be used to invoke scripts relate to scrapeable files and extractor patterns. After associating a script with an object in this way it can be disassociated by selecting it in the table and pressing the "Delete" key or by right-clicking it (or control-clicking on Mac OS X) and selecting "Delete". You can also selectively enable and disable scripts using the "Enabled?" checkbox in the rightmost column.
Working with external Java libraries
Existing Java code can be referred to from within scripts. Simply copy any jar files you'd like to reference from within scripts into the "lib\ext" folder found in screen-scraper's directory. Note that you'll still need to use the "import" statement within your scripts to refer to specific classes, like this:
import com.foo.bar.*;
Please note--screen-scraper 4.0 was built on a Java 1.5 platform. You Java scripts must accept at least a version 1.5 JRE in order to compile and run properly.
Built-in objects
screen-scraper offers a few objects that you can work with in a script. Bear in mind that not all of these variables will be available in all scripts. See the Variable scope section (following this one) for more details. You can view details on all of the objects and their methods in our API Documentation.
Variable scope
Depending on when a script gets run different variables may be in scope. When associating a script with an object, such as a scraping session or scrapeable file, you're asked to specify when the script is to be run. The table that follows specifies what variables will be in scope depending on when a given script is run. Note that none of the variables will be in scope when a script is invoked directly, though it is common in these scripts to create RunnableScrapingSession objects.
| When Script is Run | session in scope | scrapeableFile in scope | dataSet in scope | dataRecord in scope |
| Before scraping session begins | X | |||
| After scraping session ends | X | |||
| Before file is scraped | X | X | ||
| After file is scraped | X | X | ||
| Before pattern is applied | X | X | ||
| After pattern is applied | X | X | X | |
| After each pattern application | X | X | X | X |
Debugging scripts
One of the best ways to fix errors is to simply watch the scraping session log (under the "Log" tab) and the "error.log" file (located in the "log" directory where screen-scraper was installed) for script errors. When a problem arises in executing a script screen-scraper will output a series of error-related statements to the logs. Often a good approach in debugging is to build your script bit by bit, running it frequently to ensure that it runs without errors as you add each piece.
When screen-scraper is running as a server it will automatically generate individual log files in the "log" directory for each running scraping session (this can be disabled in the settings window). An "error.log" file will also be generated in that same directory when internal screen-scraper errors occur.
The "Breakpoint" window can also be invaluable in debugging scripts. You can invoke it by inserting the line session.breakpoint() into your script. While the "Breakpoint" is displayed script execution will halt. There are two buttons along the top of the window. The "play" button will simply continue execution of your script. Clicking the "stop" button will cause screen-scraper to halt execution as soon as it can. The "Breakpoint" window also exposes any session variables, data sets, and data records that are in scope. These values can be altered in the "Breakpoint" window as well.
From here: