Documentation

Documentation

Getting Started

Getting Started

Overview of screen-scraping

Overview of screen-scraping

What is screen-scraping?

Screen-scraping is the practice of extracting information from web sites so that it can be used in other contexts. It has its roots in an earlier practice that dealt with reading the display from a mainframe terminal, then re-purposing the information via character recognition or some other method in order to persist the functionality of legacy applications.

Why do screen-scraping?

If possible, the preferred method for getting information presented on a web site is via something like an RSS or other XML-based feed.  As data that's extracted from web sites is often used directly in existing applications, SOAP is another possible alternative to getting at needed information. Unfortunately it's not always possible to get information using RSS or SOAP, which makes room for screen-scraping as an approach to get at data. Take a look at our solutions page for specific examples of screen-scraping.

The basic approach

While it's typically fairly easy for a person to log in to a web site, navigate to a particular page, and copy information out of a document, a machine needs a lot more help. Web pages are obviously designed to be viewed and used by humans, so in screen-scraping we typically need to take the same actions that a human would take when copying data from a web page. There are typically three phases in scraping information from a given page:

  1. Request the page. This first part may actually be more complex than it sounds. Oftentimes the page that's needed can only be accessed after logging in to a site and following a series of specific links. Your web browser will typically handle things such as tracking cookies and submitting all of the elements of a form for you, but it becomes a bit more of a manual process when done by a computer.
  2. Extract the information. Once the web page is requested the next step is to parse the HTML text such that specific pieces of data can be extracted and used within computer code. There are several ways to go about this. One possibility is to apply regular expressions, which often work well since they allow for relatively "fuzzy" searches.  Another might be to attempt to turn the HTML in the document into XML so that it can be queried using such methods as XPath.
  3. Do something with the extracted data. From here the information might be inserted into a database or perhaps re-formatted in some way to be presented to a user.

screen-scraper dramatically reduces the time required to perform all of these steps, so that you can focus on what to do with the extracted information.

Legal issues

A good portion of the information on the web is copyrighted, which obviously has legal implications for screen-scraping. One should use discretion when grabbing data from web sites to be re-purposed.

Getting Started Using screen-scraper

Getting Started Using screen-scraper

Overview

Using screen-scraper to extract information from web sites typically consists of a few main steps:

  1. Use the proxy server to determine which files to scrape. It's frequently necessary to request a few files before you can get at the file that contains the data you need (e.g. you may need to log in to the site first). The proxy server allows you to surf a site as you normally would, then easily select files you need to have scraped.
  2. Organize and configure files to be scraped. Once you've selected the files to scrape you'll typically need to organize and sequence them. You'll also usually tweak information related to the files, such as POST data to be sent or authentication tokens.

  3. Create extractor patterns. Extractor patterns provide an intuitive way to selectively identify snippets of data you want extracted from individual pages.
  4. Create scripts. Scripts let you do something with the data that gets extracted. This might be writing the data out to a formatted file or inserting the information into a database.

The best way to learn to use screen-scraper is by going through our tutorials.


From here:

On the proxy server:

On the scraping engine:

On extractor patterns:

On scripts:

Installation

Installation

Installation Requirements

Installation Requirements

screen-scraper has been tested on Microsoft Windows, Linux, Mac OS X, and other platforms that support a Java Runtime Environment of 1.4 or higher. The Windows and Linux screen-scraper installers come with a runtime environment included. Mac OS X should already have the Java Runtime Environment installed (use the Software Update utility if not). For help installing screen-scraper on other platforms (e.g., Solaris, FreeBSD) please contact us.

Installation Instructions

Installation Instructions
See the appropriate download page for the edition you'd like to install:

From here:

screen-scraper License Agreement

screen-scraper License Agreement

screen-scraper License Agreement Copyright © 2002-2008 by ekiwi, LLC.
All Rights Reserved.

YOUR AGREEMENT TO THIS LICENSE

After reading this agreement carefully, if you ("Customer") do not agree to all of the terms of this End-User License Agreement ("EULA"), you may not use this Software (hereafter referred to as "Software Product"). Unless you have a different license agreement signed by ekiwi, LLC (hereafter referred to as "ekiwi") that covers this copy of the Software Product, your use of this Software Product indicates your acceptance of this EULA. All updates to the Software Product shall be considered part of the Software Product and subject to the terms of this EULA. Changes to this EULA may accompany updates to the Software Product, in which
case by installing such update Customer accepts the terms of the EULA as changed. The EULA is not otherwise subject to addition, amendment, modification, or exception unless in writing signed by an officer of both Customer and ekiwi. A software license and a license key ("Software Product License"), issued to a designated user only by ekiwi, is required for each concurrent user of the Software Product. By explicitly accepting this EULA you are acknowledging and agreeing to be bound by the following terms:

1. EVALUATION PERIOD

This Software Product may be used in conjunction with a free evaluation Software Product License. You may use the evaluation copy of the Software Product for only thirty (30) days in order to determine whether to purchase the Software Product, after which the Software Product will cease to function. ekiwi bears no liability for any damages resulting from use of the Software Product, and has no duty to provide any support before or after the expiration date of an evaluation license.

2. GRANT OF NON-EXCLUSIVE LICENSE

You may not tamper with, alter, or use the Software Product in a way that disables, circumvents, or otherwise defeats its built-in licensing verification and enforcement capabilities. You may not modify or create derivative copies of the Software Product or this EULA. All rights not expressly granted to you are retained by ekiwi.

ekiwi grants the non-exclusive, non-transferable right for a single user to use this Software Product. Each additional concurrent user of the Software Product must obtain an additional Software Product License. You may install the Software Product on as many computer systems as desired, so long as two copies of the same Software Product License never come into concurrent use.

3. INTELLECTUAL PROPERTY

The Software Product is owned by ekiwi and is protected by international copyright laws and treaties, as well as other intellectual property laws and treaties. You must not remove or alter any copyright notices on any copies of the Software Product. This Software Product copy is licensed, not sold. You may not use, copy, or distribute the Software Product, except as granted by this EULA, without written authorization from ekiwi. ekiwi reserves all intellectual property rights, including copyrights, patents, and trademarks.

4. TRANSFERABILITY

Customer may not rent, lease, lend, or in any way distribute or transfer any rights in this EULA or the Software Product to third parties without ekiwi's written approval, and subject to written agreement by the recipient of the terms of this EULA.

5. PROHIBITION ON REVERSE ENGINEERING AND DECOMPILATION

You may not reverse engineer, decompile, defeat license encryption mechanisms, or disassemble the Software Product or Software Product License except and only to the extent that such activity is expressly permitted by applicable law notwithstanding this limitation.

6. INDEMNIFICATION

You hereby agree to indemnify ekiwi against and hold harmless ekiwi from any claims, lawsuits, liabilty or other losses that arise out of your breach of any provision of this EULA.

7. THIRD PARTY SOFTWARE

Any software provided along with the Software Product that is associated with a separate license agreement is licensed to you under the terms of that license agreement (which license is provided with the Software Product). This license does not apply to those portions of the Software Product.

8. SUPPORT SERVICES

ekiwi may provide you with support services related to the Software Product. Use of any such support services is governed by ekiwi policies and programs described in online documentation and/or other ekiwi-provided materials.

As part of these support services, ekiwi may make available bug lists, planned feature lists, and other supplemental informational materials. ekiwi makes no warranty of any kind for these materials and assumes no liability whatsoever for damages resulting from any use of these materials. Furthermore, you may not use any materials provided in this way to support any claim made against ekiwi.

Any supplemental software code or related materials that ekiwi provides to you as part of the support services, in periodic updates to the Software Product or otherwise, is to be considered part of the Software Product and is subject to the terms and conditions of this EULA.

With respect to any technical information you provide to ekiwi as part of the support services, ekiwi may use such information for its business purposes without restriction, including for product support and development. ekiwi will not use such technical information in a form that personally identifies you without first obtaining your permission.

9. TERMINATION

This EULA terminates on the date of the first occurrence of either of the following events: (1) The expiration of one (1) month from written notice of termination from Customer to ekiwi; or (2) One party materially breaches any terms of this EULA or any terms of any other agreement between Customer and ekiwi, that are either uncorrectable or that the breaching party fails to correct within one (1) month after written notification by the other party.

10. NO WARRANTIES

YOU ACCEPT THE SOFTWARE PRODUCT AND SOFTWARE PRODUCT LICENSE "AS IS," AND EKIWI MAKES NO WARRANTY AS TO ITS USE, PERFORMANCE, OR OTHERWISE. TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, EKIWI DISCLAIMS ALL OTHER REPRESENTATIONS, WARRANTIES, AND CONDITIONS, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE, INCLUDING, BUT NOT LIMITED TO, IMPLIED WARRANTIES OR CONDITIONS OF MERCHANTABILITY, SATISFACTORY QUALITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE, AND NON-INFRINGEMENT. THE ENTIRE RISK ARISING OUT OF USE OR PERFORMANCE OF THE SOFTWARE PRODUCT REMAINS WITH YOU.

11. LIMITATION OF CONSEQUENTIAL DAMAGES

NEITHER EKIWI NOR ANYONE INVOLVED IN THE CREATION, PRODUCTION, OR DELIVERY OF THIS SOFTWARE SHALL BE LIABLE FOR ANY INDIRECT, CONSEQUENTIAL, OR INCIDENTAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE SUCH SOFTWARE EVEN IF EKIWI HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES OR CLAIMS. IN NO EVENT SHALL EKIWI'S LIABILITY FOR ANY DAMAGES EXCEED THE PRICE PAID FOR THE LICENSE TO USE THE SOFTWARE, REGARDLESS OF THE FORM OF CLAIM. EKIWI SHALL IN NO WAY BE HELD LIABLE OR RESPONSIBLE FOR ANY UNLAWFUL OR ILLEGAL USE OF THE SOFTWARE PRODUCT, INCLUDING, BUT NOT LIMITED TO, THE EXTRACTION AND USE OF COPYRIGHTED DATA FROM EXTERNAL SOURCES (E.G. WEB PAGES). THE PERSON USING THE SOFTWARE BEARS ALL RISK AND RESPONSIBILITY AS TO THE USE, QUALITY, AND PERFORMANCE OF THE SOFTWARE.

12. HIGH RISK ACTIVITIES

The Software Product is not fault-tolerant and is not designed, manufactured or intended for use or resale as on-line control equipment in hazardous environments requiring fail-safe performance, including, but not limited to, in the operation of nuclear facilities, aircraft navigation or communication systems, air traffic control, direct life support machines, and weapons systems, in which the failure of the Software Product, or any software, tool, process, or service that was developed using the Software Product, could lead directly to death, personal injury, or severe physical or environmental damage ("High Risk Activities"). Accordingly, ekiwi and its suppliers and licensors specifically disclaim any express or implied warranty of fitness for High Risk Activities. You agree that ekiwi and its suppliers and licensors will not be liable for any claims or damages arising from the use of the Software Product, or any software, tool, process, or service that was developed using the Software Product, in such applications.

13. GENERAL

This EULA is the complete statement of the agreement between the parties on the subject matter, and merges and supersedes all other or prior understandings, purchase orders, agreements and arrangements.

This EULA shall be governed by the laws of the State of Utah. Exclusive jurisdiction and venue for all matters relating to this EULA shall be in courts located in the State of Utah, and you consent to such jurisdiction and venue. If any action is brought by either party to this EULA against the other party regarding the subject matter hereof, the prevailing party shall be entitled to recover, in addition to any other relief granted, reasonable attorney fees and expenses of litigation.

You acknowledge that, in the event of your breach of any of the foregoing provisions, ekiwi will not have an adequate remedy in money or damages. ekiwi shall therefore be entitled to obtain an injunction against such breach from any court of competent jurisdiction immediately upon request. ekiwi's right to obtain injunctive relief shall not limit its right to seek further remedies.

There are no third party beneficiaries of any promises, obligations or representations made by ekiwi, LLC herein. Any waiver by ekiwi, LLC of any violation of this EULA by you shall not constitute or contribute to a waiver of any other or future violation by you of the same provision, or any other provision, of this EULA.

14. CONTACT INFORMATION

If you have any questions about this EULA, or if you want to contact ekiwi for any reason, please direct correspondence to info@screen-scraper.com.

Configuration

Configuration

Settings

Settings

Overview

What follows is a description of each of the elements found in the "Settings" window, which can be displayed by selecting "Settings" from the "Options" menu, or by clicking the wrench icon in the button bar.

General



General settings

  • Connection timeout: At times remote web servers will experience problems after screen-scraper has made a connection. When this happens the server will often hold on to the connection to screen-scraper, causing it to appear to freeze. Designating a connection timeout avoids this situation. Generally around 30 seconds is sufficient.
  • Data extractor timeout: In certain cases complex extractor patterns can take an abnormally long time when being applied. You'll likely want to designate a timeout so that screen-scraper doesn't get stuck while applying a pattern. Typically it should not take longer than 2 or 3 seconds to apply a pattern.
  • Maximum number of concurrent running scraping sessions (professional and enterprise editions only): When screen-scraper is running as a server you'll often want to limit the number of scraping sessions that can be run simultaneously, so as to avoid consuming too many resources on a machine. This setting controls how many will be allowed to run at a time. Note that this only applies when a lazy scrape is being performed.

  • Maximum application memory allocation in megabytes: This setting controls the amount of memory screen-scraper will be allowed to consume on your computer. In cases where you notice sluggish behavior or "OutOfMemoryError" messages appearing in the "error.log" file (found in the "log" directory for your screen-scraper installation folder), you'll likely want to increase this number.
  • Default proxy session to use when running in server mode (enterprise edition only): When screen-scraper is running as a server it can also run the proxy server. If you designate a proxy session in this drop-down box screen-scraper will make use of its scripts.
  • Installation directory: In virtually all cases this setting can be left untouched. If you move the screen-scraper installation directory you may need to manually set this.
  • Allow upgrading to unstable versions (professional and enterprise editions only): If this box is checked when you select "Check for updates" from the "Options" menu screen-scraper will give you the option to download alpha/unstable versions of the software.
  • Default character set (professional and enterprise editions only): Indicates the character set that should be used when not designated by the remote server. If sites are being scraped that make use of characters outside of the standard ASCII set this value should probably be set to UTF-8.
  • Default font (professional and enterprise editions only): The font that should be used in certain text boxes within screen-scraper. This includes the "Last Response" and "Extractor Pattern" boxes, among others.

Servers (professional and enterprise editions only)



Server settings

Server (professional and enterprise editions only)

These settings apply when screen-scraper is running in server mode.

Proxy Server (professional and enterprise editions only)

These settings apply only to the proxy server portion of screen-scraper.

Mail Server (professional and enterprise editions only)

These settings are used with the session.sendMail method in screen-scraper scripts.

SOAP Server (professional and enterprise editions only)

These settings apply only to the SOAP server portion of screen-scraper.

External Proxy



External proxy settings

Please note that, unless you normally connect to the Internet through an external proxy server, you don't need to modify these settings.

Anonymous Proxy



For details on this pane please see the Anonymization page.

Other Settings

There are a handful of settings that are rarely used, so we don't provide a way to adjust them in the "Settings" dialog box. You'll instead need to edit the properties file manually. And, yes, if you look at the properties file we tell you not to do that :)

screen-scraper's properties file is found in its installation folder at "resource\conf\screen-scraper.properties". You can edit it in your favorite text editor. Note that when you alter the file you should do so when screen-scraper is not running. It won't get the new settings until the application restarts, and if you edit while it's running it may overwrite your changes.

With that caveat and introduction, here are the settings:

If you want to run multiple instances of screen-scraper on a single machine you'll also need to modify and/or add several properties. Check this FAQ for instructions on doing that.

Configuring Firefox

Configuring Firefox



Firefox "General" settings


Firefox proxy settings
  1. If you're running Windows, click on the Tools->Options... menu item. If Linux, click on the Edit->Preferences menu item. In Mac OS X click on the Firefox->Preferences... menu item.
  2. Click on the "Settings..." button at the top, then on the "Network" tab.
  3. Click the "Manual proxy configuration" radio button.
  4. In the "HTTP Proxy" field type "localhost", and "8777" in "Port" (assuming you haven't changed the default port number from 8777).
  5. Click on the check box labeled "Use this proxy server for all protocols".

  6. Hit the "OK" button to get back to your web browser.

If you have trouble connecting to screen-scraper's proxy with your web browser, please see this FAQ.


Helpful Firefox Add-ons

  1. SwitchProxy (available for Firefox 2.0)
  2. SwitchProxy provides a drop-down menu in the toolbar for quickly switching to and from proxy servers.

  3. FoxyProxy (available for Firefox 3.0)
  4. SwitchProxy provides a button in the status bar for quickly switching to and from proxy servers.

  5. Firebug (available for Firefox 3.0)

For more useful add-ons for other purposes, visit the Browser Tools page.

Vista Users

Please see our note on using the Proxy server within Vista.

Configuring Internet Explorer

Configuring Internet Explorer



IE proxy settings

  1. Click on the Tools->Internet Options menu item.
  2. Go to the "Connections" tab.
  3. Click on "LAN Settings".
  4. Click on the checkbox beginning with "Use a proxy server for...".
  5. Click on the "Advanced..." button.
  6. In the "HTTP" and "Secure" fields type "localhost" under the "Proxy address to use"
    column, and "8777" under "Port" (assuming you haven't changed the
    default port number from 8777).
  7. Hit the "OK" button a few times till you get back to your web browser.

If you're using a dial-up connection the setup will differ slightly. Instead of the "LAN Settings" button you'll want to find your dial-up connection under the "Dial-up and Virtual Private Network settings" dialog box, then configure it via the "Settings" button.

If you have trouble connecting to screen-scraper's proxy with your web browser, please see this FAQ.


Vista Users

Please see our note on using the Proxy server within Vista.


From here:

Configuring Mozilla

Configuring Mozilla


Mozilla proxy settings
  1. Click on the Edit->Preferences menu item.
  2. Expand the "Advanced" node in the tree on the left.
  3. Click on the "Proxies" node.
  4. Select the "Manual proxy configuration" radio button.
  5. Click the "View..." button.

  6. In the "HTTP Proxy" and "SSL Proxy" fields type "localhost", and "8777" in "Port" (assuming you haven't changed the default port number from 8777).
  7. Hit the "OK" button twice to get back to your web browser.

If you have trouble connecting to screen-scraper's proxy with your web browser, please see this FAQ.


Vista Users

Please see our note on using the Proxy server within Vista.


From here:

Configuring Opera

Configuring Opera

To access the Proxy settings in Opera, go to the Tools -> Preferences menu. In the Advanced tab, there is a Network section listed near the bottom on the left, and a Proxy Servers... button inside:


Advanced preferences -> Network -> Proxy Settings...

Clicking on the "Proxy Settings..." button will give you a dialog with several text field you may enable. For screen-scraper to work, we only need to enable the first "HTTP" checkbox, and enter "localhost" and your proxy port (screen-scraper defaults to 8777):


Enable HTTP, enter "localhost" and your port number.

To simplify the whole process of turning the proxy server on and off, you can add a proxy button to a toolbar. One handy location is on the far right of the tab bar. To add the button, right-click just about any of Opera's interface toolbars and select Customize.... in the buttons tab, you may access a preferences section, and then drag and drop the "enable proxy servers" button onto your toolbar, to provide instant access to the usage of the proxy server:


A quick way to enable/disable your screen-scraper proxy server

New to Opera 9.5.x is a feature called Dragonfly. You can access it through Tools -> Advanced -> Developer Tools, or alternately pressing control-shift-I. If you're familiar with Firefox's "Firebug" add-on, then you'll quickly recognize Dragonfly. It's built off of the same ideas, allowing you to manipulate CSS in realtime, seeing which properties are being overwritten by another. You can debug Javascript running on the page, find HTML elements on the page by clicking on them so that Dragonfly shows you the corresponding page source, see all final properties of elements, etc. It's a great tool for sifting through a website.

Vista Users

Please see our note on using the Proxy server within Vista.

Configuring Netscape

Configuring Netscape


Netscape proxy settings
  1. Click on the Edit->Preferences menu item.
  2. Expand the "Advanced" node in the tree on the left.
  3. Click on the "Proxies" node.
  4. Select the "Manual proxy configuration" radio button.
  5. Click the "View..." button.

  6. In the "HTTP Proxy" and "SSL Proxy" fields type "localhost", and "8777" in "Port" (assuming you haven't changed the default port number from 8777).
  7. Hit the "OK" button twice to get back to your web browser.

If you have trouble connecting to screen-scraper's proxy with your web browser, please see this FAQ.


Vista Users

Please see our note on using the Proxy server within Vista.


From here:

The Proxy Server

The Proxy Server

Proxy Server Overview

Proxy Server Overview

Purpose

screen-scraper's proxy server allows you to view HTTP requests and responses as they pass between your web browser and remote servers. In scraping files from web sites there are a few more details than you typically worry about when surfing, such as HTTP headers and POST data. The proxy server makes all of these details visible to you.

Description

When running, the proxy server listens on a specified port for incoming HTTP requests from your web browser. Upon receiving a request the proxy server records it, then sends it along to the server it was intended for. When that server responds the response is sent first to the proxy server, which, once again, makes a record of it, then sends it along to your web browser.

Viewing HTTPS requests

Often one of the headaches of scraping information from sites that use HTTPS is that it's not always easy to tell what's getting passed back and forth in the way of cookies, POST data, etc. Even if you put a proxy server in the way that lets you view the requests and responses, the information is encrypted as it's leaving your browser and as it's leaving the web server that responds to the request. screen-scraper gets around this problem by using it's own temporary certificate to ecrypt traffic from itself to the browser and then encrypting each request before sending it up to the server. The result of this is that your browser will issue a warning about the certificate that screen-scraper returned. You can safely accept the certificate and be assured that all your traffic is encrypted.

Running the proxy in server mode

screen-scraper has the ability to act as a proxy while in server mode. Combined with the ability to execute scripts, this new functionality opens up many new possibilities for how you use screen-scraper, including setting up blacklists, application integration and many more. To learn how to configure screen-scraper for this see the settings documentation and look for "Default proxy session to use when running in server mode" .


From here:

Using the Proxy Server

Using the Proxy Server

Configuring the proxy server

First you will need to create a proxy session, which is really just a way to organize your interactions with specific web sites. Typically you'll have a proxy server for each site you want to scrape. Create a new proxy session by clicking on the New Proxy Session button (looks like a globe) or by selecting "File->New Proxy Session" from the menu.

The settings in the proxy server are the name, port and whether you want to have the proxy server log binary files such as images. Typically you would name the proxy after the site that you are accessing, the port set to 8777 and have the "Don't log binary files" selected.

Configuring your web browser

Confirguring a web browser to use a proxy server is generally pretty straightforward, but varies somewhat for each browser. For more detailed instructions on setting up your specific browser to use a proxy try one of the links at the bottom of this screen.

Running the proxy server

Assuming you've configured everything and set up a proxy session, from here you should be able to start up the proxy server by selecting the proxy session in the tree on the left, then clicking on the "Start Proxy Server" button. Now just surf away.

Viewing requests and responses

After you've surfed a bit with your web browser click on the "Progress" tab. From here you can view all of the HTTP and HTTPS requests and responses logged by the proxy server. The upper pane lists all of the transactions (a request/response combination). Clicking on a transaction brings up its details in the lower pane. You can delete transactions by selecting them and either hitting the "Delete" key or by right-clicking them (or Option-clicking on Mac OS X) and selecting "Delete".

The proxy server log

Currently the proxy server just logs very basic information about its activity, and probably isn't of much interest.

Viewing encrypted transactions

In Internet Explorer 7 you have to adjust your security settings. In Tools > Internet Options under the security tab slide the security level to medium. When accessing a site that uses HTTPS encryption you will encounter a browser warning that looks like this:



IE domain mismatch warning

This warning occurs because screen-scraper is using a temporary certificate for encryption that will not match the url that you are accessing. You can safely ignore this warning by clicking "Continue to this website (not recommended)".

Currently Firefox 3 will not allow you to navigate to a page with a certificate/domain mismatch. We recommend using Opera 9.5 for ssl proxy sessions.

Using an external proxy server

If you normally use an external proxy server when connecting to the Internet (on your local area network, for example), you'll need to set another property within screen-scraper. View the settings screen by selecting Options->Settings from the menu. On the "External Proxy" tab you'll notice a series of boxes toward the bottom that allow you to set parameters related to your proxy server. It should be relatively self-explanatory what needs to be designated. If you happen to be using NTLM (Windows NT) authentication you'll need to designate settings for both the "standard" proxy as well as the NTLM one.


From here:

Setting up specific browsers to use a proxy server:

Vista Users

Please see our note on using the Proxy server within Vista.

Using Scripts with the Proxy Server

Using Scripts with the Proxy Server

Overview

screen-scraper has the ability to run custom made scripts while the proxy server is running. This allows you to harness the full power of the scripting environment like you can in scraping sessions. It is recomended that you read using scripts before continuing since many of the concepts apply to invoking scripts in the proxy server environment.

Using the scripts

Scripts are added to a proxy session by selecting proxy session in the tree view then selecting the "Scripts" and clicking on the "Add Script" button. You will notice that a script will then be added to the scripts table. You will need to click on the script name and select the script that you want to run. The options "Sequence", "When to Run" and "Enabled" function similarly to other places in screen-scraper where scripts can be invoked. In the proxy server environment the "When to Run" options specify when in the proxy cycle the script will be invoked. Depending on when you decide to run your script certain built in objects will be in scope that are unique to the proxy environment.

Built-in objects

screen-scraper offers a few objects that you can work with in a script in the proxy environment. See the "Variable scope" section (following this one) for more details.

Variable scope

Depending on when a script gets run different variables may be in scope. The table that follows specifies what variables will be in scope depending on when a given script is run.

When Script is Run proxySession in scope request in scope response in scope
Beginning of proxy session X
Before HTTP request X X
After HTTP request X X
Before HTTP response X X X
After HTTP response X X X

Debugging scripts

One
of the best ways to fix
errors is to simply watch the proxy session log (under the "Log" tab) and the "error.log" file (located in the "log" directory where screen-scraper was installed) for script errors. When a problem arises in executing a script screen-scraper will output a series of error-related statements to the logs. Often a good approach in debugging is to build your script bit by bit, running it frequently to ensure that it runs without errors as you add each piece.


From here:

The Scraping Engine

The Scraping Engine

An Overview of the Scraping Engine

An Overview of the Scraping Engine

Purpose

The screen-scraper application contains a scraping engine, which is intended to provide an intuitive and convenient way to set up specific web pages to have information scraped from them. Many of the details related to screen-scraping that are typically done in code can be handled through a graphical user interface (called the "workbench", in screen-scraper).

Basic concepts

There are several basic elements used by screen-scraper in extracting data from web sites. The first is a scraping session which consists of a series of files, called scrapeable files, that screen-scraper will request in a designated sequence. A common example might be a site that requires authentication before the data that is to be extracted can be accessed. The first file, or HTTP request, might be to a server-side script that handles a user's login. It might be necessary to follow a few links, which would involve creating more scrapeable files, until the page can be requested that contains the desired data. Any number of parameters can be associated with scrapeable files. This would be GET, POST, or authentication tokens such as Basic, Digest or NTLM that need to be sent when the file is requested. For each scrapeable file that's requested any number of extractor patterns can be applied to the text retrieved from the page in order to extract out the desired pieces. Throughout this process scripts can be invoked that might perform tasks such as insert extracted data into a database or invoke subsequent scrapeable files to be requested. As a scraping session is running screen-scraper will log the activity and record each request and response corresponding to each scrapeable file that gets requested.


From here:

More details on the scraping engine:

API Documentation

API Documentation

API Documentation

Using Scraping Sessions

Using Scraping Sessions

Overview

A scraping session is simply a way to collect together files that you want scraped. Typically you'll create a scraping session for each site you want to scrape informaiton from.

You can create a new scraping session by clicking the New Scraping Session button (looks like a gear) or by selecting "File->New Scraping Session" from the menu.

General tab



The "General" tab allows you to manage basic actions and information related to the scraping session.

Scripts tab



Using this tab scripts can be designated to run either before or after the scraping session runs. This can be useful for functions like initializing session variables and performing clean-up after the session is finished. The script to be run is designated under the "Script Name" column. The sequence the scripts should be invoked in is determined by the "Sequence" column. Indicate the event that should trigger the script using the "When to Run" column. If the checkbox in the "Enabled?" column is not checked the script will not get run.

Log tab



The "Log" tab displays messages as the scraping session is running. This is one of the most valuable tools in working with and debugging scraping sessions. As you're creating your scraping session you'll want to run it frequently and check the log to ensure that it's doing what you expect it to.

Advanced tab



This tab contains a number of settings that may be required when working with certain sites.

Anonymization tab



See the Anonymization page of the documentation for details on this pane.

Running Scraping Sessions Within Scraping Sessions (enterprise edition only)

It is also possible to run a scraping session within a scraping session that is already running via the RunnableScrapingSession class. Detailed documentation on methods available for the RunnableScrapingSession class are in our API documentation. Here's a specific example of how the RunnableScrapingSession might be used in a screen-scraper script:

// Generate a new RunnableScrapingSession object that will inherit
// from the current scraping session.  This object will be used
// to run the scraping session "My Scraping Session"
myRunnableScrapingSession = new com.screenscraper.scraper.RunnableScrapingSession( "My Session", session );

// Because we passed the "session" object to the RunnableScrapingSession
// it will have access to all of the session variables within the
// currently running session.  As such, there's no need to explicitly
// set any new session variables.  We simply tell it to scrape.
myRunnableScrapingSession.scrape();

// Once it's done scraping, because it inherited from our currently
// running scraping session, we have access to any session variables
// that were set when the RunnableScrapingSession ran in the context
// of our currently running scraping session.  For example, let's
// suppose that when the RunnableScrapingSession ran it set a new
// variable called "MY_VAR".  Because of the inheritance, we could
// do something like this to see th new value:
session.log( "MY_VAR: " + session.getVariable( "MY_VAR" ) );


From here:

On scripts:

Using Scrapeable Files

Using Scrapeable Files

Overview

A scrapeable file is a URL-accessible file that you want to have retrieved as part of a scraping session. These files are the core of screen-scraping as they determine what information will be made available to extract data from.

Scrapeable files are created by clicking the "Add Scrapeable File" button from the "General" tab for a scraping session. You can delete a scrapeable file by right-clicking (or option-clicking in Mac OS X) it in the tree on the left side of the screen and selecting "Delete".

In addition to working with files on remote servers, screen-scraper can also handle files on local file systems. For example, the following is a valid path to designate in the URL field: C:\wwwroot\myweb\my_file.htm.

Properties tab



The "Properties" tab defines basic settings needed to request a file.

Parameters tab



"Get" and "Post" Parameters

The "Parameters" tab indicates GET and POST parameters that should be sent when the file is requested. Note that GET parameters can also be embedded in the "URL" field under the "Properties" tab. Parameters are added using the "Add Parameter" button. They can be deleted by selecting them and either hitting the "Delete" key on the keyboard, or by right-clicking (option-clicking in Mac OS X) and selecting "Delete".

Upload a File

In the Enterprise Edition of screen-scraper you can also designate files to be uploaded. This is done by designating "FILE" as the parameter type. The "Key" column would containg the name of the parameter (as found in the corresponding HTML form), and the value would be the local path to the file you'd like to upload (e.g., C:\myfiles\this_file.txt).

Embed Variables

Embedded session variables can be used in the "Key" and "Value" fields for parameters. For example, if you have a "username" POST parameter you might embed a USERNAME session variable in the "Value" field with the token ~#USERNAME#~. This would cause the value of the "USERNAME" session variable to be substituted in at run time.

Extractor Patterns tab



This tab holds the various extractor patterns that will be applied to the HTML of this scrapeable file. See the using extractor patterns page for more information.

Scripts tab



Using this tab scripts can be designated to run either before or after the file is requested. This can be useful for functions like setting session variables and requesting multiple pages of search results. The script to be run is designated under the "Script Name" column. The sequence the scripts should be invoked in is determined by the "Sequence" column. Indicate the event that should trigger the script using the "When to Run" column. If the checkbox in the "Enabled?" column is not checked the script will not get run.

Last Request tab



This tab will display the raw HTTP request for the last time this file was retrieved. This tab can be useful for debugging in looking at POST and GET parameters that were sent to the server.

Last Response tab



This tab displays the raw HTTP and HTML from the last time this file was requested. The most common use for this tab is in generating and testing extractor patterns. You can generate an extractor pattern by highlighting a block of text or HTML, right-clicking (option-clicking on Mac OS X) and selecting "Generate extractor pattern from selected text".

The "Render HTML"/"View Source" button allows you to toggle between a rendered version of the page and the raw HTML source. In certain cases the HTML may contain embedded JavaScript and complex DHTML that screen-scraper has difficulty rendering. You can also use the "Display Response in Browser" button to display the web page in your default web browser.

Note that the contents shown under the "Last Request" tab might appear differently from the original HTML of the page. screen-scraper has the ability to "tidy" the HTML, which can facilitate data extraction. See using extractor patterns for more details on tidying HTML.

When viewed as text, the HTML for the last response can be searched using the "Find..." button.

Advanced tab (professional and enterprise editions only)



This tab contains a few advanced settings.


From here:

More details on related stuff:

Using Extractor Patterns

Using Extractor Patterns

Overview

Extractor patterns allow you to pinpoint select snippets of data that you want extracted from a web page. They're often the most confusing part of screen-scraper, so you'll want to look over this page carefully. An extractor pattern is a block of text (usually HTML) that contains special tokens that will match pieces of data you're interested in extracting. These tokens are text labels surrounded by the delimiters ~@ and @~. The label between the delimiters should contain only alpha-numeric characters and underscores.

You can think of an extractor pattern like a stencil. A stencil is an image in cut-out form, often made of thin cardboard. As you place a stencil over a piece of paper, apply paint to it, then remove the stencil, the paint remains only where there were holes in the stencil. Analogously, you can think of placing an extractor pattern over the HTML of a web page where the tokens correspond to the holes where the paint would pass through. After an extractor pattern is applied it reveals the portions of the web page you'd like to extract.

Extractor pattern tokens designate regions where data elements are to be captured. For example, given the following HTML snippet:

<p>This is the <b>piece of text</b> I'm interested in.</p>

you would extract "piece of text" by creating an extractor pattern with a token positioned like so:

<p>This is the <b>~@EXTRACTED_TEXT@~</b> I'm interested in.</p>

The extracted text could then be accessed via the identifier "EXTRACTED_TEXT".

If you haven't done so already, we'd recommend going through our first tutorial to get a better feel for using extractor patterns.

Managing extractor patterns



Any number of extractor patterns can be associated with a given scrapeable file, and are managed by clicking on a scrapeable file, then on the "Extractor Patterns" tab. Add an extractor pattern by clicking on the "Add Extractor Pattern" button. Any number of extractor patterns can be applied to a given scrapeable file, and they will be applied to the file in a designated sequence. Any number of tokens can appear within an extractor pattern.

The recommended way to create a token is to simply select a region of text in an extractor pattern, right-click (or control-click in Mac OS X) the selected region and select "Generate extractor pattern token from selected text" in the pop-up menu (note that this can be done either when viewing the extractor pattern as HTML or as plain text). Creating an extractor pattern token in this manner will open a window that allows you to edit the attributes of the token. It's also often helpful to use an external text editor when creating extractor patterns where you can store snippets of HTML you're working with. You can then copy text into screen-scraper, as needed.

If an extractor pattern takes too long to match a block of text it will timeout. The timeout setting may be adjusted from the "Settings" window (click on the Options->Settings menu item) under the "General" tab. If you find that your extractor pattern is timing out you might try adjusting it by using more precise regular expressions. The tips at the bottom of this page might also help.

Note that when creating extractor patterns you should use the HTML that will be found under the "Last Response" tab associated with a scrapeable file. By default, screen-scraper will "tidy" the HTML once it's been scraped, meaning that it will format it in a consistent way that makes it easier to work with. If you use the HTML by viewing the source for a page in your web browser it will likely be different from the HTML that screen-scraper generates.

Main tab



The main tab allows you to edit the primary attributes of the pattern, and contains the following elements:

Sub-extractor patterns tab



This tab allows you to add, edit, delete, and test sub-extractor patterns. See the "Using sub-extractor patterns" section below for more on this.

Advanced tab (professional and enterprise editions only)



The advanced tab provides extended control over extractor patterns, described below.

Extractor pattern tokens

Extractor pattern tokens can be edited by double-clicking them, or by selecting the label between the ~@ @~ delimiters, then right-clicking (control-clicking on Mac OS X), and selecting "Edit extractor pattern token". This will display a small dialog box with a tabbed pane. Each pane is described below.

Extractor pattern tokens "General" tab



Extractor pattern tokens "Regular Expression" tab



Here you can designate a regular expression that will be used to match the text covered by this token. You can either enter one in the text box, or select one from the drop-down list. The regular expressions that appear in the drop-down list can be edited by selecting "Edit regular expressions" from the "Options" menu.

Extractor pattern tokens "Mapping" tab (enterprise edition only)



The mapping tab allows you to alter extracted values. Often once you extract data from a web page you need to put it into a consistent format. For example, you may want products with very similar names to have identical names.

screen-scraper makes use of mapping sets when determining how to map a given extracted value. A mapping set may contain any number of mappings, which screen-scraper will analyze in sequence until it finds a match, or runs out of mappings. As such, you'll often want to put more specific mappings higher in sequence than more general mappings.

The various columns in a mapping are defined below:

You can create a new mapping set by simply typing a name into the "Set" box. Sets can be deleted via the "Delete Set" button, and an individual mapping can be added by clicking the "Add Mapping" button. Individual mappings can be deleted by selecting them, then right-clicking (control-clicking on Mac OS X) and selecting "Delete".

Consider the screen-shot of the "Mapping" tab above. If the extracted value were Widget 123 screen-scraper would first try to match using the Widget 1 mapping. Because this is an "Equals" match the mapping wouldn't occur, so screen-scraper would proceed to the second mapping. The second mapping would match because a "Contains" type was designated. That is, the text Widget 123 contains the text Widget. As such, the extracted data Widget 123 would become Product ABC, because that is the "To" value designated for the second mapping.

When using regular expressions in your mapping you can also make use of back references. Back references allow you to preserve values in the original text when mapped to the "To" value. For example, if you were mapping the value Widget 123 you could use the regular expression Widget (\d*). In the "To" column you could then enter the value Product \1, which, when mapped, would convert Widget 123 to Product 123. The value in parentheses in the "From" column gets inserted via the \1 marker found in the "To" column.

Extractor pattern tokens advanced tab (enterprise edition only)



Filtering duplicates (professional and enterprise editions only)

When extracting records from web sites you'll often want to filter out duplicates. screen-scraper provides a method whereby this can be done automatically. To filter duplicates for data extracted by a given extractor pattern you'll wnat to go to the "Advanced" tab, then check the boxes labeled "Automatically save the data set generated by this extractor pattern in a session variable", "Filter duplicate records", and "Cache the data set". This will cause screen-scraper to generate a session variable with the same name as the extractor pattern identifier, and will save any records extracted by the pattern to the file system rather than saving them in memory.

Once you've set up the extractor pattern to cache and save the data set, you'll need to designate the fields that would identify a unique record. That is, when filtering duplicates screen-scraper will compare the values for designated columns in order to determine if a duplicate record already exists (more or less like a database compound key). You designate an extractor pattern token to be used in determining uniqueness by editing it, and checking the "Use to filter duplicates" box found under the "Advanced" tab.

Because screen-scraper filters duplicates as it's scraping you'll want to wait until the end to make use of the data. For example, if you want all of the data written to a .CSV file you would want to invoke the script that does that after the scraping session has ended. That way you can guarantee that all of the data has been extracted and filtered before you save it.

Using sub-extractor patterns

Sub-extractor patterns allow you to extract data in smaller pieces, providing significantly more flexibility in pinpointing the specific pieces you're after. Consider a search results page consisting of rows and columns of data. Using normal extractor patterns you would use a single pattern to extract the data from all columns for a single row. In many cases this works just fine; however, the process gets more complicated when each row differs significantly. For example, certain cell rows may be in different colors or their contents may be completely missing. With a normal extractor pattern it would be difficult to account for the variability in the cells. By using sub-extractor patterns you could create a normal extractor pattern to extract an entire row, then use individual sub-extractor patterns to pull out the individual cells.

Consider the following HTML table:

Name Phone Address
Juan Ferrero 111-222-3333 123 Elm St.
Joe Bloggs No contact information available
Sherry Lloyd 234-5678 (needs area code) 456 Maple Rd.

Here is the corresponding HTML source:

<table cellpadding="2" border="1">
<tr><th>Name</th><th>Phone</th><th>Address</th></tr>
<tr><td class="Name">Juan Ferrero</td><td class="Phone">111-222-3333</td><td class="Address">123 Elm St.</td></tr>
<tr><td class="Name" bgcolor="red">Joe Bloggs</td><td colspan="2">No contact information available</td></tr>
<tr><td class="Name">Sherry Lloyd</td><td class="Phone" bgcolor="yellow">234-5678 (needs area code)</td><td class="Address">456 Maple Rd.</td></tr>
</table>

It would be difficult (if not impossible) to write a single extractor pattern that would extract the information for each row because the contents of the cells differ so significantly. The different colored cells and the cell spanning two columns make the data too inconsistent to be extracted using a single pattern.

Consider this extractor pattern:

<tr><td~@DATARECORD@~/td></tr>

If applied to the HTML above the extractor pattern would produce the following three matches:

1.  class="Name">Juan Ferrero</td><td class="Phone">111-222-3333</td><td class="Address">123 Elm St.<
2.  class="Name" bgcolor="red">Joe Bloggs</td><td colspan="2">No contact information available<
3.  class="Name">Sherry Lloyd</td><td class="Phone" bgcolor="yellow">234-5678 (needs area code)</td><td class="Address">456 Maple Rd.<

Sub-extractor patterns would allow you to extract individual pieces of information from each row. For example, consider this sub-extractor pattern:

Name">~@NAME@~</td>

If applied to each of the individual extracted rows above the following three pieces of information would be extracted:

1.  Juan Ferrero
2. 
3.  Sherry Lloyd

Note that "Joe Bloggs" didn't get extracted because the cell his name in is red. Let's adjust the sub-extractor pattern slightly:

Name"~@nonhtml@~>~@NAME@~</td>

The ~@nonhtml@~ tag represents an extractor pattern token that uses the "Non-HTML tags" regular expression:  [^<>]*. Matching anything between where it is covering until it encounters either an opening or closing HTML bracket. In this particular case the effect is that all three names get extracted. To extract the phone number you'd use this sub-extractor pattern:

<td class="Phone"~@nonhtml@~>~@PHONE@~</td>

We have the case, however, of the cell in the second row that spans two columns, which would not get extracted by the sub-extractor pattern. We may still want this information, however, so we create the following sub-extractor pattern, just in case the cell exists:

<td colspan="2">~@PHONE@~<

If applied to our data we'd get the following results:

1. 
2. No contact information available
3.

Sub-extractor patterns aggregate everything that's extracted into a single data set. Using all of our extractor and sub-extractor patterns together we'd get the following data set:

Data record # Name Phone
Data record #1 Juan Ferrero 111-222-3333
Data record #2 Joe Bloggs No contact information available
Data record #3 Sherry Lloyd 234-5678 (needs area code)

There are a couple of important things to note about sub-extractor patterns:

Tips on using extractor patterns


From here:

Related stuff:

Using Scripts

Using Scripts

Overview

screen-scraper has a built-in scripting engine to facilitate dynamically scraping sites and working with data once it's been extracted. Depending on your needs scripts can be helpful for such things as interacting with databases and dynamically determining which files get scraped when.

Invoking scripts in screen-scraper is similar to other programming languages in that they're tied to events. Just as you might designate a block of code to be run when a button is clicked in Visual Basic, in screen-scraper you might run a script after an HTML file has been downloaded or data has been extracted from a page.

Depending on your preferences, there are a number of languages that scripts can be written in. screen-scraper supports JavaScript, Interpreted Java, and Python on any platform, and JScript, VBScript, and Perl when running on Windows. Try the links at the bottom of this screen for information specific to each of the scripting languages.

If you haven't done so already, we'd highly recommend taking some time to go through our tutorials in order to get more familiar with how scripts are used.

Managing scripts

Scripts are added by clicking the "New Script" button (looks like a pencil and paper) or by selecting "File->New Script" from the menu bar. Delete a script either by selecting it and pressing the "Delete" key or by right-clicking it (or control-clicking on Mac OS X) and selecting "Delete".

Each script is given a unique name so that you can easily indicate when it should be invoked (e.g. before a scraping session begins or after each application of an extractor pattern). You can also select the language the script is written in. Scripts can be exported to an XML file so that they can be backed up or transferred to other instances of screen-scraper. See the Importing and exporting objects page for more information on this. Clicking on the "Show Script Instances" button will display any locations where this script is invoked in the format scraping session: scrapeable file: extractor pattern.

Finally, you're given a text box in which to write your script. The text editing features for authoring scripts in screen-scraper are currently fairly limited, so you may want to consider using an external editor, then copying and pasting text in to screen-scraper.

Using scripts

You designate a script to be executed by associating it with some event. For example, if you click on a scraping session in the tree, then on the "Scripts" tab, you'll notice that you can designate scripts to be invoked either before a scraping session begins or after it completes. Other events that can be used to invoke scripts relate to scrapeable files and extractor patterns. After associating a script with an object in this way it can be disassociated by selecting it in the table and pressing the "Delete" key or by right-clicking it (or control-clicking on Mac OS X) and selecting "Delete". You can also selectively enable and disable scripts using the "Enabled?" checkbox in the rightmost column.

Working with external Java libraries

Existing Java code can be referred to from within scripts. Simply copy any jar files you'd like to reference from within scripts into the "lib\ext" folder found in screen-scraper's directory. Note that you'll still need to use the "import" statement within your scripts to refer to specific classes, like this:

import com.foo.bar.*;

Please note--screen-scraper 4.0 was built on a Java 1.5 platform. You Java scripts must accept at least a version 1.5 JRE in order to compile and run properly.

Built-in objects

screen-scraper offers a few objects that you can work with in a script. Bear in mind that not all of these variables will be available in all scripts. See the Variable scope section (following this one) for more details. You can view details on all of the objects and their methods in our API Documentation.

Variable scope

Depending on when a script gets run different variables may be in scope. When associating a script with an object, such as a scraping session or scrapeable file, you're asked to specify when the script is to be run. The table that follows specifies what variables will be in scope depending on when a given script is run. Note that none of the variables will be in scope when a script is invoked directly, though it is common in these scripts to create RunnableScrapingSession objects.

When Script is Run session in scope scrapeableFile in scope dataSet in scope dataRecord in scope
Before scraping session begins X      
After scraping session ends X      
Before file is scraped X X    
After file is scraped X X    
Before pattern is applied X X    
After pattern is applied X X X  
After each pattern application X X X X

Debugging scripts

One of the best ways to fix errors is to simply watch the scraping session log (under the "Log" tab) and the "error.log" file (located in the "log" directory where screen-scraper was installed) for script errors. When a problem arises in executing a script screen-scraper will output a series of error-related statements to the logs. Often a good approach in debugging is to build your script bit by bit, running it frequently to ensure that it runs without errors as you add each piece.

When screen-scraper is running as a server it will automatically generate individual log files in the "log" directory for each running scraping session (this can be disabled in the settings window). An "error.log" file will also be generated in that same directory when internal screen-scraper errors occur.

The "Breakpoint" window can also be invaluable in debugging scripts. You can invoke it by inserting the line session.breakpoint() into your script. While the "Breakpoint" is displayed script execution will halt. There are two buttons along the top of the window. The "play" button will simply continue execution of your script. Clicking the "stop" button will cause screen-scraper to halt execution as soon as it can. The "Breakpoint" window also exposes any session variables, data sets, and data records that are in scope. These values can be altered in the "Breakpoint" window as well.


From here:

Scripting in screen-scraper

Scripting in screen-scraper

Using Session Variables

Using Session Variables

Overview

Session variables allow you to persist values across the life of a scraping session.

Setting session variables

There are a few different ways to set session variables. The first is within a script using the session.setVariable( String identifier, Object value ) method. A second is to designate that the value extracted by a specific token in an extractor pattern should be saved in a session variable (see using extractor patterns for more on this). Third, session variables can be set when using RemoteScrapingSession objects from external sources (such as a PHP or ASP script) via their setVariable methods.

Retrieving values from session variables

As with setting session variables, there are two ways to retrieve values of session variables. The first is within a script using the session.getVariable( String identifier ) method. The second is to embed the identifier for the session variable, surrounded by ~# and #~ delimeters. For example, if you have a session variable identified by QUERY_PARAM you might embed it into the URL field of a scrapeable file like this:

http://www.mydomain.com/myscript.php?query=~#QUERY_PARAM#~

screen-scraper will automatically replace the ~#QUERY_PARAM#~ text with the actual value of the corresponding session variable.


From here:

Scripting in Interpreted Java

Scripting in Interpreted Java

screen-scraper uses the BeanShell library to allow for scripting in Java. If you've done some programming in C or JavaScript you'll probably find BeanShell's syntax familiar. Documentation for BeanShell is excellent, and we'd recommend referring to it as you program.

See the using scripts page for details on objects and methods that you can make use of in a script.

Remember that you can access external Java libraries by placing .jar files inside the "ext" directory found in the "lib" folder of your screen-scraper installation. You will need to use at least Java version 1.5.


From here:

Scripting in VBScript

Scripting in VBScript

If you've programmed in Visual Basic or Active Server Pages you should find scripting in screen-scraper to be similar. Using VBScript within screen-scraper can only be done on a Windows platform, and requires that the VBScript runtime be installed. The chances are good that you've already got the VBScript runtime on your system, but if not you can download it from Microsoft's Script Downloads page. screen-scraper will automatically detect if the VBScript runtime is installed, which you can see by selecting a script within screen-scraper (from the tree on the left of the application) and clicking on the "Language" drop-down list. If you don't see "VBScript" in the list then the runtime needs to be installed.

Please be aware that because of a bug in the third-party library that allows screen-scraper to integrate with the Microsoft Scripting Engine problems can occur if multiple VBScript scripts are run simultaneously. If you're using the professional edition of screen-scraper and plan on running multiple scraping sessions simultaneously you should use Interpreted Java, JavaScript, or Python as a scripting language.

Because screen-scraper uses the native VBScript engine, all Active X objects installed on the computer (such as ADO or the FileSystemObject) can be accessed. Additionally, all of the objects mentioned on the Using scripts page are also available.

Java classes can also be instantiated within a script using the CreateBean function. For example, the following script will instantiate a RunnableScrapingSession for the "Weather" scraping session (which is found in the default screen-scraper installation) and run it:

' Generate a new "Weather" scraping session.
Set runnableScrapingSession = CreateBean( "com.screenscraper.scraper.RunnableScrapingSession", "Weather" )

' Put the zip code in a session variable so we can reference it later.
runnableScrapingSession.SetVariable "ZIP_CODE", "90001"

' Tell the scraping session to scrape.
runnableScrapingSession.Scrape


From here:

Scripting in JavaScript

Scripting in JavaScript

Mozilla's Rhino scripting engine is used by screen-scraper to allow for scripts to be written in JavaScript. Documentation for Rhino is sparse, but the interpreter does adhere strictly to the established ECMAScript standard, so just about any reference on JavaScript could be referred to. If you try writing scripts using JavaScript, and run into difficulties (because of lack of documentation), you may want to consider using Interpreted Java instead, which has very similar syntax and provides significantly better documentation.

If you've worked with client-side JavaScript in web programming, you'll probably be comfortable using JavaScript in screen-scraper. One "gotcha" to be aware of is the method for using external classes. If you'd like to reference a class in the standard Java library, you'd do it like this:

// Declare an ArrayList.
var myArrayList = new java.util.ArrayList();

// Add two elements.
myArrayList.add( "one" );
myArrayList.add( "two" );

// Log the size.
session.log( "Size: " + myArrayList.size() );

However, packages outside of the standard Java library must be prefaced with the "Packages" keyword. Here's an example of creating and using a DataRecord object:

// Declare a new DataRecord object.
var myDR = new Packages.com.screenscraper.common.DataRecord();

// Give it a key/value pair.
myDR.put( "foo", "bar" );

// Log the value of the key.
session.log( "foo: " + myDR.get( "foo" ) );


From here:

Scripting in JScript

Scripting in JScript

Writing scripts in JScript gives you the familiarity of a widely used language, while still providing access to commonly useed Windows libraries. Using JScript within screen-scraper can only be done on a Windows platform, and requires that the JScript runtime be installed. The chances are good that you've already got the JScript runtime on your system, but if not you can download it from Microsoft's Script Downloads page. screen-scraper will automatically detect if the JScript runtime is installed, which you can see by selecting a script within screen-scraper (from the tree on the left of the application) and clicking on the "Language" drop-down list. If you don't see "JScript" in the list then the runtime needs to be installed.

Please be aware that because of a bug in the third-party library that allows screen-scraper to integrate with the Microsoft Scripting Engine problems can occur if multiple VBScript scripts are run simultaneously. If you're using the professional edition of screen-scraper and plan on running multiple scraping sessions simultaneously you should use Interpreted Java, JavaScript, or Python as a scripting language.

Because screen-scraper uses the native JScript engine, all Active X objects installed on the computer (such as ADO or the FileSystemObject) can be accessed. Additionally, all of the objects mentioned on the Using scripts page are also available.

Java classes can also be instantiated within a script using the CreateBean function. For example, the following script will instantiate a RunnableScrapingSession for the "Weather" scraping session (which is found in the default screen-scraper installation) and run it:

// Generate a new "Weather" scraping session.
var runnableScrapingSession = CreateBean( "com.screenscraper.scraper.RunnableScrapingSession", "Weather" );

// Put the zip code in a session variable so we can reference it later.
runnableScrapingSession.setVariable( "ZIP_CODE", "90001" );

// Tell the scraping session to scrape.
runnableScrapingSession.scrape();


From here:

Scripting in Perl

Scripting in Perl

screen-scraper uses ActiveState's ActivePerl library to allow for scripts to be written in Perl. Using Perl within screen-scraper can only be done on a Windows platform, and requires that the ActivePerl runtime be installed, which can be downloaded from ActiveState's download page for free. screen-scraper will automatically detect if the ActivePerl runtime is installed, which you can see by selecting a script within screen-scraper (from the tree on the left of the application) and clicking on the "Language" drop-down list. If you don't see "Perl" in the list then the runtime needs to be installed.

Java classes can be instantiated within a script using the CreateBean function. For example, the following script will instantiate a RunnableScrapingSession for the "Weather" scraping session (which is found in the default screen-scraper installation) and run it:

# Generate a new "Weather" scraping session.
$runnableScrapingSession = CreateBean( "com.screenscraper.scraper.RunnableScrapingSession", "Weather" );

# Put the zip code in a session variable so we can reference it later.
$runnableScrapingSession->setVariable( "ZIP_CODE", "90001" );

# Tell the scraping session to scrape.
$runnableScrapingSession->scrape();


From here:

Scripting in Python

Scripting in Python

The Jython interpreter is used by screen-scraper to allow for scripting in Python. Jython is a very fast interpreter, and we'd recommend using it if you're familiar with the Python programming language.

When scripting in Python all of the standard Java classes can be used. Classes must be imported using a special directive, which is also required if you'd like to create one of screen-scraper's RunnableScrapingSession objects. Here's an example that will run the "Weather" scraping session (which is found in the default screen-scraper installation):

# Import the RunnableScrapingSession class.
from com.screenscraper.scraper import RunnableScrapingSession

# Generate a new "Weather" scraping session.
runnableScrapingSession = RunnableScrapingSession( "Weather" )

# Put the zip code in a session variable so we can reference it later.
runnableScrapingSession.setVariable( "ZIP_CODE", "90001" )

# Tell the scraping session to scrape.
runnableScrapingSession.scrape()

Notice that before the RunnableScrapingSession class can be used it first must be imported.


From here:

Writing extracted data to XML (enterprise edition only)

<