![]() |
Using Scrapeable Files |
Overview
A scrapeable file is a URL-accessible file that you want to have retrieved as part of a scraping session. These files are the core of screen-scraping as they determine what information will be made available to extract data from.
Scrapeable files are created by clicking the "Add Scrapeable File" button from the "General" tab for a scraping session. You can delete a scrapeable file by right-clicking (or option-clicking in Mac OS X) it in the tree on the left side of the screen and selecting "Delete".
In addition to working with files on remote servers, screen-scraper can also handle files on local file systems. For example, the following is a valid path to designate in the URL field: C:\wwwroot\myweb\my_file.htm.
Properties tab

The "Properties" tab defines basic settings needed to request a file.
Name: Identifies the scrapeable file.
Parameters tab

"Get" and "Post" Parameters
The "Parameters" tab indicates GET and POST parameters that should be sent when the file is requested. Note that GET parameters can also be embedded in the "URL" field under the "Properties" tab. Parameters are added using the "Add Parameter" button. They can be deleted by selecting them and either hitting the "Delete" key on the keyboard, or by right-clicking (option-clicking in Mac OS X) and selecting "Delete".
Upload a File
In the Enterprise Edition of screen-scraper you can also designate files to be uploaded. This is done by designating "FILE" as the parameter type. The "Key" column would containg the name of the parameter (as found in the corresponding HTML form), and the value would be the local path to the file you'd like to upload (e.g., C:\myfiles\this_file.txt).
Embed Variables
Embedded session variables can be used in the "Key" and "Value" fields for parameters. For example, if you have a "username" POST parameter you might embed a USERNAME session variable in the "Value" field with the token ~#USERNAME#~. This would cause the value of the "USERNAME" session variable to be substituted in at run time.
Extractor Patterns tab

This tab holds the various extractor patterns that will be applied to the HTML of this scrapeable file. See the using extractor patterns page for more information.
Scripts tab

Using this tab scripts can be designated to run either before or after the file is requested. This can be useful for functions like setting session variables and requesting multiple pages of search results. The script to be run is designated under the "Script Name" column. The sequence the scripts should be invoked in is determined by the "Sequence" column. Indicate the event that should trigger the script using the "When to Run" column. If the checkbox in the "Enabled?" column is not checked the script will not get run.
Last Request tab

This tab will display the raw HTTP request for the last time this file was retrieved. This tab can be useful for debugging in looking at POST and GET parameters that were sent to the server.
Last Response tab

This tab displays the raw HTTP and HTML from the last time this file was requested. The most common use for this tab is in generating and testing extractor patterns. You can generate an extractor pattern by highlighting a block of text or HTML, right-clicking (option-clicking on Mac OS X) and selecting "Generate extractor pattern from selected text".
The "Render HTML"/"View Source" button allows you to toggle between a rendered version of the page and the raw HTML source. In certain cases the HTML may contain embedded JavaScript and complex DHTML that screen-scraper has difficulty rendering. You can also use the "Display Response in Browser" button to display the web page in your default web browser.
Note that the contents shown under the "Last Request" tab might appear differently from the original HTML of the page. screen-scraper has the ability to "tidy" the HTML, which can facilitate data extraction. See using extractor patterns for more details on tidying HTML.
When viewed as text, the HTML for the last response can be searched using the "Find..." button.
Advanced tab (professional and enterprise editions only)

This tab contains a few advanced settings.
From here:
More details on related stuff: