screen-scraper FAQ - General Technical


screen-scraper supports any character sets supported by the 1.5 Java Virtual Machine. A complete list can be found here: http://java.sun.com/j2se/1.5/docs/guide/intl/encoding.doc.html.

No. screen-scraper is designed only to scrape data from web sites. If you're looking for a solution that can extract data from older mainframe-type applications, we'd recommend looking at Jagacy.

In order to install screen-scraper on a machine, you'll likely need administrative or root access. Generally this is not the case with virtual hosting, so you likely will not be able to run screen-scraper on your server.

Oftentimes this won't preclude you from using it, however. A common scenario is to scrape data on a local machine, write the data to a CSV file, then upload it to a server to be imported. If you have a database running on the server, you may also still be able to run screen-scraper from a local machine, then insert the scraped data into your database using the technique we describe in our fifth tutorial.

Sort of, yes. See this blog posting.

The short answer to this one is, "Sometimes." Most all widgets (applets, etc.) that communicate with their server via HTTP can be sccraped by screen-scraper. Oftentimes, however, they'll use a proprietary protocol. Most of the time Adobe Flash movies use HTTP when they need to communicate with a server, but Java applets and ActiveX controls don't always. The easiest way to find out is to use screen-scraper's proxy server when interacting with a page containing one of these elements. Take a close look at the HTTP requests and responses passing between the web browser and the server. If you see text in there (often XML or URL-encoded lists of parameters) then the chances are good that screen-scraper can extract the information being passed between the client and server. Note, however, that there may be text that the widget is displaying that doesn't get passed between the client and server. Unfortunately, in such cases, screen-scraper is unable to extract that information. The only utilities we're aware of that may allow for scraping that type of information would be IBM's Rational Robot and OpenSpan.

If you're using the Enterprise Edition of screen-scraper, this can be done via the web interface.

For the Basic and Professional editions, the best way to go about this is to use an external scheduler, such as the Windows Task Scheduler or the Unix cron daemon. You'll typically set up one of these schedulers to either invoke screen-scraper from the command line or to invoke a separate application, which in turn invokes screen-scraper while it's running as a server.

Unfortunately, the short answer to this question is, "it depends." If you're doing only very simple things with screen-scraper (e.g., scraping a few files once in a while) it could run comfortably in 64MB of RAM with a 500MHz processor. On the other end of the spectrum, if you're running multiple lengthy scraping sessions in parallel the memory and CPU requirements could climb quite a bit. Allocating the right amount of memory to screen-scraper invariably involves some experimentation. For example, you might run your scraping sessions in as realistic a scenario as possible, then use tools such as the Windows Task Manager or top to monitor CPU and memory usage. Remember that you can adjust the amount of memory screen-scraper is allocated by opening the "Settings" dialog box (click on the wrench icon), then altering the value labeled "Maximum memory allocation in megabytes".

It might also be helpful to look over the question below on optimizing scraping sessions.

screen-scraper will automatically follow certain redirects, so it just depends on what type the web site is making use of. There are three types of redirects that are typically used on the web:

1. 3xx HTTP responses. These are probably the most common, and are the ones screen-scraper will automatically follow. For example, instead of responding with a 200: OK HTTP response, the server will respond with 302: Moved Temporarily, then supply the URL the browser is to redirect to in a "Location" HTTP header. In these cases you shouldn't need to do anything at all; screen-scraper will simply follow them as a browser would.

2. META refresh tags. These are special HTML tags that are often embedded in a web page which contain the URL the browser is to redirect to. screen-scraper will not automatically follow these, so you'd need to create a separate scrapeable file to send screen-scraper to them. This might also involve extracting certain parameters from the URL before going to the redirected page.

3. JavaScript redirects. Occasionally sites will utilize client-side JavaScript to send the browser to a new location. As it pertains to screen-scraper, the technique for dealing with these is basically the same as that described in #2.

Yes. This is a common situation, and generally just requires that you create a scrapeable file to handle logging in. This scrapeable file should be run first in the scraping session, allowing the web site to set cookies, which screen-scraper will then track for you.

For example, if you wanted to scrape a list of all auctions you're watching from the ebay web site, you would create a scrapeable file that would first log you in (issue a POST request with your username and password), then you would create subsequent scrapeable files that would scrape the information you're interested in.

There is also a special type of authentication known as "BASIC" or "WWW-Authenticate". You'll know a web site is using this when, upon attempting to access a particular URL, you are presented with a small dialog box requesting a username and password. When setting up screen-scraper to scrape a page using this type of authentication you simply need to enter in the username and password in the "Properties" tab under "BASIC Authentication Parameters" for the scrapeable file you set up to scrape the page. Note that you generally only need to enter the username and password once for a given site on a single scrapeable file, as screen-scraper will retain the username and password for you.

We give an example of configuring screen-scraper to log in to a site in our second tutorial

screen-scraper supports HTTPS on all supported platforms except certain early versions of Mac OS X. If you're using the screen-scraper proxy server to access a site that uses HTTPS follow the directions found under the "Viewing encrypted transactions" found on this documentation page: Using the Proxy Server. In setting up scrapeable files to access pages that use HTTPS you don't need to treat them any differently than those that use HTTP.

Absolutely. screen-scraper handles cookies (and BASIC authentication tokens) transparently behind the scenes. When setting up screen-scraper to scrape information from your site you rarely need to take any thought for cookies. In certain cases, sites will set cookies in JavaScript. In such cases, you can set them within a screen-scraper script via the session.setCookie method.