![]() |
Getting Started |
![]() |
Overview of screen-scraping |
What is screen-scraping?
Screen-scraping is the practice of extracting information from web sites so that it can be used in other contexts. It has its roots in an earlier practice that dealt with reading the display from a mainframe terminal, then re-purposing the information via character recognition or some other method in order to persist the functionality of legacy applications.
Why do screen-scraping?
If possible, the preferred method for getting information presented on a web site is via something like an RSS or other XML-based feed. As data that's extracted from web sites is often used directly in existing applications, SOAP is another possible alternative to getting at needed information. Unfortunately it's not always possible to get information using RSS or SOAP, which makes room for screen-scraping as an approach to get at data. Take a look at our solutions page for specific examples of screen-scraping.
The basic approach
While it's typically fairly easy for a person to log in to a web site, navigate to a particular page, and copy information out of a document, a machine needs a lot more help. Web pages are obviously designed to be viewed and used by humans, so in screen-scraping we typically need to take the same actions that a human would take when copying data from a web page. There are typically three phases in scraping information from a given page:
screen-scraper dramatically reduces the time required to perform all of these steps, so that you can focus on what to do with the extracted information.
Legal issues
A good portion of the information on the web is copyrighted, which obviously has legal implications for screen-scraping. One should use discretion when grabbing data from web sites to be re-purposed.
![]() |
Getting Started Using screen-scraper |
Overview
Using screen-scraper to extract information from web sites typically consists of a few main steps:
Organize and configure files to be scraped. Once you've selected the files to scrape you'll typically need to organize and sequence them. You'll also usually tweak information related to the files, such as POST data to be sent or authentication tokens.
The best way to learn to use screen-scraper is by going through our tutorials.
From here:
On the proxy server:
On the scraping engine:
On extractor patterns:
On scripts: