Getting Started

Getting Started

Overview of screen-scraping

Overview of screen-scraping

What is screen-scraping?

Screen-scraping is the practice of extracting information from web sites so that it can be used in other contexts. It has its roots in an earlier practice that dealt with reading the display from a mainframe terminal, then re-purposing the information via character recognition or some other method in order to persist the functionality of legacy applications.

Why do screen-scraping?

If possible, the preferred method for getting information presented on a web site is via something like an RSS or other XML-based feed.  As data that's extracted from web sites is often used directly in existing applications, SOAP is another possible alternative to getting at needed information. Unfortunately it's not always possible to get information using RSS or SOAP, which makes room for screen-scraping as an approach to get at data. Take a look at our solutions page for specific examples of screen-scraping.

The basic approach

While it's typically fairly easy for a person to log in to a web site, navigate to a particular page, and copy information out of a document, a machine needs a lot more help. Web pages are obviously designed to be viewed and used by humans, so in screen-scraping we typically need to take the same actions that a human would take when copying data from a web page. There are typically three phases in scraping information from a given page:

  1. Request the page. This first part may actually be more complex than it sounds. Oftentimes the page that's needed can only be accessed after logging in to a site and following a series of specific links. Your web browser will typically handle things such as tracking cookies and submitting all of the elements of a form for you, but it becomes a bit more of a manual process when done by a computer.
  2. Extract the information. Once the web page is requested the next step is to parse the HTML text such that specific pieces of data can be extracted and used within computer code. There are several ways to go about this. One possibility is to apply regular expressions, which often work well since they allow for relatively "fuzzy" searches.  Another might be to attempt to turn the HTML in the document into XML so that it can be queried using such methods as XPath.
  3. Do something with the extracted data. From here the information might be inserted into a database or perhaps re-formatted in some way to be presented to a user.

screen-scraper dramatically reduces the time required to perform all of these steps, so that you can focus on what to do with the extracted information.

Legal issues

A good portion of the information on the web is copyrighted, which obviously has legal implications for screen-scraping. One should use discretion when grabbing data from web sites to be re-purposed.

Getting Started Using screen-scraper

Getting Started Using screen-scraper

Overview

Using screen-scraper to extract information from web sites typically consists of a few main steps:

  1. Use the proxy server to determine which files to scrape. It's frequently necessary to request a few files before you can get at the file that contains the data you need (e.g. you may need to log in to the site first). The proxy server allows you to surf a site as you normally would, then easily select files you need to have scraped.
  2. Organize and configure files to be scraped. Once you've selected the files to scrape you'll typically need to organize and sequence them. You'll also usually tweak information related to the files, such as POST data to be sent or authentication tokens.

  3. Create extractor patterns. Extractor patterns provide an intuitive way to selectively identify snippets of data you want extracted from individual pages.
  4. Create scripts. Scripts let you do something with the data that gets extracted. This might be writing the data out to a formatted file or inserting the information into a database.

The best way to learn to use screen-scraper is by going through our tutorials.


From here:

On the proxy server:

On the scraping engine:

On extractor patterns:

On scripts: