SearchNavigationUser login |
An Overview of the Scraping Engine
Purpose The screen-scraper application provides an intuitive and convenient way to extract data from web pages. Many of the details related to screen-scraping that are typically done in code can be handled through a graphical user interface (called the "workbench", in screen-scraper). Basic concepts There are several basic elements used by screen-scraper in extracting data from web sites. The first is a scraping session which consists of a series of scrapeable files (or web pages), that screen-scraper will request in a designated sequence. A common example might be a site that requires authentication before the data that is to be extracted can be accessed. The first file, or HTTP request, might be to a server-side script that handles a user's login. It might be necessary to follow a few links, which would involve creating more scrapeable files, until the page can be requested that contains the desired data. Any number of parameters can be associated with scrapeable files. This would be GET, POST, or authentication tokens such as Basic, Digest or NTLM that need to be sent when the file is requested. For each scrapeable file that's requested any number of extractor patterns can be applied to the text retrieved from the page in order to extract out the desired pieces. Throughout this process scripts can be invoked that might perform tasks such as insert extracted data into a database or invoke subsequent scrapeable files to be requested. As a scraping session is running screen-scraper will log the activity and record each request and response corresponding to each scrapeable file that gets requested.
|