Tutorials: Menu

Tutorials

Before diving in to screen-scraper we highly recommend that you take some time to go through our tutorials. Each tutorial should take around 30 minutes. The current tutorials cover all of the basics of using screen-scraper, and should be adequate to get you going on most projects. Along with these tutorials you'll probably find it helpful to look through our documentation.

Tutorial 1: Hello World. This first tutorial will familiarize you with the basics of using screen-scraper and the general approach we recommend in setting up sites to be scraped.

Tutorial 2: Scraping an E-commerce Site. This tutorial covers scraping search results that span multiple pages, using extractor patterns, and logging in to a web site.

Tutorial 3: Extending Hello World. The third tutorial builds off of the first, and covers topics such as richer scripting and interacting with screen-scraper from languages such as Active Server Pages, PHP, and Java.

The rest of the tutorials build off of Tutorial 2, and can be done in any order. They're intended to give examples of more specific tasks you might want to accomplish with screen-scraper. Feel free to read through them and try any that best fit your situation.

Tutorial 4: Scraping an E-commerce Site from External Programs. Here we extend Tutorial 3 with specific examples of invoking screen-scraper from Java, Active Server Pages, PHP, and .NET.

Tutorial 5: Saving Scraped Data to a Database. This tutorial illustrates how to take the data we scraped from our e-commerce site and insert it into a database.

Tutorial 6: Generating an RSS/Atom Feed from a Product Search. Here we go over creating an XML feed based on a serach on our shopping site to demonstrate screen-scraper's RSS/Atom capabilities.

Tutorial 7: Scraping a Site Multiple Times Based on Search Terms. A common scenario in screen-scraping is submitting a form multiple times and extracting the search results. A common example is a "store locator" service where you submit many zip codes, then extract out the various locations corresponding to those zip codes. This tutorial walks you through how to use screen-scraper to tackle such a task.

If you'd like to print the tutorials for easier reading, you can use the "Printer-friendly version" link at the bottom of any given page, or click here to get such a version for all of the tutorials.

Tutorial 1: Hello World!

Hello World!

This tutorial will walk you step-by-step through the process generally used to scrape information from web pages using screen-scraper. It should take you about 20 to 30 minutes to complete, and will familiarize you with the basic principles you'll need to scrape information from web sites. To get the most from this tutorial you should have at least a basic knowledge of HTML and HTTP (really just the way web browsers interact with web servers). This tutorial also assumes that you've successfully downloaded and installed screen-scraper.

If you don't have a lot of experience working with web technologies, or if you'd just like a refresher, you might find these sites helpful:

This is intended to be a very basic tutorial, and, as such, we'll be extracting the words "Hello World" from a web page and writing them to a file. While this is a simple example of pulling a single snippet of text off of a page, you would use a very similar approach for something like a stock quote or product price.

We'll try to keep the pace of the tutorial such that (hopefully) you won't get bored or frustrated. Along the way if you'd like more information on a topic try the links at the bottom of each screen.

The scraping session you are about to create (choose Interpreted Java or VB Script):

AttachmentSize
Hello World (Scraping Session--Interpreted Java).xml3.78 KB
Hello World (Scraping Session--VBScript).xml4.28 KB

Tutorial 1: Page 2: Screen-Scraping Overview

Screen-Scraping Overview

In many ways working with screen-scraper is like working with a database, such as mySQL or SQL Server. With databases, you'll generally use an interface (often a graphical user interface) to create objects such as tables, columns, and indexes. Once you've set up the database you'll often write programming code to populate it with data as well as to pull information from it. Likewise with screen-scraper you'll use its graphical user interface to create objects needed to extract information from web sites. Once you've set up these objects you'll write programming code to interact with screen-scraper and make use of the data it extracts.

Extracting information from web sites using screen-scraper typically involves four main steps:

1. Use the proxy server to determine the exact files that need to be requested in order to get the information you're after.
2. Create a scraping session with scrapeable files that define the sequence of pages screen-scraper will request.
3. Generate extractor patterns to define the exact information you need screen-scraper to grab from each page.
4. Write small scripts or programming code to invoke screen-scraper and/or work with the data it extracts. If you don't do much programming, don't worry. Generally the scripts you'll need to write to work with screen-scraper are small and simple, and you can usually just modify the example scripts we provide.

We'll now walk through each of these steps in detail.

Tutorial 1: Page 3: Proxy Server Setup

Proxy Server Setup

An HTTP proxy server is basically just a program that sits in between a web browser and a web server, passing bits between each. screen-scraper contains a proxy server that allows you to view all requests that your web browser sends, and the corresponding responses that web servers send in return. The proxy server records all of the pages requested by your browser as you surf so that they can be easily scraped by screen-scraper at a later point.



OK, enough talk; it's time to fire up screen-scraper. If you're running Windows this is done by selecting the appropriate link from the "Start" menu. On Unix/Linux or Mac OS X use the "screen-scraper" link that was created when you installed screen-scraper.

Once screen-scraper has fully loaded you'll see a tree on the left which will contain the objects we'll be creating. Right now we need to set up screen-scraper's proxy server.

In screen-scraper you'll generally use a proxy session for each web site you'd like to extract information from. A proxy session holds all of the HTTP requests and responses recorded from your browser for the period of time you run it. Create a proxy session now by clicking the "New Proxy Session" button (looks like a globe) or by selecting "New Proxy Session" from the "File" menu. screen-scraper should now look like this:






Give the proxy session a name by typing "Hello World" into the "Name" field. The "Port" field determines the port number that your web browser will use when communicating with screen-scraper's proxy server. The bottom checkbox causes the proxy server to ignore binary files (which are generally not very interesting when you're scraping text-based data). For now we're only concerned with the "Port" field, which you should be able to leave as 8777.

Next we need to set up your web browser so that it will use screen-scraper as a proxy server. If you have two web browsers installed on your computer we recommend using one of them to continue through the tutorial and the other to interact with the proxy server. For example, if you have Internet Explorer and Firefox installed you may want to view the tutorial pages using Firefox and use Internet Explorer with the proxy server. Odds are you're using Internet Explorer as your primary browser, so we'll give detailed instructions on setting it up. If you're using a different web browser try one of the following links: Firefox, Opera, Mozilla, or Netscape

Open up Internet Explorer, then click on "Internet Options" from the "Tools" menu. You should get a dialog box like this:






From here click on the "Connections" tab, then on the "LAN Settings" button. Click on the checkbox beginning with "Use a proxy server for...", then on the "Advanced..." button. The dialog box should now look like this:






In the "HTTP" and "Secure" fields type "localhost" under the "Proxy address to use" column, and "8777" under "Port" (assuming you haven't changed the default port number from 8777). Hit the "OK" button a few times till you get back to your web browser. NOTE: Depending on your operating system, instead of "localhost" you may need to use either "127.0.0.1" or the IP address of the machine. If you have trouble connecting to screen-scraper's proxy with your web browser, please see this FAQ.

At this point your browser is set up such that any time you click on a link or submit a form the request will first go to screen-scraper, where it will be recorded, and then get sent to the web server it was intended for. The web server will respond back to screen-scraper, which will record the response, then send it along to your web browser.

If you're running Mac OS X, and are using screen-scraper Professional or Enterprise Edition, there's one more step you'll need to take. In screen-scraper, click the wrench icon to bring up the "Settings" dialog box. Click on the "Servers" button in the panel on the left, then remove any text contained in the "Hosts to allow to connect" text box. Because of the way Mac OS X handles IP addresses, we do this so that screen-scraper will accept connections from your web browser.

At this point we can get the proxy server running. Do this now in screen-scraper and clicking on the "Start Proxy Server" button for your proxy session. After this click on the "Progress" tab, which will display all of the requests and responses recorded by the proxy server.

You're now ready to have screen-scraper record a few pages for you...

Tutorial 1: Page 4: Recording Pages with the Proxy Server

Recording Pages with the Proxy Server

Return now to your web browser and go to the following URL:

http://www.screen-scraper.com/tutorial/basic_form.php

If you take a look at screen-scraper you'll notice that it recorded this page in the "HTTP Transactions" table. If you click on the first row in the table information related to your browser's request and response will appear in the lower pane:





If you didn't see your page show up in the "HTTP Transactions" table, or if your browser seems to have trouble, take a look at this FAQ for help.

The lower pane shows the details of the HTTP request your browser made--the request line, any HTTP headers (including cookies), as well as POST data (if any was sent). You can view the corresponding response from the server by clicking on the "Response" tab. Don't worry if a lot of what you're seeing doesn't make much sense; for the most part screen-scraper takes care of these kinds of details for you (such as keeping track of cookies).

At this point, in your web browser, type "Hello world!" (without the quotes) into the form text box and click the "Submit" button. This simply submits the form using the GET method to this same page, and displays what you typed in. We now have all of the pages we need recorded, so click on the "General" tab in screen-scraper then click on the "Stop Proxy Server" button. Now might also be a good time to adjust your web browser so that it no longer uses screen-scraper as a proxy server.

Tutorial 1: Page 5: Generating a Scrapeable File

Generating a Scrapeable File

At this point we're ready to start creating the objects that screen-scraper will use to extract data from the page. We start by creating a scraping session. A scraping session is simply a container for all of the files and other objects that will allow us to extract data from a given web site. Either click the "New Scraping Session" button (looks like a gear) or click on the "File" menu, then select "New Scraping Session". After the scraping session appears rename it to "Hello World" (note that if you imported the scraping session at the beginning of the tutorial you'll want to name it something else--perhaps "My Hello World"). Your window should now look like this:



Now return back to our "Hello World" proxy session by clicking on it in the tree on the left (the one with the globe by it), then click on the "Progress" tab. Click on the second or last row in the "HTTP Transactions" table. In the lower pane make sure "Hello World" is selected from the drop-down list labeled "Generate scrapeable file in:", then click the "Go" button. A scrapeable file is a web page that contains information we're interested in extracting. First off, let's rename our scrapeable file "Form submission". Your screen should now look like this:



Just to make sure things are good so far let's run a quick test. Run the "Hello World" scraping session by clicking on it in the tree on the left, then clicking the "Run Scraping Session" button. Now click on the "Log" tab. It should just take a moment to run, after which the log should show the following:

Starting scraper.
Running scraping session: Hello World
Processing scripts before scraping session begins.
Scraping file: "Form Submission"
Form Submission: Preliminary URL: http://www.screen-scraper.com/tutorial/basic_form.php
Form
Submission: Using strict mode.
Form Submission: Resolved URL: http://www.screen-scraper.com/tutorial/basic_form.php?text_string=Hello+... Submission: Sending request.
Processing scripts after scraping session has ended.
Scraping session "Hello World" finished.

The log is an invaluable tool in debugging scraping sessions, which you'll want to use often. In this case it shows that screen-scraper requested the only scrapeable file in our scraping session ("Form submission"). You can view the text of the file that was scraped by clicking on "Form submission" in the tree on the left, then clicking the "Last Response" tab. Click the "Display Response in Browser" button to ensure that the page looks like the one in your browser (it may not look exactly like it, but should resemble it closely). It's often helpful to view the last response for a scrapeable file after running a scraping session so that you can ensure that screen-scraper requested the right page.

QUICK TIP!!!!
A good principle of software design is to run code often as you make changes. Likewise, with screen-scraper it is a good idea to run your scraping session frequently and watch the log and last responses to ensure that things are working as you intend them to.

Now would be a good time to save your work. Click the "Save" button (looks like a disk) or select the "Save" option from the "File" menu.

Tutorial 1: Page 6: Generating an Extractor Pattern

Generating an Extractor Pattern

This is probably the trickiest part of the tutorial, so if you've been skimming up to this point you'll probably want to read this page a little more carefully. An extractor pattern is a block of text (usually HTML) that contains special tokens that will match pieces of data you're interested in extracting. These tokens are text labels surrounded by the delimiters ~@ and @~.

You can think of an extractor pattern like a stencil. A stencil is an image in cut-out form, often made of thin cardboard. As you place a stencil over a piece of paper, apply paint to it, then remove the stencil, the paint remains only where there were holes in the stencil. Analogously, you can think of placing an extractor pattern over the HTML of a web page where the tokens correspond to the holes where the paint would pass through. After an extractor pattern is applied it reveals the portions of the web page you'd like to extract.

Take a look at the HTML from the page we scraped by clicking on the "Form submission" scrapeable file, then on the "Last Response" tab. If you click the "Render HTML" button you should see a screen resembling the page you saw in your browser. Consider this snippet of HTML from the page:


You typed: Hello world!


As we're interested in extracting the string "Hello world!" our extractor pattern would look like this:

<table align="center">
<tr>
<td><span style="color: red">You typed: ~@FORM_SUBMITTED_TEXT@~</span> </td>
</tr>
</table>

The string "~@FORM_SUBMITTED_TEXT@~" is the token that corresponds to the data we're interested in, and, after this extractor pattern is applied, would hold the string "Hello world!". Returning to our stencil analogy, the "~@FORM_SUBMITTED_TEXT@~" token is analogous to the hole in the stencil where the paint would pass through. In a bit we'll look at how we might make use of the data extracted by that token.

We'll now create an extractor pattern that will extract the "Hello world!" text you typed in to the HTML form. Under the "Form submission" scrapeable file, click on the "Extractor Patterns" tab, then click on the "Add Extractor Pattern" button. Give your extractor pattern the identifier "Form data", and in the "Pattern text" box enter the extractor pattern shown above. Your screen should now look like this:



Go ahead and give the extractor pattern a try by clicking on the "Apply Pattern to Last Scraped Data" button. The following window will appear, displaying the text that our extractor pattern extracted from the page:



Looks like our extractor pattern has matched the snippet of text we were after. The "Apply Pattern to Last Scraped Data" is another invaluable tool you'll use often to make sure you're getting the right data. It simply uses the HTML from the "Last Response" tab, and applies the extractor pattern to it.

!!!!QUICK TIP!!!!
When creating extractor patterns, always be sure you use the HTML from screen-scraper's "Last Response" tab, and not by viewing the HTML source in your web browser. Before screen-scraper applies an extractor pattern to an HTML page, it "tidies" up the HTML to facilitate extraction. This will generally cause the HTML to be slightly different from the HTML you'd get directly from your web browser.

Before we continue we need to take a look at one more thing. Extractor pattern tokens have properties, one of which we'll need to modify. To modify the properties for our "~@FORM_SUBMITTED_TEXT@~" extractor pattern token double-click it (that is, double click on the text FORM_SUBMITTED_TEXT found between the ~@ @~ tokens in the "Pattern text" box) or select it, right-click it (or Control-click in Mac OS X), then select "Edit token". You'll see the following box:



screen-scraper makes use of session variables which allow you to save and persist objects throughout the life of a scraping session. This means that screen-scraper will save the extracted data in memory so that it can be used later in scripts and such. In this case we'd like to save the text that our "~@FORM_SUBMITTED_TEXT@~" extractor pattern token extracts. Indicate this now by clicking the "Save in sesssion variable?" checkbox, then closing the "Edit Token" window. In other words, when screen-scraper runs this scraping session and extracts the text for this extractor pattern it will save that text (e.g., "Hello world!") in a session variable so that we can do something with it later. Next we'll make use of the data we extract...

Tutorial 1: Page 7: Overview of Writing a Simple Script

Overview of Writing a Simple Script

We'll now do something with the data we've extracted by writing a simple script. A screen-scraper script is a block of code that will get executed when a certain event occurs. For example, you might have a script that gets invoked at the beginning of a scraping session that initializes variables. Another script might get invoked each time a row in a list of search results is extracted from a site so that the information in that search result can be inserted into a database. You can think of this as being analogous to "event handling" mechanisms in other programming languages. For example, in an HTML page you might associate a JavaScript method call with the "onLoad" event for the body tag. In Visual Basic you'll often create a sub-routine that gets invoked when a button is clicked. In the same way, screen-scraper scripts will get invoked when certain events occur related to requesting web pages and extracting data from them.

If you don't have much experience programming don't worry, generally scripts written in screen-scraper are short and simple. The script we'll be creating will simply write out the text we extract to a file.

In preparation for writing our script click the "New Script" button (looks like a pencil and paper) or select "New Script" from the "File" menu, and give it the identifier "Write extracted data to a file". Your screen should now look like this:






screen-scraper supports scripting in Interpreted Java, JavaScript, and Python when running on any operating system, and JScript, Perl, and VBScript when running on Windows. At this point, depending on the language you prefer, you can continue on with an explanation of scripting in Interpreted Java or VBScript, using one of the links below.

Tutorial 1: Page 8: Writing a Simple Script in Interpreted Java

Writing a Simple Script in Interpreted Java

screen-scraper uses the BeanShell library to allow for scripting in Java. If you've done some programming in C or JavaScript you'll probably find BeanShell's syntax familiar.

Let's get right to it. Copy and paste the following text into the box labeled "Script Text":

// Output a message to the log so we know that we'll be writing the text out to a file.
session.log( "Writing data to a file." );

// Create a FileWriter object that we'll use to write out the text.
out = new FileWriter( "form_submitted_text.txt" );

// Write out the text.
out.write( session.getVariable( "FORM_SUBMITTED_TEXT" ) );

// Close the file.
out.close();

Hopefully it's obvious what's going on, based on the comments in the script. We simply create an object used to write out the text (a "FileWriter"), write it out, then close up the file. Note the session.getVariable( "FORM_SUBMITTED_TEXT" ) method call, which retrieves the value of the "FORM_SUBMITTED_TEXT" session variable. This method call is able to get the value because we indicated earlier that the value for the "FORM_SUBMITTED_TEXT" token was to be saved in a session variable (i.e., when we checked the "Save in session variable?" box).

If you haven't done much programming, this is where things might seem a little confusing. If so, you may consider trying a basic tutorial on Java or JavaScript, which will hopefully introduce you to the basics of programming. You'll especially want to get an introduction to object-oriented programming.

Tutorial 1: Page 9: Invoking a Script

Invoking a Script

A script is executed in screen-scraper by associating it with some event, such as before or after an extractor pattern is applied to the text of a web page.

The script we've just written needs to be executed after screen-scraper has requested the web page and extracted the data we need from it.

At this point return to the extractor pattern we just created by clicking on the "Form submission" scrapeable file in the tree on the left, then on the "Extractor Patterns" tab. In the lower part of your screen click on the "Add Script" button. Select "Write extracted data to a file" in the column on the left, and select "After pattern is applied" in the third column. Your screen should now look like this:






Our "Write extracted data to a file" script will be invoked after screen-scraper has applied the "Form data" extractor pattern to the web page. That is, once the extractor pattern has applied as many times as it needs to (which is only once, in this case), it will invoke the script.

The curious might be wondering a bit more about the difference between "After pattern is applied" and "After each pattern application". Consider a web page that contains a table with 10 rows. We might create an extractor pattern that matches a single row in the table. The extractor pattern would match 10 times--one for each row in the table. If we associated a script with the extractor pattern and told it to run "After pattern is applied", the script would only get executed one time (i.e., after the pattern has matched as many times as it needs to). If we had indicated that the script should run "After each pattern application", it would get executed 10 times--one time for each match the pattern makes. In the current case, the pattern only matches one time, so it doesn't make a big difference whether we indicate "After pattern is applied" or "After each pattern application".

Tutorial 1: Page 10: Running the Completed Scraping Session

Running the Completed Scraping Session

Finally, we have everything in place to run our scraping session. Click on the "Hello World" scraping session in the tree on the left, then click on the "Log" tab. If there is existing text in the "Log" get rid of it by clicking the "Clear Log" button. Now click on the "Run Scraping Session" button. After it finishes running, take a look at the contents of the "form_submitted_text.txt" file, which will be located in the screen-scraper installation directory (e.g., C:\Program Files\screen-scraper professional edition\).

Tutorial 1: Page 11: Where to Go From Here

Where to Go From Here

Congratulations! You now have the basic core knowledge you need to scrape screens with screen-scraper. While this was a very simple example of a scraping session, we did cover most of the main principles you need to start your own project. If you have the time, we'd highly recommend continuing on to Tutorial 2: Scraping an E-commerce Site, as well as Tutorial 3: Extending Hello World. Otherwise, you may want to consider reading through some of the existing documentation as you work on your own project.

Tutorial 2: Scraping an E-commerce Site

Scraping an E-commerce Site

In this tutorial we'll be scraping search results from a basic e-commerce site. We'll also demonstrate logging in to a web site before scraping data. Data you'll be scraping from web sites is often in the form of "records", or data that might fit into a spreadsheet in rows and columns. It's also often necessary to log in to a web site before you can scrape the data you're interested in. Hopefully getting some practice with these situations in this tutorial will let you apply the experience to other similar situations. For example, you would likely apply the same approach we'll go over here to extracting data such as online directories, real estate listings, or product descriptions.

If you haven't already gone through tutorial 1 we'd recommend that you do so before continuing with this one. This tutorial, however, doesn't depend on scraping sessions or other objects you might have created in the previous tutorials. You may wish to download and import the completed scraping session that goes with this tutorial. The scraping session and complete output file are available below.

The site we'll be scraping information from is found here: http://www.screen-scraper.com/shop/. Feel free to click around and explore for a minute.

The scraping session you are about to create and the output file the scraping session will generate:

AttachmentSize
dvds.txt897 bytes
Shopping Site (Scraping Session).sss10.2 KB

Tutorial 2: Page 2: Screen-Scraping Overview Review

Screen-Scraping Overview Review

As you'll remember from the previous tutorials, extracting information from web sites using screen-scraper typically involves four main steps:

1. Use the proxy server to determine the exact files that need to be requested in order to get the information you're after.
2. Create a scraping session with scrapeable files that define the sequence of pages screen-scraper will request.
3. Generate extractor patterns to define the exact information you need screen-scraper to grab from each page.
4. Write small scripts or programming code to invoke screen-scraper and/or work with the data it extracts.

Tutorial 2: Page 3: Recording Search Results

Recording Search Results

As in the first tutorial, we'll be recording a browser session using the proxy server. Remember that a proxy session holds all of the HTTP requests and responses from your browser for the period of time you run it.

Create a new proxy session now either by clicking the "New Proxy Session" button (looks like a globe) or by selecting "New Proxy Session" from the "File" menu. When the proxy session appears type in "Shopping Site" in the "Name" field. In your web browser go to this URL: http://www.screen-scraper.com/shop/ (remember that you may want to use one browser with the proxy server and one to view the tutorials).

At this point start up the proxy server by clicking the "Start Proxy Server" button, then configure your web browser as you did in the first tutorial (if you need help try this page). In screen-scraper, ensure that the "Don't log binary files" checkbox is checked. Now click on the "Progress" tab so that you can see the pages appear as they get recorded.

We'll be doing a search in the shopping web site for the term "dvd" in the various products. Do this by typing "dvd" (without the quotes) into the search box located in the upper-right corner of the home page, then click the "Search" button. You'll see screen-scraper work for a bit, then, once it finishes, you should just see one row in the "HTTP Transactions" table. We'll want to traverse all of the search results, so, in your web browser, click the "Next >>" link. screen-scraper will work again for a bit while it records the next search results page. Later on we'll be scraping the details pages, so let's record one of those now. Click on the "Speed" link to view details on this DVD. These are the only pages we're interested in at this point, so go ahead and stop the proxy session by clicking the "Stop Proxy Server" button on the "General" tab. You'll also want to re-configure your web browser so that it's no longer using screen-scraper as a proxy server.

Tutorial 2: Page 4: Creating the Scraping Session

Creating the Scraping Session

Create a scraping session either by clicking the "New Scraping Session" button (looks like a gear) or by selecting "New Scraping Session" from the "File" menu. In the "Name" field enter "Shopping Site" (if you already downloaded and imported the scraping session at the first of this tutorial you'll want to name your scraping session something else--perhaps "My Shopping Site"). This is the scraping session that will hold all of the files we'll be extracting data from. Remember that a scraping session is simply a container for all of the files and other objects that will allow us to extract data from a given web site.

We'll now be adding scrapeable files to our scraping session. You'll remember from the first tutorial that a scrapeable file represents a web page you'd like screen-scraper to request.

Add the first scrapeable file to the scraping session by clicking the "Shopping Site" proxy session in the tree on the left (the first of the two "Shopping Site" nodes), then on the "Progress" tab. Find the row in the "HTTP Transactions" table with the following URL (probably the second in the table):

http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&keyword=dvd&sort=2a&page=2

This URL corresponds to the second page in the search results. We'll use this file because it should contain all of the parameters in the URL we need to request any of the search results pages (including the first). After clicking on this row in the table, information corresponding to the file will appear in the lower pane. Add the file to the "Shopping Site" scraping session by selecting it in the "Generate scrapeable file in" drop-down list, and clicking the "Go" button next to the "Generate scrapeable file in" drop-down list.

After the scrapeable file appears under the scraping session rename it to "Search results". Next, click on the "Parameters" tab. Remember that when we generate a scrapeable in this way screen-scraper pulls out the parameters from the URL and puts them under the "Parameters" tab for us. Because these are "GET" parameters (as opposed to "POST" parameters), when the scrapeable file is invoked by screen-scraper in a running scraping session, the parameters will get appended again to the URL. Let's take a closer look at each of the parameters that were embedded in the URL:

* main_page: advanced_search_result
* keyword: dvd
* sort: 2a
* page: 2

The only two that we're likely interested in are "keyword" and "page". We can guess that "keyword" refers to the text we typed into the search box initially. The "page" parameter refers to what page we're on in the search results. We can guess that if we were to replace the "2" in the "page" parameter of the URL it would bring up the first page in the search results. Try this by bringing up the following page in your web browser:

http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&keyword=dvd&sort=2a&page=1

Looks like our theory was correct. You should see the first page of search results. It's also important to note that the "keyword" and "page" parameters are those that will need to be dynamic. We'll get to that in a minute.

Tutorial 2: Page 5: Creating the Script to Initialize the Scraping Session

Creating the Script to Initialize the Scraping Session

We're now going to create a small script to initialize our scraping session. It's a common practice to run a script at the very beginning of a scraping session that can initialize variables and such. That's what we'll be doing here.

Generate the script either by clicking the "New Script" button (looks like a pencil and paper) or by selecting "New Script" from the "File" menu. In the "Name" field type "Shopping Site--initialize session". You'll remember from the first tutorial that screen-scraper scripts get invoked when certain events occur. We'll be invoking this script before the scraping session begins, as we did in the second tutorial.

If you prefer to code in Java (or JavaScript), select "Interpreted Java" from the "Language" drop-down, then copy and paste the following text into the "Script Text" box:

// Set the session variables.
session.setVariable( "SEARCH", "dvd" );
session.setVariable( "PAGE", "1" );


If you prefer to code in VBScript, select "VBScript" from the "Language" drop-down, then copy and paste the following text into the "Script Text" box:

' Set the session variables.
Call session.SetVariable( "SEARCH", "dvd" )
Call session.SetVariable( "PAGE", "1" )


We set two session variables on our current scraping session. The one item to note is the "PAGE" session variable. We start at 1 so that the first search results page will get requested first.

Before trying out this script let's modify the parameters for our scrapeable file so that they make use of the session variables. Click on the "Search results" scrapeable file, then on the "Parameters" tab. Change the value of the "keyword" parameter from "dvd" to "~#SEARCH#~" (without the quotes), and change the value of the "page" parameter from "2" to "~#PAGE#~" (again, omit the quotes).

The ~#SEARCH#~ and ~#PAGE#~ tokens will be replaced at runtime with the values of the corresponding session variables. As such, the first URL will be as follows:

http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&keyword=dvd&sort=2a&page=1

That is, screen-scraper will take all of our "GET" parameters, append them to the end of the URL, then replace any embedded session variables (surrounded by the ~# #~ markers) with their corresponding values.

Note that we could achieve the same effect by deleting all of the parameters from the "Parameters" tab, and replacing our URL with this:

http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&keyword=~#SEARCH#~&sort=2a&page=~#PAGE#~

Breaking out the parameters under the "Parameters" simply makes them easier to manage, which is why we take that approach.

We'll now need to associate our script with our scraping session so that it gets invoked before the scraping session begins. To do that, click on the scraping session in the tree on the left, then on the "Scripts" tab. Click the "Add Script" button to add a script. In the "Script Name" column select "Shopping Site--initialize session". The "When to Run" column should show "Before scraping session begins", and the "Enabled" checkbox should be checked. This will cause our script to get executed at the very beginning of the scraping session so that the two session variables can get set.

All right, we're ready to try it all out. This scraping session will generate a larger log than the one we worked on earlier, so it may be a good idea to increase the number of lines screen-scraper will display in its log. To do that, click on the scraping session in the tree on the left, then on the "Log" tab. In the text box labeled "Show only the following number of lines" enter the number 1000.

Run the scraping session by selecting it in the tree on the left, then click the "Run Scraping Session" button. View the progress of the scraping session by clicking on it in the tree on the left, then clicking on the "Log" tab. You'll notice that the URL of the requested file is the one given above. You can also verify that the correct URL was requested by clicking on the "Search results" scrapeable file, then on the "Last Response" tab, then on the "Render HTML" or "Display Response in Browser" buttons. The page should resemble the one you saw in your web browser.

Remember that it's a good idea to run scraping sessions often as you make changes, and watch the log and last responses to ensure that things are working as you expect them to. You'll also want to save your work frequently. Do that now by hitting the "Save" button (the one with the disk icon).

Tutorial 2: Page 6: Creating Extractor Patterns for Links

Creating Extractor Patterns for Links

This particular part of the tutorial is one that covers important principles that often seem confusing to people at first. If you've been speeding through the tutorial up to this point, it would probably be a good idea to slow down a bit and read more carefully.

We're now going to create a couple of extractor patterns to extract information for the "Next" link and the product details links. Remember that an extractor pattern is a block of text (usually HTML) that contains special tokens that will match pieces of data you're interested in extracting.

When creating extractor patterns we recommend that you always use the HTML from the "Last Response" tab in screen-scraper. By default, after screen-scraper requests a page it "tidies" the HTML found in it, which makes it differ from the HTML that you would get by viewing the source in your web browser (and also makes it more consistent, facilitating extraction). Click on the "Search results" scrapeable file in the tree on the left, then on the "Last Response" tab. The text box contains HTML because we just ran the scraping session. Copy all of the HTML and paste it into a text editor, such as Notepad or TextMate.

If you click either the "Render HTML" or "Display Response in Browser" button in screen-scraper you'll see a page basically resembling the search results page in your web browser. We're going to extract a portion of each of the product details links so that we can subsequently request each details page and extract information from them. The first details link corresponds to the "A Bug's Life" DVD. Find that in the text editor you just pasted the HTML into (specifically search for the text "A Bug's Life"). Here is the block of HTML representing this product:

<tr class="productListing-odd">
<td align="center" class="productListing-data">&nbsp;<a href="http://www.screen-scraper.com/shop/index.php?main_page=product_info&amp;products_id=8"><img src="images/dvd/a_bugs_life.gif" border="0" alt="A Bug's Life" title=" A Bug's Life " width="100" height="80" /></a>&nbsp;</td>
<td class="productListing-data">&nbsp;<a href="http://www.screen-scraper.com/shop/index.php?main_page=product_info&amp;products_id=8">A Bug's Life</a>&nbsp;</td>
<td align="right" class="productListing-data">&nbsp;$35.99&nbsp;</td>
<td align="center" class="productListing-data"><a href="http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&amp;keyword=dvd&amp;sort=2a&amp;page=1&amp;action=buy_now&amp;products_id=8"><img src="includes/templates/template_default/buttons/english/button_buy_now.gif" border="0" alt="Buy Now" title=" Buy Now " width="60" height="30" /></a>&nbsp;</td>
</tr>


This may seem like a bit of a mess, but if we look closely we can pick out the details link:

<td class="productListing-data">&nbsp;<a href="http://www.screen-scraper.com/shop/index.php?main_page=product_info&amp;products_id=8">A Bug's Life</a>&nbsp;</td>


Breaking it down a bit more we get the URL:

http://www.screen-scraper.com/shop/index.php?main_page=product_info&amp;products_id=8


By the way, you might notice that the typical & symbols in the URL have been replaced by &. Don't be alarmed, it's just part of the tidying process screen-scraper applies to the HTML. Again, if we examine the parameters in the URL we can guess that the important one is "products_id", which likely identifies the product whose details we're interested in. We'll guess that the "products_id" is the only one we'll need to extract. This will give us enough information to request a details page. At this point, click on the "Search results" scrapeable file in the tree on the left, then click on the "Extractor Patterns" tab. We'll create an extractor pattern to grab out the product IDs from each link. Here's the extractor pattern we'll use:

<td class="productListing-data">&nbsp;<a href="http://www.screen-scraper.com/shop/index.php?main_page=product_info&amp;products_id=~@PRODUCTID@~">~@PRODUCT_TITLE@~</a>&nbsp;</td>


Create the extractor pattern by clicking on the "Add Extractor Pattern" button, then copying and pasting the text above into the resulting box. Also, give the extractor pattern the name "Product details link". Remember that extractor pattern tokens (delineated by the ~@ @~ markers) indicate data points we're interested in extracting. In this case, we want to extract the ID of the product (embedded in the URL), and the title of the product.

Double-click the ~@PRODUCTID@~ token (or select the text between the ~@ @~ delimiters, right-click it and select "Edit token"), and, in the box that appears, click "Save in session variable" checkbox. Click on the "Regular Expression" tab, and select "Non-double quotes". You'll notice that when you do that the text [^"]* shows up in the text box just above the drop-down list. This is the regular expression that we'll be using. You could also edit it manually, but generally won't need to.

Let's slow down at this point and go over what we just did to the ~@PRODUCTID@~ extractor pattern token. You might remember from the second tutorial that by checking the "Save in session variable" box we're telling screen-scraper to preserve the value for us so that we can use it at a later point. We'll get to that in a bit. This time we also selected a regular expression for it to use. In most cases you'll want to designate a regular expression for extractor pattern tokens. If you're not very familiar with regular expressions, don't worry. In the vast majority of cases you can simply use the regular expressions found in that drop-down list. Let's go over what effect designating a regular expression has. By indicating the "Non-double quotes" regular expression we're saying that we want that token to match any character except a double-quote (i.e., the " character). You'll notice in our extractor pattern that a double-quote character just follows our ~@PRODUCTID@~ extractor pattern token. By using a regular expression we limit what the token will match so that we can ensure we get only what we want. You might think of it as putting a little fence around the token. We want it to match any characters underneath the ~@PRODUCTID@~ extractor pattern token, up to (but not including) the double-quote character.

A line from that last paragraph is worth repeating. In most cases you'll want to designate a regular expression for extractor pattern tokens. Using regular expressions also makes extractor patterns more resilient to changes in the web site. That is, if the web site makes minor changes to its HTML (e.g., altering a font style or color), often if you've been using regular expressions your extractor patterns will still match. Also, by using regular expressions we can often decrease the amount of HTML we need to use in our extractor patterns. That is, by using regular expressions we indicate more precisely what the data will look like that our tokens will match. By doing this, we can often reduce the amount of HTML we include at the beginning and end of our extractor patterns. In general, if you can reduce the amount of HTML in your extractor patterns, and increase the number of regular expressions you use in tokens, your extractor patterns will be more resilient to changes that get made in the HTML of the pages.

Now close the "Edit Token" box, which saves our settings.

Now let's alter the settings for the ~@PRODUCT_TITLE@~ token. We're not interested in saving the value for this token in a session variable, but we include it since it will differ for each section of HTML we want to match. Double-click the ~@PRODUCT_TITLE@~ extractor pattern token to bring up the "Edit token" dialogue box. Click on the "Regular expression" tab, then select "Non-HTML tags". Again, take a look at the characters on the left and right sides of our ~@PRODUCT_TITLE@~ extractor pattern token. By using this regular expression we tell it not to include any greater than (>) or less than (<) symbols. This way we create a boundary for the token so that we can ensure it matches only what we want it to.

Why even include an extractor pattern token for data we don't want to save? This is another important principle. By using extractor pattern tokens for data we don't necessarily want to save, we make the extractor pattern more resilient to changes in the HTML. By using these extra tokens we can "future proof" our extractor patterns against changes the site owners might make down the road. There are also often situations (such as the present one) where data points adjacent to data we want to extract will differ for each pattern match. Here we only want the product ID, but we also include the product title because of its proximity to the data we want to extract, and because its value will differ each time the extractor pattern matches.

If those last few paragraphs strike you as a little bit confusing, don't worry. As you get more experience using screen-scraper you'll see why they're important. For now just take our word for it that you'll generally want to use regular expressions with extractor pattern tokens, and that it's often a good idea to use extractor pattern tokens to match data points you don't necessarily want to save. As you get more experience it will become more apparent when to use extractor pattern tokens for data you don't want to save.

Let's give our new extractor pattern a try. Click the "Apply Pattern to Last Scraped Data" button. You should see a window come up that shows the extracted data.

Again, let's slow down a moment and review what this window contains. When an extractor pattern matches, it produces a DataSet. You can think of a DataSet like a spreadsheet--it contains rows columns and cells. Each row in a DataSet is called a DataRecord. Again, a DataRecord can be thought of as being analogous to a row in a spreadsheet. In this particular case our DataSet has three columns. Two of them should be familiar--they correspond to the PRODUCT_TITLE and PRODUCTID extractor pattern tokens. The "Sequence" column indicates the order in which each row was extracted. You'll notice that the sequence is zero-based, meaning the first DataRecord in the DataSet is referenced with an index of 0. You'll also notice that the DataSet has 10 records--one for each product found in the search results page. Later on when we start talking more about DataSets and DataRecords, just remember the spreadsheet analogy--a DataSet is like the entire spreadsheet, and a DataRecord is like a single row in the spreadsheet.

Another good habit to get into is applying your extractor patterns frequently to ensure they correctly match the text you want extracted. Go ahead and close the "DataSet" window now.

Now for our "Next" link. In the text editor where you pasted the full HTML from the web page, search for the text "Next". Around that area you'll find the HTML for the link:

&nbsp;&nbsp;<a href="http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&amp;keyword=dvd&amp;sort=2a&amp;page=2" title=" Next Page ">[Next&nbsp;&gt;&gt;]</a>&nbsp;</td>


Fortunately, we're already familiar with the URL, and we know that the only parameters we need to worry about are "keyword" and "page". Create a new extractor pattern, call it "Next link", and use the following to grab the values of those parameters out:

&nbsp;&nbsp;<a href="http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&amp;keyword=~@KEYWORD@~&amp;sort=2a&amp;page=~@PAGE@~" title=" Next Page ">[Next&nbsp;&gt;&gt;]</a>&nbsp;</td>


As with the previous extractor pattern, double-click the ~@PAGE@~ token, and, in the box that appears, click "Save in session variable" checkbox. Click on the "Regular Expression" tab, and select "Number" from the "Select" drop-down list.

Close the "Edit Token" box to save your settings. If you're interested, the "Number" regular expression \d* simply indicates that we only want the PAGE token to match numbers (\d signifies a digit, and the * signifies "zero or more").

Next, double-click the "KEYWORD" extractor pattern token to edit it. Click on the "Regular Expression" tab, then select "URL GET parameter" from the "Select" drop-down list. This indicates that the "KEYWORD" extractor pattern should match only characters that would be found in a "GET" parameter of a URL. We could have used the "Non-double quotes" regular expression as we did above, but used this one instead as it's a bit more specific still to what we do and don't want the token to match. You'll notice that we didn't check the box to save the "KEYWORD" extractor pattern token in a sesion variable. We already have that value in a session variable, so we don't bother getting it again.

Try out the extractor pattern by clicking the "Apply Pattern to Last Scraped Data". Excellent! We have two matches--one for each "Next" link on the page (the top and bottom of the page).

Now would be a good time to save your work. Do that by selecting "Save" from the "File" menu or by clicking the floppy disk icon.

OK, let's try out the whole thing once more. Click on the "Shopping Site" scraping session in the tree on the left, then on the "Log" tab. Click the "Clear Log" button--we're going to run it again and we don't want to get confused by the log text from the last run. As before, click on the "Run Scraping Session" button to get it going. You'll see quite a bit more text in the log this time. Take a minute to look through it to ensure you understand what's going on.

Tutorial 2: Page 7: Scraping Pages from Scripts

Scraping Pages from Scripts

For each details link we're going to scrape the corresponding details page. This is a common scenario in screen-scraping--given a search results page, you need to extract details for each product, which means following each of the product details links. For each details page you'll likely want to extract out pieces of information corresponding to the products.

Let's start by creating a scrapeable file for the details page. We could create it from the proxy session, but it's pretty simple, so let's just create it from scratch. Click on the "Shopping Site" scraping session, the "General" tab, then click the "Add Scrapeable File" button. Give the scrapeable file the name "Details page", and the following URL:

http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=~#PRODUCTID#~

You'll notice that this time we're leaving all of the parameters embedded in the URL. Sometimes with shorter URL's it's more convenient to take this approach rather than breaking them out under the "Parameters" tab. As before, when the scraping session runs, the ~#PRODUCTID#~ token will be replaced by the value of the "PRODUCTID" session variable. At this point, click the "This scrapeable file will be invoked manually from a script" checkbox. If we didn't do this, screen-scraper would invoke this scrapeable file in sequence (after the search results page), which we don't want. Instead, we're going to tell screen-scraper to invoke this scrapeable file from a script.

In screen-scraper, links are generally followed by invoking a script after an extractor pattern finds matches. Let's go over this in more detail. First, create a new script and call it "Scrape details page". If you're using Interpreted Java enter the following code:

session.scrapeFile( "Details page" );


If you're using VBScript enter the following:

Call session.ScrapeFile( "Details page" )


OK, this is where the logic may get a little tricky. For each product ID our "Product details link" extractor pattern extracts, we want to scrape the product details page using the PRODUCTID it extracts. Go to the "Product details link" extractor pattern by clicking the "Search results" scrapeable file, then the "Extractor Patterns" tab. Note the "Scripts" pane under the extractor pattern. Click the "Add Script" button. This will allow us to have a script execute as the pattern finds matches. Under the "Script Name" column, if it isn't already selected, select our "Scrape details page" script. Leave the "Sequence" as is, and, under the "When to Run" column, select "After each pattern application".

Let's walk through this a bit more slowly. After the search results page is requested the "Product details link" will be applied to the HTML in the page. Remember that this particular extractor pattern will match 10 times--once for each product details link. Each time it matches it will grab a different product ID and save the value of that product ID into the PRODUCTID session variable. The "Scrape details page" script will get invoked after each of these matches, and each time the PRODUCTID session variable will hold a different product ID. As such, when the "Details page" gets scraped the URL will get a different product. For example, the first time the extractor pattern matches the PRODUCTID session variable will hold "8", and the URL will be:

http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=8

The next time the product ID will be 34, yielding the URL:
http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=34

If it helps, think again about the spreadsheet analogy. You can imagine screen-scraper walking through each row in the spreadsheet. It encounters a row, saves any needed data in session variables (the product ID, in this case), then invokes the "Scrape details page" script. Because it just matched a specific product ID, and saved its value in a session variable, when the "Details page" scrapeable file gets invoked by the script, the current product ID in the PRODUCTID session variable will be used. Once it's finished invoking the "Details page" scrapeable file, it will go on to the next row (or DataRecord) in the spreadsheet (or DataSet). Again, it will save the next product ID in a session variable, then execute the "Scrape details page" script, which in turn invokes the "Details page" scrapeable file. Because we indicated that the script should be invoked "After pattern application", this will occur 10 times--once for each search result. If we had designated "After pattern is applied", the script would only have been executed once--after it traversed the spreadsheet and reached the very end.

Hopefully that's not too repetitive :) This is another area that people new to screen-scraper find confusing, so it's probably worth it to slow down a bit and ensure you understand what's going on.

Now would be a good time to try out the whole scraping session again. Do that like you did before by clearing out the log for the scraping session, then clicking the "Run Scraping Session" button. You'll see each details page getting requested one-by-one. Note especially each URL, which will have a different product ID at the end of each. If you'd prefer not to wait for the entire session to run you can click the "Stop Scraping Session" button. As before, it would be a good idea to go through the log carefully to ensure that you understand what it's doing.

At this point we still need to deal with the "Next" page link. We already have an extractor pattern to grab out the page number of the next page. Let's create a script to scrape the search results page again for each "Next" link. Generate a new script and call it "Scrape search results". If you're using Interpreted Java enter the following:

if( dataSet.getNumDataRecords() > 0 ){
     session.scrapeFile( "Search results" );
}


If you're using VBScript enter the following code (again, be sure to select "VBScript" from the "Language" drop-down box):

If dataSet.getNumDataRecords > 0 Then
    Call session.ScrapeFile( "Search results" )
End If


You'll notice that the script makes use of a "dataSet" variable. When the script is invoked screen-scraper will automatically create a variable corresponding to the current DataSet. This variable allows you to get access to all of the information that was extracted by the current extractor pattern. You can read more about objects available in scripts and their scope in our documentation, at the Using Scripts and API Documentation pages.

In this particular case, the script first checks the number of records in the current DataSet. That is, it looks at the number of DataRecords (or rows) in the DataSet (or spreadsheet). This effectively just checks to see if any "Next" link was found in the page. If so, it tells screen-scraper to scrape the "Search results" scrapeable file.

After creating the script return to the "Next link" extractor pattern, then click the "Add Script" button. Select the "Scrape search results" script. This time there's something slightly different we'll need to do under the "When to Run" column. First, click the "Apply Pattern to Last Scraped Data" button. You'll notice that the pattern matches twice. The problem is that we only want to follow one of the "Next" links (that is, we don't want to scrape the second page twice). This is easily dealt with by selecting "After pattern is applied" under the "When to run" column. In other words, the script will only get invoked once--after the extractor pattern has matched as many times as it can. Note, though, that because we're saving the value of the ~@PAGE@~ extractor pattern token in a session variable it will still hold the correct value when the page gets scraped. Because we indicate that the script is to be invoked "After pattern is applied", the "dataSet" variable will be in scope. See the Variable scope section in our documentation for more detail on which variables are in scope depending on when a given script is run.

OK, run the scraping session once more. Clear the scraping session log, then click the "Run Scraping Session" button again. If you let it run for a while you'll notice that it will request each details page for the products found on the first search results page, request the second search results page, then request each of the details pages for that page.

Tutorial 2: Page 8: Extracting Product Details

Extracting Product Details

At this point we're able to scrape the details pages for each of the products. We're now ready to extract the information we're really interested in: data about each DVD. To do this we're going to use sub-extractor patterns. Again, this is a point in the tutorial where you may want to slow down a bit. Sub-extractor patterns is another important concept that can be a bit confusing at first.

Sub-extractor patterns allow us to define a small region within a larger HTML page from which we'll extract individual snippets of information. This helps to eliminate most of the HTML text we're not interested in, allowing us to be more precise about the data we'd like to extract. It also makes our extractor patterns more resilient to future changes in the HTML page, as they allow us to reduce the amount of HTML we need to include.

If you let the scraping session run through to completion the last URL in the scraping session log will be the following:

http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=7

If that's not the exact one you have, don't worry; it won't make a difference for our extractor patterns. We'll need to examine the HTML for this page in order to generate the extractor patterns for it. Do this by clicking on the "Details page" scrapeable file in the tree on the left, then on the "Last Response" tab. You'll remember that screen-scraper records the HTML for the last time each page was requested. Bring up the URL above in your web browser. We'll be extracting the DVD title, price, model, shipping weight, and manufacturer.

It should be apparent in examining the page that most of the elements in it aren't of interest to us. For example, we don't care about the header, footer, or any of the boxes along the sides of the page. We'll first define a region that basically surrounds the elements we're interested in. Here is that full region:

<tr>
<td colspan="2" class="pageHeading" valign="top">
<h1>You've Got Mail</h1>
</td>
</tr>

<tr>
<td align="center" valign="top" class="smallText" rowspan="2">
<script language="javascript" type="text/javascript">
<!--
document.write('<a href="javascript:popupWindow(\'http://www.screen-scraper.com/shop/index.php?main_page=popup_image&amp;pID=7\')"><img src="images/dvd/youve_got_mail.gif" border="0" alt="You\'ve Got Mail" title=" You\'ve Got Mail " width="100" height="80" hspace="5" vspace="5" /><br />larger image<\/a>');
//-->
</script>



<noscript><a href="http://www.screen-scraper.com/shop/index.php?main_page=images/dvd/youve_got_mail.gif" target="_blank"><img src="images/dvd/youve_got_mail.gif" border="0" alt="You've Got Mail" title=" You've Got Mail " width="100" height="80" hspace="5" vspace="5" /><br />
larger image</a></noscript> </td>
<td class="main" align="center" valign="top">Model: DVD-YGEM</td>
</tr>

<tr>
<td class="main" align="center"></td>
</tr>

<tr>
<td align="center" class="pageHeading">$34.99</td>
<td class="main" align="center">Shipping Weight: 7.00 lbs.</td>
</tr>

<tr>
<td>&nbsp;</td>
<td class="main" align="center">10 Units in Stock</td>
</tr>

<tr>
<td class="main" align="center">Manufactured by: Warner</td>
<td align="center">
<table border="0" width="150px" cellspacing="2" cellpadding="2">
<tr>
<td align="center" class="cartBox">&nbsp;Quantity


That might seem like a large chunk of HTML, but it's actually a relatively small percentage of the entire page.

Before defining sub-extractor patterns we first define an extractor pattern with a special ~@DATARECORD@~ token in it. If you're familiar with computer programming in general, the ~@DATARECORD@~ token can be thought of as a "reserved word". That is, it's a token that has a special meaning in that it defines the sub-region of the HTML page containing the data elements we're interested in. You'll always use the ~@DATARECORD@~ token when using sub-extractor patterns.

Here's the extractor pattern we'll use:

<tr>
<td colspan="2" class="pageHeading~@DATARECORD@~Quantity


Notice that we simply replaced most of the middle portion of the large block of HTML with a ~@DATARECORD@~ token. If you look at the text before and after ~@DATARECORD@~ you can see that the same text is also found at the beginning and end of the large HTML block. The basic idea here is to include only as much HTML around the sub-region as necessary to uniquely identify it in the page. Any of the HTML covered by the ~@DATARECORD@~ token will be picked up by screen-scraper, and will define our sub-region that we'll be extracting the individual pieces of data from.

Create a new extractor pattern using the text given above (remember we're still using the "Details page" scrapeable file), then give it the name "PRODUCTS". Now click the "Apply Pattern to Last Scraped Data" button. In the window that appears, copy the text from the "DATARECORD" column and paste it into your text editor. The easiest way to select all of the text in that box is to triple-click it, use the keyboard to copy the text (Ctrl-C in Windows and Linux), then paste it into your text editor. The text should look like this:

" valign="top"><h1>You've Got Mail</h1></td></tr><tr><td align="center" valign="top" class="smallText" rowspan="2"><script language="javascript" type="text/javascript"><!--document.write( '<a href="javascript:popupWindow(\'http://www.screen-scraper.com/shop/index.php?main_page=popup_image &pID=7\')"><img src="images/dvd/youve_got_mail.gif" border="0" alt="You\'ve Got Mail" title=" You\'ve Got Mail " width="100" height="80" hspace="5" vspace="5" /><br />larger image<\/a>'); //--></script> <noscript><a href="http://www.screen-scraper.com/shop/index.php? main_page=images/dvd/youve_got_mail.gif" target="_blank"><img src="images/dvd/youve_got_mail.gif" border="0" alt="You've Got Mail" title=" You've Got Mail " width="100" height="80" hspace="5" vspace="5" /><br />larger image</a></noscript> </td><td class="main" align="center" valign="top"> Model: DVD-YGEM</td></tr><tr><td class="main" align="center"></td></tr><tr><td align="center" class="pageHeading">$34.99</td><td class="main" align="center">Shipping Weight: 7.00 lbs.</td> </tr><tr><td>&nbsp;</td><td class="main" align="center">10 Units in Stock</td></tr> <tr><td class="main" align="center">Manufactured by: Warner</td><td align="center"> <table border="0" width="150px" cellspacing="2" cellpadding="2"><tr><td align="center" class="cartBox">&nbsp;


This is the HTML we're after, but it's all in one large block. This occurs because screen-scraper strips out unnecessary white space when extracting information in order to make the extraction process more efficient. This can make sifting through the HTML a little more difficult, but the search feature in your text editor should make this relatively straightforward. You could also deal with the HTML found directly in the "Last Response" tab. You'd just have to be sure that you're only grabbing portions of the page that would be covered by the ~@DATARECORD@~ extractor pattern token.

First off, we're interested in the DVD title. In your text editor do a search for the first word in the title of the DVD whose page you're viewing (e.g., if you're viewing the HTML for the last DVD in the search results you'll search for "You've"). This should highlight the first word in the title. In order to extract this piece of information we'll use a small sub-extractor pattern:

<h1>~@TITLE@~</h1>


Once again, we include only as much HTML around the piece of data that we're interested in as is necessary. If we do this just right we'll still be able to extract information even if the web site itself makes minor changes. On our "PRODUCTS" extractor pattern, click the "Sub-Extractor Patterns" tab, then on the "Add Sub-Extractor Pattern" button. In the text box that appears paste the text for the sub-extractor pattern we've included above. Edit the ~@TITLE@~ extractor pattern token by double-clicking it, click the "Regular Expression" tab, then select "Non-HTML tags" from the drop-down list (as a side note, "Non-HTML tags" is probably the most common regular expression you'll use). Click on the "Apply Sub-Extractor Pattern to Last Scraped Data" to try it out. You should see a DataSet with a single row and columns for the DATARECORD and TITLE tokens.

Next, create the following sub-extractor patterns for the remaining data elements we want to extract (note that each line of text will be a separate sub-extractor pattern):

>$~@PRICE@~<

>Model: ~@MODEL@~<

>Shipping Weight: ~@SHIPPING_WEIGHT@~<

>Manufactured by: ~@MANUFACTURED_BY@~<


For each token in the sub-extractor patterns give it the "Non-HTML tags" regular expression, as you did for the ~@TITLE@~ token.

As sub-extractor patterns match data, they aggregate the pieces into a single data record. That is, when our PRODUCTS extractor pattern is applied along with its sub-extractor patterns, the following data record will be produced:


TITLE PRICE MODEL SHIPPING_WEIGHT MANUFACTURED_BY
You've Got Mail 34.99 DVD-YGEM 7.00 lbs. Warner


You can see this by clicking the "Apply Pattern to Last Scraped Data" button.

If you'd like, at this point try running the scraping session again by clearing the log and hitting the "Run Scraping Session" button. If you examine the log while the session runs you'll see that it extracts out details for each of the DVDs.

Tutorial 2: Page 9: Saving the Data

Saving the Data

Once screen-scraper extracts data there are a number of things that can be done with it. For example, you might be invoking screen-scraper from an ASP script, which, after telling screen-scraper to extract data, might display it to the user. In our case we'll simply write the data out to a text file. To do this, we'll once again write a script. Create a new script, call it "Write data to a file", and use either the following Interpreted Java:

FileWriter out = null;

try
{
session.log( "Writing data to a file." );

// Open up the file to be appended to.
out = new FileWriter( "dvds.txt", true );

// Write out the data to the file.
out.write( dataRecord.get( "TITLE" ) + "\t" );
out.write( dataRecord.get( "PRICE" ) + "\t" );
out.write( dataRecord.get( "MODEL" ) + "\t" );
out.write( dataRecord.get( "SHIPPING_WEIGHT" ) + "\t" );
out.write( dataRecord.get( "MANUFACTURED_BY" ) );
out.write( "\n" );

// Close up the file.
out.close();
}
catch( Exception e )
{
session.log( "An error occurred while writing the data to a file: " + e.getMessage() );
}

Or the following VBScript (remember to select "VBScript" from the "Language" drop-down box):

' Generate objects to write data to a file.
Set objFSO = CreateObject( "Scripting.FileSystemObject" )
' The "8" indicates that we want to append data to the file.
Set objDVDFile = objFSO.OpenTextFile( "dvds.txt", 8, True )

' Write out the data to the file.
objDVDFile.Write dataRecord.Get( "TITLE" ) + vbTab
objDVDFile.Write dataRecord.Get( "PRICE" ) + vbTab
objDVDFile.Write dataRecord.Get( "MODEL" ) + vbTab
objDVDFile.Write dataRecord.Get( "SHIPPING_WEIGHT" ) + vbTab
objDVDFile.Write dataRecord.Get( "MANUFACTURED_BY" ) + vbTab
objDVDFile.Write vbCrLf

' Close the file and clean up.
objDVDFile.Close
Set objFSO = Nothing

Our script simply takes the contents of the current data record (which for us will be the data record that constitutes a single DVD) and appends it to a "dvd.txt" text file.

If you're familiar with VBScript or Java, hopefully the scripts make sense. There is one important point worth noting, though. You'll notice that each script makes use of a "DataRecord" object (referenced as the "dataRecord" variable in the scripts). This object refers to the current DataRecord as the script is executed. Again, think of the spreadsheet. When the script gets invoked, a specific DataRecord (or row in the spreadsheet) will be current. This DataRecord automatically becomes a variable you can use in your script. The DataRecord object has a "get" method, which allows you to retrieve the value for a key it contains (i.e., you're referencing a specific cell in the spreadsheet). Again, you can read more about objects available in scripts and their scope in our documentation, at the Using Scripts and API Documentation pages.

Click on the "Details page" scrapeable file, then on the "Extractor Patterns" tab. Below the extractor pattern text click the "Add Script" button. In the "Script Name" column, select "Write data to a file" and in the "When to Run" column select "After each pattern application" (even though there will only be one match per page). For each DVD we'll execute the script that will write the information out to a file.

To clarify a bit further, because we're invoking the script "After each pattern application", the "dataRecord" variable will be in scope. In other words, for each row in the spreadsheet (which happens to be a single row in this case) screen-scraper will execute the "Write data to a file" script. Each time it gets invoked a DataRecord will be current (again, think of it walking through each row in the spreadsheet). As such, we have access to the current row in the spreadsheet by way of the "dataRecord" variable. Had we indicated that the script was to be invoked "After pattern is applied", the "dataRecord" would not be in scope. Again using the spreadsheet analogy, scripts that get invoked "After pattern is applied" would run after screen-scraper had walked through all of the rows in the spreadsheet, so no DataRecord would be in scope (i.e., it's at the end of the spreadsheet--after the very last row). See the Variable scope section in our documentation for more detail on which variables are in scope depending on when a given script is run.

Once again, run the scraping session. This time if you check the directory where screen-scraper is installed you'll notice a dvds.txt file that will grow as the DVD details pages get scraped.

Note that as an alternative to the above scripts you could do the following in Interpreted Java (professional and enterprise editions only):

dataSet.writeToFile( "dvds.txt" );

Or in VBScript:

Call dataSet.WriteToFile( "dvds.txt" )

We included the first example to demonstrate referencing data records in scripts.

If you would like more information on saving extracted data to a database please consult our FAQ on the topic here.

Tutorial 2: Page 10: Logging In

Logging In

Oftentimes it's necessary to log in to a web site before extracting the information you're interested in. This is generally quite a bit easier than it might seem. Typically this simply involves creating a scrapeable file to handle the login that will get invoked before any of the other pages. The shopping site we're scraping from doesn't require us to log in before performing searches, but for the sake of this tutorial we'll set it up as if it did.

Before we look at the page that handles the actual login, we need to have screen-scraper request the home page for the shopping site. This is necessary because it allows for a few initial cookies to be set before we attempt to log in. If you're familiar with web programming, we're requesting the home page so that the server can create a session for us (tracked by the cookies) prior to our attempting a login. By having screen-scraper request the home page, those cookies will get set, and screen-scraper will then automatically track them for us.

Create a scrapeable file for the home page by clicking on the "Shopping Site" scraping session (the one with a gear) in the tree on the left, then on the "Add Scrapeable File" button. Give the new scrapeable file the name "Home". Leave its sequence as "1", and give it the URL "http://www.screen-scraper.com/shop/".

Login HTTP requests are usually POST requests, which makes it trickier to tell what parameters are being passed to the server (i.e., the parameters won't appear in the URL). The proxy server can make viewing the parameters easier, so let's make use of it. Open your web browser to the shopping login page:

http://www.screen-scraper.com/shop/index.php?main_page=login

In screen-scraper click on the "Shopping Site" proxy session, then on the "Start Proxy Server" button (found on the "General" tab). Now click on the "Progress" tab. Go ahead and remove any HTTP transactions that are already there by clicking the "Clear All Transactions" button. Configure your web browser to use screen-scraper as a proxy server as you did earlier.

In your web browser, in the "E-Mail Address" field enter test@test.com and in the "Password" field enter testing, then click the "login" button. After screen-scraper works for a bit, return to the "General" tab and click the "Stop Proxy Server" button. Re-configure your web browser so that it no longer uses screen-scraper as a proxy server.

If you paid close attention to screen-scraper as it was working you may have noticed that two rows were added to the "HTTP Transactions" table (it's actually possible that three were added; if so just delete the last one by highlighting it and hitting the "Delete" key on your keyboard). Click on the second to last row in the table (the URL should begin with:

http://www.screen-scraper.com/shop/index.php?main_page=login

This is the actual login POST request. If you scroll down in the lower section and look in the "POST data" text box you'll see the email address and password we entered in earlier. You'll also notice that "x" and "y" parameters were passed in (these simply represent the coordinates where you clicked the "login" button). If you click on the "Response" tab, once again in the lower section, you'll notice that the "Status Line" field shows a response code of "302 Found". This is a redirect response, which indicates that the browser should be redirected to a different URL. When this response was issued by the server your browser faithfully followed to this other URL, creating the last row in the "HTTP Transaction" table.

At this point we'll want to copy the login POST request to our scraping session. We only need the second to last transaction in the table (the login request itself) and not the request representing the redirect, since screen-scraper will automatically follow redirects for us. Copy the HTTP transaction to your scraping session by clicking on the second to last row in the table (the one corresponding to the POST request), ensure that the "Shopping Site" scraping session is selected in the drop-down, then click the "Go" button. After the new scrapeable file is created under the scraping session rename it "Login". Also, set its sequence to 2. It should be requested right after the home page is requested. screen-scraper automatically tracks cookies, just like a web browser, so by requesting it near the beginning any subsequent pages that are protected by the login will be accessible.

Now click the "Parameters" tab in our "Login" scrapeable file. You'll notice that screen-scraper automatically extracted out the various POST parameters and added them to the scrapeable file. If you're familiar with URL encoding, you'll also notice that screen-scraper decoded the "email_address" parameter to "test@test.com". screen-scraper automatically URL encodes parameters found under the "Parameters" tab before passing them up to the server.

At this point feel free to run the scraping session again. Because our site doesn't require logging in before searching can take place it won't make much difference, but you'll at least be able to see the login page being requested in the log for the scraping session.

Tutorial 2: Page 11: Where to Go From Here

Where to Go From Here

Congratulations! At this point you should have the basics under your belt to scrape most web sites. From here you could continue on with one of the subsequent tutorials, if they seem relevant to your project. It may also be a good idea to look through a bit more of our documentation in order to get familiar with other details of screen-scraper. Either way, probably the best way to learn screen-scraper is to use it. Try it on one of your own projects!

Tutorial 3: Extending Hello World

Extending Hello World

This tutorial continues on where Tutorial 1: Hello World left off, and covers aspects of screen-scraper related to richer scripting and interacting with screen-scraper from external languages, including Active Server Pages, PHP, and Java.

If you haven't completed the first tutorial don't worry, but you'll at least need to import the script and scraping session that were created in the first tutorial. To do that, follow these directions:

  1. Download the zip file located here and unzip it. You should now have an "interpreted_java" directory and a "vbscript" directory.
  2. If you're running Windows, and prefer to program in VBScript, import the "Hello World (Scraping Session).xml" scraping session located in the "vbscript" directory; otherwise, import the one located in the "interpreted_java" directory. Instructions on importing objects into screen-scraper can be found here.

The following scraping session is the completed version of the Tutorial 3 scraping session.

AttachmentSize
Hello World (Scraping Session).sss3.06 KB

Tutorial 3: Page 2: Embedding Session Variables

Embedding Session Variables

A significant limitation of our first "Hello World" project was that we could only scrape the text from our first request. That is, we were always scraping the text "Hello World!", which really isn't that useful. We'll now adjust our setup so that we can designate the text to be submitted in the form.

At this point we're going to set a session variable that will hold the text we'd like submitted in the form. Within screen-scraper, session variables are used to transfer information between scripts, scrapeable files, and other objects. Session variables are generally set from within scripts, but can also be automatically set within extractor patterns as well as passed in from external applications.

We'll now set up a script to set a session variable before our scraping session runs. Create a new script as you've done before, and call it "Initialize scraping session". If you prefer to script in Interpreted Java, use the following for the body of the script:

// Put the text to be submitted in the form into a
// session variable so we can reference it later.
session.setVariable( "TEXT_TO_SUBMIT", "Hi everybody!" );

If you wrote the script in VBScript, make it look like this:

' Put the text to be submitted in the form into a
' session variable so we can reference it later.
session.SetVariable "TEXT_TO_SUBMIT", "Hi everybody!"

Hopefully the scripts seem straightforward. It simply sets a session variable named "TEXT_TO_SUBMIT", and gives it the value "Hi everybody!" (spoken, of course, in your best Dr. Nick voice).

Setting the session variable "TEXT_TO_SUBMIT" will allow us to access that value in other scripts and scrapeable files while our "Hello World" scraping session is running.

We'll now need to associate our script with our scraping session so that it gets invoked before the scraping session begins. To do that, click on the scraping session in the tree on the left, then on the "Scripts" tab. Click the "Add Script" button to add a script. In the "Script Name" column select "Initialize scraping session". The "When to Run" column should show "Before scraping session begins", and the "Enabled" checkbox should be checked. This will cause our script to get executed at the very beginning of the scraping session so that the "TEXT_TO_SUBMIT" session variable can get set.

Just as we use special tokens in extractor patterns to designate values we'd like to extract, we use special tokens to insert values of session variables into the URLs or parameters (GET, POST, or BASIC authentication) of scrapeable files. We'll do this now by embedding it into one of the parameters of our only scrapeable file. Expand the "Hello World" scraping session in the tree on the left, then select the "Form submission" scrapeable file. Click on the "Parameters" tab. In the "Value" column for our "text_string" parameter replace the text "Hello world!" with the text:

~#TEXT_TO_SUBMIT#~

The ~# and #~ delimiters are used to designate a session variable whose value should be inserted into that location when the scrapeable file gets executed. When the scrapeable file gets invoked, screen-scraper will construct the URL by including the "text_string" parameter in it. In other words, the URL for our scrapeable file will become this:

http://www.screen-scraper.com/screen-scraper/tutorial/basic_form.php?text_string=Hi+everybody%21

We're going to run our scraping session again, but before doing that clear out the scraping session log by selecting the "Hello World" scraping session in the tree, clicking on the "Log" tab, then on the "Clear Log" button. Start up the scraping session again by clicking the "Run Scraping Session" button. Once the scrape has run, you should notice the following lines in the log:

Form submission: The following data elements were found:
Form data--DataRecord 0:
FORM_SUBMITTED_TEXT=Hi everybody!

And if you look at the contents of the "form_submitted_text.txt" file you'll notice the same text.

Remember that it's a good idea to run scraping sessions often as you make changes, and watch the log and last responses to ensure that things are working as you expect them to.

Tutorial 3: Page 3: Interacting with Screen-Scraper Externally

Interacting with Screen-Scraper Externally

Invoking screen-scraper from the command line

If you've decided to use the basic edition of screen-scraper your only option for invoking screen-scraper externally is from the command line (invoking screen-scraper from the command line is also available in the professional and enterprise editions). You can find full documentation and examples on doing that at our Invoking screen-scraper from the command line documentation page. If you don't need to invoke screen-scraper from the command line you can skip to the Invoking screen-scraper from an external application section.

In order to invoke screen-scraper from the command line, you'll want to create a batch file (in Windows) or a shell script (in Linux or Mac OS X) to invoke the scraping session. If you're using Windows open a text editor (e.g., Notepad) and enter the following:

jre\bin\java -jar screen-scraper.jar -s "Hello World" --params
"TEXT_TO_SUBMIT=Hello+World"



Save the batch file (call it "hello_world.bat") in the folder where screen-scraper is installed (e.g., C:\Program Files\screen-scraper professional edition\). Vista users, you will need to save your batch file to a location such as your Documents folder or your Desktop. Then, within Windows Explorer, manually transfer the file to the directory where screen-scraper is installed.

Within screen-scraper, you'll want to disable the "Initialize scraping session" script; otherwise, the value we pass in from the command line would get overwritten once that script is executed. Disable the script by clicking on the "Hello World" scraping session, then on the "Scripts" tab, then un-checking the "Enabled?" check box for the script.

You can then run the batch file by opening a DOS prompt, changing to the folder containing the batch file, then invoking it. You should see the text from screen-scraper's log appear in the DOS window. If you're running Linux or Mac OS X, you'll need to close the workbench before invoking your shell script.

Invoking screen-scraper from an external application

Note that the rest of this tutorial only applies to the professional and enterprise editions of screen-scraper.

Oftentimes you'll want to use a language or platform external to screen-scraper to scrape data. screen-scraper can be controlled externally using Java, PHP, Ruby, Python, .NET, ColdFusion, any COM-friendly language (such as Active Server Pages or Visual Basic), or any language that supports SOAP. In this next part of the tutorial we'll give examples in PHP, Java, ColdFusion, and Active Server Pages.

In order to interact with screen-scraper externally it needs to be running as a server. When running as a server screen-scraper acts much like a database server does. That is, it listens for requests from external sources, services those requests, and sends back responses. For example, when you issue a SQL statement to a database from an ASP script your script is opening up a socket to the database, sending the request over it, then receiving the database's response back over the socket. Once this transaction has been completed the socket will be closed, but the database will continue to listen for other requests. screen-scraper works in a similar way.

At this point we'd recommend reading over the documentation page that discusses running screen-scraper as a server, and gives details on how to start and stop it according to the platform you're running on. Follow the link below, then return back to this page when you're finished:

Running screen-scraper as a server

Before we start writing code to interact with screen-scraper externally we need to configure a few things. Depending on the language you'd like to program in, please follow one of the links below, which will give you an overview of interacting with screen-scraper using that language and guide you through any configuration that needs to take place. Once you're finished return back to this page.

Invoking screen-scraper from ColdFusion

Invoking screen-scraper from a COM-based application

Invoking screen-scraper from Java

Invoking screen-scraper from PHP

Each time you run a scraping session externally screen-scraper will generate a log file corresponding to that scraping session in the "log" folder found inside the folder where you installed screen-scraper. This can be invaluable for debugging, so you'll want to take a look at it if you run into trouble. You can turn server logging off by unchecking the "Generate log files" check box under the "Servers" section of the "Settings" dialog box.

If you haven't already, within screen-scraper, you'll want to disable the "Initialize scraping session" script; otherwise, the value we pass in from our external application would get overwritten once that script is executed. Disable the script by clicking on the "Hello World" scraping session, then on the "Scripts" tab, then un-checking the "Enabled?" check box for the script.

OK, we're now ready to write some code. Follow one of the links below.

Tutorial 3: Page 4: Interacting with screen-scraper from ASP

Interacting with screen-scraper from ASP

The ASP script we'll be writing will invoke our scraping session remotely, passing in a value for the "TEXT_TO_SUBMIT" session variable. Create a new ASP script on your computer, and paste the following code into it:

<%
' Create a RemoteScrapingSession object.
Set objRemoteSession = Server.CreateObject("Screenscraper.RemoteScrapingSession")

' Generate a new "Hello World" scraping session.
Call objRemoteSession.Initialize("Hello World")
   
' Put the text to be submitted in the form into a session variable so we can reference it later.
Call objRemoteSession.SetVariable( "TEXT_TO_SUBMIT", "Hi everybody!" )

' Check for errors.
If objRemoteSession.isError Then
Response.Write( "Error: " & objRemoteSession.GetErrorMessage )
Else
' Tell the scraping session to scrape.
Call objRemoteSession.Scrape

' Write out the text that was scraped:
Response.Write( "Scraped text: " + objRemoteSession.GetVariable("FORM_SUBMITTED_TEXT") )
End If

' Disconnect from the server.
Call objRemoteSession.Disconnect
%>



There are just a couple of extra steps we take here that we didn't take in our previous script. First, after creating our RemoteScrapingSession object we make a separate call to initialize it for our specific scraping session. Also, you'll notice that before calling the Scrape method we check for any errors that may have occurred up to this point. For example, if for some reason your ASP script can't connect to the server you'd want to know before you tried to tell it to scrape. Finally, we need to explicitly disconnect from the server so that it knows we're done.

OK, we're ready to give our script a try. Start screen-scraper running as a server. If you need help or have trouble with this refer to the documentation page here: Running screen-scraper as a server. If you've succeeded in starting up the server go ahead and load your ASP script in a browser. After a short pause you should see the "Hi everybody!" message output to your browser. If something goes wrong please refer to the "Related pages" section found below for help.

Tutorial 3: Page 4: Interacting with Screen-Scraper from Java

Interacting with Screen-Scraper from Java

The Java class we'll be writing will simply substitute for the "Initialize scraping session" script we wrote previously. That is, our Java class will invoke our scraping session remotely, passing in a value for the "TEXT_TO_SUBMIT" session variable. Create a new Java class on your computer, and paste the following code into it:

import com.screenscraper.scraper.*;

public class HelloWorldRemoteScrapingSession
{
      /**
      * The entry point.
      */
      public static void main( String args[] )
      {
             try
             {
                 // Create a remoteSession to communicate with the server.
                 RemoteScrapingSession remoteSession = new RemoteScrapingSession( "Hello World" );

                 // Put the text to be submitted in the form into a session variable so we can reference it later.
                 remoteSession.setVariable( "TEXT_TO_SUBMIT", "Hi everybody!" );

                 // Tell the session to scrape.
                 remoteSession.scrape();

                 // Output the text that was scraped:
                 System.out.println( "Scraped text: " + remoteSession.getVariable( "FORM_SUBMITTED_TEXT" ) );

                 // Very important! Be sure to disconnect from the server.
                 remoteSession.disconnect();
              }
              catch( Exception e )
              {
                 System.err.println( e.getMessage() );
            }
       }
}



For the most part this Java code is virtually identical to our script. The one notable difference is that we need to explicitly disconnect from the server so that it knows we're done.

OK, we're ready to give our Java class a try. After you've successfully compiled the class (remember to include the "screen-scraper.jar" file in your classpath), start screen-scraper running as a server. If you need help or have trouble with this refer to the documentation page here: Running screen-scraper as a server. If you've succeeded in starting up the server go ahead and run the Java class from a command prompt or console. After a short pause you should see the "Hi everybody!" message output. If something goes wrong please refer to the "Related pages" section found below for help.

Tutorial 3: Page 5: Where to Go From Here

Where to Go From Here

Congratulations! You've now covered all of the basic principles needed to invoke screen-scraper externally. In working on your own projects we'd suggest referring frequently to the screen-scraper documentation available from within the application or on our web site.

The third tutorial deals with other topics, including scraping search results (with multiple records) across multiple pages, and logging in to a web site before scraping information.

Tutorial 4: Scraping an E-commerce Site from External Programs

Tutorial Overview

This tutorial illustrates invoking screen-scraper from other programs in ways more complex than those presented in Tutorial 3. From our external program we'll be passing to screen-scraper search parameters, invoking the scraping process, getting the scraped data from screen-scraper, then iterating over the data, and outputting it within our application.

Before proceeding it would be a good idea to go through Tutorial 2, if you haven't done so already.

If you haven't gone through Tutorial 2, or don't still have the scraping session you created in it, you can download and load it into screen-scraper by following these steps:

  1. Download the zip file located here and unzip it. You should now have an "interpreted_java" directory and a "vbscript" directory.
  2. If you're running Windows, and prefer to program in VBScript, import the "Shopping Site (Scraping Session).sss" scraping session located in the "vbscript" directory; otherwise, import the one located in the "interpreted_java" directory. Instructions on importing objects into screen-scraper can be found here.

Once you've got the scraping sessions imported into screen-scraper you're ready to roll. Click on the "Tutorial Details" link below to get going.

Tutorial 4: Page 2: Tutorial Details

Tutorial Details

screen-scraper can be invoked from software applications written in most modern programming languages, including Java, Active Server Pages, PHP, .NET, and anything that supports SOAP. In this tutorial we'll give some examples of applications that do just that.

Our application will pass parameters to screen-scraper corresponding to login information as well as a key phrase for which to search. As in the third tutorial, we're going to pretend that the web site requires us to log in before we can search, for the sake of providing an example, even though it actually doesn't. Once we pass the parameters to screen-scraper we'll tell it to start scraping. screen-scraper will then run the scraping session using the parameters we gave it, extracting out the data it normally does. Once it's done, we'll ask it for the extracted information, then output it for the user to see.

Before we begin we'll first need to make a couple of minor changes to the e-commerce scraping session from the third tutorial. If you haven't already, start up screen-scraper. Under the "Shopping Site" scraping session click on the "Login" scrapeable file, then on the "Parameters" tab. We're going to alter the "email_address" and "password" POST parameters so that we can pass those parameters in rather than hard-coding them. For the "email_address" parameter change the value "test@test.com" to ~#EMAIL_ADDRESS#~, and change the "testing" value for the "password" parameter to ~#PASSWORD#~. You might remember from Tutorial 2 that tokens surrounded by the ~# #~ delimiters indicate that the value of a session variable should be inserted. For example, in our case we're going to create an "EMAIL_ADDRESS" session variable and give it the value "test@test.com" such that screen-scraper substitutes it in for the corresponding POST parameter at runtime.

In addition, click on the "Details page" scrapeable file. On the "PRODUCTS" extractor pattern, select the "Advanced" tab and check the box next to "Automatically save the data set generated by this extractor pattern in a session variable."

The code that we'll be writing in our external application will also be essentially taking the place of the current "Shopping Site--initialize session" script. Let's disable that since it would otherwise overwrite the values we'll be passing in externally. To do that click on the "Shopping Site" scraping session in the tree on the left, then on the "Scripts" tab. In the scripts table, un-check the "Enabled?" check box for the "Shopping Site--initialize session" script. Save your changes and exit screen-scraper.

Where you go next depends on which programming language you're interested in. Use one of the links below according to your preference.

Tutorial 4: Page 3: Invoking screen-scraper from ASP

Invoking screen-scraper from ASP

In order to invoke screen-scraper from ASP, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow that link, then come back here.

Okay, let's try it out before we go over the code. Download the shopping.asp file here, then save it to a directory where it will be web-accessible (i.e., within your IIS web dir). After that start up screen-scraper in server mode.

Open up your web browser and go to the URL corresponding to the "shopping.asp" file (e.g, http://localhost/screen-scraper/shopping.asp). You'll see a simple search form. Type in a product keyword, such as "bug", then hit the "Go" button. If all goes well the page will take a little while to load (it's waiting as screen-scraper extracts the data), then it will output the corresponding products.

If that didn't go quite as you expected here are some things to check:

  • Make sure screen-scraper is running as a server, and that nothing is blocking its ports (such as a firewall running on your machine).
  • If you're running screen-scraper on a different machine than the one your ASP file resides on, make sure that screen-scraper is allowing connections from the ASP machine. In the screen-scraper workbench click on the wrench icon, then on the "Servers" button, and check the "Hosts to allow to connect" includes the IP address (or perhaps just the first part of the IP address) of the ASP machine.
  • Check screen-scraper's "log" folder for a "Shopping Site" log file. If you find one it means that screen-scraper is at least receiving the request. Open the log file in a text editor to see if you find any error messages.
  • If you still can't seem to get it to work feel free to drop us a support request.

Assuming that test worked, fire up your favorite ASP editor and open the "shopping.asp" file in it. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing our COM documentation, posting to our forum, or sending us a support request.

When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its "log" folder. Take a look in that folder for your "Shopping Site" log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.

T