![]() |
Tutorials |
Before diving in to screen-scraper we highly recommend that you take some time to go through our tutorials. Each tutorial should take around 30 minutes. The current tutorials cover all of the basics of using screen-scraper, and should be adequate to get you going on most projects. Along with these tutorials you'll probably find it helpful to look through our documentation and API.
Tutorial 1: Hello World This first tutorial will familiarize you with the basics of using screen-scraper and the general approach we recommend in setting up sites to be scraped.
Tutorial 2: Scraping an E-commerce Site This tutorial covers scraping search results that span multiple pages, using extractor patterns, and logging in to a web site.
Tutorial 3: Extending Hello World The third tutorial builds off of the first, and covers topics such as richer scripting and interacting with screen-scraper from languages such as Active Server Pages, PHP, and Java.
The rest of the tutorials build off of Tutorial 2, and can be done in any order. They're intended to give examples of more specific tasks you might want to accomplish with screen-scraper. Feel free to read through them and try any that best fit your situation.
Tutorial 4: Scraping an E-commerce Site from External Programs Here we extend Tutorial 3 with specific examples of invoking screen-scraper from Java, Active Server Pages, PHP, and .NET.
Tutorial 5: Saving Scraped Data to a Database This tutorial illustrates how to take the data we scraped from our e-commerce site and insert it into a database.
Tutorial 6: Generating an RSS/Atom Feed from a Product Search Here we go over creating an XML feed based on a serach on our shopping site to demonstrate screen-scraper's RSS/Atom capabilities.
Tutorial 7: Scraping a Site Multiple Times Based on Search Terms A common scenario in screen-scraping is submitting a form multiple times and extracting the search results. A common example is a "store locator" service where you submit many zip codes, then extract out the various locations corresponding to those zip codes. This tutorial walks you through how to use screen-scraper to tackle such a task.
If you'd like to print the tutorials for easier reading, you can use the "Printer-friendly version" link at the bottom of any given page, or click here to get such a version for all of the tutorials.
![]() |
Hello World! |
| Click here for a video version of this tutorial (this will open a new window which may take a moment to load) |
This tutorial will walk you step-by-step through the process generally used to scrape information from web pages using screen-scraper. It should take you about 20 to 30 minutes to complete, and will familiarize you with the basic principles you'll need to scrape information from web sites. To get the most from this tutorial you should have at least a basic knowledge of HTML and HTTP (e.g., if you don't know the difference between a GET and POST request you ought to read through one of the HTTP articles below). This tutorial also assumes that you've successfully downloaded and installed screen-scraper.
If you don't have a lot of experience working with web technologies, or if you'd just like a refresher, you might find these sites helpful:
This is intended to be a very basic tutorial, and, as such, we'll be extracting the words "Hello World" from a web page and writing them to a file. While this is a simple example of pulling a single snippet of text off of a page, you would use a very similar approach for something like a stock quote or product price.
We'll try to keep the pace of the tutorial such that (hopefully) you won't get bored or frustrated.
If you'd like to take a peek at the final product you'll be creating, you can download and import the scraping session below. If you're wanting to learn to use screen-scraper you're probably better off not importing the scraping session, and instead following along closely with the tutorial. If, however, you're just trying to get a feel for what it's like to use screen-scraper, it might be helpful to import the scraping session.
| Attachment | Size |
|---|---|
| Hello World (Scraping Session).sss | 2.27 KB |
![]() |
Screen-Scraping Overview |
In many ways working with screen-scraper is like working with a database, such as mySQL or SQL Server. With databases, you'll generally use an interface (often a graphical interface) to create objects such as tables, columns, and indexes. Once you've set up the database you'll often write programming code to populate it with data as well as to pull information from it. Likewise with screen-scraper you'll use its graphical user interface to create objects needed to extract information from web sites. Once you've set up these objects you'll write programming code to interact with screen-scraper and make use of the data it extracts.
Extracting information from web sites using screen-scraper typically involves four main steps:
1. Use the proxy server to determine the exact files that need to be requested in order to get the information you're after.
2. Create a scraping session with scrapeable files that define the sequence of pages screen-scraper will request.
3. Generate extractor patterns to define the exact information you need screen-scraper to grab from each page.
4. Write small scripts or programming code to invoke screen-scraper and/or work with the data it extracts. If you don't do much programming, don't worry. Generally the scripts you'll need to write to work with screen-scraper are small and simple, and you can often just modify the example scripts we provide.
We'll now walk through each of these steps in detail.
![]() |
Proxy Server Setup |
An HTTP proxy server is basically just a program that sits in between a web browser and a web server, passing bits between each. screen-scraper contains a proxy server that allows you to view all requests that your web browser sends, and the corresponding responses that web servers send in return. You can think of it as a program that simply records your browser session as you click links, submit forms, etc. The proxy server records all of the pages requested by your browser as you surf so that they can be easily scraped by screen-scraper at a later point.

OK, enough talk; it's time to fire up screen-scraper. If you're running Windows this is done by selecting the appropriate link from the "Start" menu. On Unix/Linux or Mac OS X use the "screen-scraper" link that was created when you installed screen-scraper.
Once screen-scraper has fully loaded you'll see a panel on the left with a folder which will contain the objects we'll be creating. Right now we need to set up screen-scraper's proxy server.
In screen-scraper you'll generally use a proxy session for each web site you'd like to extract information from. A proxy session holds all of the HTTP requests and responses recorded from your browser for the period of time you run it. Create a proxy session now by clicking the "New Proxy Session" button (looks like a globe) or by selecting "New Proxy Session" from the "File" menu. screen-scraper should now look like this:



![]() |
Recording Pages with the Proxy Server |
Return now to your web browser and go to the following URL:
http://www.screen-scraper.com/tutorial/basic_form.php
If you take a look at screen-scraper you'll notice that it recorded this page in the "HTTP Transactions" table. If you click on the first row in the table information related to your browser's request and response will appear in the lower pane:

If you didn't see your page show up in the "HTTP Transactions" table, or if your browser seems to have trouble, take a look at this FAQ for help.
The lower pane shows the details of the HTTP request your browser made--the request line, any HTTP headers (including cookies), as well as POST data (if any was sent). You can view the corresponding response from the server by clicking on the "Response" tab. Don't worry if a lot of what you're seeing doesn't make much sense; for the most part screen-scraper takes care of these kinds of details for you (such as keeping track of cookies).
At this point, in your web browser, type "Hello world!" (without the quotes) into the form text box and click the "Submit" button. This simply submits the form using the GET method to this same page, and displays what you typed in. We now have all of the pages we need recorded, so click on the "General" tab in screen-scraper then click on the "Stop Proxy Server" button. Now might also be a good time to adjust your web browser so that it no longer uses screen-scraper as a proxy server.
![]() |
Generating a Scrapeable File |
At this point we're ready to start creating the objects that screen-scraper will use to extract data from the page. We start by creating a scraping session. A scraping session is simply a container for all of the files and other objects that will allow us to extract data from a given web site. Either click the "New Scraping Session" button (looks like a gear) or click on the "File" menu, then select "New Scraping Session". After the scraping session appears rename it to "Hello World" (note that if you imported the scraping session at the beginning of the tutorial you'll want to name it something else--perhaps "My Hello World"). Your window should now look like this:

Now return back to our "Hello World" proxy session by clicking on it in the tree on the left (the one with the globe by it), then click on the "Progress" tab. Click on the second or last row in the "HTTP Transactions" table. In the lower pane make sure "Hello World" is selected from the drop-down list labeled "Generate scrapeable file in:", then click the "Go" button. A scrapeable file is a web page that contains information we're interested in extracting. First off, let's rename our scrapeable file "Form submission". Your screen should now look like this:

Just to make sure things are good so far let's run a quick test. Run the "Hello World" scraping session by clicking on it in the tree on the left, then clicking the "Run Scraping Session" button, which will start the scraping session and transition you to the "Log" tab. It should just take a moment to run, after which the log should show the following:
Starting scraper.
Running scraping session: Hello World
Processing scripts before scraping session begins.
Scraping file: "Form Submission"
Form Submission: Preliminary URL: http://www.screen-scraper.com/tutorial/basic_form.php
Form Submission: Using strict mode.
Form Submission: Resolved URL: http://www.screen-scraper.com/tutorial/basic_form.php?text_string=Hello+... Submission: Sending request.
Processing scripts after scraping session has ended.
Scraping session "Hello World" finished.The log is an invaluable tool in debugging scraping sessions, which you'll want to use often. In this case it shows that screen-scraper requested the only scrapeable file in our scraping session ("Form submission"). You can view the text of the file that was scraped by clicking on "Form submission" in the tree on the left, then clicking the "Last Response" tab. Click the "Display Response in Browser" button to ensure that the page looks like the one in your browser (it may not look exactly like it, but should resemble it closely). It's often helpful to view the last response for a scrapeable file after running a scraping session so that you can ensure that screen-scraper requested the right page.
QUICK TIP
A good principle of software design is to run code often as you make changes. Likewise, with screen-scraper it is a good idea to run your scraping session frequently and watch the log and last responses to ensure that things are working as you intend them to.
Now would be a good time to save your work. Click the "Save" button (looks like a disk) or select the "Save" option from the "File" menu.
![]() |
Generating an Extractor Pattern |
This is probably the trickiest part of the tutorial, so if you've been skimming up to this point you'll probably want to read this page a little more carefully. An extractor pattern is a block of text (usually HTML) that contains special tokens that will match pieces of data you're interested in extracting. These tokens are text labels surrounded by the delimiters ~@ and @~.
You can think of an extractor pattern like a stencil. A stencil is an image in cut-out form, often made of thin cardboard. As you place a stencil over a piece of paper, apply paint to it, then remove the stencil, the paint remains only where there were holes in the stencil. Analogously, you can think of placing an extractor pattern over the HTML of a web page where the tokens correspond to the holes where the paint would pass through. After an extractor pattern is applied it reveals the portions of the web page you'd like to extract.
Take a look at the HTML from the page we scraped by clicking on the "Form submission" scrapeable file, then on the "Last Response" tab. If you click the "Render HTML" button you should see a screen resembling the page you saw in your browser. Consider this snippet of HTML from the page:
As we're interested in extracting the string "Hello world!" our extractor pattern would look like this:
>You typed: ~@FORM_SUBMITTED_TEXT@~<The string "~@FORM_SUBMITTED_TEXT@~" is the token that corresponds to the data we're interested in, and, after this extractor pattern is applied, would hold the string "Hello world!". Returning to our stencil analogy, the "~@FORM_SUBMITTED_TEXT@~" token is analogous to the hole in the stencil where the paint would pass through. In a bit we'll look at how we might make use of the data extracted by that token.
You can add as many extractor patterns as you'd like to a given scrapeable file, and screen-scraper will invoke each of them in sequence after it's requested the web page.
We'll now create an extractor pattern that will extract the "Hello world!" text you typed in to the HTML form. Under the "Form submission" scrapeable file, click on the "Extractor Patterns" tab, then click on the "Add Extractor Pattern" button. Give your extractor pattern the identifier "Form data", and in the "Pattern text" box enter the extractor pattern shown above. Your screen should now look like this:

Go ahead and give the extractor pattern a try by clicking on the "Apply Pattern to Last Scraped Data" button. The following window will appear, displaying the text that our extractor pattern extracted from the page:

Looks like our extractor pattern has matched the snippet of text we were after. The "Apply Pattern to Last Scraped Data" is another invaluable tool you'll use often to make sure you're getting the right data. It simply uses the HTML from the "Last Response" tab, and applies the extractor pattern to it.
QUICK TIP
When creating extractor patterns, always be sure you use the HTML from screen-scraper's "Last Response" tab, and not by viewing the HTML source in your web browser. Before screen-scraper applies an extractor pattern to an HTML page, it "tidies" up the HTML to facilitate extraction. This will generally cause the HTML to be slightly different from the HTML you'd get directly from your web browser.
Before we continue we need to take a look at one more thing. Extractor pattern tokens have properties, one of which we'll need to modify. To modify the properties for our "~@FORM_SUBMITTED_TEXT@~" extractor pattern token double-click it (that is, double click on the text FORM_SUBMITTED_TEXT found between the ~@ @~ tokens in the "Pattern text" box) or select it, right-click it (or Control-click in Mac OS X), then select "Edit token". You'll see the following box:

screen-scraper makes use of session variables which allow you to save and persist objects throughout the life of a scraping session. This means that screen-scraper will save the extracted data in memory so that it can be used later in scripts and such. In this case we'd like to save the text that our "~@FORM_SUBMITTED_TEXT@~" extractor pattern token extracts. Indicate this now by clicking the "Save in sesssion variable?" checkbox, then closing the "Edit Token" window. In other words, when screen-scraper runs this scraping session and extracts the text for this extractor pattern it will save that text (e.g., "Hello world!") in a session variable so that we can do something with it later. Next we'll make use of the data we extract...
![]() |
Overview of Writing a Simple Script |
We'll now do something with the data we've extracted by writing a simple script. A screen-scraper script is a block of code that will get executed when a certain event occurs. For example, you might have a script that gets invoked at the beginning of a scraping session that initializes variables. Another script might get invoked each time a row in a list of search results is extracted from a site so that the information in that search result can be inserted into a database. You can think of this as being analogous to "event handling" mechanisms in other programming languages. For example, in an HTML page you might associate a JavaScript method call with the "onLoad" event for the body tag. In Visual Basic you'll often create a sub-routine that gets invoked when a button is clicked. In the same way, screen-scraper scripts will get invoked when certain events occur related to requesting web pages and extracting data from them.
If you don't have much experience programming don't worry, generally scripts written in screen-scraper are short and simple. The script we'll be creating will simply write out the text we extract to a file.
In preparation for writing our script click the "New Script" button (looks like a pencil and paper) or select "New Script" from the "File" menu, and give it the identifier "Write extracted data to a file". Your screen should now look like this:

![]() |
Writing a Simple Script in Interpreted Java |
screen-scraper uses the BeanShell library to allow for scripting in Java. If you've done some programming in C or JavaScript you'll probably find BeanShell's syntax familiar.
Let's get right to it. Copy and paste the following text into the box labeled "Script Text":
// Output a message to the log so we know that we'll be writing the text out to a file.
session.log( "Writing data to a file." );
// Create a FileWriter object that we'll use to write out the text.
out = new FileWriter( "form_submitted_text.txt" );
// Write out the text.
out.write( session.getVariable( "FORM_SUBMITTED_TEXT" ) );
// Close the file.
out.close();Hopefully it's obvious what's going on, based on the comments in the script. We simply create an object used to write out the text (a "FileWriter"), write it out, then close up the file. Note the session.getVariable( "FORM_SUBMITTED_TEXT" ) method call, which retrieves the value of the "FORM_SUBMITTED_TEXT" session variable. This method call is able to get the value because we indicated earlier that the value for the "FORM_SUBMITTED_TEXT" token was to be saved in a session variable (i.e., when we checked the "Save in session variable?" box).
If you haven't done much programming, this is where things might seem a little confusing. If so, you may consider trying a basic tutorial on Java or JavaScript, which will hopefully introduce you to the basics of programming. You'll especially want to get an introduction to object-oriented programming.
![]() |
Invoking a Script |
A script is executed in screen-scraper by associating it with some event, such as before or after an extractor pattern is applied to the text of a web page.
The script we've just written needs to be executed after screen-scraper has requested the web page and extracted the data we need from it.
At this point return to the extractor pattern we just created by clicking on the "Form submission" scrapeable file in the tree on the left, then on the "Extractor Patterns" tab. In the lower part of your screen click on the "Add Script" button. Select "Write extracted data to a file" in the column on the left, and select "After pattern is applied" in the third column. Your screen should now look like this:

Our "Write extracted data to a file" script will be invoked after screen-scraper has applied the "Form data" extractor pattern to the web page. That is, once the extractor pattern has applied as many times as it needs to (which is only once, in this case), it will invoke the script.
The curious might be wondering a bit more about the difference between "After pattern is applied" and "After each pattern application". Consider a web page that contains a table with 10 rows. We might create an extractor pattern that matches a single row in the table. The extractor pattern would match 10 times--one for each row in the table. If we associated a script with the extractor pattern and told it to run "After pattern is applied", the script would only get executed one time (i.e., after the pattern has matched as many times as it needs to). If we had indicated that the script should run "After each pattern application", it would get executed 10 times--one time for each match the pattern makes. In the current case, the pattern only matches one time, so it doesn't make a difference whether we indicate "After pattern is applied" or "After each pattern application".
![]() |
Running the Completed Scraping Session |
Finally, we have everything in place to run our scraping session. Click on the "Hello World" scraping session in the tree on the left, then click on the "Log" tab. If there is existing text in the "Log" get rid of it by clicking the "Clear Log" button. Now click on the "Run Scraping Session" button. After it finishes running, take a look at the contents of the "form_submitted_text.txt" file, which will be located in the screen-scraper installation directory (e.g., C:\Program Files\screen-scraper professional edition\).
![]() |
Where to Go From Here |
Congratulations! You now have the basic core knowledge you need to scrape screens with screen-scraper. While this was a very simple example of a scraping session, we covered most of the main principles you need to start your own project. If you have the time, we'd highly recommend continuing on to Tutorial 2: Scraping an E-commerce Site, as well as Tutorial 3: Extending Hello World. Otherwise, you may want to consider reading through some of the existing documentation as you work on your own project.
![]() |
Scraping an E-commerce Site |
| Click here for a video version of this tutorial (this will open a new window which may take a moment to load) |
In this tutorial we'll be scraping search results from a basic e-commerce site. We'll also demonstrate logging in to a web site before scraping data. Data you'll be scraping from web sites is often in the form of "records", or data that might fit into a spreadsheet in rows and columns. It's also often necessary to log in to a web site before you can scrape the data you're interested in. Hopefully getting some practice with these situations in this tutorial will let you apply the experience to other similar situations. For example, you would likely apply the same approach we'll go over here to extracting data such as online directories, real estate listings, or product descriptions.
If you haven't already gone through Tutorial 1 we'd recommend that you do so before continuing with this one. This tutorial, however, doesn't depend on scraping sessions or other objects you might have created in the previous tutorial.
The site we'll be scraping information from is found here: http://www.screen-scraper.com/shop/. Feel free to click around and explore for a minute.
If you're interested in seeing the final scraping session you'll be creating, along with the output file that will get generated, you'll find them in the table below. You can import the scraping session by following the instructions found here. If you're wanting to learn to use screen-scraper you're probably better off not importing the scraping session, and instead following along closely with the tutorial. If, however, you're just trying to get a feel for what it's like to use screen-scraper, it might be helpful to import the scraping session.
| Attachment | Size |
|---|---|
| dvds.txt | 897 bytes |
| Shopping Site (Scraping Session).sss | 10.2 KB |
![]() |
Screen-Scraping Overview Review |
As you'll remember from the previous tutorial, extracting information from web sites using screen-scraper typically involves four main steps:
1. Use the proxy server to determine the exact files that need to be requested in order to get the information you're after.
2. Create a scraping session with scrapeable files that define the sequence of pages screen-scraper will request.
3. Generate extractor patterns to define the exact information you need screen-scraper to grab from each page.
4. Write small scripts or programming code to invoke screen-scraper and/or work with the data it extracts.
![]() |
Recording Search Results |
As in the first tutorial, we'll be recording a browser session using the proxy server. Remember that a proxy session holds all of the HTTP requests and responses from your browser for the period of time you run it.
Create a new proxy session now either by clicking the "New Proxy Session" button (looks like a globe) or by selecting "New Proxy Session" from the "File" menu. When the proxy session appears type in "Shopping Site" in the "Name" field. In your web browser go to this URL: http://www.screen-scraper.com/shop/ (remember that you may want to use one browser with the proxy server and one to view the tutorials).
At this point start up the proxy server by clicking the "Start Proxy Server" button, then configure your web browser as you did in the first tutorial (if you need help try this page). In screen-scraper, ensure that the "Don't log binary files" checkbox is checked. Now click on the "Progress" tab so that you can see the pages appear as they get recorded.
We'll be doing a search in the shopping web site for the term "dvd" in the product catalog. Do this by typing "dvd" (without the quotes) into the search box located in the upper-right corner of the home page, then click the "Search" button. You'll see screen-scraper work for a bit, then, once it finishes, you should just see one row in the "HTTP Transactions" table. We'll want to traverse all of the search results, so, in your web browser, click the "Next >>" link. screen-scraper will work again for a bit while it records the next search results page. Later on we'll be scraping the details pages, so let's record one of those now. Click on the "Speed" link to view details on this DVD. These are the only pages we're interested in at this point, so go ahead and stop the proxy session by clicking the "Stop Proxy Server" button on the "General" tab. You'll also want to re-configure your web browser so that it's no longer using screen-scraper as a proxy server.
![]() |
Creating the Scraping Session |
Create a scraping session either by clicking the "New Scraping Session" button (looks like a gear) or by selecting "New Scraping Session" from the "File" menu. In the "Name" field enter "Shopping Site" (if you already downloaded and imported the scraping session at the first of this tutorial you'll want to name your scraping session something else--perhaps "My Shopping Site"). This is the scraping session that will hold all of the files we'll be extracting data from. Remember that a scraping session is simply a container for all of the files and other objects that will allow us to extract data from a given web site.
We'll now be adding scrapeable files to our scraping session. You'll remember from the first tutorial that a scrapeable file represents a web page you'd like screen-scraper to request.
Add the first scrapeable file to the scraping session by clicking the "Shopping Site" proxy session in the tree on the left (the first of the two "Shopping Site" nodes), then on the "Progress" tab. Find the row in the "HTTP Transactions" table with the following URL (probably the second in the table):
http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&keyword=dvd&sort=2a&page=2After the scrapeable file appears under the scraping session rename it to "Search results". Next, click on the "Parameters" tab. Remember that when we generate a scrapeable in this way screen-scraper pulls out the parameters from the URL and puts them under the "Parameters" tab for us. Because these are "GET" parameters (as opposed to "POST" parameters), when the scrapeable file is invoked by screen-scraper in a running scraping session, the parameters will get appended again to the URL. Let's take a closer look at each of the parameters that were embedded in the URL:
The only two that we're likely interested in are "keyword" and "page". We can guess that "keyword" refers to the text we typed into the search box initially. The "page" parameter refers to what page we're on in the search results. We can guess that if we were to replace the "2" in the "page" parameter of the URL it would bring up the first page in the search results. Try this by bringing up the following page in your web browser:
http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&keyword=dvd&sort=2a&page=1
![]() |
Creating the Script to Initialize the Scraping Session |
We're now going to create a small script to initialize our scraping session. It's a common practice to run a script at the very beginning of a scraping session that can initialize variables and such. That's what we'll be doing here.
Generate the script either by clicking the "New Script" button (looks like a pencil and paper) or by selecting "New Script" from the "File" menu. In the "Name" field type "Shopping Site--initialize session". You'll remember from the first tutorial that screen-scraper scripts get invoked when certain events occur. We'll be invoking this script before the scraping session begins.
Ensure that "Interpreted Java" is selected in the "Language" drop-down, then copy and paste the following text into the "Script Text" box:
// Set the session variables.
session.setVariable( "SEARCH", "dvd" );
session.setVariable( "PAGE", "1" );We set two session variables on our current scraping session. The one item to note is the "PAGE" session variable. We start at 1 so that the first search results page will get requested first.
Before trying out this script let's modify the parameters for our scrapeable file so that they make use of the session variables. Click on the "Search results" scrapeable file, then on the "Parameters" tab. Change the value of the "keyword" parameter from "dvd" to "~#SEARCH#~" (without the quotes), and change the value of the "page" parameter from "2" to "~#PAGE#~" (again, omit the quotes).
The ~#SEARCH#~ and ~#PAGE#~ tokens will be replaced at runtime with the values of the corresponding session variables. As such, the first URL will be as follows:
http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&keyword=dvd&sort=2a&page=1Note that we could achieve the same effect by deleting all of the parameters from the "Parameters" tab, and replacing our URL with this:
http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&keyword=~#SEARCH#~&sort=2a&page=~#PAGE#~We'll now need to associate our script with our scraping session so that it gets invoked before the scraping session begins. To do that, click on the scraping session in the tree on the left, then on the "Scripts" tab. Click the "Add Script" button to add a script. In the "Script Name" column select "Shopping Site--initialize session". The "When to Run" column should show "Before scraping session begins", and the "Enabled" checkbox should be checked. This will cause our script to get executed at the very beginning of the scraping session so that the two session variables can get set.
All right, we're ready to try it all out. This scraping session will generate a larger log than the one we worked on earlier, so it may be a good idea to increase the number of lines screen-scraper will display in its log. To do that, click on the scraping session in the tree on the left, then on the "Log" tab. In the text box labeled "Show only the following number of lines" enter the number 1000.
Run the scraping session by selecting it in the tree on the left, then click the "Run Scraping Session" button, which will cause the "Log" tab to be activated, allowing you to watch the scraping session progress. You'll notice that the URL of the requested file is the one given above. You can also verify that the correct URL was requested by clicking on the "Search results" scrapeable file, then on the "Last Response" tab, then on the "Render HTML" or "Display Response in Browser" buttons. The page should resemble the one you saw in your web browser.
Remember that it's a good idea to run scraping sessions often as you make changes, and watch the log and last responses to ensure that things are working as you expect them to. You'll also want to save your work frequently. Do that now by hitting the "Save" button (the one with the disk icon).
![]() |
Creating Extractor Patterns for Links |
This particular part of the tutorial is one that covers important principles that often seem confusing to people at first. If you've been speeding through the tutorial up to this point, it would probably be a good idea to slow down a bit and read more carefully.
We're now going to create a couple of extractor patterns to extract information for the "Next" link and the product details links. Remember that an extractor pattern is a block of text (usually HTML) that contains special tokens that will match pieces of data you're interested in extracting.
When creating extractor patterns we recommend that you always use the HTML from the "Last Response" tab in screen-scraper. By default, after screen-scraper requests a page it "tidies" the HTML found in it, which makes it differ from the HTML that you would get by viewing the source in your web browser (and also makes it more consistent, facilitating extraction). Click on the "Search results" scrapeable file in the tree on the left, then on the "Last Response" tab. The text box contains HTML because we just ran the scraping session. Copy all of the HTML and paste it into a text editor, such as Notepad or TextMate.
If you click either the "Render HTML" or "Display Response in Browser" button in screen-scraper you'll see a page basically resembling the search results page in your web browser. We're going to extract a portion of each of the product details links so that we can subsequently request each details page and extract information from them. The first details link corresponds to the "A Bug's Life" DVD. Find that in the text editor you just pasted the HTML into (specifically search for the text "A Bug's Life"). Here is the block of HTML representing this product:
<tr class="productListing-odd">
<td align="center" class="productListing-data"> <a href="http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=8"><img src="images/dvd/a_bugs_life.gif" border="0" alt="A Bug's Life" title=" A Bug's Life " width="100" height="80" /></a> </td>
<td class="productListing-data"> <a href="http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=8">A Bug's Life</a> </td>
<td align="right" class="productListing-data"> $35.99 </td>
<td align="center" class="productListing-data"><a href="http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&keyword=dvd&sort=2a&page=1&action=buy_now&products_id=8"><img src="includes/templates/template_default/buttons/english/button_buy_now.gif" border="0" alt="Buy Now" title=" Buy Now " width="60" height="30" /></a> </td>
</tr>This may seem like a bit of a mess, but if we look closely we can pick out the details link:
<td class="productListing-data"> <a href="http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=8">A Bug's Life</a> </td>Breaking it down a bit more we get the URL:
http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=8By the way, you might notice that the typical & symbols in the URL have been replaced by &. Don't be alarmed, it's just part of the tidying process screen-scraper applies to the HTML. Again, if we examine the parameters in the URL we can guess that the important one is "products_id", which likely identifies the product whose details we're interested in. We'll guess that the "products_id" is the only one we'll need to extract. This will give us enough information to request a details page. At this point, click on the "Search results" scrapeable file in the tree on the left, then click on the "Extractor Patterns" tab. We'll create an extractor pattern to grab out the product IDs from each link. Here's the extractor pattern we'll use:
<td class="productListing-data"> <a href="http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=~@PRODUCTID@~">~@PRODUCT_TITLE@~</a> </td>Create the extractor pattern by clicking on the "Add Extractor Pattern" button, then copying and pasting the text above into the resulting box. Also, give the extractor pattern the name "Product details link". Remember that extractor pattern tokens (delineated by the ~@ @~ markers) indicate data points we're interested in extracting. In this case, we want to extract the ID of the product (embedded in the URL), and the title of the product.
Double-click the ~@PRODUCTID@~ token (or select the text between the ~@ @~ delimiters, right-click it and select "Edit token"), and, in the box that appears, click "Save in session variable" checkbox. Under the "Regular Expression" section, select "URL GET parameter". You'll notice that when you do that the text [^&"]* shows up in the text box just above the drop-down list. This is the regular expression that we'll be using. You could also edit it manually, but generally won't need to.
Let's slow down at this point and go over what we just did to the ~@PRODUCTID@~ extractor pattern token. You might remember from the second tutorial that by checking the "Save in session variable" box we're telling screen-scraper to preserve the value for us so that we can use it at a later point. We'll get to that in a bit. This time we also selected a regular expression for it to use. In most cases you'll want to designate a regular expression for extractor pattern tokens. If you're not very familiar with regular expressions, don't worry. In the vast majority of cases you can simply use the regular expressions found in that drop-down list. Let's go over what effect designating a regular expression has. By indicating the "URL GET parameter" regular expression we're saying that we want that token to match any character except a double-quote and an ampersand (i.e., the " and & characters). You'll notice in our extractor pattern that a double-quote character just follows our ~@PRODUCTID@~ extractor pattern token. By using a regular expression we limit what the token will match so that we can ensure we get only what we want. You might think of it as putting a little fence around the token. We want it to match any characters underneath the ~@PRODUCTID@~ extractor pattern token, up to (but not including) the double-quote character.
A line from that last paragraph is worth repeating. In most cases you'll want to designate a regular expression for extractor pattern tokens. Using regular expressions also makes extractor patterns more resilient to changes in the web site. That is, if the web site makes minor changes to its HTML (e.g., altering a font style or color), often if you've been using regular expressions your extractor patterns will still match. Also, by using regular expressions we can often decrease the amount of HTML we need to use in our extractor patterns. That is, by using regular expressions we indicate more precisely what the data will look like that our tokens will match. By doing this, we can often reduce the amount of HTML we include at the beginning and end of our extractor patterns. In general, if you can reduce the amount of HTML in your extractor patterns, and increase the number of regular expressions you use in tokens, your extractor patterns will be more resilient to changes that get made in the HTML of the pages.
Now close the "Edit Token" box, which saves our settings.
Now let's alter the settings for the ~@PRODUCT_TITLE@~ token. We're not interested in saving the value for this token in a session variable, but we include it since it will differ for each section of HTML we want to match. Double-click the ~@PRODUCT_TITLE@~ extractor pattern token to bring up the "Edit token" dialogue box. Under the "Regular expression" section, select "Non-HTML tags". Again, take a look at the characters on the left and right sides of our ~@PRODUCT_TITLE@~ extractor pattern token. By using this regular expression we tell it not to include any greater than (>) or less than (<) symbols. This way we create a boundary for the token so that we can ensure it matches only what we want it to.
Why even include an extractor pattern token for data we don't want to save? This is another important principle. By using extractor pattern tokens for data we don't necessarily want to save, we make the extractor pattern more resilient to changes in the HTML. By using these extra tokens we can "future proof" our extractor patterns against changes the site owners might make down the road. There are also often situations (such as the present one) where data points adjacent to data we want to extract will differ for each pattern match. Here we only want the product ID, but we also include the product title because of its proximity to the data we want to extract, and because its value will differ each time the extractor pattern matches.
If those last few paragraphs strike you as a little bit confusing, don't worry. As you get more experience using screen-scraper you'll see why they're important. For now just take our word for it that you'll generally want to use regular expressions with extractor pattern tokens, and that it's often a good idea to use extractor pattern tokens to match data points you don't necessarily want to save. As you get more experience it will become more apparent when to use extractor pattern tokens for data you don't want to save.
Let's give our new extractor pattern a try. Click the "Apply Pattern to Last Scraped Data" button. You should see a window come up that shows the extracted data.
Again, let's slow down a moment and review what this window contains. When an extractor pattern matches, it produces a DataSet. You can think of a DataSet like a spreadsheet--it contains rows, columns, and cells. Each row in a DataSet is called a DataRecord. Again, a DataRecord can be thought of as being analogous to a row in a spreadsheet. In this particular case our DataSet has three columns. Two of them should be familiar--they correspond to the PRODUCT_TITLE and PRODUCTID extractor pattern tokens. The "Sequence" column indicates the order in which each row was extracted. You'll notice that the sequence is zero-based, meaning the first DataRecord in the DataSet is referenced with an index of 0. You'll also notice that the DataSet has 10 records--one for each product found in the search results page. Later on when we start talking more about DataSets and DataRecords, just remember the spreadsheet analogy--a DataSet is like the entire spreadsheet, and a DataRecord is like a single row in the spreadsheet.
Another good habit to get into is applying your extractor patterns frequently to ensure they correctly match the text you want extracted. Go ahead and close the "DataSet" window now.
Now for our "Next" link. In the text editor where you pasted the full HTML from the web page, search for the text "Next". Around that area you'll find the HTML for the link:
<a href="http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&keyword=dvd&sort=2a&page=2" title=" Next Page ">[Next >>]</a> </td>Fortunately, we're already familiar with the URL, and we know that the only parameters we need to worry about are "keyword" and "page". Create a new extractor pattern, call it "Next link", and use the following to grab the values of those parameters out:
<a href="http://www.screen-scraper.com/shop/index.php?main_page=advanced_search_result&keyword=~@KEYWORD@~&sort=2a&page=~@PAGE@~" title=" Next Page ">[Next >>]</a> </td>As with the previous extractor pattern, double-click the ~@PAGE@~ token, and, in the box that appears, click "Save in session variable" checkbox. Under the "Regular Expression" section, select "Number" from the "Select" drop-down list.
Close the "Edit Token" box to save your settings. If you're interested, the "Number" regular expression [\d,]+ simply indicates that we only want the PAGE token to match numbers (the regular expression essentially says, "match anything that contains either numbers or commas, and has at least one of either of those types of characters in it.").
Next, double-click the "KEYWORD" extractor pattern token to edit it. Under the "Regular Expression" section, select "URL GET parameter" from the "Select" drop-down list. This indicates that the "KEYWORD" extractor pattern should match only characters that would be found in a "GET" parameter of a URL. You'll notice that we didn't check the box to save the "KEYWORD" extractor pattern token in a session variable. We already have that value in a session variable, so we don't bother getting it again.
Try out the extractor pattern by clicking the "Apply Pattern to Last Scraped Data". Excellent! We have two matches--one for each "Next" link on the page (the top and bottom of the page).
Now would be a good time to save your work. Do that by selecting "Save" from the "File" menu or by clicking the floppy disk icon.
OK, let's try out the whole thing once more. Click on the "Shopping Site" scraping session in the tree on the left, then on the "Log" tab. Click the "Clear Log" button--we're going to run it again and we don't want to get confused by the log text from the last run. As before, click on the "Run Scraping Session" button to get it going. You'll see quite a bit more text in the log this time. Take a minute to look through it to ensure you understand what's going on.
![]() |
Scraping Pages from Scripts |
For each details link we're going to scrape the corresponding details page. This is a common scenario in screen-scraping--given a search results page, you need to extract details for each product, which means following each of the product details links. For each details page you'll likely want to extract out pieces of information corresponding to the products.
Let's start by creating a scrapeable file for the details page. We could create it from the proxy session, but it's pretty simple, so let's just create it from scratch. Click on the "Shopping Site" scraping session, the "General" tab, then click the "Add Scrapeable File" button. Give the scrapeable file the name "Details page", and the following URL:
http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=~#PRODUCTID#~In screen-scraper, links are generally followed by invoking a script after an extractor pattern finds matches. Let's go over this in more detail. First, create a new script and call it "Scrape details page". Enter the following code:
session.scrapeFile( "Details page" );OK, this is where the logic may get a little tricky. For each product ID our "Product details link" extractor pattern extracts, we want to scrape the product details page using the PRODUCTID it extracts. Go to the "Product details link" extractor pattern by clicking the "Search results" scrapeable file, then the "Extractor Patterns" tab. Note the "Scripts" pane under the extractor pattern. Click the "Add Script" button. This will allow us to have a script execute as the pattern finds matches. Under the "Script Name" column, if it isn't already selected, select our "Scrape details page" script. Leave the "Sequence" as is, and, under the "When to Run" column, select "After each pattern application".
Let's walk through this a bit more slowly. After the search results page is requested the "Product details link" will be applied to the HTML in the page. Remember that this particular extractor pattern will match 10 times--once for each product details link. Each time it matches it will grab a different product ID and save the value of that product ID into the PRODUCTID session variable. The "Scrape details page" script will get invoked after each of these matches, and each time the PRODUCTID session variable will hold a different product ID. As such, when the "Details page" gets scraped the URL will get a different product. For example, the first time the extractor pattern matches the PRODUCTID session variable will hold "8", and the URL will be:
http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=8http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=34Hopefully that's not too repetitive :) This is another area that people new to screen-scraper find confusing, so it's probably worth it to slow down a bit and ensure you understand what's going on.
Now would be a good time to try out the whole scraping session again. Do that like you did before by clearing out the log for the scraping session, then clicking the "Run Scraping Session" button. You'll see each details page getting requested one-by-one. Note especially each URL, which will have a different product ID at the end of each. If you'd prefer not to wait for the entire session to run you can click the "Stop Scraping Session" button. As before, it would be a good idea to go through the log carefully to ensure that you understand what it's doing.
At this point we still need to deal with the "Next" page link. We already have an extractor pattern to grab out the page number of the next page. Let's create a script to scrape the search results page again for each "Next" link. Generate a new script and call it "Scrape search results". Enter the following:
if( dataSet.getNumDataRecords() > 0 ){
session.scrapeFile( "Search results" );
}You'll notice that the script makes use of a "dataSet" variable. When the script is invoked screen-scraper will automatically create a variable corresponding to the current DataSet. This variable allows you to get access to all of the information that was extracted by the current extractor pattern. You can read more about objects available in scripts and their scope in our documentation, at the Scripting in screen-scraper and API documentation sections.
In this particular case, the script first checks the number of records in the current DataSet. That is, it looks at the number of DataRecords (or rows) in the DataSet (or spreadsheet). This effectively just checks to see if any "Next" link was found in the page. If so, it tells screen-scraper to scrape the "Search results" scrapeable file.
After creating the script, return to the "Next link" extractor pattern, then click the "Add Script" button. Select the "Scrape search results" script. This time there's something slightly different we'll need to do under the "When to Run" column. First, click the "Apply Pattern to Last Scraped Data" button. You'll notice that the pattern matches twice. The problem is that we only want to follow one of the "Next" links (that is, we don't want to scrape the second page twice). This is easily dealt with by selecting "After pattern is applied" under the "When to run" column. In other words, the script will only get invoked once--after the extractor pattern has matched as many times as it can. Selecting "After pattern is applied" under the "When to run" column guarantees that the script will get invoked once and only once. Note, though, that because we're saving the value of the ~@PAGE@~ extractor pattern token in a session variable it will still hold the correct value when the page gets scraped. Because we indicate that the script is to be invoked "After pattern is applied", the "dataSet" variable will be in scope. See the Variable scope section in our documentation for more detail on which variables are in scope depending on when a given script is run.
OK, run the scraping session once more. Clear the scraping session log, then click the "Run Scraping Session" button again. If you let it run for a while you'll notice that it will request each details page for the products found on the first search results page, request the second search results page, then request each of the details pages for that page.
![]() |
Extracting Product Details |
At this point we're able to scrape the details pages for each of the products. We're now ready to extract the information we're really interested in: data about each DVD. To do this we're going to use sub-extractor patterns. Again, this is a point in the tutorial where you may want to slow down a bit. Sub-extractor patterns is another important concept that can be a bit confusing at first.
Sub-extractor patterns allow us to define a small region within a larger HTML page from which we'll extract individual snippets of information. This helps to eliminate most of the HTML text we're not interested in, allowing us to be more precise about the data we'd like to extract. It also makes our extractor patterns more resilient to future changes in the HTML page, as they allow us to reduce the amount of HTML we need to include.
If you let the scraping session run through to completion the last URL in the scraping session log will be the following:
http://www.screen-scraper.com/shop/index.php?main_page=product_info&products_id=7It should be apparent in examining the page that most of the elements in it aren't of interest to us. For example, we don't care about the header, footer, or any of the boxes along the sides of the page. We'll first define a region that basically surrounds the elements we're interested in. Here is that full region:
<tr>
<td colspan="2" class="pageHeading" valign="top">
<h1>You've Got Mail</h1>
</td>
</tr>
<tr>
<td align="center" valign="top" class="smallText" rowspan="2">
<script language="javascript" type="text/javascript">
<!--
document.write('<a href="javascript:popupWindow(\'http://www.screen-scraper.com/shop/index.php?main_page=popup_image&pID=7\')"><img src="images/dvd/youve_got_mail.gif" border="0" alt="You\'ve Got Mail" title=" You\'ve Got Mail " width="100" height="80" hspace="5" vspace="5" /><br />larger image<\/a>');
//-->
</script>
<noscript><a href="http://www.screen-scraper.com/shop/index.php?main_page=images/dvd/youve_got_mail.gif" target="_blank"><img src="images/dvd/youve_got_mail.gif" border="0" alt="You've Got Mail" title=" You've Got Mail " width="100" height="80" hspace="5" vspace="5" /><br />
larger image</a></noscript> </td>
<td class="main" align="center" valign="top">Model: DVD-YGEM</td>
</tr>
<tr>
<td class="main" align="center"></td>
</tr>
<tr>
<td align="center" class="pageHeading">$34.99</td>
<td class="main" align="center">Shipping Weight: 7.00 lbs.</td>
</tr>
<tr>
<td> </td>
<td class="main" align="center">10 Units in Stock</td>
</tr>
<tr>
<td class="main" align="center">Manufactured by: Warner</td>
<td align="center">
<table border="0" width="150px" cellspacing="2" cellpadding="2">
<tr>
<td align="center" class="cartBox"> QuantityThat might seem like a large chunk of HTML, but it's actually a relatively small percentage of the entire page.
Before defining sub-extractor patterns we first define an extractor pattern with a special ~@DATARECORD@~ token in it. If you're familiar with computer programming in general, the ~@DATARECORD@~ token can be thought of as a "reserved word". That is, it's a token that has a special meaning in that it defines the sub-region of the HTML page containing the data elements we're interested in. You'll always use the ~@DATARECORD@~ token when using sub-extractor patterns.
Here's the extractor pattern we'll use:
<tr>
<td colspan="2" class="pageHeading~@DATARECORD@~QuantityNotice that we simply replaced most of the middle portion of the large block of HTML with a ~@DATARECORD@~ token. If you look at the text before and after ~@DATARECORD@~ you can see that the same text is also found at the beginning and end of the large HTML block. The basic idea here is to include only as much HTML around the sub-region as necessary to uniquely identify it in the page. Any of the HTML covered by the ~@DATARECORD@~ token will be picked up by screen-scraper, and will define our sub-region that we'll be extracting the individual pieces of data from.
Create a new extractor pattern using the text given above (remember we're still using the "Details page" scrapeable file), then give it the name "PRODUCTS". Now click the "Apply Pattern to Last Scraped Data" button. In the window that appears, copy the text from the "DATARECORD" column and paste it into your text editor. The easiest way to select all of the text in that box is to triple-click it, use the keyboard to copy the text (Ctrl-C in Windows and Linux), then paste it into your text editor. The text should look like this:
" valign="top"><h1>You've Got Mail</h1></td></tr><tr><td align="center" valign="top" class="smallText" rowspan="2"><script language="javascript" type="text/javascript"><!--document.write( '<a href="javascript:popupWindow(\'http://www.screen-scraper.com/shop/index.php?main_page=popup_image &pID=7\')"><img src="images/dvd/youve_got_mail.gif" border="0" alt="You\'ve Got Mail" title=" You\'ve Got Mail " width="100" height="80" hspace="5" vspace="5" /><br />larger image<\/a>'); //--></script> <noscript><a href="http://www.screen-scraper.com/shop/index.php? main_page=images/dvd/youve_got_mail.gif" target="_blank"><img src="images/dvd/youve_got_mail.gif" border="0" alt="You've Got Mail" title=" You've Got Mail " width="100" height="80" hspace="5" vspace="5" /><br />larger image</a></noscript> </td><td class="main" align="center" valign="top"> Model: DVD-YGEM</td></tr><tr><td class="main" align="center"></td></tr><tr><td align="center" class="pageHeading">$34.99</td><td class="main" align="center">Shipping Weight: 7.00 lbs.</td> </tr><tr><td> </td><td class="main" align="center">10 Units in Stock</td></tr> <tr><td class="main" align="center">Manufactured by: Warner</td><td align="center"> <table border="0" width="150px" cellspacing="2" cellpadding="2"><tr><td align="center" class="cartBox"> This is the HTML we're after, but it's all in one large block. This occurs because screen-scraper strips out unnecessary white space when extracting information in order to make the extraction process more efficient. This can make sifting through the HTML a little more difficult, but the search feature in your text editor should make this relatively straightforward. You could also deal with the HTML found directly in the "Last Response" tab. You'd just have to be sure that you're only grabbing portions of the page that would be covered by the ~@DATARECORD@~ extractor pattern token.
First off, we're interested in the DVD title. In your text editor do a search for the first word in the title of the DVD whose page you're viewing (e.g., if you're viewing the HTML for the last DVD in the search results you'll search for "You've"). This should highlight the first word in the title. In order to extract this piece of information we'll use a small sub-extractor pattern:
<h1>~@TITLE@~</h1>Once again, we include only as much HTML around the piece of data that we're interested in as is necessary. If we do this just right we'll still be able to extract information even if the web site itself makes minor changes. On our "PRODUCTS" extractor pattern, click the "Sub-Extractor Patterns" tab, then on the "Add Sub-Extractor Pattern" button. In the text box that appears paste the text for the sub-extractor pattern we've included above. Edit the ~@TITLE@~ extractor pattern token by double-clicking it, under the "Regular Expression" section, select "Non-HTML tags" from the drop-down list (as a side note, "Non-HTML tags" is probably the most common regular expression you'll use). Click on the "Apply Sub-Extractor Pattern to Last Scraped Data" to try it out. You should see a DataSet with a single row and columns for the DATARECORD and TITLE tokens.
Next, create the following sub-extractor patterns for the remaining data elements we want to extract (note that each line of text will be a separate sub-extractor pattern):
>$~@PRICE@~<>Model: ~@MODEL@~<>Shipping Weight: ~@SHIPPING_WEIGHT@~<>Manufactured by: ~@MANUFACTURED_BY@~<For each token in the sub-extractor patterns give it the "Non-HTML tags" regular expression, as you did for the ~@TITLE@~ token.
As sub-extractor patterns match data, they aggregate the pieces into a single data record. That is, when our PRODUCTS extractor pattern is applied along with its sub-extractor patterns, the following data record will be produced:
| TITLE | PRICE | MODEL | SHIPPING_WEIGHT | MANUFACTURED_BY |
|---|---|---|---|---|
| You've Got Mail | 34.99 | DVD-YGEM | 7.00 lbs. | Warner |
You can see this by clicking the "Apply Pattern to Last Scraped Data" button.
If you'd like, at this point try running the scraping session again by clearing the log and hitting the "Run Scraping Session" button. If you examine the log while the session runs you'll see that it extracts out details for each of the DVDs.
![]() |
Saving the Data |
Once screen-scraper extracts data there are a number of things that can be done with it. For example, you might be invoking screen-scraper from an ASP script, which, after telling screen-scraper to extract data, might display it to the user. In our case we'll simply write the data out to a text file. To do this, we'll once again write a script. Create a new script, call it "Write data to a file", and use the following Interpreted Java:
FileWriter out = null;
try
{
session.log( "Writing data to a file." );
// Open up the file to be appended to.
out = new FileWriter( "dvds.txt", true );
// Write out the data to the file.
out.write( dataRecord.get( "TITLE" ) + "\t" );
out.write( dataRecord.get( "PRICE" ) + "\t" );
out.write( dataRecord.get( "MODEL" ) + "\t" );
out.write( dataRecord.get( "SHIPPING_WEIGHT" ) + "\t" );
out.write( dataRecord.get( "MANUFACTURED_BY" ) );
out.write( "\n" );
// Close up the file.
out.close();
}
catch( Exception e )
{
session.log( "An error occurred while writing the data to a file: " + e.getMessage() );
}Our script simply takes the contents of the current data record (which for us will be the data record that constitutes a single DVD) and appends it to a "dvds.txt" text file.
If you're familiar with Java, hopefully the scripts make sense. There is one important point worth noting, though. You'll notice that each script makes use of a "DataRecord" object (referenced as the "dataRecord" variable in the scripts). This object refers to the current DataRecord as the script is executed. Again, think of the spreadsheet. When the script gets invoked, a specific DataRecord (or row in the spreadsheet) will be current. This DataRecord automatically becomes a variable you can use in your script. The DataRecord object has a "get" method, which allows you to retrieve the value for a key it contains (i.e., you're referencing a specific cell in the spreadsheet). Again, you can read more about objects available in scripts and their scope in our documentation, at the Using Scripts and API documentation pages.
Click on the "Details page" scrapeable file, then on the "Extractor Patterns" tab. Below the extractor pattern text click the "Add Script" button. In the "Script Name" column, select "Write data to a file" and in the "When to Run" column select "After each pattern application" (even though there will only be one match per page). For each DVD we'll execute the script that will write the information out to a file.
To clarify a bit further, because we're invoking the script "After each pattern application", the "dataRecord" variable will be in scope. In other words, for each row in the spreadsheet (which happens to be a single row in this case) screen-scraper will execute the "Write data to a file" script. Each time it gets invoked a DataRecord will be current (again, think of it walking through each row in the spreadsheet). As such, we have access to the current row in the spreadsheet by way of the "dataRecord" variable. Had we indicated that the script was to be invoked "After pattern is applied", the "dataRecord" would not be in scope. Again using the spreadsheet analogy, scripts that get invoked "After pattern is applied" would run after screen-scraper had walked through all of the rows in the spreadsheet, so no DataRecord would be in scope (i.e., it's at the end of the spreadsheet--after the very last row). See the Variable scope section in our documentation for more detail on which variables are in scope depending on when a given script is run.
Once again, run the scraping session. This time if you check the directory where screen-scraper is installed you'll notice a dvds.txt file that will grow as the DVD details pages get scraped.
Note that as an alternative to the above script you could do the following code (professional and enterprise editions only):
dataSet.writeToFile( "dvds.txt" );We included the first example to demonstrate referencing data records in scripts.
If you would like more information on saving extracted data to a database please consult our FAQ on the topic here.
![]() |
Logging In |
Oftentimes it's necessary to log in to a web site before extracting the information you're interested in. This is generally quite a bit easier than it might seem. Typically this simply involves creating a scrapeable file to handle the login that will get invoked before any of the other pages. The shopping site we're scraping from doesn't require us to log in before performing searches, but for the sake of this tutorial we'll set it up as if it did.
Before we look at the page that handles the actual login, we need to have screen-scraper request the home page for the shopping site. This is necessary because it allows for a few initial cookies to be set before we attempt to log in. If you're familiar with web programming, we're requesting the home page so that the server can create a session for us (tracked by the cookies) prior to our attempting a login. By having screen-scraper request the home page, those cookies will get set, and screen-scraper will then automatically track them for us.
Create a scrapeable file for the home page by clicking on the "Shopping Site" scraping session (the one with a gear) in the tree on the left, then on the "Add Scrapeable File" button. Give the new scrapeable file the name "Home", set its sequence to "1", and give it the URL "http://www.screen-scraper.com/shop/".
Login HTTP requests are usually POST requests, which makes it trickier to tell what parameters are being passed to the server (i.e., the parameters won't appear in the URL). The proxy server can make viewing the parameters easier, so let's make use of it. Open your web browser to the shopping login page:
http://www.screen-scraper.com/shop/index.php?main_page=loginIn your web browser, in the "E-Mail Address" field enter test@test.com and in the "Password" field enter testing, then click the "login" button. After screen-scraper works for a bit, return to the "General" tab and click the "Stop Proxy Server" button. Re-configure your web browser so that it no longer uses screen-scraper as a proxy server.
If you paid close attention to screen-scraper as it was working you may have noticed that two rows were added to the "HTTP Transactions" table (it's actually possible that three were added; if so, just delete the last one by highlighting it and hitting the "Delete" key on your keyboard). Click on the first row in the table (the URL should begin with:
http://www.screen-scraper.com/shop/index.php?main_page=loginAt this point we'll want to copy the login POST request to our scraping session. We only need the first transaction in the table (or whichever one corresponds to the login request itself--it should be the one with the POST data) and not the request representing the redirect, since screen-scraper will automatically follow redirects for us. Copy the HTTP transaction to your scraping session by clicking on the first row in the table (the one corresponding to the POST request), ensure that the "Shopping Site" scraping session is selected in the drop-down list, then click the "Go" button. After the new scrapeable file is created under the scraping session rename it "Login". Also, set its sequence to 2. It should be requested right after the home page is requested. screen-scraper automatically tracks cookies, just like a web browser, so by requesting it near the beginning any subsequent pages that are protected by the login will be accessible.
Now click the "Parameters" tab in our "Login" scrapeable file. You'll notice that screen-scraper automatically extracted out the various POST parameters and added them to the scrapeable file. If you're familiar with URL encoding, you'll also notice that screen-scraper decoded the "email_address" parameter to "test@test.com". screen-scraper automatically URL encodes parameters found under the "Parameters" tab before passing them up to the server.
At this point feel free to run the scraping session again. Because our site doesn't require logging in before searching can take place it won't make much difference, but you'll at least be able to see the login page being requested in the log for the scraping session.
![]() |
Where to Go From Here |
Congratulations! At this point you should have the basics under your belt to scrape most web sites. From here you could continue on with one of the subsequent tutorials, if they seem relevant to your project. It may also be a good idea to look through a bit more of our documentation in order to get familiar with other details of screen-scraper. Either way, probably the best way to learn screen-scraper is to use it. Try it on one of your own projects!
![]() |
Tutorial Overview |
This tutorial continues on where Tutorial 1: Hello World left off, and covers aspects of screen-scraper related to richer scripting and interacting with screen-scraper from external languages, including Active Server Pages, PHP, and Java.
If you haven't completed the first tutorial don't worry, but you'll at least need to import the script and scraping session that were created in the first tutorial. To do that, download and import the scraping session found here.
If you'd like to see the final version of the scraping session you'll be creating in this tutorial you can download it below.
| Attachment | Size |
|---|---|
| Hello World (Scraping Session).sss | 3.06 KB |
![]() |
Embedding Session Variables |
A significant limitation of our first "Hello World" project was that we could only scrape the text from our first request. That is, we were always scraping the text "Hello World!", which really isn't that useful. We'll now adjust our setup so that we can designate the text to be submitted in the form.
At this point we're going to set a session variable that will hold the text we'd like submitted in the form. Within screen-scraper, session variables are used to transfer information between scripts, scrapeable files, and other objects. Session variables are generally set from within scripts, but can also be automatically set within extractor patterns as well as passed in from external applications.
We'll now set up a script to set a session variable before our scraping session runs. Create a new script as you've done before, and call it "Initialize scraping session". Use the following for the body of the script:
// Put the text to be submitted in the form into a
// session variable so we can reference it later.
session.setVariable( "TEXT_TO_SUBMIT", "Hi everybody!" );Hopefully the scripts seem straightforward. It simply sets a session variable named "TEXT_TO_SUBMIT", and gives it the value "Hi everybody!" (spoken, of course, in your best Dr. Nick voice).
Setting the session variable "TEXT_TO_SUBMIT" will allow us to access that value in other scripts and scrapeable files while our "Hello World" scraping session is running.
We'll now need to associate our script with our scraping session so that it gets invoked before the scraping session begins. To do that, click on the scraping session in the tree on the left, then on the "Scripts" tab. Click the "Add Script" button to add a script. In the "Script Name" column select "Initialize scraping session". The "When to Run" column should show "Before scraping session begins", and the "Enabled" checkbox should be checked. This will cause our script to get executed at the very beginning of the scraping session so that the "TEXT_TO_SUBMIT" session variable can get set.
Just as we use special tokens in extractor patterns to designate values we'd like to extract, we use special tokens to insert values of session variables into the URLs or parameters (GET, POST, or BASIC authentication) of scrapeable files. We'll do this now by embedding it into one of the parameters of our only scrapeable file. Expand the "Hello World" scraping session in the tree on the left, then select the "Form submission" scrapeable file. Click on the "Parameters" tab. In the "Value" column for our "text_string" parameter replace the text "Hello world!" with the text:
~#TEXT_TO_SUBMIT#~
The ~# and #~ delimiters are used to designate a session variable whose value should be inserted into that location when the scrapeable file gets executed. When the scrapeable file gets invoked, screen-scraper will construct the URL by including the "text_string" parameter in it. In other words, the URL for our scrapeable file will become this:
http://www.screen-scraper.com/screen-scraper/tutorial/basic_form.php?text_string=Hi+everybody%21Form submission: The following data elements were found:
Form data--DataRecord 0:
FORM_SUBMITTED_TEXT=Hi everybody!
And if you look at the contents of the "form_submitted_text.txt" file you'll notice the same text.
Remember that it's a good idea to run scraping sessions often as you make changes, and watch the log and last responses to ensure that things are working as you expect them to.
![]() |
Interacting with Screen-Scraper Externally |
Invoking screen-scraper from the command line
If you've decided to use the basic edition of screen-scraper your only option for invoking screen-scraper externally is from the command line (invoking screen-scraper from the command line is also available in the professional and enterprise editions). You can find full documentation and examples on doing that at our Invoking screen-scraper from the command line documentation page. If you don't need to invoke screen-scraper from the command line you can skip to the Invoking screen-scraper from an external application section.
In order to invoke screen-scraper from the command line, you'll want to create a batch file (in Windows) or a shell script (in Linux or Mac OS X) to invoke the scraping session. If you're using Windows open a text editor (e.g., Notepad) and enter the following:
jre\bin\java -jar screen-scraper.jar -s "Hello World" --params "TEXT_TO_SUBMIT=Hello+World"Save the batch file (call it "hello_world.bat") in the folder where screen-scraper is installed (e.g., C:\Program Files\screen-scraper professional edition\). If the version of screen-scraper you're running is prior to 4.5, and you're running Windows Vista, you will need to save your batch file to a location such as your Documents folder or your Desktop. Then, within Windows Explorer, manually transfer the file to the directory where screen-scraper is installed.
If you're running Linux, the shell script would look like this:
jre/bin/java -jar screen-scraper.jar -s "Hello World" --params "TEXT_TO_SUBMIT=Hello+World"And for Mac OS X, you'd use this:
java -jar screen-scraper.jar -s "Hello World" --params "TEXT_TO_SUBMIT=Hello+World"Within screen-scraper, you'll want to disable the "Initialize scraping session" script; otherwise, the value we pass in from the command line would get overwritten once that script is executed. Disable the script by clicking on the "Hello World" scraping session, then on the "Scripts" tab, then un-checking the "Enabled?" check box for the script.
You can then run the batch file by opening a DOS prompt (or terminal in Linux or Mac OS X), changing to the folder containing the batch file, then invoking it. You should see the text from screen-scraper's log appear in the DOS window. If you're running Linux or Mac OS X, you'll need to close the workbench before invoking your shell script.
Invoking screen-scraper from an external application
Note that the rest of this tutorial only applies to the professional and enterprise editions of screen-scraper.
Oftentimes you'll want to use a language or platform external to screen-scraper to scrape data. screen-scraper can be controlled externally using Java, PHP, Ruby, Python, .NET, ColdFusion, any COM-friendly language (such as Active Server Pages or Visual Basic), or any language that supports SOAP. In this next part of the tutorial we'll give examples in PHP, Java, ColdFusion, and Active Server Pages.
In order to interact with screen-scraper externally it needs to be running as a server. When running as a server screen-scraper acts much like a database server does. That is, it listens for requests from external sources, services those requests, and sends back responses. For example, when you issue a SQL statement to a database from an ASP script your script is opening up a socket to the database, sending the request over it, then receiving the database's response back over the socket. Once this transaction has been completed the socket will be closed, but the database will continue to listen for other requests. screen-scraper works in a similar way.
At this point we'd recommend reading over the documentation page that discusses running screen-scraper as a server, and gives details on how to start and stop it according to the platform you're running on. Follow the link below, then return back to this page when you're finished:
Running screen-scraper as a server
Before we start writing code to interact with screen-scraper externally we need to configure a few things. Depending on the language you'd like to program in, please follow one of the links below, which will give you an overview of interacting with screen-scraper using that language and guide you through any configuration that needs to take place. Once you're finished return back to this page.
Invoking screen-scraper from ColdFusion
Invoking screen-scraper from a COM-based application
Invoking screen-scraper from Java
Invoking screen-scraper from PHP
Each time you run a scraping session externally screen-scraper will generate a log file corresponding to that scraping session in the "log" folder found inside the folder where you installed screen-scraper. This can be invaluable for debugging, so you'll want to take a look at it if you run into trouble. You can turn server logging off by unchecking the "Generate log files" check box under the "Servers" section of the "Settings" dialog box.
If you haven't already, within screen-scraper, you'll want to disable the "Initialize scraping session" script; otherwise, the value we pass in from our external application would get overwritten once that script is executed. Disable the script by clicking on the "Hello World" scraping session, then on the "Scripts" tab, then un-checking the "Enabled?" check box for the script.
OK, we're now ready to write some code. Follow one of the links below.
![]() |
Interacting with screen-scraper from ASP |
The ASP script we'll be writing will invoke our scraping session remotely, passing in a value for the "TEXT_TO_SUBMIT" session variable. Create a new ASP script on your computer, and paste the following code into it:
<%
' Create a RemoteScrapingSession object.
Set objRemoteSession = Server.CreateObject("Screenscraper.RemoteScrapingSession")
' Generate a new "Hello World" scraping session.
Call objRemoteSession.Initialize("Hello World")
' Put the text to be submitted in the form into a session variable so we can reference it later.
Call objRemoteSession.SetVariable( "TEXT_TO_SUBMIT", "Hi everybody!" )
' Check for errors.
If objRemoteSession.isError Then
Response.Write( "Error: " & objRemoteSession.GetErrorMessage )
Else
' Tell the scraping session to scrape.
Call objRemoteSession.Scrape
' Write out the text that was scraped:
Response.Write( "Scraped text: " + objRemoteSession.GetVariable("FORM_SUBMITTED_TEXT") )
End If
' Disconnect from the server.
Call objRemoteSession.Disconnect
%>OK, we're ready to give our script a try. Start screen-scraper running as a server. If you need help or have trouble with this refer to the documentation page here: Running screen-scraper as a server. If you've succeeded in starting up the server go ahead and load your ASP script in a browser. After a short pause you should see the "Hi everybody!" message output to your browser.
We'll be creating two different scripts to interact with screen-scraper via ColdFusion. The first will be using ColdFusion tags, and the second will be using ColdFusion script. Each of these scripts will invoke our scraping session remotely, passing in a value for the "TEXT_TO_SUBMIT" session variable
Now is probably a good time to review the setup to work with screen-scraper from ColdFusion. Take a minute now to read our Invoking screen-scraper from ColdFusion page, get everything set up, then return back here.
All right, we're ready to write some code. Create a new ColdFusion script on your computer, and paste the following code into it:
<html>
<head>
<title>ColdFusion Tag Example</title>
</head>
<body>
<cfobject
action = "create"
type = "java"
class = "com.screenscraper.scraper.RemoteScrapingSession"
name = "RemoteScrapingSession">
<cfset remoteSession = RemoteScrapingSession.init("Hello World","localhost",8778)>
<cfset remoteSession.setVariable( "TEXT_TO_SUBMIT", urlEncodedFormat("Hi everybody!") )>
<cfset remoteSession.scrape()>
<cfset test = remoteSession.getVariable("FORM_SUBMITTED_TEXT")>
<cfset remoteSession.disconnect()>
<cfoutput>
textReturned: #test#
</cfoutput>
</body>
</html>You can probably follow the logic to see that this code is virtually identical to our script. The one notable difference is that we need to explicitly disconnect from the server so that it knows we're done.
OK, we're ready to give our ColdFusion script a try. Start screen-scraper running as a server. If you need help or have trouble with this refer to the documentation page here: Running screen-scraper as a server. If you've succeeded in starting up the server go ahead and access your ColdFusion script from your browser. After a short pause you should see the "Hi everybody!" message output.
If you prefer using ColdFusion script to program, you can use the following code instead of the code we give above:
<html>
<head>
<title>ColdFusion Script Example</title>
</head>
<body>
<cfscript>
RemoteScrapingSession = CreateObject("java","com.screenscraper.scraper.RemoteScrapingSession");
remoteSession = RemoteScrapingSession.init("Hello World","localhost",8778);
remoteSession.setVariable( "TEXT_TO_SUBMIT", "Hi everybody!" );
remoteSession.scrape();
xlnt = remoteSession.getVariable( "FORM_SUBMITTED_TEXT" );
remoteSession.disconnect();
</cfscript>
<cfoutput>
test:#xlnt#<br>
</cfoutput>
</body>
</html>
![]() |
Interacting with Screen-Scraper from Java |
The Java class we'll be writing will simply substitute for the "Initialize scraping session" script we wrote previously. That is, our Java class will invoke our scraping session remotely, passing in a value for the "TEXT_TO_SUBMIT" session variable. Create a new Java class on your computer, and paste the following code into it:
import com.screenscraper.scraper.*;
public class HelloWorldRemoteScrapingSession
{
/**
* The entry point.
*/
public static void main( String args[] )
{
try
{
// Create a remoteSession to communicate with the server.
RemoteScrapingSession remoteSession = new RemoteScrapingSession( "Hello World" );
// Put the text to be submitted in the form into a session variable so we can reference it later.
remoteSession.setVariable( "TEXT_TO_SUBMIT", "Hi everybody!" );
// Tell the session to scrape.
remoteSession.scrape();
// Output the text that was scraped:
System.out.println( "Scraped text: " + remoteSession.getVariable( "FORM_SUBMITTED_TEXT" ) );
// Very important! Be sure to disconnect from the server.
remoteSession.disconnect();
}
catch( Exception e )
{
System.err.println( e.getMessage() );
}
}
}OK, we're ready to give our Java class a try. After you've successfully compiled the class (remember to include the "screen-scraper.jar" file in your classpath), start screen-scraper running as a server. If you need help or have trouble with this refer to the documentation page here: Running screen-scraper as a server. If you've succeeded in starting up the server go ahead and run the Java class from a command prompt or console. After a short pause you should see the "Hi everybody!" message output.
The PHP script we'll be writing will invoke our scraping session remotely, passing in a value for the "TEXT_TO_SUBMIT" session variable. Create a new PHP script on your computer, and paste the following code into it:
<?
/**
* Note that in order to run this script the file
* remote_scraping_session.php must be in the same
* directory.
*/
require( 'remote_scraping_session.php' );
// Instantiate a remote scraping session.
$session = new RemoteScrapingSession;
// Initialize the "Hello World" session.
echo "Initializing the session.<br>";
flush();
$session->initialize( "Hello World" );
// Put the text to be submitted in the form into a session variable so we can reference it later.
$session->setVariable( "TEXT_TO_SUBMIT", urlencode( "Hi everybody!" ) );
// Check for errors.
if( $session->isError() )
{
echo "An error occurred: " . $session->getErrorMessage() . "<br>";
exit();
}
// Tell the session to scrape.
echo "Scraping<br>";
flush();
$session->scrape();
// Write out the text that was scraped:
echo "Scraped text: " . $session->getVariable( "FORM_SUBMITTED_TEXT" ) . "<br>";
// Very important! Be sure to disconnect from the server.
$session->disconnect();
// Indicate that we're finished.
echo "Finished.";
?>There are just a couple of extra steps we take here that we didn't take in our previous script. First, after creating our RemoteScrapingSession object we make a separate call to initialize it for our specific scraping session. Also, you'll notice that after calling the Scrape method we check for any errors that may have occurred up to this point. For example, if for some reason your PHP script can't connect to the server you'd want to know before you tried to tell it to scrape. Finally, we need to explicitly disconnect from the server so that it knows we're done.
OK, we're ready to give our script a try. Start screen-scraper running as a server. If you need help or have trouble with this refer to the documentation page here: Running screen-scraper as a server. Remember also that the "remote_scraping_session.php" file needs to be in the same directory as your PHP script. If you've succeeded in starting up the server go ahead and load your PHP script in a browser. After a short pause you should see the "Hi everybody!" message output to your browser.
![]() |
Where to Go From Here |
Congratulations! You've now covered all of the basic principles needed to invoke screen-scraper externally. In working on your own projects we'd suggest referring frequently to the screen-scraper documentation available from within the application or on our web site.
The third tutorial deals with other topics, including scraping search results (with multiple records) across multiple pages, and logging in to a web site before scraping information.
![]() |
Tutorial Overview |
This tutorial illustrates invoking screen-scraper from other programs in ways more complex than those presented in Tutorial 3. From our external program we'll be passing to screen-scraper search parameters, invoking the scraping process, getting the scraped data from screen-scraper, then iterating over the data, and outputting it within our application.
Before proceeding it would be a good idea to go through Tutorial 2, if you haven't done so already.
If you haven't gone through Tutorial 2, or don't still have the scraping session you created in it, you can download it here and import it into screen-scraper.
Once you've got the scraping sessions imported into screen-scraper you're ready to roll. Click on the "Tutorial Details" link below to get going.
![]() |
Tutorial Details |
screen-scraper can be invoked from software applications written in most modern programming languages, including Java, Active Server Pages, PHP, .NET, and anything that supports SOAP. In this tutorial we'll give some examples of applications that do just that.
Our application will pass parameters to screen-scraper corresponding to login information as well as a key phrase for which to search. As in the third tutorial, we're going to pretend that the web site requires us to log in before we can search, for the sake of providing an example, even though it actually doesn't. Once we pass the parameters to screen-scraper we'll tell it to start scraping. screen-scraper will then run the scraping session using the parameters we gave it, extracting out the data it normally does. Once it's done, we'll ask it for the extracted information, then output it for the user to see.
Before we begin we'll first need to make a couple of minor changes to the e-commerce scraping session from the third tutorial. If you haven't already, start up screen-scraper. Under the "Shopping Site" scraping session click on the "Login" scrapeable file, then on the "Parameters" tab. We're going to alter the "email_address" and "password" POST parameters so that we can pass those parameters in rather than hard-coding them. For the "email_address" parameter change the value "test@test.com" to ~#EMAIL_ADDRESS#~, and change the "testing" value for the "password" parameter to ~#PASSWORD#~. You might remember from Tutorial 2 that tokens surrounded by the ~# #~ delimiters indicate that the value of a session variable should be inserted. For example, in our case we're going to create an "EMAIL_ADDRESS" session variable and give it the value "test@test.com" such that screen-scraper substitutes it in for the corresponding POST parameter at runtime.
In addition, click on the "Details page" scrapeable file. On the "PRODUCTS" extractor pattern, select the "Advanced" tab and check the box next to "Automatically save the data set generated by this extractor pattern in a session variable."
The code that we'll be writing in our external application will also be essentially taking the place of the current "Shopping Site--initialize session" script. Let's disable that since it would otherwise overwrite the values we'll be passing in externally. To do that click on the "Shopping Site" scraping session in the tree on the left, then on the "Scripts" tab. In the scripts table, un-check the "Enabled?" check box for the "Shopping Site--initialize session" script. Save your changes and exit screen-scraper.
Where you go next depends on which programming language you're interested in. Use one of the links below according to your preference.
![]() |
Invoking screen-scraper from ASP |
In order to invoke screen-scraper from ASP, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow that link, then come back here.
Okay, let's try it out before we go over the code. Right-click and download the shopping.asp file here, then save it to a directory where it will be web-accessible (i.e., within your IIS web dir). After that start up screen-scraper in server mode.
Open up your web browser and go to the URL corresponding to the "shopping.asp" file (e.g, http://localhost/screen-scraper/shopping.asp). You'll see a simple search form. Type in a product keyword, such as "bug", then hit the "Go" button. If all goes well the page will take a little while to load (it's waiting as screen-scraper extracts the data), then it will output the corresponding products.
If that didn't go quite as you expected here are some things to check:
Assuming that test worked, fire up your favorite ASP editor and open the "shopping.asp" file in it. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing our COM documentation or posting to our forum.
When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its "log" folder. Take a look in that folder for your "Shopping Site" log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.
![]() |
Invoking screen-scraper from C#.NET |
Before we dig into the code take a minute to review our Invoking screen-scraper via .NET documentation page. The C# file we'll be referring to can be downloaded here.
In order to invoke screen-scraper from C#.NET, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow that link, then come back here.
Okay, let's try it out before we dive into the code. Start screen-scraper running as a server. From your .NET environment compile and execute the "shopping.cs" file.
If that didn't go quite as you expected here are some things to check:
Assuming that test worked, take a closer look over the "shopping.cs" class. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing our .NET documentation or posting to our forum.
When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its "log" folder. Take a look in that folder for your "Shopping Site" log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.
![]() |
Invoking screen-scraper from ColdFusion |
Before we dig into the code you'll probably want to take a minute to review our Invoking screen-scraper from ColdFusion documentation page. Remember that you need to add the "screen-scraper.jar" file for you classpath in order to be able to interact with screen-scraper.
In order to invoke screen-scraper from ColdFusion, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow that link, then come back here.
Okay, let's try it out before we go over the code. Download the shopping.cfm.txt file here, then save it in a directory that will be accessible from your web server. Rename the file from "shopping.cfm.txt" to "shopping.cfm". After that start up screen-scraper in server mode.
Open up your web browser and go to the URL corresponding to the "shopping.cfm" file (e.g, http://localhost/screen-scraper/shopping.cfm). You'll see a simple search form. Type in a product keyword, such as "bug", then hit the "Go" button. If all goes well the page will take a little while to load (it's waiting as screen-scraper extracts the data), then it will output the corresponding products.
If that didn't go quite as you expected here are some things to check:
Assuming that test worked, fire up your favorite ColdFusion editor and open the "shopping.cfm" file in it. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing ColdFusion documentation or posting to our forum.
When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its "log" folder. Take a look in that folder for your "Shopping Site" log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.
![]() |
Invoking screen-scraper from Java |
Before we dig into the code let's review a few things related to invoking screen-scraper via Java. First, your Java code will need to have two jars in its classpath: screen-scraper.jar (found in the root screen-scraper install folder) and log4j.jar (found in screen-scraper's "lib" folder). For convenience we've packaged all of the files you'll need in this zip file. Download that file and unzip it. You'll notice that we also include an Ant build file that you can use to compile and run the sample class.
In order to invoke screen-scraper from Java, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow that link, then come back here.
Okay, let's try it out before we dive into the code. Start screen-scraper running as a server. If you're using Ant simply type "ant run" at a command prompt inside of the folder where the build.xml files is found.
If that didn't go quite as you expected here are some things to check:
Assuming that test worked, fire up your favorite Java editor and open the "Shopping.java" file in it. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing our Java documentation or posting to our forum.
When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its "log" folder. Take a look in that folder for your "Shopping Site" log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.
![]() |
Invoking screen-scraper from PHP |
Before we dig into the code let's review a few things related to invoking screen-scraper via PHP. First, your PHP code will need to refer to screen-scraper's PHP driver, called "remote_scraping_session.php". You can find this file in the "misc\php\" folder of your screen-scraper installation. You'll want to put a copy of that file into the directory where you plan on putting the PHP file that will invoke screen-scraper.
In order to invoke screen-scraper from PHP, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow that link, then come back here.
Okay, let's try it out before we go over the code. Download the shopping.php.zip file here, unzip it, then save it in the same directory where you copied the "remote_scraping_session.php" file. Rename the file from "shopping.php.txt" to "shopping.php". After that start up screen-scraper in server mode.
Open up your web browser and go to the URL corresponding to the "shopping.php" file (e.g, http://localhost/screen-scraper/shopping.php). You'll see a simple search form. Type in a product keyword, such as "bug", then hit the "Go" button. If all goes well the page will take a little while to load (it's waiting as screen-scraper extracts the data), then it will output the corresponding products.
If that didn't go quite as you expected here are some things to check:
Assuming that test worked, fire up your favorite PHP editor and open the "shopping.php" file in it. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing PHP documentation or posting to our forum.
When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its "log" folder. Take a look in that folder for your "Shopping Site" log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.
![]() |
Invoking screen-scraper from Python |
Before we dig into the code let's review a few things related to invoking screen-scraper via Python. First, your Python code will need to refer to screen-scraper's Python driver, called "remote_scraping_session.py". You can find this file in the "misc\python\" folder of your screen-scraper installation, or you can download it here. You'll want to put a copy of that file into the directory where you plan on putting the Python file that will invoke screen-scraper.
In order to invoke screen-scraper from Python, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow that link, then come back here.
Okay, let's try it out before we go over the code. Download the shopping.py.txt file here, then save it in the same directory where you copied the "remote_scraping_session.py" file. Rename the file from "shopping.py.txt" to "shopping.py". After that start up screen-scraper in server mode.
Run the command "python shopping.py" in your console. You'll be asked which keyword to search. Type in a product keyword, such as "bug", then press "Enter" key. If all goes well the program will take a little while to load (it's waiting as screen-scraper extracts the data), then it will output the corresponding products.
If that didn't go quite as you expected here are some things to check:
Assuming that test worked, fire up your favorite Python editor and open the "shopping.py" file in it. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing Python documentation or posting to our forum.
When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its "log" folder. Take a look in that folder for your "Shopping Site" log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.
![]() |
Invoking screen-scraper from VB.NET |
Before we dig into the code take a minute to review our Invoking screen-scraper via .NET documentation page. The VB file we'll be referring to can be downloaded here.
In order to invoke screen-scraper from VB.NET, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow that link, then come back here.
Okay, let's try it out before we dive into the code. Start screen-scraper running as a server. From your .NET environment compile and execute the "shopping.vb" file.
If that didn't go quite as you expected here are some things to check:
Assuming that test worked, take a closer look over the "shopping.vb" class. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing our .NET documentation or posting to our forum.
When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its "log" folder. Take a look in that folder for your "Shopping Site" log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.
![]() |
Where to Go From Here |
The approach we outline in this tutorial works great for relatively small sets of data. When we extract records from the shopping site we're probably not going to extract more than 25 or so. When screen-scraper extracts the data it is saved in memory (remember we checked the "Automatically save the data set generated by this extractor pattern in a session variable" check box for the "DETAILS" extractor pattern, which is what causes this to happen), so it works fine because there aren't that many products.
So what happens when we want to extract and save large numbers of records? The simple answer is that you need to save them out as they're extracted rather than having screen-scraper keep them in memory. Usually this means either inserting the scraped records into a database or writing them out to a text file. Tutorial 5 walks your through saving scraped data to a database. You might also find this FAQ helpful. Additionally, we provide an example in Tutorial 2 that illustrates how to write the data out to a file. Just remember that if you're writing the data out to a file you'll want to uncheck the box labeled "Automatically save the data set generated by this extractor pattern in a session variable" for the extractor pattern that pulls out the data you want to save. If it's checked it will cause screen-scraper to store all of the data in memory, which could cause it to run out of memory while it's running.
![]() |
Saving Scraped Data to a Database |
We continue on with our e-commerce site in this tutorial by inserting the data we scrape into a database. Generally once you've extracted data from a web site you want to save it out either to a file or a database. We already went over saving the data to a file in Tutorial 2, so here we'll cover inserting the information into a database.
If you haven't gone through Tutorial 2, or don't still have the scraping session you created in it, you can download it here and import it into screen-scraper.
Once you've got the scraping session and script imported into screen-scraper you're ready to roll. Click on the "Tutorial Details" link below to get going.
![]() |
Tutorial Details |
There are a number of ways to insert scraped data into a database, which we outline in this FAQ. Take a minute now to look through that. We'll be giving an example of the last option mentioned, which is one of the easier methods to implement.
If you're using the Enterprise Edition of screen-scraper, you should be aware of screen-scraper's ability to handle scraped data in real time (available only in the Enterprise Edition). As of right now, this has been implemented in the Java and PHP drivers for screen-scraper. If you're running the Enterprise Edition, and want to interact with screen-scraper using either of those languages, read over the "Handling Scraped Data in Real Time" section of either our Invoking screen-scraper from Java or Invoking screen-scraper from PHP pages for details on this. The current tutorial doesn't cover this approach, but it's quite a bit easier and cleaner to implement than the method that will be described here. Later on we'll likely create a tutorial for Enterprise Edition users that makes use of this approach.
The basic idea in this tutorial is that we'll have a special scrapeable file that will POST data to a PHP file, which will handle inserting the data into a database. The flow of events will look like this:
We'll start by modifying our existing "Shopping Site" scraping session a bit, adding to it the scrapeable file that will POST the data to our PHP file.
![]() |
Setting Up the Scraping Session |
We'll first modify our existing scraping session a bit to get it ready to save the scraped data to our database. First, click on the "Details page" scrapeable file in the tree on the left, then on the "Extractor Patterns" tab, then click the "Sub-Extractor Patterns" tab for our "DETAILS" extractor pattern. We're going to update each of our extractor pattern tokens so that they save their extracted values in a session variable. Do this by double-clicking each of them (e.g., on ~@TITLE@~) or right-clicking (control-clicking on Mac OS X) and selecting "Edit token". In the "Edit Token" box click the "Save in session variable?" check box, then close the "Edit Token" window. Do that for each extractor pattern token (~@TITLE@~, ~@PRICE@~, etc.).
We need to save the values in a session variable so that we can use them as POST parameters in the scrapeable file that POSTS's to our PHP file.
Let's create that scrapeable file now. Click on the "Shopping Site" scrapeable file in the tree on the left, then click the "Add Scrapeable File" button, found on the "General" tab. Once the scrapeable file appears give it the name "Save product". In the URL field enter:
http://www.screen-scraper.com/support/tutorials/tutorial5/db/save_product.phpClick on the "Parameters" tab for the new scrapeable file, and give it five POST parameters, as shown in the screen-shot below:

You might remember that the ~# #~ delimiters indicate that the value of the corresponding session variable should be substituted in. For example, in our case the value of the TITLE session variable (e.g., "A Bug's Life") will be substituted in for the ~#TITLE#~ token. This value will be the one that gets submitted to the PHP file so that it can be inserted into the database.
Finally, we need to create a simple script that will invoke our new scrapeable file. Click on the "New Script" button (looks like a pencil and paper) in the button bar. Give the script the name "Save product", and give it the "Script Text":
session.scrapeFile( "Save product" );The script simply tells screen-scraper to invoke the "Save product" scrapeable file.
Now we need to tell screen-scraper when to invoke the scrapeable file. We need it invoked for each product, so that they all get saved to the database. As such, we'll invoke the script after the "Details page" is requested. Do this by clicking on the "Details page" scrapeable file in the tree on the left, then on the "Scripts" tab. Click the "Add Script" button, and in the "Script Name" column select "Save product". Under the "When to Run" column select "After file is scraped".
Okay, we're done setting up screen-scraper, so we're ready give our scraping session a run. Before we invoke it, let's make one minor tweak so that the session doesn't take quite so long to run. In the "Shopping Site--initialize session" script, change the value for the "SEARCH" session variable from "dvd" to "bug". We'll get the two "Bug's Life" DVD's rather than every DVD in the system. Once you've done that click on the "Shopping Site" scraping session in the tree on the left, then on the "Run Scraping Session" button.
Once the scraping session has run it's course click on the "Save product" scrapeable file, then on the "Last Response" tab. You should see something like this for the response:
<?xml version="1.0" encoding="UTF-8"?>
<result>
<status>Success</status>
<product>
<title>A Bug\'s Life \"Multi Pak\"</title>
<price>35.99</price>
<manufactured_by>Warner</manufactured_by>
<model>DVD-ABUG</model>
<shipping_weight>7.00 lbs.</shipping_weight>
</product>
</result>Which indicates that the last product was successfully inserted.
Now it's time to take a closer look at the PHP file...
![]() |
The PHP File |
Hopefully this is obvious, but we could just as easily use a web script written in ASP, Cold Fusion, or anything else that can be accessed via HTTP. In this tutorial we use PHP as an example simply because it's one of the most commonly used languages. Don't worry if you're not familiar with PHP, though, most of what we'll be going through is simple pseudo-code.
First try accessing the PHP directly here: http://www.screen-scraper.com/support/tutorials/tutorial5/db/save_product.php.
First, you'll notice that when the PHP script is accessed directly (i.e., not via a POST request) it simply displays a form that you can use to test it. Go ahead and try that now. Enter in some bogus information, then submit the form. If you entered in at least a title and a price you should get a small XML document that resembles the one you saw in screen-scraper. Go back to the form, leave the "Price" field blank, then re-submit the form. This time you get a message indicating that the data is incomplete.
When you submit the form you're taking exactly the same action that screen-scraper does when it invokes its "Save product" scrapeable file. The data is submitted to the PHP file via a POST request, validated, then inserted into the database.
We've simplified our example some to mostly pseudo-code. Take a look over the code for the PHP file found in this zip file: http://www.screen-scraper.com/support/tutorials/tutorial5/db/save_product.php.zip. It's pretty heavily commented, so hopefully you can follow it even if you don't know PHP. Your code will obviously vary depending on the database you're using and any data validation you want to perform.
Having your web application return some kind of status message allows you to handle error conditions and such within screen-scraper. In this case you would probably want to set up an extractor pattern for the "Save product" scrapeable file that might look something like this:
<status>~@STATUS@~</status>
You might then write a script that does something special in the case of an error.
![]() |
Where to Go From Here |
The best way to proceed would probably be to try this on your own project. If you run into any glitches don't hesitate to post to our forum so that we can lend a hand.
![]() |
Generating an RSS/Atom Feed from a Product Search |
In this tutorial will go over configuring screen-scraper to generate an RSS or Atom feed based on extracted data. The ability for screen-scraper to generate these feeds is available only in the Enterprise Edition. We will continue on using the "Shopping Site" scraping session we generated in Tutorial 2. In order to use the RSS/Atom functionality you need to be using the Enterprise Edition of screen-scraper.
If you haven't already gone through Tutorial 2, this tutorial will make more sense if you do so first.
If you haven't gone through Tutorial 2, or don't still have the scraping session you created in it, you can download it here and import it into screen-scraper.
Once you've got the scraping session imported into screen-scraper you're ready to roll. Click on the "Tutorial Details" link below to get going.
![]() |
Tutorial Details |
Before going on, take a minute to read over the Generating RSS and Atom Feeds page in our documentation. That should give you a basic overview.
We're going to configure our "Shopping Site" scraping session so that it generates a feed of products based on a search parameter. That is, we'll give it a search keyword (e.g., "bug" or "dvd"), it will extract the product data, then create an XML feed out of the scraped data. For testing purposes we'll just access the XML feed from a web browser, though you could just as easily access it from an RSS/Atom reader.
![]() |
Setting Up the Scraping Session |
If you read over the Generating RSS and Atom Feeds page you can probably guess at how we'll need to modify the scraping session. Let's start by altering the name of the extractor pattern that grabs the product details. In screen-scraper click on the "Details page" scrapeable file for the "Shopping Site" scraping session, then click the "Extractor Patterns" tab. Change the name of the extractor pattern from "PRODUCTS" to "XML_FEED". This pattern will extract out the DataSet that will hold our entire feed. We'll now need to designate the fields for the individual items in the feed. Click on the "Sub-Extractor Patterns" tab for our feed. There are several fields we're extracting, but for the sake of simplicity we'll just deal with two of them. For the "TITLE" portion of our feed we're in luck because we already have a "TITLE" sub-extractor pattern. For the "DESCRIPTION" part of the feed item we're not currently extracting the full description from the product details page. Just for the sake of providing an example let's use the "MODEL" field instead. Change the name of the "MODEL" sub-extractor pattern to "DESCRIPTION" so that it looks like this:
>Model: ~@DESCRIPTION@~<There are two more elements we need for our XML feed: "LINK" and "PUBLISHED_DATE". We're obviously not extracting either of these, so let's write a quick script to set them for us. Create a new script by clicking on the pencil and paper icon in the button bar. Give the script the name "Set URL and published date". Copy and paste this in for the text of the script:
// Set the "LINK" element to the URL of the current product details page.
dataRecord.put( "LINK", scrapeableFile.getCurrentURL() );
// Create a formatted date representing the current date.
dataRecord.put( "PUBLISHED_DATE", new Date() );Once you've created the script associate it with the "XML_FEED" extractor pattern by clicking on the "Details page" scrapeable file, then on the "Extractor Patterns" tab. Click on the "Add Script" button, select "Set URL and published date" under the "Script Name" column, and "After each pattern application" under the "When to Run" column.
The script is fairly straightforward. We first set the "LINK" element to the URL of the product details page we're currently on. You'll notice that we're setting the value via the "put" method on the current DataRecord object. Because this script will get invoked for each pattern application the "dataRecord" object will be in scope. You'll likely remember from previous tutorials that the "dataRecord" object can be thought of as the current row on the spreadsheet of extracted data. Here we're simply adding a cell to the current row of the spreadsheet for the "LINK" element of the feed. The second element we set is the "PUBLISHED_DATE". For those unfamiliar with Java, passing it "new Date()" simply indicates that the feed item was published on the current date.
If you haven't done so previously, you'll also want to disable the "Shopping Site--initialize session" script. We'll be passing values in externally, and this script would otherwise overwrite those values. To disable the script, click on the "Shopping Site" scraping session in the tree on the left, then on the "Scripts" tab. Un-check the box in the table under the "Enabled?" column.
Take a minute now to save your work.
That's it for setting up the scraping session. We're now going to generate the feed.
![]() |
Generating the XML Feed |
Let's run a quick test just to make sure the scraping session works. After that, we'll add a few more bells and whistles. Start up screen-scraper as a server. If you need help on that try this page. Once that's up, assuming you haven't altered the default "Web/SOAP Server" port (which is also the web server port), and that you're running screen-scraper on your local machine, try entering this URL in to your browser:
http://localhost:8779/ss/xmlfeed?scraping_session=Shopping+Site&SEARCH=bug
If all goes well the browser should take a little bit to load, then you should see an XML document appear containing the extracted information. If you got an error message or the document didn't appear as you expected it to, check screen-scraper's log. Just as with scraping sessions run remotely, screen-scraper will create a log file in its "log" folder corresponding to each RSS/Atom scraping session.
Dealing with the URL directly can be a bit cryptic, what with the encoding and all. As such, let's make use of a little HTML file that will allow us to generate feeds using different search parameters and formats. You can access it here. Note that this HTML file assumes that you're running screen-scraper as a server on your local machine on port 8779. If any of that isn't the case you'll want to download the HTML file to your local machine, alter it with your settings, then open it back up in your browser.
Try experimenting with the form a bit. It gives you control over most all of the features that are available, including the format of the feed. Also take a close look at the URL. screen-scraper simply converts the GET parameters in the URL to session variables in the scraping session. If you'd like, you can even open the feed in your favorite RSS/Atom reader to ensure that the format is valid.
![]() |
Where to Go From Here |
The ability to generate RSS/Atom feeds directly from scraped data opens up quite a few interesting possibilities. Where you take things from this point is left to your imagination...
![]() |
Scraping a Site Multiple Times Based on Search Terms |
It's often the case in screen-scraping that you want to submit a form multiple times using different parameters each time. For example, you may be extracting locations from the "store locator" service on a site, and need to submit the form for a series of zip codes. In this tutorial we'll provide an example on how to go about that. We will continue on using the "Shopping Site" scraping session we generated in Tutorial 2.
If you haven't already gone through Tutorial 2, this tutorial will make more sense if you do so first.
If you haven't gone through Tutorial 2, or don't still have the scraping session you created in it, you can download it here and import it into screen-scraper.
Once you've got the scraping session imported into screen-scraper you're ready to roll. Click on the "Tutorial Details" link below to get going.
![]() |
Tutorial Details |
Our "Shopping Site" example is pretty limited in that it can only handle one search term. What if we want to extract products for multiple search terms? For example, we may want to scrape various DVD titles that would fit with the other titles in our collection. We could search for the new DVD's using a series of keywords.
We're going to alter the existing "Shopping Site" scraping session so that it reads in a file containing search terms, and performs a search for each one. Just as before, as it performs a search it will follow the "details" links and extract out information for each product. Once the information is extracted it will write it out to a file.
![]() |
Altering the Scraping Session |
The changes we'll be making to our "Shopping Site" scraping session in order to add this new functionality are actually pretty minor. First, let's deal with the trickiest part (which really isn't all that tricky): creating the script that will read in the file containing our search terms, and run each search.
Create a new script by clicking the pencil and paper icon in the button bar. Give the script the name "Read search terms". Leave the "Language" drop-down list with the value "Interpreted Java". Paste in the following for the content of the script:
// Create a file object that will point to the file containing |
The script is pretty heavily commented, so it may be apparent what's going on, but let's walk through it a bit, just in case.
First off we create a few objects that are going to allow us to read in search terms from a file called "search_terms.txt". We read the search terms in line-by-line in a "while" loop. For each search term we're going to invoke the scrapeable file "Search results". You might remember that the "Search results" scrapeable file is the one that handles issuing the search to the e-commerce web site, and walks through all of the search results pages. It also has an extractor pattern that pulls the details links, following each one of those to the "Details page" scrapeable file.
That might sound a bit complicated, so let's put the rest of the pieces in place, run it, then walk through it again.
There are just a few more modifications we need to make. Please do the following:
That should do it. Click ahead to finalize setup and run the scraping session.
![]() |
Running the Scraping Session |
The last item we need to take care of is creating the text file that will contain our search terms. Let's keep it simple. Fire up your favorite text editor and create a file called "search_terms.txt" inside of screen-scraper's installation folder (e.g., "C:\Program Files\screen-scraper professional edition\search_terms.txt"). Add the following three lines to the text file:
bug
speed
blade
Those search terms should yield at least a few DVD's we can add to our collection.
All right, now's the moment of truth. Run the new scraping session by clicking on it in screen-scraper and clicking the "Run Scraping Session" button, then watch the "Log" tab to see it do its thing. If all goes well, once it's done, you should have a "dvds.txt" file in screen-scraper's install folder containing scraped data for all of the search terms.
Take a look carefully through the log. If it all seems to make sense, you're done. If not, read on so that we can walk through it a bit more carefully.
The flow of events goes like this, once you hit the "Run Scraping Session" button:
Remember that the "Log" tab is key to understanding the flow of events in screen-scraper. If you're still a bit fuzzy on how things are working, try looking more carefully through the log to piece together how the site is being scraped.
![]() |
Where to Go From Here |
At this point feel free to experiment a bit. You may want to try adding a few more search terms to the "search_terms.txt" file.
Probably the best way to extend on what this tutorial covers would be to try your own project. If you're faced with the task of scraping a web site multiple times for various numbers or search keywords, chances are the scraping session you'll create won't differ too significantly from the one we've presented here.