Tutorial 4: Scraping an E-commerce Site from External Programs

Tutorial Overview

This tutorial illustrates invoking screen-scraper from other programs in ways more complex than those presented in Tutorial 3. From our external program we'll be passing to screen-scraper search parameters, invoking the scraping process, getting the scraped data from screen-scraper, then iterating over the data, and outputting it within our application.

Before proceeding it would be a good idea to go through Tutorial 2, if you haven't done so already.

If you haven't gone through Tutorial 2, or don't still have the scraping session you created in it, you can download and load it into screen-scraper by following these steps:

  1. Download the zip file located here and unzip it. You should now have an "interpreted_java" directory and a "vbscript" directory.
  2. If you're running Windows, and prefer to program in VBScript, import the "Shopping Site (Scraping Session).sss" scraping session located in the "vbscript" directory; otherwise, import the one located in the "interpreted_java" directory. Instructions on importing objects into screen-scraper can be found here.

Once you've got the scraping sessions imported into screen-scraper you're ready to roll. Click on the "Tutorial Details" link below to get going.

Tutorial 4: Page 2: Tutorial Details

Tutorial Details

screen-scraper can be invoked from software applications written in most modern programming languages, including Java, Active Server Pages, PHP, .NET, and anything that supports SOAP. In this tutorial we'll give some examples of applications that do just that.

Our application will pass parameters to screen-scraper corresponding to login information as well as a key phrase for which to search. As in the third tutorial, we're going to pretend that the web site requires us to log in before we can search, for the sake of providing an example, even though it actually doesn't. Once we pass the parameters to screen-scraper we'll tell it to start scraping. screen-scraper will then run the scraping session using the parameters we gave it, extracting out the data it normally does. Once it's done, we'll ask it for the extracted information, then output it for the user to see.

Before we begin we'll first need to make a couple of minor changes to the e-commerce scraping session from the third tutorial. If you haven't already, start up screen-scraper. Under the "Shopping Site" scraping session click on the "Login" scrapeable file, then on the "Parameters" tab. We're going to alter the "email_address" and "password" POST parameters so that we can pass those parameters in rather than hard-coding them. For the "email_address" parameter change the value "test@test.com" to ~#EMAIL_ADDRESS#~, and change the "testing" value for the "password" parameter to ~#PASSWORD#~. You might remember from Tutorial 2 that tokens surrounded by the ~# #~ delimiters indicate that the value of a session variable should be inserted. For example, in our case we're going to create an "EMAIL_ADDRESS" session variable and give it the value "test@test.com" such that screen-scraper substitutes it in for the corresponding POST parameter at runtime.

In addition, click on the "Details page" scrapeable file. On the "PRODUCTS" extractor pattern, select the "Advanced" tab and check the box next to "Automatically save the data set generated by this extractor pattern in a session variable."

The code that we'll be writing in our external application will also be essentially taking the place of the current "Shopping Site--initialize session" script. Let's disable that since it would otherwise overwrite the values we'll be passing in externally. To do that click on the "Shopping Site" scraping session in the tree on the left, then on the "Scripts" tab. In the scripts table, un-check the "Enabled?" check box for the "Shopping Site--initialize session" script. Save your changes and exit screen-scraper.

Where you go next depends on which programming language you're interested in. Use one of the links below according to your preference.

Tutorial 4: Page 3: Invoking screen-scraper from ASP

Invoking screen-scraper from ASP

In order to invoke screen-scraper from ASP, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow that link, then come back here.

Okay, let's try it out before we go over the code. Download the shopping.asp file here, then save it to a directory where it will be web-accessible (i.e., within your IIS web dir). After that start up screen-scraper in server mode.

Open up your web browser and go to the URL corresponding to the "shopping.asp" file (e.g, http://localhost/screen-scraper/shopping.asp). You'll see a simple search form. Type in a product keyword, such as "bug", then hit the "Go" button. If all goes well the page will take a little while to load (it's waiting as screen-scraper extracts the data), then it will output the corresponding products.

If that didn't go quite as you expected here are some things to check:

  • Make sure screen-scraper is running as a server, and that nothing is blocking its ports (such as a firewall running on your machine).
  • If you're running screen-scraper on a different machine than the one your ASP file resides on, make sure that screen-scraper is allowing connections from the ASP machine. In the screen-scraper workbench click on the wrench icon, then on the "Servers" button, and check the "Hosts to allow to connect" includes the IP address (or perhaps just the first part of the IP address) of the ASP machine.
  • Check screen-scraper's "log" folder for a "Shopping Site" log file. If you find one it means that screen-scraper is at least receiving the request. Open the log file in a text editor to see if you find any error messages.
  • If you still can't seem to get it to work feel free to drop us a support request.

Assuming that test worked, fire up your favorite ASP editor and open the "shopping.asp" file in it. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing our COM documentation, posting to our forum, or sending us a support request.

When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its "log" folder. Take a look in that folder for your "Shopping Site" log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.

Tutorial 4: Page 3: Invoking screen-scraper from C#.NET

Invoking screen-scraper from C#.NET

Before we dig into the code take a minute to review our Invoking screen-scraper via .NET documentation page. The C# file we'll be referring to can be downloaded here.

In order to invoke screen-scraper from C#.NET, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow that link, then come back here.

Okay, let's try it out before we dive into the code. Start screen-scraper running as a server. From your .NET environment compile and execute the "shopping.cs" file.

If that didn't go quite as you expected here are some things to check:

  • Make sure screen-scraper is running as a server, and that nothing is blocking its ports (such as a firewall running on your machine).
  • If you're running screen-scraper on a different machine than the one your C# class resides on, make sure that screen-scraper is allowing connections from the C# machine. In the screen-scraper workbench click on the wrench icon, then on the "Servers" button, and check the "Hosts to allow to connect" includes the IP address (or perhaps just the first part of the IP address) of the C# machine.
  • Check screen-scraper's "log" folder for a "Shopping Site" log file. If you find one it means that screen-scraper is at least receiving the request. Open the log file in a text editor to see if you find any error messages.
  • If you still can't seem to get it to work feel free to drop us a support request.

Assuming that test worked, take a closer look over the "shopping.cs" class. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing our .NET documentation, posting to our forum, or sending us a support request.

When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its "log" folder. Take a look in that folder for your "Shopping Site" log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.

Tutorial 4: Page 3: Invoking screen-scraper from Cold Fusion

Invoking screen-scraper from Cold Fusion

Before we dig into the code you'll probably want to take a minute to review our Invoking screen-scraper from ColdFusion documentation page. Remember that you need to add the "screen-scraper.jar" file for you classpath in order to be able to interact with screen-scraper.

In order to invoke screen-scraper from Cold Fusion, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow that link, then come back here.

Okay, let's try it out before we go over the code. Download the shopping.cfm.txt file here, then save it in a directory that will be accessible from your web server. Rename the file from "shopping.cfm.txt" to "shopping.cfm". After that start up screen-scraper in server mode.

Open up your web browser and go to the URL corresponding to the "shopping.cfm" file (e.g, http://localhost/screen-scraper/shopping.cfm). You'll see a simple search form. Type in a product keyword, such as "bug", then hit the "Go" button. If all goes well the page will take a little while to load (it's waiting as screen-scraper extracts the data), then it will output the corresponding products.

If that didn't go quite as you expected here are some things to check:

  • Make sure screen-scraper is running as a server, and that nothing is blocking its ports (such as a firewall running on your machine).
  • If you're running screen-scraper on a different machine than the one your Cold Fusion file resides on, make sure that screen-scraper is allowing connections from the Cold Fusion machine. In the screen-scraper workbench click on the wrench icon, then on the "Servers" button, and check the "Hosts to allow to connect" includes the IP address (or perhaps just the first part of the IP address) of the Cold Fusion machine.
  • Ensure that the permissions on the "shopping.cfm" file are such that your web server can execute it.
  • Check screen-scraper's "log" folder for a "Shopping Site" log file. If you find one it means that screen-scraper is at least receiving the request. Open the log file in a text editor to see if you find any error messages.
  • If you still can't seem to get it to work feel free to drop us a support request.

Assuming that test worked, fire up your favorite Cold Fusion editor and open the "shopping.cfm" file in it. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing Cold Fusion documentation, posting to our forum, or sending us a support request.

When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its "log" folder. Take a look in that folder for your "Shopping Site" log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.

Tutorial 4: Page 3: Invoking screen-scraper from Java

Invoking screen-scraper from Java

Before we dig into the code let's review a few things related to invoking screen-scraper via Java. First, your Java code will need to have two jars in its classpath: screen-scraper.jar (found in the root screen-scraper install folder) and log4j.jar (found in screen-scraper's "lib" folder). For convenience we've packaged all of the files you'll need in this zip file. Download that file and unzip it. You'll notice that we also include an Ant build file that you can use to compile and run the sample class.

In order to invoke screen-scraper from Java, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow that link, then come back here.

Okay, let's try it out before we dive into the code. Start screen-scraper running as a server. If you're using Ant simply type "ant run" at a command prompt inside of the folder where the build.xml files is found.

If that didn't go quite as you expected here are some things to check:

  • Make sure screen-scraper is running as a server, and that nothing is blocking its ports (such as a firewall running on your machine).
  • If you're running screen-scraper on a different machine than the one your Java class resides on, make sure that screen-scraper is allowing connections from the Java machine. In the screen-scraper workbench click on the wrench icon, then on the "Servers" button, and check the "Hosts to allow to connect" includes the IP address (or perhaps just the first part of the IP address) of the Java machine.
  • Check screen-scraper's "log" folder for a "Shopping Site" log file. If you find one it means that screen-scraper is at least receiving the request. Open the log file in a text editor to see if you find any error messages.
  • If you still can't seem to get it to work feel free to drop us a support request.

Assuming that test worked, fire up your favorite Java editor and open the "Shopping.java" file in it. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing our Java documentation, posting to our forum, or sending us a support request.

When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its "log" folder. Take a look in that folder for your "Shopping Site" log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.

Tutorial 4: Page 3: Invoking screen-scraper from PHP

Invoking screen-scraper from PHP

Before we dig into the code let's review a few things related to invoking screen-scraper via PHP. First, your PHP code will need to refer to screen-scraper's PHP driver, called "remote_scraping_session.php". You can find this file in the "misc\php\" folder of your screen-scraper installation. You'll want to put a copy of that file into the directory where you plan on putting the PHP file that will invoke screen-scraper.

In order to invoke screen-scraper from PHP, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow that link, then come back here.

Okay, let's try it out before we go over the code. Download the shopping.php.txt file here, then save it in the same directory where you copied the "remote_scraping_session.php" file. Rename the file from "shopping.php.txt" to "shopping.php". After that start up screen-scraper in server mode.

Open up your web browser and go to the URL corresponding to the "shopping.php" file (e.g, http://localhost/screen-scraper/shopping.php). You'll see a simple search form. Type in a product keyword, such as "bug", then hit the "Go" button. If all goes well the page will take a little while to load (it's waiting as screen-scraper extracts the data), then it will output the corresponding products.

If that didn't go quite as you expected here are some things to check:

  • Make sure screen-scraper is running as a server, and that nothing is blocking its ports (such as a firewall running on your machine).
  • If you're running screen-scraper on a different machine than the one your PHP file resides on, make sure that screen-scraper is allowing connections from the PHP machine. In the screen-scraper workbench click on the wrench icon, then on the "Servers" button, and check the "Hosts to allow to connect" includes the IP address (or perhaps just the first part of the IP address) of the PHP machine.
  • Ensure that the permissions on the "shopping.php" and "remote_scraping_session.php" files are such that your web server can execute them.
  • Check screen-scraper's "log" folder for a "Shopping Site" log file. If you find one it means that screen-scraper is at least receiving the request. Open the log file in a text editor to see if you find any error messages.
  • If you still can't seem to get it to work feel free to drop us a support request.

Assuming that test worked, fire up your favorite PHP editor and open the "shopping.php" file in it. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing PHP documentation, posting to our forum, or sending us a support request.

When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its "log" folder. Take a look in that folder for your "Shopping Site" log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.

Tutorial 4: Page 3: Invoking screen-scraper from Python

Invoking screen-scraper from Python

Before we dig into the code let's review a few things related to invoking screen-scraper via Python. First, your Python code will need to refer to screen-scraper's Python driver, called "remote_scraping_session.py". You can find this file in the "misc\python\" folder of your screen-scraper installation, or you can download it here. You'll want to put a copy of that file into the directory where you plan on putting the Python file that will invoke screen-scraper.

In order to invoke screen-scraper from Python, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow that link, then come back here.

Okay, let's try it out before we go over the code. Download the shopping.py.txt file here, then save it in the same directory where you copied the "remote_scraping_session.py" file. Rename the file from "shopping.py.txt" to "shopping.py". After that start up screen-scraper in server mode.

Run the command "python shopping.py" in your console. You'll be asked which keyword to search. Type in a product keyword, such as "bug", then press "Enter" key. If all goes well the program will take a little while to load (it's waiting as screen-scraper extracts the data), then it will output the corresponding products.

If that didn't go quite as you expected here are some things to check:

  • Make sure screen-scraper is running as a server, and that nothing is blocking its ports (such as a firewall running on your machine).
  • If you're running screen-scraper on a different machine than the one your Python file resides on, make sure that screen-scraper is allowing connections from the Python machine. In the screen-scraper workbench click on the wrench icon, then on the "Servers" button, and check the "Hosts to allow to connect" includes the IP address (or perhaps just the first part of the IP address) of the Python machine.
  • Ensure that the permissions on the "shopping.py" and "remote_scraping_session.py" files are such that you can execute them.
  • Check screen-scraper's "log" folder for a "Shopping Site" log file. If you find one it means that screen-scraper is at least receiving the request. Open the log file in a text editor to see if you find any error messages.
  • If you still can't seem to get it to work feel free to drop us a support request.

Assuming that test worked, fire up your favorite Python editor and open the "shopping.py" file in it. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing Python documentation, posting to our forum, or sending us a support request.

When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its "log" folder. Take a look in that folder for your "Shopping Site" log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.

Tutorial 4: Page 3: Invoking screen-scraper from VB.NET

Invoking screen-scraper from VB.NET

Before we dig into the code take a minute to review our Invoking screen-scraper via .NET documentation page. The VB file we'll be referring to can be downloaded here.

In order to invoke screen-scraper from VB.NET, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow that link, then come back here.

Okay, let's try it out before we dive into the code. Start screen-scraper running as a server. From your .NET environment compile and execute the "shopping.vb" file.

If that didn't go quite as you expected here are some things to check:

  • Make sure screen-scraper is running as a server, and that nothing is blocking its ports (such as a firewall running on your machine).
  • If you're running screen-scraper on a different machine than the one your VB class resides on, make sure that screen-scraper is allowing connections from the VB machine. In the screen-scraper workbench click on the wrench icon, then on the "Servers" button, and check the "Hosts to allow to connect" includes the IP address (or perhaps just the first part of the IP address) of the VB machine.
  • Check screen-scraper's "log" folder for a "Shopping Site" log file. If you find one it means that screen-scraper is at least receiving the request. Open the log file in a text editor to see if you find any error messages.
  • If you still can't seem to get it to work feel free to drop us a support request.

Assuming that test worked, take a closer look over the "shopping.vb" class. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing our .NET documentation, posting to our forum, or sending us a support request.

When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its "log" folder. Take a look in that folder for your "Shopping Site" log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.

Tutorial 4: Page 4: Where to Go From Here

Where to Go From Here

The approach we outline in this tutorial works great for relatively small sets of data. When we extract records from the shopping site we're probably not going to extract more than 25 or so. When screen-scraper extracts the data it is saved in memory (remember we checked the "Automatically save the data set generated by this extractor pattern in a session variable" check box for the "DETAILS" extractor pattern, which is what causes this to happen), so it works fine because there aren't that many products.

So what happens when we want to extract and save large numbers of records? The simple answer is that you need to save them out as they're extracted rather than having screen-scraper keep them in memory. Usually this means either inserting the scraped records into a database or writing them out to a text file. We'll soon have a tutorial up that gives an example of saving records to a database. For now, take a look at this FAQ. We also provide an example in Tutorial 2 that illustrates how to write the data out to a file. Just remember that if you're writing the data out to a file you'll want to uncheck the box labeled "Automatically save the data set generated by this extractor pattern in a session variable" for the extractor pattern that pulls out the data you want to save. If it's checked it will cause screen-scraper to store all of the data in memory, which could cause it to run out of memory while it's running.