Tutorial 3: Extending Hello World

Extending Hello World

This tutorial continues on where Tutorial 1: Hello World left off, and covers aspects of screen-scraper related to richer scripting and interacting with screen-scraper from external languages, including Active Server Pages, PHP, and Java.

If you haven't completed the first tutorial don't worry, but you'll at least need to import the script and scraping session that were created in the first tutorial. To do that, follow these directions:

  1. Download the zip file located here and unzip it. You should now have an "interpreted_java" directory and a "vbscript" directory.
  2. If you're running Windows, and prefer to program in VBScript, import the "Hello World (Scraping Session).xml" scraping session located in the "vbscript" directory; otherwise, import the one located in the "interpreted_java" directory. Instructions on importing objects into screen-scraper can be found here.

The following scraping session is the completed version of the Tutorial 3 scraping session.

AttachmentSize
Hello World (Scraping Session).sss3.06 KB

Tutorial 3: Page 2: Embedding Session Variables

Embedding Session Variables

A significant limitation of our first "Hello World" project was that we could only scrape the text from our first request. That is, we were always scraping the text "Hello World!", which really isn't that useful. We'll now adjust our setup so that we can designate the text to be submitted in the form.

At this point we're going to set a session variable that will hold the text we'd like submitted in the form. Within screen-scraper, session variables are used to transfer information between scripts, scrapeable files, and other objects. Session variables are generally set from within scripts, but can also be automatically set within extractor patterns as well as passed in from external applications.

We'll now set up a script to set a session variable before our scraping session runs. Create a new script as you've done before, and call it "Initialize scraping session". If you prefer to script in Interpreted Java, use the following for the body of the script:

// Put the text to be submitted in the form into a
// session variable so we can reference it later.
session.setVariable( "TEXT_TO_SUBMIT", "Hi everybody!" );

If you wrote the script in VBScript, make it look like this:

' Put the text to be submitted in the form into a
' session variable so we can reference it later.
session.SetVariable "TEXT_TO_SUBMIT", "Hi everybody!"

Hopefully the scripts seem straightforward. It simply sets a session variable named "TEXT_TO_SUBMIT", and gives it the value "Hi everybody!" (spoken, of course, in your best Dr. Nick voice).

Setting the session variable "TEXT_TO_SUBMIT" will allow us to access that value in other scripts and scrapeable files while our "Hello World" scraping session is running.

We'll now need to associate our script with our scraping session so that it gets invoked before the scraping session begins. To do that, click on the scraping session in the tree on the left, then on the "Scripts" tab. Click the "Add Script" button to add a script. In the "Script Name" column select "Initialize scraping session". The "When to Run" column should show "Before scraping session begins", and the "Enabled" checkbox should be checked. This will cause our script to get executed at the very beginning of the scraping session so that the "TEXT_TO_SUBMIT" session variable can get set.

Just as we use special tokens in extractor patterns to designate values we'd like to extract, we use special tokens to insert values of session variables into the URLs or parameters (GET, POST, or BASIC authentication) of scrapeable files. We'll do this now by embedding it into one of the parameters of our only scrapeable file. Expand the "Hello World" scraping session in the tree on the left, then select the "Form submission" scrapeable file. Click on the "Parameters" tab. In the "Value" column for our "text_string" parameter replace the text "Hello world!" with the text:

~#TEXT_TO_SUBMIT#~

The ~# and #~ delimiters are used to designate a session variable whose value should be inserted into that location when the scrapeable file gets executed. When the scrapeable file gets invoked, screen-scraper will construct the URL by including the "text_string" parameter in it. In other words, the URL for our scrapeable file will become this:

http://www.screen-scraper.com/screen-scraper/tutorial/basic_form.php?text_string=Hi+everybody%21

We're going to run our scraping session again, but before doing that clear out the scraping session log by selecting the "Hello World" scraping session in the tree, clicking on the "Log" tab, then on the "Clear Log" button. Start up the scraping session again by clicking the "Run Scraping Session" button. Once the scrape has run, you should notice the following lines in the log:

Form submission: The following data elements were found:
Form data--DataRecord 0:
FORM_SUBMITTED_TEXT=Hi everybody!

And if you look at the contents of the "form_submitted_text.txt" file you'll notice the same text.

Remember that it's a good idea to run scraping sessions often as you make changes, and watch the log and last responses to ensure that things are working as you expect them to.

Tutorial 3: Page 3: Interacting with Screen-Scraper Externally

Interacting with Screen-Scraper Externally

Invoking screen-scraper from the command line

If you've decided to use the basic edition of screen-scraper your only option for invoking screen-scraper externally is from the command line (invoking screen-scraper from the command line is also available in the professional and enterprise editions). You can find full documentation and examples on doing that at our Invoking screen-scraper from the command line documentation page. If you don't need to invoke screen-scraper from the command line you can skip to the Invoking screen-scraper from an external application section.

In order to invoke screen-scraper from the command line, you'll want to create a batch file (in Windows) or a shell script (in Linux or Mac OS X) to invoke the scraping session. If you're using Windows open a text editor (e.g., Notepad) and enter the following:

jre\bin\java -jar screen-scraper.jar -s "Hello World" --params
"TEXT_TO_SUBMIT=Hello+World"



Save the batch file (call it "hello_world.bat") in the folder where screen-scraper is installed (e.g., C:\Program Files\screen-scraper professional edition\). Vista users, you will need to save your batch file to a location such as your Documents folder or your Desktop. Then, within Windows Explorer, manually transfer the file to the directory where screen-scraper is installed.

Within screen-scraper, you'll want to disable the "Initialize scraping session" script; otherwise, the value we pass in from the command line would get overwritten once that script is executed. Disable the script by clicking on the "Hello World" scraping session, then on the "Scripts" tab, then un-checking the "Enabled?" check box for the script.

You can then run the batch file by opening a DOS prompt, changing to the folder containing the batch file, then invoking it. You should see the text from screen-scraper's log appear in the DOS window. If you're running Linux or Mac OS X, you'll need to close the workbench before invoking your shell script.

Invoking screen-scraper from an external application

Note that the rest of this tutorial only applies to the professional and enterprise editions of screen-scraper.

Oftentimes you'll want to use a language or platform external to screen-scraper to scrape data. screen-scraper can be controlled externally using Java, PHP, Ruby, Python, .NET, ColdFusion, any COM-friendly language (such as Active Server Pages or Visual Basic), or any language that supports SOAP. In this next part of the tutorial we'll give examples in PHP, Java, ColdFusion, and Active Server Pages.

In order to interact with screen-scraper externally it needs to be running as a server. When running as a server screen-scraper acts much like a database server does. That is, it listens for requests from external sources, services those requests, and sends back responses. For example, when you issue a SQL statement to a database from an ASP script your script is opening up a socket to the database, sending the request over it, then receiving the database's response back over the socket. Once this transaction has been completed the socket will be closed, but the database will continue to listen for other requests. screen-scraper works in a similar way.

At this point we'd recommend reading over the documentation page that discusses running screen-scraper as a server, and gives details on how to start and stop it according to the platform you're running on. Follow the link below, then return back to this page when you're finished:

Running screen-scraper as a server

Before we start writing code to interact with screen-scraper externally we need to configure a few things. Depending on the language you'd like to program in, please follow one of the links below, which will give you an overview of interacting with screen-scraper using that language and guide you through any configuration that needs to take place. Once you're finished return back to this page.

Invoking screen-scraper from ColdFusion

Invoking screen-scraper from a COM-based application

Invoking screen-scraper from Java

Invoking screen-scraper from PHP

Each time you run a scraping session externally screen-scraper will generate a log file corresponding to that scraping session in the "log" folder found inside the folder where you installed screen-scraper. This can be invaluable for debugging, so you'll want to take a look at it if you run into trouble. You can turn server logging off by unchecking the "Generate log files" check box under the "Servers" section of the "Settings" dialog box.

If you haven't already, within screen-scraper, you'll want to disable the "Initialize scraping session" script; otherwise, the value we pass in from our external application would get overwritten once that script is executed. Disable the script by clicking on the "Hello World" scraping session, then on the "Scripts" tab, then un-checking the "Enabled?" check box for the script.

OK, we're now ready to write some code. Follow one of the links below.

Tutorial 3: Page 4: Interacting with screen-scraper from ASP

Interacting with screen-scraper from ASP

The ASP script we'll be writing will invoke our scraping session remotely, passing in a value for the "TEXT_TO_SUBMIT" session variable. Create a new ASP script on your computer, and paste the following code into it:

<%
' Create a RemoteScrapingSession object.
Set objRemoteSession = Server.CreateObject("Screenscraper.RemoteScrapingSession")

' Generate a new "Hello World" scraping session.
Call objRemoteSession.Initialize("Hello World")
   
' Put the text to be submitted in the form into a session variable so we can reference it later.
Call objRemoteSession.SetVariable( "TEXT_TO_SUBMIT", "Hi everybody!" )

' Check for errors.
If objRemoteSession.isError Then
Response.Write( "Error: " & objRemoteSession.GetErrorMessage )
Else
' Tell the scraping session to scrape.
Call objRemoteSession.Scrape

' Write out the text that was scraped:
Response.Write( "Scraped text: " + objRemoteSession.GetVariable("FORM_SUBMITTED_TEXT") )
End If

' Disconnect from the server.
Call objRemoteSession.Disconnect
%>



There are just a couple of extra steps we take here that we didn't take in our previous script. First, after creating our RemoteScrapingSession object we make a separate call to initialize it for our specific scraping session. Also, you'll notice that before calling the Scrape method we check for any errors that may have occurred up to this point. For example, if for some reason your ASP script can't connect to the server you'd want to know before you tried to tell it to scrape. Finally, we need to explicitly disconnect from the server so that it knows we're done.

OK, we're ready to give our script a try. Start screen-scraper running as a server. If you need help or have trouble with this refer to the documentation page here: Running screen-scraper as a server. If you've succeeded in starting up the server go ahead and load your ASP script in a browser. After a short pause you should see the "Hi everybody!" message output to your browser. If something goes wrong please refer to the "Related pages" section found below for help.

Tutorial 3: Page 4: Interacting with Screen-Scraper from Java

Interacting with Screen-Scraper from Java

The Java class we'll be writing will simply substitute for the "Initialize scraping session" script we wrote previously. That is, our Java class will invoke our scraping session remotely, passing in a value for the "TEXT_TO_SUBMIT" session variable. Create a new Java class on your computer, and paste the following code into it:

import com.screenscraper.scraper.*;

public class HelloWorldRemoteScrapingSession
{
      /**
      * The entry point.
      */
      public static void main( String args[] )
      {
             try
             {
                 // Create a remoteSession to communicate with the server.
                 RemoteScrapingSession remoteSession = new RemoteScrapingSession( "Hello World" );

                 // Put the text to be submitted in the form into a session variable so we can reference it later.
                 remoteSession.setVariable( "TEXT_TO_SUBMIT", "Hi everybody!" );

                 // Tell the session to scrape.
                 remoteSession.scrape();

                 // Output the text that was scraped:
                 System.out.println( "Scraped text: " + remoteSession.getVariable( "FORM_SUBMITTED_TEXT" ) );

                 // Very important! Be sure to disconnect from the server.
                 remoteSession.disconnect();
              }
              catch( Exception e )
              {
                 System.err.println( e.getMessage() );
            }
       }
}



For the most part this Java code is virtually identical to our script. The one notable difference is that we need to explicitly disconnect from the server so that it knows we're done.

OK, we're ready to give our Java class a try. After you've successfully compiled the class (remember to include the "screen-scraper.jar" file in your classpath), start screen-scraper running as a server. If you need help or have trouble with this refer to the documentation page here: Running screen-scraper as a server. If you've succeeded in starting up the server go ahead and run the Java class from a command prompt or console. After a short pause you should see the "Hi everybody!" message output. If something goes wrong please refer to the "Related pages" section found below for help.

Tutorial 3: Page 5: Where to Go From Here

Where to Go From Here

Congratulations! You've now covered all of the basic principles needed to invoke screen-scraper externally. In working on your own projects we'd suggest referring frequently to the screen-scraper documentation available from within the application or on our web site.

The third tutorial deals with other topics, including scraping search results (with multiple records) across multiple pages, and logging in to a web site before scraping information.