Invoking screen-scraper from PHP

Overview

A PHP script interacts with screen-scraper via a PHP class called RemoteScrapingSession. You can utilize this class by including the file remote_scraping_session.php (found in the misc/php directory of your screen-scraper installation) within your PHP script.

screen-scraper needs to be running as a server before invoking screen-scraper from a PHP script.

Methods

The following is a reference for all of the methods found in the RemoteScrapingSession class.

  • initialize( $name ). Initializes a RemoteScrapingSession identified by name. If this constructor is called the default host (localhost) and port (8778) will be used.

    $session->initialize("Hello World");

  • initialize( $name, $host, $port ). Instantiates a RemoteScrapingSession identified by name, and connecting to the server found at host listening on port.

    $session->initialize("Hello World", "127.0.0.1", "8080");

  • setVariable( $var_name, $value ). Sets a session variable using the given var_name and value.

    $session->setVariable("TEXT_TO_SUBMIT", "Hi everybody!" );

  • scrape(). Causes the session to start. This is equivalent to clicking the Run Scraping Session button from within screen-scraper on the General tab for a scraping session.

    $session->scrape();

  • getVariable( $var_name ). Gets the value of a session variable that was set during the course of the scraping session. If the object identified by $var_name is a data record an associative array will be returned. If the object identified by $var_name is a data set a two-dimensional ordinal array of associative arrays will be returned.

    Currently only Strings, DataRecords, and DataSets can be accessed by this method.

    $session->getVariable("FORM_SUBMITTED_TEXT");

  • isError(). Indicates whether or not an error has occurred in the scraping process.

    $session->isError();

  • getErrorMessage(). Returns the last error message returned from the server, if one was returned.

    $session->getErrorMessage();

  • disconnect(). Disconnects from the remote server. This should be called once a scraping session is complete so that system resources can be freed up.

    $session->disconnect();

  • getNumDataRecordsInDataSet( $data_set_name ). Returns the number of data records found in the data set named by data_set_name.

    $session->getNumDataRecordsInDataSet( "PRODUCTS" )

  • getDataRecordFromDataSet( $data_set_name, $index ). Returns a single data record (a hash array) from the data set named by data_set_name at the given index.

    $session->getDataRecordFromDataSet( "PRODUCTS", 2 )

  • setDoLazyScrape( $doLazyScrape ). Indicates whether or not a scraping session should be run in a separate thread. By default this value is false.

    Calling this method will only have an effect if it's done before calling the scrape method. If this value is set to true, after the scrape method is called, program flow will return immediately, but the scraping session will still be running in screen-scraper.

    $session->setDoLazyScrape( true )

Receiving Data in Real Time

This feature is only available to Enterprise editions of screen-scraper.

By creating a special PHP class, your code can handle extracted data as it is being scraped instead of after the scrape is finished. That means, you will not need to wait until the scraping session has finished before getting access to the extracted data.

We recommend calling the class DataReceiver.

DataReceiver Class

The DataReceiver class needs to contain the following method (you can add other methods as needed to process the data but this one is particular).

  • function receiveData( $key, $value ). The key portion is simply a string you'll designate in a screen-scraper script. The value parameter holds the value you pass from screen-scraper to your code.

Real Time RemoteScrapingSession Methods

Once you have created the DataReceiver class containing the receiveData method it must be incorporated into the RemoteScrapingSession using the setDataReceiver method. Here are other methods that allow you to control the flow of real time information.

  • setDataReceiver( $data_receiver ) Adds the DataReceiver class specified by data_receiver to the RemoteScrapingSession object.

    $session->setDataReceiver( $my_data_receiver );

  • getDataReceiver(). Use this to see if a DataReceiver has already been set.

    $session->getDataReceiver( );

  • setPollFrequency( $poll_frequency ). Sets the frequency in seconds with which screen-scraper should be polled for data to be sent. The default is five seconds.

    $session->setPollFrequency( 1 );

  • getPollFrequency(). Gets the current poll frequency, in seconds.

    $session->getPollFrequency( );

Passing Information in Real Time

On the screen-scraper side, whenever you'd like to send data from screen-scraper back to your code, you simply invoke the session.sendDataToClient method. Data sent through this method will be processed through the receiveData method.

Examples

In screen-scraper

As a specific example, let's suppose you've created a scraping session that extracts product records from a shopping web site. As each product record is being scraped, you might simply output them to a CSV file, but you decide instead that you'd like to insert them into your database, and determine that it would be best for you to write your own code to perform the database insertion. On the screen-scraper side, in your scraping session, you might have a script that contains the following:

 session.sendDataToClient( "ProductRecord", dataRecord );

You set up this script to be invoked After each pattern match for the extractor pattern that pulls the product information. For example, the extractor pattern might get the price, title, and weight of the product. Because the script is being invoked After each pattern match, the current dataRecord object will hold all of that information. You invoke session.sendDataToClient so that each record can be processed by your code as it gets extracted.

In PHP Script

In your PHP code you create a class that implements the receiveData( $key, $value ). You create an instance of this class and pass it to your RemoteScrapingSession object so that you can process each of the product records as they get extracted. Your DataReceiver class implementation might look something like this:

class DataReceiver
{
   function receiveData( $key, $value )
   {
      echo "Received data from ss:\n";
      echo "Key: $key\n";
      echo "Value: $value\n";
      flush();
      writeRow( $value );
   }
}

You would instantiate the class and set it on your session like so:

$data_receiver = new DataReceiver;
$session->setDataReceiver( $data_receiver );

Each time you invoke session.sendDataToClient in screen-scraper, there will be a corresponding method call made to your receiveData method, which will allow you to handle each of the data pieces individually.

For other examples of using the PHP driver please see Tutorial 3: Extending Hello World and Tutorial 4: Scraping an E-commerce Site from External Programs.