Invoking screen-scraper from PHP

screen-scraper needs to be running as a server before invoking screen-scraper from a PHP script. Please read that section now, if you haven't already. For examples of using the PHP driver please see Tutorial 3: Extending Hello World and Tutorial 4: Scraping an E-commerce Site from External Programs.

A PHP script interacts with screen-scraper via a PHP class called "RemoteScrapingSession". You can utilize this class by including the file "remote_scraping_session.php" (found in the misc/php directory of your screen-scraper installation) within your PHP script.

Full documentation on all of the methods found in the RemoteScrapingSession class is given below:

  • initialize( $name ). Initializes a RemoteScrapingSession identified by name. If this constructor is called the default host (localhost) and port (8778) will be used.
  • initialize( $name, $host, $port ). Instantiates a RemoteScrapingSession identified by name, and connecting to the server found at host listening on port.
  • setVariable( $var_name, $value ). Sets a session variable using the given name and value.
  • scrape(). Causes the session to start. This is equivalent to clicking the "Run Scraping Session" button from within screen-scraper on the "General" tab for a scraping session.
  • getVariable( $var_name ). Gets the value of a session variable that was set during the course of the scraping session. If the object identified by $var_name is a data record an associative array will be returned. If the object identified by $var_name is a data set a two-dimensional ordinal array of associative arrays will be returned. Note that currently only Strings, DataRecords, and DataSets can be accessed by this method.
  • isError(). Indicates whether or not an error has occurred in the scraping process.
  • getErrorMessage(). Returns the last error message returned from the server, if one was returned.
  • disconnect(). Disconnects from the remote server. This should be called once a scraping session is complete so that system resources can be freed up.
  • getNumDataRecordsInDataSet( $data_set_name ). Returns the number of data records found in the data set named by data_set_name.
  • getDataRecordFromDataSet( $data_set_name, $index ). Returns a single data record (a hash array) from the data set named by data_set_name at the given index.
  • setDoLazyScrape( $doLazyScrape ). Indicates whether or not a scraping session should be run in a separate thread. By default this value is false. Note that calling this method will only have an effect if it's done before calling the scrape method. If this value is set to true, after the scrape method is called, program flow will return immediately, but the scraping session will still be run by screen-scraper.

Handling Scraped Data in Real Time

By creating a special PHP class, your code can handle extracted data as it is being scraped. That is, you need not wait until the scraping session has finished before getting access to the extracted data. This class must contain this method:

  • function receiveData( $key, $value ). The key portion is simply a string you'll designate in a screen-scraper script. The value parameter holds the value you pass from screen-scraper to your code.

We recommend calling the class DataReceiver. Once you have created the DataReceiver class containing the receiveData method, you'll then pass an instance of the class to the RemoteScrapingSession via:

  • setDataReceiver( $data_receiver )

Other useful methods include:

  • getDataReceiver(). Use this to see if a DataReceiver has already been set.
  • setPollFrequency( $poll_frequency ). Sets the frequency in seconds with which screen-scraper should be polled for data to be sent. The default is five seconds.
  • getPollFrequency(). Gets the current poll frequency, in seconds.

On the screen-scraper side, whenever you'd like to send data from screen-scraper back to your code, you simply invoke the session.sendDataToClient method. Data sent through this method will show up through the receiveData method.

As a specific example, let's suppose you've created a scraping session that extracts product records from an e-commerce web site. As each product record is being scraped, you might simply output them to a CSV file, but you decide instead that you'd like to insert them into your database, and determine that it would be best for you to write your own code to perform the database insertion. On the screen-scraper side, in your scraping session, you might have a script that contains the following:

 session.sendDataToClient( "ProductRecord", dataRecord );

You set up this script to be invoked "After each pattern application" for the extractor pattern that pulls the product information. For example, the extractor pattern might get the price, title, and weight of the product. Because the script is being invoked "After each pattern application", the current "dataRecord" object will hold all of that information. You invoke session.sendDataToClient so that each record can be processed by your code as it gets extracted.

In your PHP code you create a class that implements the receiveData( $key, $value ). You create an instance of this class and pass it to your RemoteScrapingSession object so that you can process each of the product records as they get extracted. Your DataReceiver class implementation might look something like this:

class DataReceiver
{
   function receiveData( $key, $value )
   {
      echo "Received data from ss:\n";
      echo "Key: $key\n";
      echo "Value: $value\n";
      flush();
      writeRow( $value );
   }
}

Each time you invoke session.sendDataToClient in screen-scraper, there will be a corresponding method call made to your receiveData method, which will allow you to handle each of the data pieces individually.