Invoking screen-scraper from Java

Overview

A Java application or servlet interacts with screen-scraper via the class RemoteScrapingSession class (com.screenscraper.scraper.RemoteScrapingSession). You can utilize the class by including the screen-scraper.jar and lib\log4j.jar files in your CLASSPATH.

screen-scraper needs to be running as a server before invoking it from a Java class.

You can also reference your own Java code from within screen-scraper

Methods

The following is a reference for all of the methods found in the RemoteScrapingSession class.

  • RemoteScrapingSession( String identifier ). Instantiates a RemoteScrapingSession identified by identifier . If this constructor is called the default host (localhost) and port (8778) will be used.

    import com.screenscraper.scraper.*;
    RemoteScrapingSession remoteSession = new RemoteScrapingSession( "Hello World" );
  • RemoteScrapingSession( String identifier, String host, int port ). Instantiates a RemoteScrapingSession identified by identifier, and connecting to the server found at host listening on port.

    import com.screenscraper.scraper.*;
    RemoteScrapingSession remoteSession = new RemoteScrapingSession( "Hello World", "localhost", 8080 );
  • RemoteScrapingSession( String identifier, String host, int port, String characterSet ). Instantiates a RemoteScrapingSession identified by identifier, connecting to the server found at host listening on port, and utilizing the given characterSet.

    import com.screenscraper.scraper.*;
    RemoteScrapingSession remoteSession = new RemoteScrapingSession( "Hello World", "localhost", 8080, "UTF-8" );
  • int getNumDataRecordsInDataSet( String dataSetName ) throws RemoteScrapingSessionException. Gets the number of records found in the DataSet named by dataSetName.

    remoteSession.getNumDataRecordsInDataSet( "PRODUCTS" );

  • disconnect() throws IOException. Disconnects from the screen-scraper server and closes up the network socket.

    remoteSession.disconnect();
  • DataRecord getDataRecordFromDataSet( String dataSetName, int index ) throws RemoteScrapingSessionException. Gets the DataRecord specified by the index found in the DataSet named dataSetName.

    remoteSession.getDataRecordFromDataSet( "PRODUCTS", 2 );

  • getVariable( String varName ) throws RemoteScrapingSessionException. Gets the value of a session variable, varName that was set during the course of the scraping session.

    Currently only Strings, DataRecords, and DataSets can be accessed by this method.

    remoteSession.getVariable( "FORM_SUBMITTED_TEXT" );

  • loadVariables( String fileToReadFrom ) throws RemoteScrapingSessionException. This method will cause screen-scraper to load variables in from the file fileToReadFrom. More details on this method can be found with the loadVariables method.

    remoteSession.loadVariables( "variables.txt" );

  • scrape() throws RemoteScrapingSessionException. Causes the session to scrape. This is equivalent to clicking the Run Scraping Session button from within screen-scraper on the General tab of the scraping session.

    remoteSession.scrape();

  • boolean sessionTimedOut() throws RemoteScrapingSessionException. For non-lazy scrapes, this method can be called after the scrape method returns to determine whether or not a scraping session timed out. This method may only return true if the setTimeout method was called prior to calling scrape.

    remoteSession.sessionTimedOut();

  • setDoLazyScrape( boolean doLazyScrape ) throws RemoteScrapingSessionException. If set to true, screen-scraper will execute the scraping session in a separate thread, returning execution flow to the calling application immediately after the scrape method is called. This is false by default.

    remoteSession.setDoLazyScrape( true );

  • setOutputLogFiles( boolean outputLogFiles ) throws RemoteScrapingSessionException. Indicates whether or not screen-scraper should output a log file to the log folder when running this scraping session. This is true by default.

    remoteSession.setOutputLogFiles( false );

  • setTimeout( int timeout ) throws RemoteScrapingSessionException. Sets the number of minutes a scraping session should be allowed to run before it automatically stops itself. The timeout value is in minutes.

    remoteSession.setTimeout( 60 );

  • setVariable( String varName, String value ) throws RemoteScrapingSessionException. Sets a session variable, varName, in the session that will be accessible from within a screen-scraper script.

    remoteSession.setVariable( "TEXT_TO_SUBMIT", "Hi everybody!" );

  • stopServer() throws RemoteScrapingSessionException. Tells the server to stop.

    The server cannot be started remotely.

    remoteSession.stopServer();

  • DataRecords getNextCachedDataRecord( String dataSetName ) throws RemoteScrapingSessionException and
    DataSet getNextCachedDataRecord( String dataSetName, int numRecordsToRetrive ) throws RemoteScrapingSessionException. In the case of a data set that's been cached, this allows for individual DataRecord objects to be retrieved in piecemeal fashion. This is desirable in cases where a large amount of data is to be extracted throughout the life of the scraping session, and retaining it all in memory could cause problems. DataSet objects are cached by checking the Cache the data set check box under the Advanced tab for an extractor pattern.

    remoteSession.getNextCachedDataRecord( "PRODUCTS" );

Built-In Objects/Classes

It is also possible to store data sets and data records in session variables, which can then be accessed via the RemoteScrapingSession class. Data set objects are analogous to database result sets and data records are analogous to individual records within a result set. When an extractor pattern is applied a data set is generated. Storing the resulting data set in a session variable (within a screen-scraper script) will allow for it to be accessed via a RemoteScrapingSession.getVariable call. More information on these classes can be found in the DataRecord and DataSet API documentation pages.

Receiving Data in Real Time

This feature is only available to Enterprise editions of screen-scraper.

DataReceiver Interface

The DataReceiver (com.screenscraper.scraper.DataReceiver) interface allows your code to handle extracted data as it is being scraped. That is, you need not wait until the scraping session has finished before getting access to the extracted data. This interface contains a single method:

  • receiveData( String key, Object value ) throws RemoteScrapingSessionException. The key portion is simply a string you'll designate in a screen-scraper script. The value parameter holds the value you pass from screen-scraper to your code.

Real Time RemoteScrapingSession Methods

Once you have implemented the DataReceiver interface on any of your own classes, then pass an instance of the class to the RemoteScrapingSession via the setDataReceiver method. Here are other methods that allow you to control the flow of real time information.

  • setDataReceiver( DataReceiver dataReceiver ) throws RemoteScrapingSessionException

    remoteSession.setDataReceiver( dataReceiver );

  • DataReceiver getDataReceiver() throws RemoteScrapingSessionException. Use this to see if a DataReceiver has already been set.

    remoteSession.getDataReceiver( );

  • setPollFrequency( int pollFrequency ) throws RemoteScrapingSessionException. Sets the frequency in seconds with which screen-scraper should be polled for data to be sent. The default is five seconds.

    remoteSession.setPollFrequency( 2 );

  • int getPollFrequency() throws RemoteScrapingSessionException. Gets the current poll frequency, in seconds.

    remoteSession.getPollFrequency( );

Passing Information in Real Time

On the screen-scraper side, whenever you'd like to send data from screen-scraper back to your code, you simply invoke the session.sendDataToClient method. Data sent through this method will show up through the receiveData method.

Examples

In screen-scraper

As a specific example, let's suppose you've created a scraping session that extracts product records from a shopping web site. As each product record is being scraped, you might simply output them to a CSV file, but you decide instead that you'd like to insert them into your database, and determine that it would be best for you to write your own code to perform the database insertion. In your scraping session, you might have a script that contains the following:

 session.sendDataToClient( "ProductRecord", dataRecord );

You set up this script to be invoked After each pattern match for the extractor pattern that pulls the product information. For example, the extractor pattern might get the price, title, and weight of the product. Because the script is being invoked After each pattern match, the current dataRecord object will hold all of that information. You invoke session.sendDataToClient so that each record can be processed by your code as it gets extracted.

In Java Code

In your Java code you create a class that implements the DataReceiver interface. You create an instance of this class and pass it to your RemoteScrapingSession object so that you can process each of the product records as they get extracted. Your >receiveData method implementation might look something like this:

public void receiveData( String key, Object value ) throws RemoteScrapingSessionException
{
    if( key.equals( "ProductRecord" ) && value instanceof DataRecord )
    {
        // Here you would include code that might make
        // use of an existing JDBC connection to insert
        // or update the record in your database.
    }
}

Each time you invoke session.sendDataToClient in screen-scraper, there will be a corresponding method call made to your receiveData method, which will allow you to handle each of the data pieces individually.

For other examples of using the Java driver please see Tutorial 3: Extending Hello World and Tutorial 4: Scraping a Shopping Site from External Programs.