Invoking screen-scraper from Java
Overview
screen-scraper needs to be running as a server before invoking it from a Java class. Please read that section now, if you haven't already. For examples of using the Java driver please see Tutorial 3: Extending Hello World and Tutorial 4: Scraping an E-commerce Site from External Programs.
A Java application or servlet interacts with screen-scraper via the class com.screenscraper.scraper.RemoteScrapingSession. You can utilize the com.screenscraper.scraper.RemoteScrapingSession class by including the "screen-scraper.jar" and "lib\log4j.jar" files in your CLASSPATH.
Methods
There are only a handful of methods in the RemoteScrapingSession class, which are documented below:
- RemoteScrapingSession( String identifier ). Instantiates a
RemoteScrapingSessionidentified byidentifier. If this constructor is called the default host (localhost) and port (8778) will be used. - RemoteScrapingSession( String identifier, String host, int port ). Instantiates a
RemoteScrapingSessionidentified byidentifier, and connecting to the server found athostlistening onport. - int getNumDataRecordsInDataSet( String dataSetName ) throws RemoteScrapingSessionException. Gets the number of records found in the DataSet named by
dataSetName. - DataRecord getDataRecordFromDataSet( String dataSetName, int index ) throws RemoteScrapingSessionException. Gets the
DataRecordspecified by theindexfound in the DataSet named bydataSetName. - getVariable( String varName ) throws RemoteScrapingSessionException. Gets the value of a session variable that was set during the course of the scraping session. Note that currently only Strings, DataRecords, and DataSets can be accessed by this method.
- loadVariables( String fileToReadFrom ) throws RemoteScrapingSessionException. This method will cause screen-scraper to load variables in from the file named by
fileToReadFrom. More details on this method can be found here. - scrape() throws RemoteScrapingSessionException. Causes the session to scrape. This is equivalent to clicking the "Run Scraping Session" button from within screen-scraper on the "General" tab for a scraping session.
- boolean sessionTimedOut() throws RemoteScrapingSessionException. For non-lazy scrapes, this method can be called after the
scrapemethod returns to determine whether or not a scraping session timed out. This method may only return true if thesetTimeoutmethod was called prior to callingscrape. - setDoLazyScrape( boolean doLazyScrape ) throws RemoteScrapingSessionException. If set to true, screen-scraper will execute the scraping session in a separate thread, returning execution flow to the calling application immediately after the
scrapemethod is called. This is false by default. - setOutputLogFiles( boolean outputLogFiles ) throws RemoteScrapingSessionException. Indicates whether or not screen-scraper should output a log file to the "log" folder when running this scraping session. This is true by default.
- setTimeout( int timeout ) throws RemoteScrapingSessionException. Sets the number of minutes a scraping session should be allowed to run before it automatically stops itself. The
timeoutvalue is in minutes. - setVariable( String varName, String value ) throws RemoteScrapingSessionException. Sets a session variable in the session that will be accessible from within a screen-scraper script.
- stopServer() throws RemoteScrapingSessionException. Tells the server to stop. Note that the server cannot be started remotely.
- DataRecords getNextCachedDataRecord( String dataSetName ) throws RemoteScrapingSessionException and
DataSet getNextCachedDataRecord( String dataSetName, int numRecordsToRetrive ) throws RemoteScrapingSessionException. In the case of a data set that's been cached, this allows for individualDataRecordobjects to be retrieved in piecemeal fashion. This is desirable in cases where a large amount of data is to be extracted throughout the life of the scraping session, and retaining it all in memory could cause problems.DataSetobjects are cached by checking the "Cache the data set" check box under the "Advanced" tab for an extractor pattern.
Other Classes
It is also possible to store data sets and data records in session variables, which can then be accessed via the RemoteScrapingSession class. Data set objects are analogous to database result sets and data records are analogous to individual records within a result set. When an extractor pattern is applied a data set is generated. Storing the resulting data set in a session variable (within a screen-scraper script) will allow for it to be accessed via a RemoteScrapingSession.getVariable call. More information on these classes can be found in the DataRecord and DataSet API documentation pages.
Handling Scraped Data in Real Time
The com.screenscraper.scraper.DataReceiver interface allows your code to handle extracted data as it is being scraped. That is, you need not wait until the scraping session has finished before getting access to the extracted data. This interface contains a single method:
- receiveData( String key, Object value ) throws RemoteScrapingSessionException. The
keyportion is simply a string you'll designate in a screen-scraper script. Thevalueparameter holds the value you pass from screen-scraper to your code.
Simply implement the DataReceiver class on any of your own classes, then pass an instance of the class to the RemoteScrapingSession via:
- setDataReceiver( DataReceiver dataReceiver ) throws RemoteScrapingSessionException
Other useful methods include:
- DataReceiver getDataReceiver() throws RemoteScrapingSessionException. Use this to see if a
DataReceiverhas already been set. - setPollFrequency( int pollFrequency ) throws RemoteScrapingSessionException. Sets the frequency in seconds with which screen-scraper should be polled for data to be sent. The default is five seconds.
- int getPollFrequency() throws RemoteScrapingSessionException. Gets the current poll frequency, in seconds.
On the screen-scraper side, whenever you'd like to send data from screen-scraper back to your code, you simply invoke the session.sendDataToClient method. Data sent through this method will show up through the receiveData method.
As a specific example, let's suppose you've created a scraping session that extracts product records from an e-commerce web site. As each product record is being scraped, you might simply output them to a CSV file, but you decide instead that you'd like to insert them into your database, and determine that it would be best for you to write your own code to perform the database insertion. In your scraping session, you might have a script that contains the following:
You set up this script to be invoked "After each pattern application" for the extractor pattern that pulls the product information. For example, the extractor pattern might get the price, title, and weight of the product. Because the script is being invoked "After each pattern application", the current "dataRecord" object will hold all of that information. You invoke session.sendDataToClient so that each record can be processed by your code as it gets extracted.
In your Java code you create a class that implements the DataReceiver interface. You create an instance of this class and pass it to your RemoteScrapingSession object so that you can process each of the product records as they get extracted. Your receiveData method implementation might look something like this:
{
if( key.equals( "ProductRecord" ) && value instanceof DataRecord )
{
// Here you would include code that might make
// use of an existing JDBC connection to insert
// or update the record in your database.
}
}
Each time you invoke session.sendDataToClient in screen-scraper, there will be a corresponding method call made to your receiveData method, which will allow you to handle each of the data pieces individually.
- Printer-friendly version
- Login or register to post comments

