![]() |
Invoking screen-scraper from Java |
Overview
screen-scraper needs to be running as a server before invoking it from a Java class. Please read that section now, if you haven't already. For an example of using the Java driver please see Tutorial 2: Extending Hello World.
A Java application or servlet interacts with screen-scraper via the class com.screenscraper.scraper.RemoteScrapingSession. You can utilize the com.screenscraper.scraper.RemoteScrapingSession class by including the "screen-scraper.jar" and "lib\log4j.jar" files in your CLASSPATH.
Methods
There are only a handful of methods in the RemoteScrapingSession class, which are documented below:
RemoteScrapingSession identified by identifier. If this constructor is called the default host (localhost) and port (8778) will be used.
RemoteScrapingSession identified by identifier, and connecting to the server found at host listening on port.
dataSetName.
DataRecord specified by the index found in the DataSet named by dataSetName.
fileToReadFrom. More details on this method can be found here.
scrape() throws RemoteScrapingSessionException. Causes the session to scrape. This is equivalent to clicking the "Run Scraping Session" button from within screen-scraper on the "General" tab for a scraping session.
scrape method returns to determine whether or not a scraping session timed out. This method may only return true if the setTimeout method was called prior to calling scrape.
resource/conf/screen-scraper.properties.
setTimeout( int timeout ) throws RemoteScrapingSessionException. Sets the number of minutes a scraping session should be allowed to run before it automatically stops itself. The timeout value is in minutes.
DataRecord objects to be retrieved in piecemeal fashion. This is desirable in cases where a large amount of data is to be extracted throughout the life of the scraping session, and retaining it all in memory could cause problems. DataSet objects are cached by checking the "Cache the data set" check box under the "Advanced" tab for an extractor pattern.
Other Classes
It is also possible to store data sets and data records in session variables, which can then be accessed via the RemoteScrapingSession class. Data set objects are analogous to database result sets and data records are analogous to individual records within a result set. When an extractor pattern is applied a data set is generated. Storing the resulting data set in a session variable (within a screen-scraper script) will allow for it to be accessed via a RemoteScrapingSession getVariable call.
The data record class (com.screenscraper.common.DataRecord) simply extends Sun's Hashtable. Documentation on methods in the data set class (com.screenscraper.common.DataSet) can be found below:
DataRecord objects as an ArrayList of DataRecords.
DataRecord at position dataRecordNumber containing data extracted from a single application of an ExtractorPattern.
DataRecords held by this object.
identifier from the DataRecord at dataRecordNumber.
Handling Scraped Data in Real Time
The com.screenscraper.scraper.DataReceiver interface allows your code to handle extracted data as it is being scraped. That is, you need not wait until the scraping session has finished before getting access to the extracted data. This interface contains a single method:
key portion is simply a string you'll designate in a screen-scraper script. The value parameter holds the value you pass from screen-scraper to your code.
Simply implement the DataReceiver class on any of your own classes, then pass an instance of the class to the RemoteScrapingSession via:
setDataReceiver( DataReceiver dataReceiver ) throws RemoteScrapingSessionException
Other useful methods include:
DataReceiver has already been set.
On the screen-scraper side, whenever you'd like to send data from screen-scraper back to your code, you simply invoke the session.sendDataToClient method. Data sent through this method will show up through the receiveData method.
As a specific example, let's suppose you've created a scraping session that extracts product records from an e-commerce web site. As each product record is being scraped, you might simply output them to a CSV file, but you decide instead that you'd like to insert them into your database, and determine that it would be best for you to write your own code to perform the database insertion. In your scraping session, you might have a script that contains the following:
session.sendDataToClient( "ProductRecord", dataRecord ); |
You set up this script to be invoked "After each pattern application" for the extractor pattern that pulls the product information. For example, the extractor pattern might get the price, title, and weight of the product. Because the script is being invoked "After each pattern application", the current "dataRecord" object will hold all of that information. You invoke session.sendDataToClient so that each record can be processed by your code as it gets extracted.
In your Java code you create a class that implements the DataReceiver interface. You create an instance of this class and pass it to your RemoteScrapingSession object so that you can process each of the product records as they get extracted. Your receiveData method implementation might look something like this:
public void receiveData( String key, Object value ) throws RemoteScrapingSessionException |
Each time you invoke session.sendDataToClient in screen-scraper, there will be a corresponding method call made to your receiveData method, which will allow you to handle each of the data pieces individually.
From here: