scrapeableFile

scrapeableFile Methods

addHTTPParameter

scrapeableFile.addHTTPParameter( HTTPParameter parameter )
Description
Dynamically adds an HTTPParameter to the current scrapeable file. The HTTPParameter constructor is as follows: HTTPParameter( String key, String value, int sequence, int type ). Valid types for the constructor are TYPE_GET, TYPE_POST, and TYPE_FILE. Calling this method will have no effect unless it's invoked before the file is scraped.
Example
// Adds a new POST HTTP parameter to the current file.
scrapeableFile.addHTTPParameter
(
 new com.screenscraper.common.HTTPParameter
 (
  "key",
  "value",
  1,
  com.screenscraper.common.HTTPParameter.TYPE_POST
 )
);

extractData

scrapeableFile.extractData( String text, String name ) (professional and enterprise editions only)

Description
Manually invokes an extractor pattern, returning the extracted data in a DataSet object. The text parameter should be a string containing the HTML you'd like to extract information from. The name parameter should be the name of an extractor pattern of the form [scraping session]:[scrapeable file]:extractor pattern where the scraping session and scrapeable file portions of the name are optional. For example, if you passed in "My Scraping Session:My Scrapeable File:My Extractor Pattern" screen-scraper would find the extractor pattern named "My Extractor Pattern" inside the scrapeable file "My Scrapeable File", which it would look for inside the scraping session called "My Scraping Session". You could also pass in "My Scrapeable File:My Extractor Pattern", which would cause screen-scraper to look in the current running scraping session for the scrapeable file "My Scrapeable File" where it would look for the extractor pattern "My Extractor Pattern". If the extractor pattern you want to use is associated with the current scrapeable file you can simply pass in its name (e.g., "My Extractor Pattern").
Example
// Applies the "PRODUCT" extractor pattern to the text found in the
// productDescriptionText variable. The resulting DataSet from
// extractData is stored in the variable productData.

import com.screenscraper.common.*;

DataSet productData = scrapeableFile.extractData( productDescriptionText, "PRODUCT" );

Example
// Expanded example using the "PRODUCT" extractor pattern to the text found in the
// productDescriptionText variable. The resulting DataSet from
// extractData is stored in the variable myDataSet, which has multiple dataRecords.
// Each myDataRecord has a PRICE and a PRODUCT_ID.

import com.screenscraper.common.*;

myDataSet = scrapeableFile.extractData( productDescriptionText, "PRODUCT" );
for (i = 0; i < myDataSet.getNumDataRecords(); i++) {
    myDataRecord = myDataSet.getDataRecord(i);

    session.setVariable("PRICE", myDataRecord.get("PRICE"));
    session.setVariable("PRODUCT_ID", myDataRecord.get("PRODUCT_ID"));
}

See also, How to manually extract data using the session.extractData method

extractOneValue

scrapeableFile.extractOneValue( String text, String name ) (professional and enterprise editions only)
scrapeableFile.extractOneValue( String text, String name, String token ) (professional and enterprise edition version 4.0.20a and above only)

Description
This method is similar to extractData except that it assumes only a single string will be returned. When the first method is invoked the first column in the first row of the resulting DataSet object will be returned and when the second method is invoked the column named token in the first row of the resulting DataSet object will be returned. The text parameter should be a string containing the HTML you'd like to extract information from. The name parameter should be the name of an extractor pattern associated with the current scrapeable file. The token parameter should be the name of the token in the extractor pattern from name.
Example
// Applies the extractor pattern "PRODUCT_NAME" to the data found in
// the variable productDescriptionText. The extracted string is
// stored in the productName variable.
// Returns the value found in the first token found in the extractor pattern
// or null if no token is found.
productName = scrapeableFile.extractOneValue( productDescriptionText, "PRODUCT_NAME" );
Example
// Applies the extractor pattern "PRODUCT_NAME" to the data found in
// the variable productDescriptionText. The extracted string is
// stored in the productName variable.
// Returns the value found in the token "NAME" found in the extractor pattern
// or null if no token is found.
productName = scrapeableFile.extractOneValue( productDescriptionText, "PRODUCT_NAME", "NAME" );

getContentAsString

scrapeableFile.getContentAsString()
Description
Gets the content that was retrieved when the scrapeable file was requested.
Example
// Sends the HTML of the current file to the log.
session.log( scrapeableFile.getContentAsString() );

getCurrentPOSTData

scrapeableFile.getCurrentPOSTData()
Description
Returns the POST data for the scrapeable file. Note that if this method is invoked after the scrapeable file is requested it will contain the POST data with all of the session variable tokens resolved.
Example
// Stores the POST data from the scrapeable file in the
// currentPOSTData variable.
currentPOSTData = scrapeableFile.getCurrentPOSTData();

getCurrentURL

scrapeableFile.getCurrentURL()
Description
Returns the URL of the scrapeable file. Note that if this method is invoked after the scrapeable file is requested it will contain the URL with all of the session variable tokens resolved.
Example
// Stores the current URL in the variable currentURL.
currentURL = scrapeableFile.getCurrentURL();

getName

scrapeableFile.getName()
Description
Gets the name of the current scrapeable file.
Example
// Outputs the name of the scrapeable file to the log.

session.log( "Current scrapeable file: " + scrapeableFile.getName() );

getNonTidiedHTML

scrapeableFile.getNonTidiedHTML() (enterprise edition only)
Description
If screen-scraper has been configured to retain non-tidied HTML, this method will return the original HTML sent from the web server before it was tidied by screen-scraper. This can be useful in debugging in cases where sometimes tidying succeeds and sometimes it doesn't.
Example
// Outputs the non-tidied HTML from the scrapeable file
// to the log.
session.log( "Non-tidied HTML: " + scrapeableFile.getNonTidiedHTML() );

getRetainNonTidiedHTML

scrapeableFile.getRetainNonTidiedHTML() (enterprise edition only)
Description
Indicates whether or not non-tidied HTML is to be retained for this scrapeable file. See scrapeableFile.getNonTidiedHTML for more details.
Example
// Outputs to the log whether or not non-tidied HTML is

// being retained.

session.log( "Retaining non-tidied HTML: " + scrapeableFile.getRetainNonTidiedHTML() );

getStatusCode

scrapeableFile.getStatusCode() (professional and enterprise editions only)
Description
If this method is invoked after the HTTP request has been made for a scrapeable file, it will return the HTTP status code sent by the server (e.g., 200, 403, 404, 500).
Example
// Check for a 404 response (file not found).
if( scrapeableFile.getStatusCode()==404 )
{
 session.log( "Warning! The server returned a 404 response." );
}

noExtractorPatternsMatched

scrapeableFile.noExtractorPatternsMatched()
Description
Will return true if no extractor patterns associated with the scrapeable file found a match. This can be a useful error-handling mechanism.
Example
// If no patterns matched, outputs a message indicating such
// to the session log.
if( scrapeableFile.noExtractorPatternsMatched() )
{
 session.log( "Warning! No extractor patterns matched." );
}

removeAllHTTPParameters

scrapeableFile.removeAllHTTPParameters() (professional and enterprise editions only)
Description
Removes all of the HTTP parameters from the current scrapeable file. This can be useful in cases where scrapeable files are requested multiple times and parameters are added dynamically using the addHTTPParameter method.
Example
// Removes all of the HTTP parameters from the current scrapeable file.
scrapeableFile.removeAllHTTPParameters();

removeHTTPParameter

scrapeableFile.removeHTTPParameter( int sequence )
Description
Dynamically removes an HTTPParameter indicated by the parameter's sequence from the current scrapeable file. The order of the remaining parameters are automatically adjusted immediately upon calling the method. (NOTE: If calling this method more than once in the same script, and when used in conjunction with the addHTTPParameter method, it is important to keep track of how the list is reorderd before calling either method again.) This method can be used for both GET and POST parameters. Calling this method will have no effect unless it's invoked before the file is scraped.
Example
// Removes the eighth HTTP parameter from the current file. scrapeableFile.removeHTTPParameter( 8 );

saveFileBeforeTidying

scrapeableFile.saveFileBeforeTidying( String filePath ) (professional and enterprise editions only)
Description
Calling this method will cause screen-scraper to output to filePath the original HTML sent from the web server before it was tidied by screen-scraper. This can be useful in debugging in cases where sometimes tidying succeeds and sometimes it doesn't. This method must be called before the file is scraped.
Example
// Causes the non-tidied HTML from the scrapeable file
// to be output to the file path.
scrapeableFile.saveFileBeforeTidying( "C:/non-tidied.html" );

saveFileOnRequest

scrapeableFile.saveFileOnRequest( String pathToSaveTo ) (enterprise edition only)
Description
Causes the file to be saved to the local file system after being requested by screen-scraper. This method must be called before the file is scraped. That is, the script calling this method should be associated with the scrapeable file, and should be invoked "Before file is scraped". Note that the preferred method for downloading files to the file system is session.downloadFile, but this method is useful in cases where a POST request is required to request the file. For example, if you'd like to download and save a PDF that is accessible only through a POST request it would be appropriate to use this method.
Example
// When the current file is requested it will be saved to the
// local file system as "sample.pdf".
scrapeableFile.saveFileOnRequest( "C:/downloaded_files/sample.pdf" );

setContentType

scrapeableFile.setContentType( String contentType ) (professional and enterprise editions only)
Description
In certain rare cases it may be necessary to explicitly set the content type of the POST data of an HTTP request. This may be required in cases where a site is using AJAX, and the POST payload of a request is sent as XML (e.g., using the setRequestEntity method). This method must be invoked before the HTTP request is made (e.g., "Before file is scraped" for a scrapeable file).
Example
// Sets the type of the POST entity to XML.
scrapeableFile.setContentType( "text/xml" );

setReferer

scrapeableFile.setReferer( String url ) (professional and enterprise editions only)
Description
Dynamically sets the HTTP header referer for the current scrapeable file. This method must be called before the file is scraped. That is, the script calling this method should be associated with the scrapeable file, and should be invoked "Before file is scraped".
Example
// Sets the value of url as the HTTP header
// referer for the current scrapeable file.
import java.net.URL;

URL url = new URL( "http://www.foo.com/" );
scrapeableFile.setReferer( url );

setRequestEntity

scrapeableFile.setRequestEntity( String requestEntity ) (professional and enterprise editions only)
Description
Sets the complete value that will be sent in the POST payload portion of the request. This method allows you to set the entity portion of a POST request that would otherwise be set by designating parameters under the "Parameters" tab for a scrapeable file. This is rarely necessary, but can be useful in cases where an entire string of XML must be sent (e.g., in many AJAX applications).
Example
// Sets the request entitiy to an XML document.
scrapeableFile.setRequestEntity( "<outerNode><innerNode>my data</innerNode></outerNode>" );

setRetainNonTidiedHTML

scrapeableFile.setRetainNonTidiedHTML( boolean retainNonTidiedHTML ) (enterprise edition only)
Description
Sets whether or not non-tidied HTML is to be retained for the current scrapeable file. This defaults to false. See scrapeableFile.getNonTidiedHTML for more details.
Example
// Tells screen-scraper to retain tidied HTML for the current

// scrapeable files.

scrapeableFile.setRetainNonTidiedHTML( true );

setUserAgent

scrapeableFile.setUserAgent( String userAgent ) (professional and enterprise editions only)
Description
In certain rare cases it may be desirable to explicitly set the "User-Agent" header screen-scraper will send for a given HTTP request. That is, screen-scraper will identify itself as if it were a specific web browser. If unspecified, the user agent "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)" will be used. Note that this method must be invoked before the file is scraped.
Example
// Causes screen-scraper to identify itself as Firefox
// running on Linux.
scrapeableFile.setUserAgent( "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020826" );

wasErrorOnRequest

scrapeableFile.wasErrorOnRequest()
Description
Indicates if the server responds with a status code other than those in the 200 or 300 entity range or if the connection to the server timed out. Each time a server responds to a request made by screen-scraper it sends back a three digit code indicating the status of the response. Responses in either the 200 or 300 range indicate that there is no error in the transaction. Responses in either the 400 or 500 range indicate some kind of error. This method responds to such an occurance.
Example
// If an error occurred when the file was requested, an error
// message indicating such gets output to the log.
if( scrapeableFile.wasErrorOnRequest() )
{
 session.log( "Connection error occurred." );
}