When writing scripts within screen-scraper, there are a number of objects and methods available to you. The Using Scripts page provides an overview of working with scripts, where this page provides details on specific objects and methods you'll use when scripting within screen-scraper.
The API documentation emphasizes Interpreted Java as Java is the language in which screen-scraper proper is written. That should not deter you from using whatever language you desire; all the methods are available in what ever language you choose.
The examples given here assume you're using Interpreted Java as the scripting language, but there should be very little difference in syntax if you decide to use another language. For example, if you're scripting in VBScript, you would simply omit the semi-colon at the end of each line, and for methods that don't return a value you would precede them with the VBScript keyword Call (either that, or omit the parentheses around the method parameters).
The screen-scraper, internal API has been divided into three groups for convenience.
The two main groups are the scraping engine and the proxy server. The various objects available in these sections are exclusive to running screen-scraper for in one of these two ways. The one exception is the RunnableScrapingSession which has been grouped with the scraping engine simply because it is unlikely to be needed or used with the proxy server.
The utilities are available to scripts run in either the scraping engine or the proxy server and have since been separated from both. These represent classes that we have written to simplify some common tasks that are performed with retrieved data.
There are many additional classes that are available through Java Libraries that we did not create/modify that are especially worthy of note. Regardless of the language that you are using to program in screen-scraper you can have access to these.
There are a few other APIs to be aware of. They are particular to dealing with screen-scraper in certain ways or certain versions. Make sure that you understand the implications of using these APIs before you start playing with them.
The scraping engine is the backbone of screen-scraper and provides four built-in objects. These objects are: session, scrapeableFile, dataSet, and dataRecord. We have also included the RunnableScrapingSession class as it best pertains to the engine.
For details on which objects are available to scripts in the context of a scrape see the variable scope section of the documentation.
The dataRecord object is populated using the names of tokens from extractor patterns.
This object gives access to the most recently extracted data record. This will most likely only be used in scripts that get accessed after each time an extractor pattern is applied. This object simply extends Hashtable (documentation on its methods can be found in Java's documentation).
The dataRecord is populated using the token names in the extractor patterns. You'll find a few of the most commonly used methods below. DataRecord objects can also be created from scratch, and subsequently added to DataSet objects using the addDataRecord method.
See example usage: Iterate over DataSets & DataRecords.
Create a new DataRecord object.
This method does not receive any parameters.
Returns DataRecord object.
Version | Description |
---|---|
4.5 | Available for all editions. |
com.screenscraper.common.DataRecord
See additional example usage: Iterate over DataSets & DataRecords.
Get the value of a DataRecord field.
Returns the value associated with the specified key. Usually it will be a string but, if you have manually added fields, it can be an integer, boolean, long, or other object.
Version | Description |
---|---|
4.5 | Available for all editions. |
Add a new field to the DataRecord or update the value of an existing field.
Returns the value previously associated with the specified key. If the key did not exist then it will return null.
Version | Description |
---|---|
4.5 | Available for all editions. |
See additional example usage: Iterate over DataSets & DataRecords.
Remove a field from the DataRecord.
Returns the value previously associated with the specified key. If the key did not exist then it will return null.
Version | Description |
---|---|
4.5 | Available for all editions. |
The dataSet object holds all data records extracted by an extractor pattern after it has been applied as many times as possible to the HTML retrieved by a scrapeable file. A data set is analogous to a result or record set that would be returned from a database query. A data set contains any number of data records, which are analogous to rows in a database.
The dataSet object provides methods to aid in getting at the information that has been gathered.
See example usage: Iterate over DataSets & DataRecords.
Manually create a DataSet.
Returns DataSet object.
Version | Description |
---|---|
4.5 | Available for all editions. |
com.screenscraper.common.DataSet
See additional example usage: Iterate over DataSets & DataRecords.
Add a DataRecord to a DataSet.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
See additional example usage: Iterate over DataSets & DataRecords.
Remove all DataRecord objects from the DataSet.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
See additional example usage: Iterate over DataSets & DataRecords.
Remove a DataRecord from the DataSet.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Retrieve a field's value in a data set based on another field.
Returns the value in the returned column, usually a string (unless records have been manually added). If no match is found, null is returned.
Version | Description |
---|---|
5.0 | Added for all editions. |
Get a single piece of data held by a DataRecord in the DataSet.
Returns the value associated with the DataRecord identifier. It will be a string unless you have added values to the DataRecord whose values are not strings.
Version | Description |
---|---|
4.5 | Available for all editions. |
Get all DataRecords in the DataSet.
This method does not receive any parameters.
Returns an ArrayList of DataRecord objects.
Version | Description |
---|---|
4.5 | Available for all editions. |
This method is provided as a convenience, the recommended way to iterate over data records in a data set is to use getNumDataRecords and getDataRecord.
Get the character set being applied the scraped data.
This method does not receive any parameters.
Returns the character set applied to the scraped data, as a string. If a character set has not been specified then it will default to the character set specified in settings dialog box.
Version | Description |
---|---|
5.0 | Added for all editions. |
Get one DataRecord in the DataSet.
Returns a DataRecord (Hashtable object). If there is not a DataRecord at the specified index an error will be thrown.
Version | Description |
---|---|
4.5 | Available for all editions. |
Get the first non-null value, in a data set, for a given token.
Returns the first non-null value in the column, usually a string (unless records have been manually added). If none is found, null is returned.
Version | Description |
---|---|
5.0 | Added for all editions. |
Get the number of DataRecords in the DataSet.
This method does not receive any parameters.
Returns the number of DataRecords in the DataSet, as an integer.
Version | Description |
---|---|
4.5 | Available for all editions. |
Merge data records from two data sets.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
Set the character set to be used for rendering dataSet values.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
This will only change the character set on the current data set. If you want it to be changed for all data sets, you would need to change it in the settings dialog box or screen-scraper.properties file.
Get the number of DataRecords in the DataSet.
This method does not receive any parameters.
Returns the number of DataRecords in the DataSet, as an integer.
Version | Description |
---|---|
6.0.3a | Available for all editions. |
Write DataSet string and integer contents to a file. The fields will be tab-delimited and records hard-return delimited.
Returns void. If the file cannot be written to then an error will be thrown.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
This object contains various methods used to log information about a running scraping session to log files, the workbench "Log" tab, and the web interface.
Creates an automatic progress bar and adds it to the progress bars. These progress bars match their progress to a value from a session variable and a list of values. When web messages are output with the webDebug, webInfo, webWarn, or webError methods, a progress bar will be drawn to give a visual representation of the current progress of the scrape.
Note that when using auto progress bars, it is advised to not use any manually monitored ones, as it can cause conflicts. Anytime an auto progress bar has no session variable set for its monitored key, it deletes itself and all children progress bars (including manual ones). As long as you keep that in mind, it should be safe to use both types together.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.31a | Available in enterprise edition. |
5.5.43a | Moved from session to log class. |
Watches for all session variables whose keys end with the postfix specified, and will output their values when monitored variables are logged.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.42a | Moved from session to log class. |
Watches for all session variables whose keys begin with the prefix specified, and will output their values when monitored variables are logged.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.42a | Moved from session to log class. |
Adds a specific name and value to be logged with the web messages methods or logMonitoredValues method
The previous value associated with the name, or null if there wasn't one
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.42a | Moved from session to log class. |
Watches the value of a session variable, and will output it each time monitored variables are output
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.42a | Moved from session to log class. |
Adds a new progress bar. If no progress bar exists, this will be set as the root, otherwise it will be the child of the lowest progress bar. When web messages are output with the webDebug, webInfo, webWarn, or webError methods, a progress bar will be drawn to give a visual representation of the current progress of the scrape. The addProgressBarIfNotStopped versions remove the progress bar if the scrape has not been stopped, which is useful for determining when a scrape was stopped.
This method returns a reference to the new progress bar, which can be used to update the current progress
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.31a | Available in enterprise edition. |
5.5.43a | Moved from session to log class. |
Appends a status message to be displayed in the web interface.
None
Version | Description |
---|---|
5.5.32a | Available in Enterprise edition. |
5.5.43a | Moved from session to log class. |
Adds a file to the cache. This can be used to add anything to the cache, from a text file to an image that was downloaded, or any other file that would be useful.
A File that represents the cached file.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
5.5.43a | Moved from session to log class. |
Caches the HTML and headers of the scrapeable file. This will include both the request and response headers.
A File that represents the cached file.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
5.5.43a | Moved from session to log class. |
Adds text to the cache. This will create a new text file in the cache and store the given content in it.
A File that represents the cached file.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
5.5.43a | Moved from session to log class. |
Write message to the log.
Returns void.
Version | Description |
---|---|
5.5 | Now accepts any Object as a message |
4.5 | Available for all editions. |
When the workbench is running, this will be found under the log tab for the scraping
session. When screen-scraper is running in server mode, the message will get sent to the corresponding .log file found in screen-scraper's log folder. When screen-scraper is invoked from the command
line, the message will get sent to standard out.
Enables caching for this scrape. When caching is enabled, each time a scrapeable file is downloaded it will be saved to the file system. Once the session is completed the cache will be either zipped or the directory renamed, depending on the conditions that were specified when the cache was enabled. Optionally this will save the log files to the cached location, and will save everything from the error.log file that was added while the cache was enabled.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
5.5.32a | Renamed from enableCache to enableCaching |
5.5.43a | Moved from session to log class. |
Ends the caching for the scrape. This method will be called once all the scripts and files are run/scraped. It can be called in a script to end the caching early (thereby only caching a portion of the scrape). This only deals with saving downloaded content to the file system, not with reading it back in during a scrape.
This method takes no parameters
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
5.5.32a | Renamed from endCache to endCaching. |
5.5.43a | Moved from session to log class. |
Write message to the log.
Returns void.
Version | Description |
---|---|
5.5 | Now accepts any Object as a message |
4.5 | Available for all editions. |
When the workbench is running, this will be found under the log tab for the scraping
session. When screen-scraper is running in server mode, the message will get sent to the corresponding .log file found in screen-scraper's log folder. When screen-scraper is invoked from the command
line, the message will get sent to standard out.
Returns whether or not the cache is enabled for the scrape. When enabled, it simply means that each ScrapeableFile will save the content it downloads from the server to the file system so it can be viewed later, generally for debugging purposes.
This method takes no parameters
Returns true if caching is currently enabled for this session
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.32a | Available enterprise and professional editions (Returns false in basic edition, but doesn't throw an Exception). Renamed from getCacheEnabled to getCachingEnabled. |
5.5.43a | Moved from session to log class. |
Returns the progress bar specified. If the index if given, returns the progress bar at that index (0 is the root, 1 is the first child, etc...). If the title is given, returns the most recently added progress bar with the given title
The ProgressBar indicated, or null if none was found matching the required criteria
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.31a | Available in enterprise edition. |
5.5.43a | Moved from session to log class. |
Write message to the log.
Returns void.
Version | Description |
---|---|
5.5 | Now accepts any Object as a message |
4.5 | Available for all editions. |
When the workbench is running, this will be found under the log tab for the scraping
session. When screen-scraper is running in server mode, the message will get sent to the corresponding .log file found in screen-scraper's log folder. When screen-scraper is invoked from the command
line, the message will get sent to standard out.
Write message to the log.
Returns void.
Version | Description |
---|---|
5.5 | Now accepts any Object as a message |
4.5 | Available for all editions. |
When the workbench is running, this will be found under the log tab for the scraping
session. When screen-scraper is running in server mode, the message will get sent to the corresponding .log file found in screen-scraper's log folder. When screen-scraper is invoked from the command
line, the message will get sent to standard out.
Logs all the values in a Data Record to the log, with one line per value. If a value in the record is a List, Set, Map, Data Set, Scrapeable File, or Exception, it will have detailed output as well.
This method returns nothing
Version | Description |
---|---|
5.5.26a | Available in all editions. |
5.5.43a | Moved from session to log class. |
The output from the above call might look something like this:
DataRecord --- A_FLOAT : 3.14159 --- A_LIST : List ------ Element 0 : Value 1 ------ Element 1 : Value 2 ------ Element 2 : Value 3 ------ Element 3 : Set --------- Element : A value --------- Element : More value --------- Element : Other stuff --- A_MAP : Map ------ KEY_1 : 1 ------ KEY_2 : 2 ------ KEY_3 : 3 --- A_SET : Set Logged above as "------ Element 3 : " --- A_STRING : Screen-Scraper --- AN_INT : 5
Logs an Exception, with a full stack trace, at the Error level
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Moved from session to log class. |
Logs the values of all the currently monitored variables, the progress of the scrape, if known, and puts the message at the top. Also logs any additional values being watched. Logs values at the specified level.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Moved from session to log class. |
Logs closing values to indicate the scrape is complete and what values were when everything finished. It will log at whatever the highest level logged to was. For instance, if a webWarn had been logged during the scrape, this will log at the warning level.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Moved from session to log class. |
Logs the Object in a semi intelligent way. For example, Maps are logged as key-value pairs, lists are logged with one element per line, all elements of a set are logged, etc... Some objects will just log their value using String.valueOf() if it isn't a standard type of data set/list
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Moved from session to log class. |
Logs useful information about the current instance of Screen-Scraper, as well as the Java VM and the General Utility version being used. Information will be logged as an info message in the web interface (when running in server mode) and the log.
This method takes no parameters
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Moved from session to log class. |
Stops watching for a postfix in session variables
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Moved from session to log class. |
Stops watching for a prefix in session variables
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Moved from session to log class. |
Removes a specific name from the manually set values to be logged. Doesn't affect the value of session variables
The previous value associated with the name, or null if there wasn't one
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Moved from session to log class. |
Stops watching the specified variable
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Moved from session to log class. |
Removes the specified progress bar. The removeProgressBarIfNotStopped version removes the progress bar if the scrape has not been stopped, which is useful for determining when a scrape was stopped.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.31a | Available in enterprise edition. |
5.5.43a | Moved from session to log class. |
Write message to the log.
Returns void.
Version | Description |
---|---|
5.5 | Now accepts any Object as a message |
4.5 | Available for all editions. |
When the workbench is running, this will be found under the log tab for the scraping
session. When screen-scraper is running in server mode, the message will get sent to the corresponding .log file found in screen-scraper's log folder. When screen-scraper is invoked from the command
line, the message will get sent to standard out.
Logs closing values to indicate the scrape is complete and what values were when everything finished. It will log at whatever the highest level logged to was. For instance, if a webWarn had been logged during the scrape, this will log at the warning level. When running in Professional edition, this simply outputs to the log.
Using this method is preferred over logMonitoredValuesClose (which only logs to the log), because if at a later point the scrape is run in server mode for enterprise edition, a useful message is output in the web interface without needing to modify the scrape.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
5.5.43a | Moved from session to log class. |
Logs a debug message to the web interface status message area. Uses the message header as the top of the message, and then logs all currently monitored session variables underneath as well as the current progress (if known) of the scrape. Also outputs the message to the log. When running in Professional edition, this simply outputs to the log.
Using this method is preferred over logMonitoredValues (which only logs to the log), because if at a later point the scrape is run in server mode for enterprise edition, a useful message is output in the web interface without needing to modify the scrape.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
5.5.43a | Moved from session to log class. |
Logs an error message to the web interface status message area. Uses the message header as the top of the message, and then logs all currently monitored session variables underneath as well as the current progress (if known) of the scrape. Also outputs the message to the log. When running in Professional edition, this simply outputs to the log.
Using this method is preferred over logMonitoredValues (which only logs to the log), because if at a later point the scrape is run in server mode for enterprise edition, a useful message is output in the web interface without needing to modify the scrape.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
5.5.43a | Moved from session to log class. |
Logs an info message to the web interface status message area. Uses the message header as the top of the message, and then logs all currently monitored session variables underneath as well as the current progress (if known) of the scrape. Also outputs the message to the log. When running in Professional edition, this simply outputs to the log.
Using this method is preferred over logMonitoredValues (which only logs to the log), because if at a later point the scrape is run in server mode for enterprise edition, a useful message is output in the web interface without needing to modify the scrape.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
5.5.43a | Moved from session to log class. |
Logs a warning message to the web interface status message area. Uses the message header as the top of the message, and then logs all currently monitored session variables underneath as well as the current progress (if known) of the scrape. Also outputs the message to the log. When running in Professional edition, this simply outputs to the log.
Using this method is preferred over logMonitoredValues (which only logs to the log), because if at a later point the scrape is run in server mode for enterprise edition, a useful message is output in the web interface without needing to modify the scrape.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
5.5.43a | Moved from session to log class. |
This is a class that can be instantiated within a script in order to run a scraping session.
Also see:
The Maximum number of concurrent running scraping sessions in the settings dialog box will control how many scraping sessions can be run simultaneously.
Initiates a RunnableScrapingSession object using the name of an existing scraping session.
Returns a RunnableScrapingSession. On failure an error will be thrown.
Version | Description |
---|---|
5.0 | inheritHttpState added as optional parameter. |
4.5 | Available for professional and enterprise editions. |
com.screenscraper.scraper
Retrieve the name of the scraping session in the runnableScrapingSession.
This method does not receive any parameters.
Returns a string with the name of the scraping session.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Get the timeout of the session in the runnableScrapingSession.
This method does not receive any parameters.
Returns a integer representing the timeout length in minutes.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Retrieve the the value of a session variable. This method should be called after scrape method has returned.
Returns the value of the session variable: object, boolean, int, string, etc. If the variable doesn't exists it returns null.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Run the session scraping.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
The default is for the script to continue executing without waiting for the scraping session to finish. You can use setDoLazyScrape to force the script to wait until the scape finishes before continuing the script.
Indicate whether or not the scraping session should run concurrently with (at the same time as) other scraping sessions. The default for doLazyScrape is true.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
We recommend not setting this value to false! When running scraping sessions in the workbench, it will cause the interface to freeze up until sessions have completed.
If you'd like to run multiple scraping sessions serially (one after another), the best option is to set the Maximum number of concurrent running scraping sessions to 1 in the settings window.
Sets the timeout of the session. That is, after the given number of minutes have passed the session will automatically terminate.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
This method must be called before scrape.
Set the value of a session variable.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
The scrapeableFile object refers to the current file being requested from a given server. It houses both the request for a file and response and can be manipulated to meet any necessary requirements: GET and POST parameters, referer information, cookies, FILE parameters, HTTP headers, characterset, and such.
Dynamically adds a GET parameter to the URL of the current scrapeable file. If a parameter with the given sequence already exists, it will be replaced by the one created from this method call. Calling this method is the equivalent in the workbench of adding a parameter under the "Parameters" tab, and designating the type as GET. Once the scraping session is completed the original HTTP parameters (those under the "Parameters" tab in the workbench) will be restored.
None
Version | Description |
---|---|
5.5.32a | Available in Professional and Enterprise editions. |
Add an HTTP header to be sent along with the request.
Returns void. If you are not using enterprise edition it will throw an error.
Version | Description |
---|---|
5.0 | Available for professional and enterprise edition. |
4.5 | Available for enterprise edition. |
In certain rare cases it may be necessary to explicitly add a custom header of the POST data of an HTTP request. This may be required in cases where a site is using AJAX, and the POST payload of a request is sent as XML (e.g., using the setRequestEntity method). This method must be invoked before the HTTP request is made (e.g., "Before file is scraped" for a scrapeable file).
Dynamically add an HTTPParameter to the current scrapeable file.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
The HTTPParameter constructor is as follows: HTTPParameter( String key, String value, int sequence, String type ). Valid types for the constructor are GET, POST, and FILE. Calling this method will have no effect unless it's invoked before the file is scraped.
Dynamically adds a POST parameter to the existing set of POST parameters. If a parameter with the given sequence already exists, it will be replaced by the one created from this method call. If the method call is used that doesn't take a sequence, the new POST parameter will carry a sequence just higher than the highest existing sequence. Calling this method is the equivalent in the workbench of adding a parameter under the "Parameters" tab, and designating the type as POST. Once the scraping session is completed the original HTTP parameters (those under the "Parameters" tab in the workbench) will be restored.
None
Version | Description |
---|---|
5.5.32a | Available in Professional and Enterprise editions. |
Manually apply an extractor pattern to a string.
Returns DataSet on success. Failures will be written out to the log as errors.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
An example of how to manually extract data is available.
Manually retrieve the value of a single extractor token.
Returns the match from the last data record, as a string, on success. On failure it returns null and writes a error to the log.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
If you want it to be from the first data record you could use getDataRecord.
Gets the ASPX .NET values from the string. The standard values are __VIEWSTATE, __EVENTTARGET, __EVENTVALIDATION, and __EVENTARGUMENT. Values will be stored in the returned DataRecord as ASPX_VIEWSTATE, ASPX_EVENTTARGET, etc...
A DataRecord object with each ASPX name as ASPX_[NAME] mapped to it's value. Note that when onlyStandard is false, any parameter that starts with the name __ will be returned in this DataRecord
Version | Description |
---|---|
5.5.26a | Available in all editions. |
Retrieve the authentication expectation of the request.
This method does not receive any parameters.
Returns whether the scrapeable file expects to have to authenticate and so will send the information initially instead of waiting for the request for it, as a boolean.
Version | Description |
---|---|
5.0 | Available for all editions. |
Get the character set being used in the page response rendering.
This method does not receive any parameters.
Returns the character set applied to the scraped page, as a string. If a character set has not been specified then it will default to the character set specified in settings dialog box.
Version | Description |
---|---|
4.5 | Available for all editions. |
If you are having trouble with characters displaying incorrectly, we encourage you to read about how to go about finding a solution using one of our FAQs.
Retrieve contents of the response.
This method does not receive any parameters.
Returns contents of the last response, as a string. If the file has not been scraped it will return an empty string.
Version | Description |
---|---|
4.5 | Available for all editions. |
Retrieve the POST payload type being used to interpret the page. This can be important with scraping some site's implementation of AJAX, where the payload in explicitly set as xml.
This method does not receive any parameters.
Returns the content type, as a string (e.g., text/html or text/xml).
Version | Description |
---|---|
5.0 | Available for all editions. |
Retrieve the POST data.
This method does not receive any parameters.
Returns the POST data for the scrapeable file, as a string. If called after the file has been scraped the session variable token will be resolved to their values; otherwise, the tokens will simply be removed from the string.
Version | Description |
---|---|
4.5 | Available for all editions. |
Get the URL of the file.
This method does not receive any parameters.
Returns the URL of the scrapeable file, as a string. If called after the file has been scraped the session variable tokens will be resolved to their values; otherwise, the tokens will simply be removed from the string.
Version | Description |
---|---|
4.5 | Available for all editions. |
Indicates whether or not the most recent extractor pattern application timed out.
None
Version | Description |
---|---|
5.5.36a | Available in all editions. |
Determine whether or not the contents of this response are being forced to be recognized as non-binary.
This method does not receive any parameters.
Returns true if the scrapeable file is being forced to be treated as non-binary; otherwise, it returns false.
Version | Description |
---|---|
5.0 | Added for all editions. |
Gets the value of the header in the response of the scrapeable file, or returns null if it couldn't be found
The value of the header, or null if not found
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
Gets the header section of the HTTP Response
This method takes no parameters
A String containing the HTTP Response Headers
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
Gets the headers of the HTTP Response as a map, and returns them.
This method takes no parameters
A Map from header name to header value for the response headers.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
Indicates whether or not the most recent attempt to tidy the HTML failed.
None
Version | Description |
---|---|
5.5.36a | Available in all editions. |
Indicates whether or not the maximum attempts to request a given scrapeable file were reached.
None
Version | Description |
---|---|
5.5.36a | Available in all editions. |
Retrieve the kilobyte limit for information retrieved by the scrapeable file, any additional information will not be retrieved.
This method does not receive any parameters.
Returns the current kilobyte limit on the response, as an integer.
Version | Description |
---|---|
5.0 | Add for professional and enterprise editions. |
Get the name of the scrapeable file.
This method does not receive any parameters.
Returns the name of the scrapeable file, as a string.
Version | Description |
---|---|
4.5 | Available for all editions. |
Retrieve the non-tidied HTML of the scrapeable file.
This method does not receive any parameters.
Returns the non-tidied contents of the scrapeable file, as a string. On failure it returns null.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
By default non-tidied html is not retained. For this method to return anything other than null you must use setRetainNonTidiedHTML to force non-tidied html to be retained.
Gets an array of strings containing the redirect URL's for the current scrapeable file request attempt.
This method does not receive any parameters.
Returns the array of strings; may be empty.
Version | Description |
---|---|
6.0.24a | Available in Professional and Enterprise editions. |
Determine if the scrapeable file is set to retain non-tidied html.
This method does not receive any parameters.
Returns boolean flag for non-tidied contents being retained.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Returns the retry policy. Note that in any 'After file is scraped' scripts this is null
This method takes no parameters.
The Retry Policy that will be used by this scrapeable file
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
Determine the HTTP status code sent by the server.
This method does not receive any parameters.
Returns integer corresponding to the HTTP status code of the response.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Retrieve the name of the user agent making the request.
This method does not receive any parameters.
Returns the user agent, as a string.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Determine if an input or output error occurred when requesting file.
This method does not receive any parameters.
Returns true if an error has occurred; otherwise, it returns false.
Version | Description |
---|---|
5.0 | Added for all editions. |
This method should be run after the scrapeable file has been scraped.
Determine whether any extractor patterns associated with the scrapeable file found a match.
This method does not receive any parameters.
Returns boolean corresponding to whether any extractor pattern matched in the scrapeable file.
Version | Description |
---|---|
4.5 | Available for all editions. |
Remove all of the HTTP parameters from the current scrapeable file.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Remove an HTTP header from a scrapeable file.
Returns void.
Version | Description |
---|---|
5.0.5a | Introduced for enterprise edition. |
Dynamically removes an HTTPParameter. The order of the remaining parameters are adjusted immediately.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
5.5.32a: Added method call that takes a String. | Available for Professional and Enterprise editions. |
If calling this method more than once in the same script, when used in conjunction with the addHTTPParameter method, it is important to keep track of how the list is reordered before calling either method again.
Calling this method will have no effect unless it's invoked before the file is scraped.
This method can be used for both GET and POST parameters.
Resequences an HTTP parameter.
None
Version | Description |
---|---|
5.5.32a | Available in Professional and Enterprise editions. |
Resolves a relative URL to an absolute URL based on the current URL of this scrapeable file.
Returns string containing the complete url to the file. On failure it will return the relative path and an error will be written to the log.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Write non-tidied contents of the scrapeable file response to a text file.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
This method must be called before the file is scraped.
Because the response header are also saved in the file, if the file is anything except a text file it will not be valid (e.g. images, pdfs).
Save the file returned from a scrapeable file request.
Returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
This method must be called from a scrapeable file before the file is scraped. Do not call this method from a script which is invoked by other means such as after an extractor pattern match or from within another script.
It is preferable to use downloadFile; however, at times you may have to send POST parameters in order to access a file. If that is the case, you would use this method.
This method cannot save local file requests to another location.
Set the authentication expectation of the request.
Returns void.
Version | Description |
---|---|
5.0 | Available for all editions. |
Set the character set used in a specific scrapeable file's response renderings. This can be particularly helpful when the page renders characters incorrectly.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
This method must be called before the file is scraped.
If you are having trouble with characters displaying incorrectly, we encourage you to read about how to go about finding a solution using one of our FAQs.
Set POST payload type. This is particularly helpful with scraping some site's implementation of AJAX, where the payload in explicitly set as xml.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
This method must be called before the file is scraped.
This method is usually used in connection with setRequestEntity as that method specifies the content of the POST data.
Set content type header to multipart/form-data.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
This method must be called before the file is scraped.
Occasionally a site will expect a multi-part request when a file is not being sent in the request.
If you include a file upload parameter under the parameters tab of the scrapeable file the request will automatically be multi-part.
Set whether or not the contents of this response should be forced to be treated as non-binary. Default forceNonBinary value is false.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
This is provided in the case where screen-scraper misidentifies a non-binary file as a binary file. It doesn't happen often but is possible.
Determines whether or not a POST request should be forced.
Returns void.
Version | Description |
---|---|
6.0.14a | Available in Professional and Enterprise editions. |
Sets the request type to use.
ScrapeableFile.RequestType is an enum with the following options as values
If the method sets the request to one of those types, all paramenters set as GET in the paramenters tab will be appended to the url (like normal) and all parameters set as POST parameters will be used to buld the request entity. If there are POST values on a type that doesn't support a request entity an exception will be thrown when the request is issued.
Returns void.
Version | Description |
---|---|
6.0.55a | Available in Professional and Enterprise editions. |
Overwrite the content of the "last response"
Returns void.
This method must be called from an extractor pattern before the pattern is run.
Limit the amount of information retrieved by the scrapeable file. This method can be useful in cases of very large responses where the desired information is found in the first portion of the response. It can also help to make the scraping process more efficient by only downloading the needed information.
Returns void.
Version | Description |
---|---|
5.0 | Add for professional and enterprise editions. |
This method must be called before the file is scraped.
Set referer HTTP header.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
This method must be called before the file is scraped.
Set POST payload data. This is particularly helpful with scraping some site's implementation of AJAX, where the payload in explicitly set as xml.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
This method must be called before the file is scraped.
This method is usually used in connection with setContentType as that method specifies the content of the POST data.
Though you can set plain text POST data using this method it is preferable to use the addHTTPParameter method for this task.
Set whether or not non-tidied HTML is to be retained for the current scrapeable file.
Returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
If, after the file is scraped, you want to be able to use getNonTidiedHTML this method has to be called before the file is scraped.
Sets a Retry Policy that will be run to check if a page should be re-downloaded or not. The policy will be checked after all the extractors have run, and will check for an error on the page based on a set of conditions. If the policy shows an error on the page, it can run scripts or other code to attempt to remedy the situation, and then it will rescrape the file.
The file will be re-downloaded without rerunning any of the scripts that run before the file is downloaded, and before any of the scripts marked to run after the file is scraped. If there is any change that needs to be made to session variables/headers, etc... they should be made in the script or runnable that will be executed. Also, the policy can specify that session variables should be restored to their previous values before the file is rescraped. If it does, they will be reset after the error checking portion of the policy but before the policy runs the code to make changes before a retry.
The retry policy should be set in a script run 'Before file is scraped', but can also be set by a script on an extractor pattern. It it is set on an extractor pattern, session variables will not be restored if the retry is required
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
Explicitly state the user agent making the request.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
This method must be called before the file is scraped.
Determine if an error occurred with the request. Errors are considered to be server timeouts as well as any status code outside of the range 200-399.
This method does not receive any parameters.
Returns true for server timeouts as well as any status code outside of the range 200-399; otherwise, it returns false.
Version | Description |
---|---|
4.5 | Available for all editions. |
This method must be called after the file is scraped.
If you want to know what the status code was you can use getStatusCode.
This object refers to the current scraping session that is running. To make the methods a little easier to sort through they have been grouped into related methods. The groups have been named to ease in finding them when they are needed.
The following methods are provided to aid you in setting up an anonymous scraping session. If you are using your own server proxy pool you will use the methods to allow screen-scraper to interact with and manage your proxy pool. If you are using automatic anonymization then the only method you will use is currentProxyServerIsBad as screen-scraper will manage the servers using the anonymization settings from your setup.
See an example of Anonymization via Manual Proxy Pools.
Remove proxy server from proxy pool. This is only used with anonymization and indicates that one server in the pool is bad and should be removed.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
If you are using automatic anonymization or manual proxy pools, a new proxy server will be created as a result of the method call.
When checking if a request you have made is invalid it is best not to rely on the HTTP status code (eg. 404) alone as the status codes are not always accurate. It is recommended that you also scrape a known string (eg. "Not found") from the response HTML that validates the status code.
Get the current proxy server from the proxy server pool.
This method does not receive any parameters.
Returns the current proxy server being used.
Version | Description |
---|---|
4.5 | Available for all editions. |
Holds the proxy server pool object that allows proxies to be cycled through.
Returns true if there is an available proxy server pool.
Version | Description |
---|---|
4.5 | Available for all editions. |
Determine whether proxies are set to be terminated when the scrape ends.
This method does not receive any parameters.
Returns true if a proxy will be terminated; otherwise, it returns false.
Version | Description |
---|---|
5.0 | Available for all editions. |
Determine whether proxies are being used from proxy pool.
This method does not receive any parameters.
Returns true if a proxy pool is being used; otherwise, it returns false.
Version | Description |
---|---|
4.5 | Available for all editions. |
Associate a proxy pool with a scraping session.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Manually set the outcome of proxies when the scrape ends.
Returns void.
Version | Description |
---|---|
5.0 | Available for all editions. |
Determine if proxies from a proxyServerPool be used when making scrapeable file request.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
If you are already going through a proxy server, screen-scraper must be told the credentials in order to get out to the internet. These methods are all provided to manually tell screen-scraper how to get through your external proxy.
If you always go through the same external proxy you would probably want to set the credentials in screen-scraper's proxy settings so that you don't have to specify them in all of your scrapes.
Retrieve the external NT proxy domain.
This method does not receive any parameters.
Returns the external NT domain, as a string.
Version | Description |
---|---|
5.0 | Added for all editions. |
Retrieve the external NT proxy host.
This method does not receive any parameters.
Returns the external NT host, as a string.
Version | Description |
---|---|
5.0 | Added for all editions. |
Retrieve the external NT proxy password.
This method does not receive any parameters.
Returns the external NT password, as a string.
Version | Description |
---|---|
5.0 | Added for all editions. |
Retrieve the external NT proxy username.
This method does not receive any parameters.
Returns the external NT username, as a string.
Version | Description |
---|---|
5.0 | Added for all editions. |
Retrieve the external proxy host.
This method does not receive any parameters.
Returns the external host, as a string.
Version | Description |
---|---|
5.0 | Available for all editions. |
Retrieve the external proxy password.
This method does not receive any parameters.
Returns the external password, as a string.
Version | Description |
---|---|
5.0 | Available for all editions. |
Retrieve the external proxy port.
This method does not receive any parameters.
Returns the external port, as a string.
Version | Description |
---|---|
5.0 | Available for all editions. |
Retrieve the external proxy username.
This method does not receive any parameters.
Returns the external username, as a string.
Version | Description |
---|---|
5.0 | Available for all editions. |
Manually set external NT proxy domain.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
If you are using this method on all of your scripts you might want to set it in screen-scraper's external NT proxy settings.
If you are using NTLM (Windows NT) authentication you'll need to designate settings for both the standard external proxy as well as the external NT proxy.
Manually set external NT proxy host/domain.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
If you are using this method on all of your scripts you might want to set it in screen-scraper's external NT proxy settings.
If you are using NTLM (Windows NT) authentication you'll need to designate settings for both the standard external proxy as well as the external NT proxy.
Manually set external NT proxy password.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
If you are using this method on all of your scripts you might want to set it in screen-scraper's external NT proxy settings.
If you are using NTLM (Windows NT) authentication you'll need to designate settings for both the standard external proxy as well as the external NT proxy.
Manually set external NT proxy username.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
If you are using this method on all of your scripts you might want to set it in screen-scraper's external NT proxy settings.
If you are using NTLM (Windows NT) authentication you'll need to designate settings for both the standard external proxy as well as the external NT proxy.
Manually set external proxy host/domain.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
If you are using this method on all of your scripts you might want to set it in screen-scraper's external proxy settings.
Manually set external proxy password.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
If you are using this method on all of your scripts you might want to set it in screen-scraper's external proxy settings.
Manually set external proxy port.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
If you are using this method on all of your scripts you might want to set it in screen-scraper's external proxy settings.
Manually set external proxy username.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
If you are using this method on all of your scripts you might want to set it in screen-scraper's external proxy settings.
Use of log is a great tool to ensure that your scrapes are working correctly as well as troubleshooting problems that arise. Though logging large amounts of information may slow down a scrape, the best way around this is not to remove log writing requests but rather change the verbosity of the logging when running the scrape in a production environment. If you do this, know that you make it harder to troubleshoot some problems should they arise.
The number of methods provided is merely to enhance your ability to log information according to importance.
Get the name of the current log file.
This method does not receive any parameters.
Returns the name of the log file, as a string.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
This method can be very helpful when screen-scraper is running in server mode and you are tracking the log where the scrape of a record is located, or for tracking the location of errors in larger scrapes.
Write message to the log.
Returns void.
Version | Description |
---|---|
5.5 | Now accepts any Object as a message |
4.5 | Available for all editions. |
When the workbench is running, this will be found under the log tab for the scraping session. When screen-scraper is running in server mode, the message will get sent to the corresponding .log file found in screen-scraper's log folder. When screen-scraper is invoked from the command line, the message will get sent to standard out.
Write current date and time to log (at most verbose level). It is formatted to be human readable.
This method does not receive any parameters.
Returns void. If an error occurs, an error will be thrown.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Write current time to log (at most verbose level). The time is formatted to be human readable.
This method does not receive any parameters.
Returns void. If an error occurs, an error will be thrown.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Write message to the log, at the the debug level (most verbose).
Returns void.
Version | Description |
---|---|
5.5 | Now accepts any Object as a message |
4.5 | Available for professional and enterprise editions. |
Write scrape run time to the log (at most verbose level). It is formatted to be human readable, including breaking it into days, hours, minutes, and seconds.
This method does not receive any parameters.
Returns void. If an error occurs, an error will be thrown.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Write message to the log, at the the error level (least verbose).
Returns void. If an error occurs, an error will be thrown.
Version | Description |
---|---|
5.5 | Now accepts any Object as a message |
4.5 | Available for professional and enterprise editions. |
Write message to the log, at the the info level (second most verbose).
Returns void. If an error occurs, an error will be thrown.
Version | Description |
---|---|
5.5 | Now accepts any Object as a message |
4.5 | Available for professional and enterprise editions. |
Write all session variables to log.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
Write message to the log, at the the warn level (third most verbose).
Returns void. If an error occurs, an error will be thrown.
Version | Description |
---|---|
5.5 | Now accepts any Object as a message |
4.5 | Available for professional and enterprise editions. |
These methods are used in connection with the web interface of screen-scraper. Their use will provide the interface with more detailed information regarding the state of a running scrape. If you are not running the scrapes using the web interface then these methods are not particularly helpful to you.
As the web interface is an enterprise edition feature, these methods are only available in enterprise edition users.
Add to the value of duplicate records scraped. (As opposed to new or error records.)
Returns void.
Version | Description |
---|---|
7.0 | Available for enterprise edition. |
Add to the value error records. (As opposed to duplicate or new records.)
Returns void.
Version | Description |
---|---|
7.0 | Available for enterprise edition. |
Add to the value of new records scraped. (As opposed to duplicate or error records.)
Returns void.
Version | Description |
---|---|
7.0 | Available for enterprise edition. |
Add to the value of number of records scraped.
Returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Append an error message to any existing error messages.
Returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Get the current error message.
This method does not receive any parameters.
Returns current error message, as a string.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Determine the fatal error status of the scrape.
This method does not receive any parameters.
Returns whether a fatal error has occurred, as a boolean .
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Get the number of records that have been scraped.
This method does not receive any parameters.
Returns number of records scraped, as a integer.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Reset the count on the number of scraped records.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
5.0 | Available for all editions. |
Set the current error message.
Returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Set the fatal error status of the scrape.
Returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Set the number of records that have been scraped.
Returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Add a runnable that will be executed at the given time.
Note: session.addEventCallback is automatically executed at a priority of 0.
Returns void.
Version | Description |
---|---|
6.0.55a | Introduced for pro and enterprise editions. |
The EventFireTime is an interface which defines the methods that a fire time must have and so the addEventCallback method can take different types of fire times.
A number of different types of classes based on this interface have been defined for you which call out the various parts of a scrape that you can add event handlers to. Those are defined below.
Version | Description |
---|---|
6.0.55a | Introduced for pro and enterprise editions. |
*Note: When using the Async HTTP client you will have access to the request builder from ScrapeableFileEventData.getRedirectRequestBuilder() which can be used to modify and adjust the request before it is sent. If you use the Apache HTTP client the getRedirectRequestBuilder() method will always return null.
Version | Description |
---|---|
6.0.55a | Introduced for pro and enterprise editions. |
Returns the RedirectToURL value for the object.
This method does not receive any parameters.
Returns the RedirectToURL value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Version | Description |
---|---|
6.0.55a | Introduced for pro and enterprise editions. |
*Note: Calling a setVariable or getVariable method in here WILL trigger the events for those again. Avoid infinite recursion please!
Version | Description |
---|---|
6.0.55a | Introduced for pro and enterprise editions. |
Version | Description |
---|---|
6.0.55a | Introduced for pro and enterprise editions. |
Creates an EventHandler callback object which will be called when the event triggers
Version | Description |
---|---|
6.0.55a | Introduced for pro and enterprise editions. |
Returns the name of the handler. This method doesn't need to be implemented but helps with debugging.
This method does not receive any parameters.
Returns the name of the handler. This method doesn't need to be implemented but helps with debugging.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Processes the event, and potentially returns a useful value modifying something in the internal code as defined by the EventFireTime used to launch this event.
Returns a value based on which AbstractEventData class is used.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
The AbstractEventData class is an abstract class which allows for the accessing of various data values found within ScreenScraper. Below are the various classes that extend AbstractEventData
AbstractEventData is extended by the following classes and it is those classes that should be used in place of AbstractEventData.
Returns the LastReturnValue for the object. This is the value previously returned by another callback. This can be null, if no callbacks have been fired yet for this event. A null value is also the default return value for the given event.
This method does not receive any parameters.
Returns the LastReturnValue for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Sets the LastReturnValue fro the object.
Returns void.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
ExtractorPatternEventData extends AbstractEventData
This contains the data for various extractor pattern operations
Inherits the following methods from AbstractEventData
Returns the status of the extractor pattern timeout. Returns true if and only if the extractor pattern was applied and timed out while doing so. Otherwise it will return false.
This method does not receive any parameters.
Returns a boolean value representing the status of the extractor pattern timeout.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the DataRecord value for the object.
This method does not receive any parameters.
Returns the DataRecord value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the DataSet value for the object.
This method does not receive any parameters.
Returns the DataSet value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the ExtractorPattern value for the object.
This method does not receive any parameters.
Returns the ExtractorPattern value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the Scrapeablefile value for the object.
This method does not receive any parameters.
Returns the Scrapeablefile value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the Session value for the object.
This method does not receive any parameters.
Returns the Session value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
ScrapeableFileEventData extends AbstractEventData
This contains the data for various scrapeable file operations
Inherits the following methods from AbstractEventData
Returns the HttpResponseData for the object.
This method does not receive any parameters.
Returns the HttpResponseData for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the RedirectRequestBuilder for the object. Use this to add headers, etc... for the redirect. It can be null depending on the HTTP client being used, and whether or not it supports manually playing with the redirect.
This method does not receive any parameters.
Returns the RedirectRequestBuilder for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the Scrapeablefile value for the object.
This method does not receive any parameters.
Returns the Scrapeablefile value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the Session value for the object.
This method does not receive any parameters.
Returns the Session value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
ScriptEventData extends AbstractEventData
This contains the data for various script operations
Inherits the following methods from AbstractEventData
Returns the DataRecord value for the object.
This method does not receive any parameters.
Returns the DataRecord value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the DataSet value for the object.
This method does not receive any parameters.
Returns the DataSet value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the Scrapeablefile value for the object.
This method does not receive any parameters.
Returns the Scrapeablefile value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the ScriptException for the object.
This method does not receive any parameters.
Returns the ScriptException for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the ScriptName value for the object.
This method does not receive any parameters.
Returns the ScriptName value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the Session value for the object.
This method does not receive any parameters.
Returns the Session value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
SessionEventData extends AbstractEventData
This contains the data for various session operations
Inherits the following methods from AbstractEventData
Returns the IncrementRecordsAmount value for the object.
This method does not receive any parameters.
Returns the IncrementRecordsAmount value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the Session value for the object.
This method does not receive any parameters.
Returns the Session value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the VariableName value for the object.
This method does not receive any parameters.
Returns the VariableName value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the VariableValue value for the object.
This method does not receive any parameters.
Returns the VariableValue value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
StringEventData extends AbstractEventData
This contains the data for various string operations
Inherits the following methods from AbstractEventData
Returns the Input value for the object.
This method does not receive any parameters.
Returns the Input value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Add to the value of a session variable.
Returns void. If the variable doesn't exist, or is not a string or integer, a message will be added to the log. If it cannot add to the variable for any other reason it will write an error to the log.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Pause scrape and display breakpoint window. If the scrape is running in server mode, to avoid the break, logVariables will be called in place of breakpoint.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Remove all session variables.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Clear stored cookies.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Clears the value of all session variables that match the keys in the Map. This will ignore a key of DATARECORD.
This method is provided using a Map or Collection rather than a List or Set to work easier with the setSessionVariables method.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Changed from session.removeSessionVariablesInMap to session.clearVariables. |
Decode HTML Entities on a session variable.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
Downloads the file to the local file system.
Returns true on successful download of the file otherwise it return false.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. Lazy scrape only available for enterprise edition. |
If the file to download requires that POST data is sent in order to get the file you would use saveFileOnRequest with a scrapeable file.
Using this method in a script takes the place of requesting the target URL as a scrapeable file.
Manual start the execution of a script.
Returns void. If the file doesn't exist a message will be written to the log. If the called script has an error in it a warning will be written to the log.
Version | Description |
---|---|
5.0 | Scripts called using this method are now exported with the scraping session. |
4.5 | Available for professional and enterprise editions. |
Executes the named script, but preserves the current context (dataRecord, scrapeableFile, etc...)
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
Get the general character set being used in page response renderings.
This method does not receive any parameters.
Returns the character set applied to the scraping session's files, as a string. If a character set has not been specified then it will default to the character set specified in settings dialog box.
Version | Description |
---|---|
4.5 | Available for all editions. |
If you are having trouble with characters displaying incorrectly, we encourage you to read about how to go about finding a solution using one of our FAQs.
Retrieve the timeout value for scrapeable files in the session.
This method does not receive any parameters.
Returns the timeout value in milliseconds, as an integer.
Version | Description |
---|---|
5.0.1a | Introduced for all editions. |
Get the current cookies.
This method does not receive any parameters.
Returns an array of the cookies in the session.
Version | Description |
---|---|
5.0 | Available for all editions. |
Checks to see if this is currently set to run in debug mode. This is useful for developing scrapes, as enabling debug mode logs a warning message, so it is easier to notice a scrape with hard-coded values used for development. Also logs a warning in the web interface or log each time monitored variables are logged with the logMonitoredValues or webMessage methods are called.
This method takes no parameters.
True if debug mode is enabled, false otherwise.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Gets the default retry policy to be used by each scrapeable file when one wasn't set for it.
This method takes no parameters
The default return policy, or null if there isn't one
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
Get how long the current session has been running.
This method does not receive any parameters.
Returns number of milliseconds the scrape has been running, as a long (8-byte integer).
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
If you would like to log the running time of the scraping session you should use logElapsedRunningTime.
Get the logging level of the scrape.
This method does not receive any parameters.
Returns the logging level, as an integer. Currently there are four levels: 1 = Debug, 2 = Info, 3 = Warn, 4 = Error.
Version | Description |
---|---|
5.0.1a | Introduced for all editions. |
Retrieve the maximum number of concurrent file downloads being allowed.
This methods does not receive any parameters.
Returns the max number of concurrent file downloads allowed, as an integer.
Version | Description |
---|---|
5.0 | Added for professional and enterprise editions. |
Retrieve the number of attempts that scrapeable files should make to get the requested page.
This method does not receive any parameters.
Returns the number of attempts that will be made, as a integer.
Version | Description |
---|---|
5.0 | Available for all editions. |
Get the total number of scripts allowed on the stack before the scraping session is forcibly stopped.
This method does not receive any parameters.
Returns max number of scripts that can be running at a time, as an integer.
Version | Description |
---|---|
5.0 | Added for all editions. |
Get the name of the current scraping session.
This method does not receive any parameters.
Returns the name of the scraping session, as a string.
Version | Description |
---|---|
4.5 | Available for all editions. |
Get the number of scripts currently running.
This method does not receive any parameters.
Returns number of running scripts, as an integer.
Version | Description |
---|---|
5.0 | Added for all editions. |
Determine whether or not non-tidied HTML is to be retained for all scrapeable files in this scraping session.
This method does not receive any parameters.
Returns whether non-tidied HTML is be retained for all scrapeable files or not, as a boolean.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Get the unique identifier for the scraping session.
This method does not receive any parameters.
Returns unique session id for the scraping session, as an integer.
Version | Description |
---|---|
5.0 | Added for enterprise edition. |
Retrieve the time at which the scrape started.
This method does not receive any parameters.
Returns the start time of the scrape in milliseconds, as a long.
Version | Description |
---|---|
4.5 | Available for all editions. |
Gets the current time zone of the Scraping Session
This method takes no parameters.
The time zone this scrape is set to.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Retrieve the value of a saved session variable.
Returns the value of the session variable. This will be a string unless you have used setVariable to place something other than a string into a session variable.
Version | Description |
---|---|
4.5 | Available for all editions. |
Retrieve the value of a saved session variable (alias of getVariable).
Returns the value of the session variable. This will be a string unless you have used setVariable to place something other than a string into a session variable.
Version | Description |
---|---|
4.5 | Added for all editions. |
Returns whether or not we are currently running in the command line. This is a convenience method for doing something different in a script when running in the command line as opposed to other modes
This method does not receive any parameters.
Returns true if and only if the scrape is currently running in the command line.
Version | Description |
---|---|
6.0.37a | Introduced for all editions. |
Returns whether or not we are currently running in the server. This is a convenience method for doing something different in a script when running in the server as opposed to other modes
This method does not receive any parameters.
Returns true if and only if the scrape is currently running in the server.
Version | Description |
---|---|
6.0.37a | Introduced for all editions. |
Returns whether or not we are currently running in the workbench. This is a convenience method for doing something different in a script when running in the workbench as opposed to other modes
This method does not receive any parameters.
Returns true if and only if the scrape is currently running in the workbench.
Version | Description |
---|---|
6.0.37a | Introduced for all editions. |
Loads the state that would have been previously saved by invoking the session.saveStateToString method.
None
Version | Description |
---|---|
5.5.30a | Available in Professional and Enterprise editions. |
Load session variables from a file.
Returns void. If there is a problem retrieving the file contents an I/O error will be written to the log.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
See also: saveVariables.
If you want to create your own file of session variables, the format is a hard return-delimited list of name/value pairs. Both the key and value should be URL-encoded.
Saves the current state of the scraping session to a string. An example use case for this method would be a scraping session that logs in to a site, extracts some information, and then is stopped, saving its state out to a file. A second scraping session could then be run, loading the state back in from the file, which would keep the session logged in so that other information could be obtained without logging in once again. By default the scraping session will save out information such as the URL to use as a referer. More information can be saved using the boolean flags described below.
None
Version | Description |
---|---|
5.5.30a | Available in Professional and Enterprise editions. |
Saves all current string and integer variables to a file.
Returns void. If there is a problem retrieving the file contents an I/O error will be written to the log.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Manually scrape a scrapeable file.
Returns void. If there is a problem accessing the scrapeable file an message will be written to the log.
Version | Description |
---|---|
4.5 | Available for all editions. |
Invokes a scrapeable file using a string of content instead of a web page or local file.
None
Version | Description |
---|---|
5.5.13a | Available in all editions. |
Send data to the external script that initiated the scrape. This isn't currently supported with all drivers (e.g., remote scraping session), check the documentation on the language of the external script for more information.
Returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Set the general character set used in page response renderings. This can be particularly helpful when the pages render characters incorrectly.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
This method must be invoked before the session starts.
If you are having trouble with characters displaying incorrectly, we encourage you to ready about how to go about finding a solution using one of our FAQs.
Set the timeout value for scrapeable files in the session.
Returns void.
Version | Description |
---|---|
5.0.1a | Introduced for all editions. |
Manually set a cookie in the current session state.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
This method should be rarely used as screen-scraper automatically manages cookies. In cases where cookies are set via JavaScript, this function might be necessary.
Sets the debug state for the scrape. Enabled debug mode simply outputs a warning periodically while running, to help prevent running a production scrape in debug mode.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Sets a retry policy that will affect all files in the scrape. This policy will be used by all scrapeable files that do not have a retry policy set for them. If a retry policy was manually set for them, this one will not be used.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
Sets the path to the keystore file. Some web sites require a special type of authentication that requires the use of a keystore file. See our blog entry on Using Client Certificates for more detail. Calling this method is the equivalent of setting the corresponding value under the "Advanced" tab for the scraping session in the workbench.
None
Version | Description |
---|---|
5.5.10a | Available in all editions. |
Sets the password for the keystore file. Some web sites require a special type of authentication that requires the use of a keystore file. See our blog entry on Using Client Certificates for more detail. Calling this method is the equivalent of setting the corresponding value under the "Advanced" tab for the scraping session in the workbench.
None
Version | Description |
---|---|
5.5.10a | Available in all editions. |
Set the logging level of the scrape.
Returns void.
Version | Description |
---|---|
5.0.1a | Introduced for all editions. |
Set the maximum number of concurrent file downloads to a allow.
Returns void.
Version | Description |
---|---|
5.0 | Added for professional and enterprise editions. |
Set the number of attempts that scrapeable files should make to get the requested page.
Returns void.
Version | Description |
---|---|
5.0 | Available for all editions. |
Get the total number of scripts that can be running concurrently. Default value for maxScriptsOnStack is 50.
Returns void.
Version | Description |
---|---|
5.0 | Added for enterprise edition. |
Before you start upping the value of the number of scripts that can be on the stack you should make sure that your scrape is not eating more then it should. One thing to consider is recursion instead of iterating. This is discussed in more details on our blog or in the Tips, Tricks, and Samples section of this site.
Causes the "User-Agent" header sent by screen-scraper to be randomized. The user agent strings from which screen-scraper will select are found in the "resource\conf\user_agents.txt" file.
None
Version | Description |
---|---|
5.5.34a | Available in Professional and Enterprise editions. |
Set whether or not non-tidied HTML is to be retained for all scrapeable files.
Returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
If, after the file is scraped, you want to be able to use getNonTidiedHTML this method has to be called before a file is scraped.
Sets the value of all session variables that match the keys in the Map to the values in the Map. This will ignore a key of DATARECORD.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Changed from session.setSessionVariablesFromMap to session.setSessionVariables. |
Sets a status message to be displayed in the web interface.
None
Version | Description |
---|---|
5.5.32a | Available in Enterprise edition. |
If this method is passed the value of true, it will cause screen-scraper to stop the current scraping session if an extractor pattern timeout occurs.
None
Version | Description |
---|---|
5.5.36a | Available in Professional and Enterprise editions. |
If this method is passed the value of true, it will cause screen-scraper to stop the current scraping session if the maximum attempts to request a file is reached.
None
Version | Description |
---|---|
5.5.36a | Available in Professional and Enterprise editions. |
If this method is passed the value of true, it will cause screen-scraper to stop the current scraping session if a script error occurs.
None
Version | Description |
---|---|
5.5.36a | Available in Professional and Enterprise editions. |
Sets the time zone that will be used when using a method that returns a time formatted as a string.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
If this method is passed the value of true, it will cause screen-scraper to utilize whatever character set is specified by the server in its "Content-Type" response header. If no such header exists, screen-scraper will default to either the character set indicated for the scraping session or the global character set (in that order).
None
Version | Description |
---|---|
5.5.11a | Available in all editions. |
Sets the user agent to be used for all requests.
None
Version | Description |
---|---|
5.5.23a | Available in Professional and Enterprise editions. |
Set the value of a session variable.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Set the value of a session variable (alias of setVariable).
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
Determine if the scrape has been stopped. This can be done using the stop button in the workbench or the stop scraping button on the web interface (for enterprise users).
This method does not receive any parameters.
Returns true if the scrape has been requested to stop; otherwise, it returns false.
Version | Description |
---|---|
5.0 | Added for enterprise edition. |
Stop the current scraping session.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Waits for any file downloads to complete before returning. This should be used in tandem with the session.downloadFile method call that takes the "doLazy" paraameter.
None
None
Version | Description |
---|---|
5.5.43a | Available in Enterprise edition. |
The sutil class provides general functions used to manipulate and work with extracted data. It also allows you to get information regarding screen-scraper such as its memory usage or version.
In the course of a scrape it you might want to gather images associated with the other information being gathered. These methods are provided to not only download the images but to gather size information and resize to your desired size.
These methods are only available to enterprise edition users.
Get the height of an image.
Returns the height in pixels of the image file, as an integer. If the file doesn't exist or is not an image an error will be thrown and -1 will be returned.
Version | Description |
---|---|
5.0 | Moved from session to sutil. |
4.5 | Available for enterprise edition. |
Get the width of an image.
Returns the width in pixels of the image file, as an integer. If the file doesn't exist or is not an image an error will be thrown and -1 will be returned.
Version | Description |
---|---|
5.0 | Moved from session to sutil. |
4.5 | Available for enterprise edition. |
Internally, only one function is used to resize all images; however, to facilitate the resizing of images, we have provided you with three methods. Each method will help you specify what measurement is most important (width or height) and whether the image should retain its aspect ratio.
Resize image, retaining aspect ratio, based on specified height.
Returns void. If an error is encountered it will be thrown.
Version | Description |
---|---|
5.0 | Moved from session to sutil. |
4.5 | Available for enterprise edition. |
Resize image, retaining aspect ratio, based on specified width.
Returns void. If an error is encountered it will be thrown.
Version | Description |
---|---|
5.0 | Moved from session to sutil. |
4.5 | Available for enterprise edition. |
Resize image to a specified size.
Returns void. If an error is encountered it will be thrown.
Version | Description |
---|---|
5.0 | Moved from session to sutil. |
4.5 | Available for enterprise edition. |
This method can cause distortions of the image if the aspect ratio of the original and target images are different.
To be used in conjunction with the ImageDecoder class.
This class represents decoded images. The objects can be queried for the text that was in the image, as well as any error that occurred while the image was being decoded. When the returned text is incorrect, there is a method that can be used to report it as bad. This can be used for sites like decaptcher.com, where refunds are given for incorrectly interpreted images.
Gets any error message, or returns null if there was no error
This method takes no parameters
The error message returned
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Gets the result from decoding the image. Most likely this will be a String, but each implementation could return a specific object type.
This method takes no parameters
The text extracted from the image, or null if there was an error
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Handles an incorrectly resolved image. Some types of decoders won't have anything here
This method takes no parameters
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Returns true if there was an error, false otherwise. Also returns false if the image has not been resolved yet
This method takes no parameters
True if there was an error, false otherwise
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Class to convert images to text for interacting with CAPTCHA challenges. There are currently two implementations:
When a reference to an image is passed to an instance of this class, it returns a DecodedImage object that can be queried for the resulting text, errors, and can report an image as poorly converted.
See example attached.
Requires an account with decaptcher.com.
Type of ImageDecoder in the com.screenscraper.util.images
package that uses the decaptcher.com service to convert images to text. The constructor is DecaptcherDecoder(ScrapingSession session, String username, String password) or DecaptcherDecoder(ScrapingSession session, String username, String password, String apiUrl).
Returns void. If it runs into any problems accessing the decaptcher.com service an error will be thrown.
Version | Description |
---|---|
5.5.29a | Available in all editions |
5.5.40a | Added the port parameter. The service now requires the correct port in order to authenticate. |
Initialization script
Type of ImageDecoder in the com.screenscraper.util.images
package that uses a popup window prompting the user to enter the text read from an image. Useful for debugging purposes, as the input text should always be correct (so long as it is typed correctly). Helpful during testing to avoid costs associated with paid-for CAPTCHA decoding services such as decaptcher.com.
Returns void. If it runs into any problems decoding an image an error will be thrown.
Version | Description |
---|---|
5.5.29a | Available in all editions |
Initialize script
Converts the image given to a DecodedImage that will handle it. Does not delete the file.
A DecodedImage used to get the text, errors, and possibly report a result as bad.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Converts the image at the given URL to a DecodedImage that will handle it. Temporarily saves the file in the screen-scraper root folder, but deletes it once it has been decoded. By default, this will use the scraping session's HttpClient to request the URL.
A DecodedImage used to get the text, errors, and possibly report a result as bad.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Converts the Date given to a string in a specified format, or in the "MM/dd/yyyy HH:mm:ss.SS zzz" if no format is given.
A String representing the date given
Version | Description |
---|---|
5.5.26a | Available in all editions. |
Decode HTML Entities.
Returns string with decoded HTML entities.
Version | Description |
---|---|
5.0 | Added for all editions. |
Converts a String to a Date object using the given format. If null is given as a format, "MM/dd/yyyy HH:mm:ss.SS zzz" is used
The Date object matching the date given in the String, or null if it couldn't be parsed with the given format
Version | Description |
---|---|
5.5.26a | Available in all editions. |
Replaces the UTF variants on whitespace with a regular space character.
Returns the converted string.
Version | Description |
---|---|
6.0.55a | Available in all editions. |
Checks to see if one date is within a certain number of days of another.
Version | Description |
---|---|
5.5.13a | Available in all editions. |
Compare two strings ignoring case.
Returns true if the values of the two strings are equal when case is not considered; otherwise, it returns false.
Version | Description |
---|---|
5.0 | Added for all editions. |
Returns a number formatted in such a way that it could be parsed as a Float, such as xxxxxxxxx.xxxx. It attempts to figure out if the number is formatted as European or American style, but if it cannot determine which it is, it defaults to American. If the number is something with a k on the end, it will convert the k to thousand (as 000). It will also try to convert m for million and b for billion. It also assumes that you won't have a number like 3.123k or 3.765m, however 3.54m is fine. It figures if you wanted all three of those digits you would have specified it as 3765k or 3,765k
Returns a String formatted as a phone number, such as +1 (123) 456-7890x2, or null if the input was null
Version | Description |
---|---|
5.5.26a | Available in all editions. |
Converts a String to a US formatted phone number, as +1 (123) 456-7890x2. Expects a 7 digit or 10+ digit phone number. The extension is optional, and will be any digits found after an x. This allows for extensions listed as ext, x, or extension.
Returns a String formatted as a phone number, such as +1 (123) 456-7890x2, or null if the input was null
Version | Description |
---|---|
5.5.26a | Available in all editions. |
Formats and returns a US style zip code as 12345-6789. If the given zip code isn't 5 or 9 digits, will log a warning, but it will put 5 digits before the - and anything else (if any) after the -
Zip code formatted String, such as 12345-6789 or 12345
Version | Description |
---|---|
5.5.26a | Available in all editions. |
Returns the current date in a specified format, or uses the "MM/dd/yyyy HH:mm:ss.SS zzz" if null is given. Uses the session's timezone.
A String representing the date and time this method was invoked
Version | Description |
---|---|
5.5.26a | Available in all editions. |
Retrieve the file path of the screen-scraper installation.
This method does not receive parameters.
Returns the installation directory file path, as a string.
Version | Description |
---|---|
5.0 | Added for all editions. |
Get memory usage of screen-scraper.
This method does not receive any parameters.
Returns the average percentage of memory used by screen-scraper over the past 30 seconds, as an integer.
Version | Description |
---|---|
5.0 | Moved from session to sutil. |
4.5 | Available for enterprise edition. |
For tips on optimizing screen-scraper's memory usage so that it can run faster, see our FAQ on optimization.
Get the mime-type of a local file.
Returns the mime-type of the file, as a string.
Version | Description |
---|---|
5.0 | Added for all editions. |
Get the number of runnable scraping sessions.
This method does not receive any parameters.
Returns the number of scraping sessions in this instance of screen-scraper, as a integer.
Version | Description |
---|---|
5.0 | Added for all editions. |
Gets the number of scraping sessions that are currently being run.
An int representing the number of running scraping sessions.
Version | Description |
---|---|
5.5.42a | Available in Enterprise edition. |
Gets a DataSet containing each of the elements of a <select> tag. The returned DataRecords will contain a key for the text found between the tags (possibly with html tags removed), a value indicating if it was the selected option, and the value to submit for the specific option. Note that this only looks for option tags, and as such passing in text containing more than a single select tag will produce false output.
A DataSet with one record per option. Values extracted will be stored in
VALUE : The value the browser would submit for this option
TEXT : The text that was between the tags
SELECTED : A boolean that is true if this option was selected by default
Version | Description |
---|---|
5.5.26a | Available in all editions. |
Gets all the options from a radio button group. The values are returned in a data record. Any labels that are to be ignored will not be included in the returned set. Not all buttons have a label, as radio buttons do not require a label, and it would be difficult to know in a regular expression exactly what to extract as the label unless there is a label tag.
DataSet containing one record for each of the extracted radio buttons. Values will be stored in
VALUE : The value the browser would submit for this radio button
TEXT : The text that represents this button, or null if no label could be found for it
SELECTED : A boolean that is true if this button was selected by default
ID : The ID of the radio button, or null if no ID was found
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Gets a random referrer page from a list of many different search engine web sites and a few other pages.
This method does not receive any parameters.
Returns a random referrer URL.
Version | Description |
---|---|
6.0.1a | Introduced for all editions. |
Returns a random User Agent. The list isn't closely monitored, so it may not include newer user agents, and may include extremely old ones as well.
This method does not receive any parameters.
Returns a random user agent.
Version | Description |
---|---|
6.0.1a | Introduced for all editions. |
Get edition of screen-scraper instance.
This method does not receive any parameters.
Returns the edition name, as a string.
Version | Description |
---|---|
5.0 | Added for all editions. |
Get version of screen-scraper instance.
This method does not receive any parameters.
Returns the version number, as a string.
Version | Description |
---|---|
5.0 | Added for all editions. |
Determine if the value of a string is an integer.
Returns true if the string is an integer; otherwise, it returns false. If it is passed an object that is not a string, including an integer, an error will be thrown.
Version | Description |
---|---|
5.0 | Added for all editions. |
Determine if an object's value is null or empty.
Returns true if the value of the object is null or an empty string; otherwise, it returns false.
Version | Description |
---|---|
5.0 | Added for all editions. |
Determine if operating system is a Linux platform.
This method does not receive parameters.
Returns true if the operating system is Linux; otherwise, it returns false.
Version | Description |
---|---|
5.0 | Added for all editions. |
Determine if operating system is a Mac platform.
This method does not receive parameters.
Returns true if the operating system is Mac; otherwise, it returns false.
Version | Description |
---|---|
5.0 | Added for all editions. |
Determine if operating system is a Windows platform.
This method does not receive parameters.
Returns true if the operating system is Windows; otherwise, it returns false.
Version | Description |
---|---|
5.0 | Added for all editions. |
Retrieve the response contents of a GET request.
Returns contents of the response, as a string.
Version | Description |
---|---|
5.0 | Added for all editions. |
This method will use any proxy settings that have been specified in the Settings dialog box.
Makes a GET request and returns the result as a string. This method will use the proxy settings indicated in the "Settings" dialog box, if any.
This method does not receive any parameters.
Version | Description |
---|---|
6.0.6a | Introduced for all editions. |
Makes a GET request and returns the result as a string. This method will use the proxy settings attached to the current scraping session.
This method does not receive any parameters.
Version | Description |
---|---|
6.0.6a | Introduced for all editions. |
Retrieve the response header contents.
Returns contents of the response, as a two-dimensional array.
Version | Description |
---|---|
5.0 | Added for all editions. |
This method will use any proxy settings that have been specified in the Settings dialog box..
Merges two data records by copying all values from the second record over values of the first record, and returning a new DataRecord with these values. Doesn't modify either original record
A new DataRecord with the merged values
Version | Description |
---|---|
5.5.26a | Available in all editions. |
Get an object in string format.
Returns an empty string if the value of the object is null; otherwise, returns the value of the toString method of the object.
Version | Description |
---|---|
5.0 | Added for all editions. |
Attempts to parse a string to a name. The parser is not perfect and works best on english formatted names (for example, "John Smith Jr." or "Guerrero, Antonio K". This uses standard settings for the parser. To get more control over how the name is parsed, use the EnglishNameParser class.
Returns the parsed name, as a Name object.
Version | Description |
---|---|
6.0.59a | Available for professional and enterprise editions. |
Attempts to parse a string to a name. The parser is not perfect and works best on english formatted names (for example, "John Smith Jr." or "Guerrero, Antonio K". This uses standard settings for the parser. To get more control over how the name is parsed, use the EnglishNameParser class.
Returns the parsed name, as a Name object.
Version | Description |
---|---|
6.0.59a | Available for professional and enterprise editions. |
Attempts to parse a string to an address. The parser is not perfect and works best on US addresses. Most likely other address formats can be parsed with the USAddressParser class by providing different constraints in the builder. This method is here for convenience in working with US addresses.
Returns the parsed address, as a Address object.
Version | Description |
---|---|
6.0.59a | Available for professional and enterprise editions. |
Pause scraping session.
Returns void.
Version | Description |
---|---|
5.0 | Moved from session to sutil. |
4.5 | Available for professional and enterprise editions. |
Pausing the scraping session also pauses the execution of the scripts including the one that initiates the pause.
Pauses for a random amount of time. This is also setup to stop immediately if the stop scrape button is clicked, and to allow breakpoints to be triggered while it is pausing.
Returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
Change a date format.
Returns formatted date according to the specified format, as a string.
Version | Description |
---|---|
5.0 | Moved from session to sutil. |
4.5 | Available for professional and enterprise editions. Unspecified source format available for enterprise edition. |
The date formats are not the same for the two methods. Read carefully.
Send an email using SMTP mail server specified in the settings.
Returns void. If it runs into any problems while attempting to send the email an error will be thrown.
Version | Description |
---|---|
6.0.35a | Now supports alternate content types. |
5.0 | Moved from session to sutil. |
4.5 | Available for enterprise edition. |
Sorts the elements in a set into an ordered list.
This method returns a sorted list of elements that are in the set.
Version | Description |
---|---|
5.5.26a | Available in all editions. |
Determine if one string is the start of another, without regards for case.
Returns true if string starts with start when case is not considered; otherwise, it returns false.
Version | Description |
---|---|
5.0 | Added for all editions. |
Parse string into a floating point number.
Returns the string's value as a floating point number.
Version | Description |
---|---|
5.0.1a | Introduced for professional and enterprise editions. |
Strips HTML from a string, replacing some tags with corresonding text-only equivalents.
Returns the stripped content.
Version | Description |
---|---|
6.0.20a | Available in only the Enterprise edition. |
Tidies the DataRecord by performing actions based on the values of the settings map given (or getDefaultTidySettings if none is given). Each value in the record that is a string will be tidied. Keys are not modified. The record given will not be modified, but a new record with the tidied values will be returned. If no settings are given, will use the values obtained from sUtil.getDefaultTidySettings().
The settings tidy settings and their default values are given below. If a key is missing in the settings map, that operation will not be performed.
Map Key | Default Value | Description of operation performed |
---|---|---|
trim | true | Trims whitespace from values |
convertNullStringToLiteral | true | Converts the string 'null' (without quotes) to the null literal (unless it has quotes around it, such as "null") |
convertLinks | true | Preserves links by converting <a href="link">text</a> to text (link), will try to resolve urls if scrapeableFile isn't null. Note that if there isn't a start and end <a> tag, this will do nothing |
removeTags | true | Remove html tags, and attempts to convert line break HTML tags such as <br> to a new line in the result |
removeSurroundingQuotes | true | Remove quotes from values surrounded by them -- "value" becomes value |
convertEntities (professional and enterprise editions only) | true | Convert html entities |
removeNewLines | false | Remove all new lines from the text. Replaces them with a space |
removeMultipleSpaces | true | Convert multiple spaces to a single space, and preserve new lines |
convertBlankToNull | false | Convert blank strings to null literal |
A new DataRecord containing all the tidied values and any values that were not Strings in the original record. The values that were Strings but were not tidied as well as the DATARECORD value will not be in the returned record.
Version | Description |
---|---|
5.5.26a | Available in all editions. |
5.5.28a | Now uses a Map for the settings, rather than bit flags. |
Tidies the string by performing actions based on the values of the settings map.
The tidy settings and their default values are given below. If a key is missing in the settings map, that operation will not be performed.
Map Key | Default Value | Description of operation performed |
---|---|---|
trim | true | Trims whitespace from values |
convertNullStringToLiteral | true | Converts the string 'null' (without quotes) to the null literal (unless it has quotes around it, such as "null") |
convertLinks | true | Preserves links by converting <a href="link">text</a> to text (link), will try to resolve urls if scrapeableFile isn't null. Note that if there isn't a start and end <a> tag, this will do nothing |
removeTags | true | Remove html tags, and attempts to convert line break HTML tags such as <br> to a new line in the result |
removeSurroundingQuotes | true | Remove quotes from values surrounded by them -- "value" becomes value |
convertEntities (professional and enterprise editions only) | true | Convert html entities |
removeNewLines | false | Remove all new lines from the text. Replaces them with a space |
removeMultipleSpaces | true | Convert multiple spaces to a single space, and preserve new lines |
convertBlankToNull | false | Convert blank strings to null literal |
The tidied string
Version | Description |
---|---|
5.5.26a | Available in all editions. |
5.5.28a | Now uses a Map for the settings, rather than bit flags. |
Assuming the extracted text's HTML code was:
<a href="http://www.somelink.com">This</a> was great because of these reasons:<br />
1 - Some reason<br />
2 - Another reason<br />
3 - Final reason
The output text would be:
This (http://www.somelink.com) was great because of these reasons:
1 - Some reason
2 - Another reason
3 - Final reason
Unzip a zipped file. Contents will appear in the same directory as the zipped file.
Returns void. If a file input/output error is experienced it will be thrown.
Version | Description |
---|---|
5.0 | Added for all editions. |
Write to a file.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
screen-scraper provides three built-in objects for proxy sessions. These objects are: proxySession, request, and response. See the Variable scope section for details on which objects are available based on when scripts are run.
This object gives you the ability to control interactions with the proxy session. It is only for use in scripts that associated with proxy sessions.
Retrieve the value of the proxy session variable.
Returns the value of the session variable.
Version | Description |
---|---|
4.5 | Available for all editions. |
Write to the log.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Set the value of a proxy session variable.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
A request objects references a proxySession page request. Through this object you can control various aspects of the request.
Scripts run in the scraping engine use the scrapeable file to manipulate server requests.
Manually add an HTTP header.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Add POST parameter to HTTP request.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Retrieve the URL of the request.
This method does not receive any parameters.
Returns the URL of the request, as a string.
Version | Description |
---|---|
4.5 | Available for all editions. |
Manually remove an HTTP header. Both the key and value have to be specified as HTTP headers allow for multiple headers with the same key.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Remove POST parameter from HTTP request.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Manually set the request line.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
The response class provides you with a means for editing the responses received by the proxy server.
Scripts run in the scraping engine us the scrapeable file to manipulate server responses.
Add HTTP header to response.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Retrieve the content of the response.
This method does not receive any parameters.
Returns the content of the response, as a string.
Version | Description |
---|---|
4.5 | Available for all editions. |
Retrieve the status line of the response.
This method does not receive any parameters.
Returns the status line of the response, as a string.
Version | Description |
---|---|
4.5 | Available for all editions. |
Remove HTTP header from response.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Manually set the response content.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Manually set the status line.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
There are many classes that can be very helpful in getting your scripts to run correctly. Many of these are initially developed in-house to speed up coding time and once they have proved very stable offered to the public. For all classes you will need to import their packages. They are not automatically imported like the built-in screen-scraper objects.
The Apache Lang library provides enhancements to the standard Lang library of Java and can be particularly useful for completing tasks. As it is not a class that we maintain we will not document the methods in case they change without our notice but we invite you to look over how to use it in their API.
The CSVReader is not a class that is part of screen-scraper but is very useful and well put together. We have used it extensively. It is part of the opencsv package which actually holds the under pinnings of our own CsvWriter. As it is not a class that we maintain we will not document the methods in case they change without our notice but we invite you to look over how to use it in their API or brief documentation.
To use the CSVReader simply import it in your script, the same as you would any other utility class. The opencsv.jar file is already included in the Professional and Enterprise Editions of screen-scraper's default installation.
This CsvWriter has been created to work particularly well with the screen-scraper objects. It is simple to use and provided to ease the task of keeping track of everything when creating a csv file.
The most used methods are documented here but if you would like more information you can read the JavaDoc for the CsvWriter.
Create a csv file writer.
Returns a CsvWriter object. If it encounters an error it will be thrown.
Version | Description |
---|---|
5.0 | Available for Professional and Enterprise editions. |
4.5.18a | Introduced in alpha version. |
com.screenscraper.csv.CsvWriter
Clear the buffer contents and close the file.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
5.0 | Available for all editions. |
4.5.18a | Introduced in alpha version. |
Write the buffer contents to the file.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
5.0 | Available for all editions. |
4.5.18a | Introduced in alpha version. |
Set the header row of the csv document. If the document already exists the headers will not be written. Also creates a data record mapping to ease writing to file.
Returns void.
Version | Description |
---|---|
5.0 | Available for all editions. |
4.5.18a | Introduced in alpha version. |
If you want to use the data record mapping then the extractor tokens names should be all caps and all spaces should be replaced with underscores.
Write to the CsvWriter object.
Returns void.
Version | Description |
---|---|
5.0 | Available for all editions. |
4.5.18a | Introduced in alpha version. |
This class is used to instantiate a data manager object. This is done to simplify the process of creating a data manager of a given type. Currently it only creates SqlDataManagers. A SQL data manager can be created without the use of this class, but it is simplified greatly through its use.
This class should no longer be used. Use a java.sql.BasicDataSource or com.screenscraper.datamanager.SshDataSource instead. See the SqlDataManager.buildSchemas page for examples
This class is only available for Professional and Enterprise editions of screen-scraper.
This method is no longer supported. Use a java.sql.BasicDataSource or com.screenscraper.datamanager.SshDataSource instead. See the SqlDataManager.buildSchemas page for examples.
Create a MsSQL data manager object.
Returns a SqlDataManager object. If an error is experienced it will be thrown.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
In order to create the MsSQL data manager you will need to make sure to install the appropriate jdbc driver. This can be done by downloading the MsSQL JDBC driver and placing it in the lib/ext folder in the screen-scraper installation directory.
This method is no longer supported. Use a java.sql.BasicDataSource or com.screenscraper.datamanager.SshDataSource instead. See the SqlDataManager.buildSchemas page for examples.
Create a MySQL data manager object.
Returns a SqlDataManager object. If an error is experienced it will be thrown.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
In order to create the MySQL data manager you will need to make sure to install the appropriate jdbc driver. This can be done by downloading the MySQL JDBC driver and placing it in the lib/ext folder in the screen-scraper installation directory.
This method is no longer supported. Use a java.sql.BasicDataSource or com.screenscraper.datamanager.SshDataSource instead. See the SqlDataManager.buildSchemas page for examples.
Create an Oracle data manager object.
Returns a SqlDataManager object. If an error is experienced it will be thrown.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
In order to create the Oracle data manager you will need to make sure to install the appropriate jdbc driver. This can be done by downloading the Oracle JDBC driver and placing it in the lib/ext folder in the screen-scraper installation directory.
This method is no longer supported. Use a java.sql.BasicDataSource or com.screenscraper.datamanager.SshDataSource instead. See the SqlDataManager.buildSchemas page for examples.
Create a Postgre data manager object.
Returns a SqlDataManager object. If an error is experienced it will be thrown.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
In order to create the Postgre data manager you will need to make sure to install the appropriate jdbc driver. This can be done by downloading the Postgre JDBC driver and placing it in the lib/ext folder in the screen-scraper installation directory.
This method is no longer supported. Use a java.sql.BasicDataSource or com.screenscraper.datamanager.SshDataSource instead. See the SqlDataManager.buildSchemas page for examples.
Create a SQLite data manager object.
Returns a SqlDataManager object. If an error is experienced it will be thrown.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
In order to create the Sqlite data manager you will need to make sure to install the appropriate jdbc driver. This can be done by downloading the Sqlite JDBC driver and placing it in the lib/ext folder in the screen-scraper installation directory.
The proxy server pool object is used to aid with manual anonymization of scrapes. An example of how to setup manual proxy pools is available in the documentation. You will likely want to read that page first if you are new to the process.
Additionally, you should reference the available method's available in the Anonymous API
Initiate a ProxyServerPool object.
This method does not receive any parameters.
Returns a ProxyServerPool.
Version | Description |
---|---|
4.5 | Available for all editions. |
com.screenscraper.util.ProxyServerPool
Set the timeout that will render a proxy as being bad.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Retrieve the number of available proxy servers.
This method does not receive any parameters.
Returns the number of available proxy servers, as an integer.
Version | Description |
---|---|
4.5 | Available for all editions. |
Write list of proxies to log.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Add proxy servers to pool using a text file.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Enables or disables automatic proxy cycling. When this is set to false (default is true) the current proxy that was automatically selected from the pool will be used each time the next proxy is requested. When set to true, each call to the getNextProxy method will cycle as normal between all available proxies.
A boolean value.
None
Version | Description |
---|---|
5.5.17a | Available in Professional and Enterprise editions. |
Set the number of proxies that can be tested concurrently.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Set threshold to get more proxy servers.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Write list of proxies after invalid proxies have been removed.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Retry Policies are objects that tell a scrapeable file how to check for errors, and optionally what to do before retrying to download the files. Some of the things that can be done are executing scripts when the page loads incorrectly or running Runnables. Usually these things would either request a new proxy, output some helpful information, or could simply stop the scrape. RetryPolicy is an interface and can be implemented to create a custom retry policy, or there is a RetryPolicyFactory class that can be used to create some standard policies.
This policy is checked AFTER all the extractors have been run. This allows for checks on whether extractor patterns matched or not, and also allows a page to have it's 'error status' based off of another page (since extractor patterns could execute scripts that scrape other files, and those files could set a variable that acts as a flag to a previous retry policy). This could also cause some problems if the scrape isn't built to handle a page whose extractors shouldn't be run before the error checking occurs.
This interface is in the com.screenscraper.util.retry
package.
If you need a custom retry policy, you can implement your own version of it. Be aware that you will need to ensure the references it has to the scrapeableFile are to the correct scrapeableFile. This could be tricky if you use the session.setDefaultRetryPolicy method. When using the scrapeableFile.setRetryPolicy method, the scrapeableFile will be the correct object. The interface is given below.
To help ensure you can create custom retry policies that have access to the scraping session and the scrapeable file that is currently being checked, there is an AbstractRetryPolicy class in the same package as the interface. This class defines some default behavior and adds protected fields for the session and scrapeable file that get set before the policy is run. If you extend this abstract class you can access the session and scrapeable file through this.scrapingSession and this.theScrapeableFile. Due to some oddities with the interpreter it is best to reference these variables with 'this.' to eliminate a few problems that arise in a few specific cases.
Returns a map that can be used to output an error message to indicate what checks failed. For instance, you could set a key to the value "Status Code" and the value '200', or a key with "Valid Page" and value 'false'
This method takes no parameters
Map of keys, or null if no values are indicated
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Return the maximum number of times this policy allows for a retry before terminating in an error
This method takes no parameters
The maximum number of times to allow the ScrapeableFile to be rescraped before resulting in an error
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Checks to see if the page loaded incorrectly
This method takes no parameters
True on errors, false otherwise
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Returns true if the referrer should be reset before attempting to rescrape the file, if there was an error. This can be useful to reset so the referrer doesn't show the page you just requested.
This method takes no parameters
True if the referrer should be reset if there was an error, false otherwise.
Version | Description |
---|---|
6.0.36a | Available in all editions. |
Returns true if the session variables should be reset before attempting to rescrape the file, if there was an error. This can be useful especially if extractors null session variables when they don't match, but the value is needed to rescrape the file.
This method takes no parameters
True if session variables should be reset if there was an error, false otherwise.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
This will be called if all the retry attempts for the scrapeable file failed. In other words, if the policy said to retry 25 times, after 25 failures this method will be called. Note that runOnError will be called just before this, as it is called after each time the scrapeable file fails to load correctly, including the last time it fails to load.
This should only contain code that handles the final error. Any proxy rotating, cookie clearing, etc... should generally be done in the runOnError method, especially since it will still be called after the final error.This method takes no parameters
This method returns void
Version | Description |
---|---|
6.0.37a | Available in all editions. |
Runs this code when the page had an error. This could include things such as rotating the proxy. This code will be executed just before the page is downloaded again.
This method takes no parameters
This method returns void
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Returns true if errors should be logged to the log/web interface when they occur
This method takes no parameters
True if errors should be logged to the log/web interface when they occur
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Class used to create simple Retry Policies. See the RetryPolicy page for more details on what a RetryPolicy does. This class is found in the com.screenscraper.util.retry
package.
Policy that retries if there was an error on the request by status code. Executes the runnable given before retrying.
The RetryPolicy to set in the ScrapeableFile
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Policy that returns no error. Useful for having a session-wide retry policy, but then using this for a particular scrapeable file so it doesn't use the session's policy
The RetryPolicy to set in the ScrapeableFile
Version | Description |
---|---|
6.0.25a | Available in all editions. |
Policy that requires a Regular Expression to match the page content (including headers) in order to be considered valid.
The RetryPolicy to set in the ScrapeableFile
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Policy that requires a Regular Expression NOT to match the page content (including headers) in order to be considered valid. In other words, if the Regular Expression matches, it means that the page should be rescraped.
The RetryPolicy to set in the ScrapeableFile
Version | Description |
---|---|
5.5.29a | Available in all editions. |
This object simplifies your interactions with a JDBC-compliant SQL database. It can work with various types of databases and even in a multi-threaded format to allow scrapes to continue without having to wait for the queries to process. View an example of how to use the SqlDataManager.
This feature is only available for Professional and Enterprise editions of screen-scraper.
Prefer a more traditional approach? See an example of Working with MySQL databases.
In order to use the SqlDataManager you will need to make sure to install the appropriate JDBC driver. This can be done by downloading the driver and placing it in the lib/ext folder in the screen-scraper installation directory.
Add an event callback to SqlDataManager object.
This feature is only available for Professional and Enterprise editions of screen-scraper.
Before adding an event to the SqlDataManager, you must build the schema of any tables you will use because events are related to table operations such as inserting data
public void handleEvent(DataManagerEvent event)
that needs to be implemented. The DataManagerEvent has a method getDataNode() to retrieve the relevant DataNode.Returns a DataManagerEventListener. The same DataManagerEventListener object that was passed in
Version | Description |
---|---|
5.5 | Available for professional and enterprise editions. |
Add data to fields, in preparation for insertion into a database.
When adding data in a many-to-many relation, if setAutoManyToMany is set to false, a null row should be inserted into the relating table so the datamanager will link the keys correctly between related tables. For example, dm.addData("many_to_many", null);
Before adding data the first time, you must build the schema of any tables you will use, as well as add foreign keys if you are not using a database engine that natively supports them (such as InnoDB for MySQL).
The SqlDataManager will attempt to convert a value that is given to the correct format for the database. For example, if the database requires an int for a column named age, dm.addData("table", "age", "32") will convert the String "32" to an int before adding it to the database. See the table below the examples for other types of java objects and how they map to SQL types.
The table and columnName parameters are not case sensitive. The same is true for the key values in the data map.
Returns void.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
Since the DataManager is designed with screen-scraper in mind all inputs support using the String type in addition to their corresponding Java object type, but the String needs to be parseable into the corresponding data type. For example if there is a column that is defined as an Integer in the database then the String needs to be parseable by Integer.parseInt(String value). Here is a mapping of the sql types (based on java.sql.Types) to Java objects:
SQL Type | Java Object | |
---|---|---|
java.sql.Types.CHAR | String | |
java.sql.Types.VARCHAR | String | |
java.sql.Types.LONGVARCHAR | String | |
java.sql.Types.LONGNVARCHAR | String | |
java.sql.Types.NUMERIC | BigDecimal | |
java.sql.Types.DECIMAL | BigDecimal | |
java.sql.Types.TINYINT | Integer | |
java.sql.Types.SMALLINT | Integer | |
java.sql.Types.INTEGER | Integer | |
java.sql.Types.BIGINT | Long | |
java.sql.Types.REAL | Float | |
java.sql.Types.FLOAT | Double | |
java.sql.Types.DOUBLE | Double | |
java.sql.Types.BIT | Boolean | |
java.sql.Types.BINARY | ByteArray | |
java.sql.Types.VARBINARY | ByteArray | |
java.sql.Types.LONGVARBINARY | ByteArray | |
java.sql.Types.DATE | SQLDate or Long | |
java.sql.Types.TIME | SQLTime or Long | |
java.sql.Types.TIMESTAMP | SQLTime or Long | |
java.sql.Types.ARRAY | Object | |
java.sql.Types.BLOB | ByteArray | |
java.sql.Types.CLOB | Object | |
java.sql.Types.JAVA_OBJECT | Object | |
java.sql.Types.OTHER | Object |
Manually setup table connection (key matching).
If SqlDataManager.buildSchemas is called, any foreign keys manually added before that point will be overridden or erased.
Returns void.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
If the database has some indication of foreign keys then these will be followed automatically. If the database does not allow for foreign key references then you will need to build the table connections using this method.
Manually add session variable data to fields, in preparation for insertion into a database.
The keys from the session will be matched in a case insensitive way to the column names of the database.
Returns void.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
Add corresponding session variables to the tables automatically when it is committed.
Returns void.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
Collect the database schema information, including foreign key relations between tables.
Schemas must be built for any tables that will be used by this DataManager before data can be added.
Returns void.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
Clear all data from the data manager without writing it to the database. This includes all data previously committed but not yet written.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
Clear session variables corresponding to the fields of a specific table (case insensitive).
Returns void.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
Clear session variables corresponding to a committed table automatically.
Returns void.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
Close data manager's connections.
If there is data that has not yet been written to the database when this method is called it will not be written.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
Commit a prepared row of data into queue. Once called the data can no longer be edited. When working with multiple tables that relate by a foreign key, it is important to commit rows in the correct order. The rows in each of the child tables should be committed before the parent, or they will not be correctly linked when written to the database.
This does not write the row of data to the database, but rather puts it in queue to be written at a later time.
Returns void.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
Commit prepared rows of data for all tables into queue. Once called the data can no longer be edited.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
Write committed data to the database. Any data that has not been committed using either the commit or commitAll method will be lost and not written to the database.
This method does not receive any parameters.
Returns true data was successfully written to the database; otherwise, it returns false.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
Retrieve the connection object of the data manager. This can be helpful if you want to do something that the data manager cannot do easily, such as query the database.
Be sure to close the connection once it is no longer needed. Failure to do so could exhaust the connection pool used by the datamanger, which will cause the scraping session to hang.
This method does not receive and parameters.
Returns a connection object matching the one used in the data manager.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
Retrieve the last autogenerated primary key, if any, for the given table
case insensitve table name
Returns a com.screenscraper.datamanager.DataObject containing the primary key.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
Sets whether or not the data manager should automatically take care of many-to-many relationships.
Returns void.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
If the many-to-many table has more information than just the keys then you will want to leave this feature turned off so that you can add more data than just the keys before committing.
This feature is only available for Professional and Enterprise editions of screen-scraper.
Set global merge status. When conflicts exist in data, a merge of true will take the newer values and save them over previous null values.
When merging or updating values in a table, that table must have a Primary Key. When the Primary Key is set to autoincrement, if the value of that key was not set with the addData method the DataManager will create a new row rather than update or merge with an existing row. One solution is to use an SqlDuplicateFilter to set fields that would identify an entry as a duplicate and automatically insert the value of the autoincrement key when data is committed.
Update | Merge | Resulting Action |
---|---|---|
false | false | Ignore row on duplicate |
true | false | Update only values whose corresponding columns are currently NOT NULL in the database |
false | true | Update only values whose corresponding columns are currently NULL in the database |
true | true | Update all values to new data |
Returns void.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
This feature is only available for Professional and Enterprise editions of screen-scraper.
Set update status globally. When conflicts exist in data, an update of true will take the newer values and save them over previous non-null values.
When merging or updating values in a table, that table must have a Primary Key. When the Primary Key is set to autoincrement, if the value of that key was not set with the addData method the DataManager will create a new row rather than update or merge with an existing row. One solution is to use an SqlDuplicateFilter to set fields that would identify an entry as a duplicate and automatically insert the value of the autoincrement key when data is committed.
Update | Merge | Resulting Action |
---|---|---|
false | false | Ignore row on duplicate |
true | false | Update only values whose corresponding columns are currently NOT NULL in the database |
false | true | Update only values whose corresponding columns are currently NULL in the database |
true | true | Update all values to new data |
Returns void.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
Set the error logging level. Currently only DEBUG and ERROR levels are supported. At the DEBUG level, all queries and results will be output to the log.
Returns void.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
This feature is only available for Professional and Enterprise editions of screen-scraper.
Set merge status for a table. When conflicts exists in data, a merge of true will take the newer values and save them over previous null values.
When merging or updating values in a table, that table must have a Primary Key. When the Primary Key is set to autoincrement, if the value of that key was not set with the addData method the DataManager will create a new row rather than update or merge with an existing row. One solution is to use an SqlDuplicateFilter to set fields that would identify an entry as a duplicate and automatically insert the value of the autoincrement key when data is committed.
Update | Merge | Resulting Action |
---|---|---|
false | false | Ignore row on duplicate |
true | false | Update only values whose corresponding columns are currently NOT NULL in the database |
false | true | Update only values whose corresponding columns are currently NULL in the database |
true | true | Update all values to new data |
Returns void.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
Set number of threads that the data manager can have open at once. When set higher than one, the scraping session can continue to run and download pages while the database is being written. This can decrease the time required to run a scrape, but also makes debugging harder as there is no guarantee about the order in which data will be written. It is recommended to leave this setting alone while developing a scrape. Also, the flush method will always return true if more than one thread is being used to write to the database, even if the write failed.
Returns void.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
This feature is only available for Professional and Enterprise editions of screen-scraper.
Set update status for a given table. When conflicts exists in data, an update of true will take the newer values and save them over previous non-null values.
When merging or updating values in a table, that table must have a Primary Key. When the Primary Key is set to autoincrement, if the value of that key was not set with the addData method the DataManager will create a new row rather than update or merge with an existing row. One solution is to use an SqlDuplicateFilter to set fields that would identify an entry as a duplicate and automatically insert the value of the autoincrement key when data is committed.
Update | Merge | Resulting Action |
---|---|---|
false | false | Ignore row on duplicate |
true | false | Update only values whose corresponding columns are currently NOT NULL in the database |
false | true | Update only values whose corresponding columns are currently NULL in the database |
true | true | Update all values to new data |
Returns void.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
Initiate a SqlDataManager object.
Before adding data to the SqlDataManager, you must build the schema of any tables you will use, as well as add foreign keys if you are not using a database engine that natively supports them (such as InnoDB for MySQL).
Returns a SqlDataManager. If an error is experienced it will be thrown.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
com.screenscraper.datamanager.sql.SqlDataManager
SqlDuplicateFilters are designed to filter duplicates when more information than just a primary key might define a duplicate entry. For example, you might define a unique person by their SSN, driver's license number, or by their first name, last name, and phone number. It is also possible that a single person may have multiple phone numbers, and if any of them match then the duplicate constraint should be met. Using an SqlDuplicateFilter can check for conditions such as this and correctly recognize duplicate entries.
This feature is only available for Professional and Enterprise editions of screen-scraper.
Sometimes the data will need to be filtered across multiple tables, or possibly different constaints might indicate a duplicate. An example of this is a person might be a duplicate if their SSN matches OR if their driver's license number matches. Alternatively, they may be a duplicate when they have the same first name, last name, and phone number.
Duplicate filters are checked in the order they are added, so consider perfomance when creating duplicate filters. If, for instance, most duplicates will match on the social security number, create that filter before the others. Also make sure to add indexes into your database on those columns that you are selecting by or else performance will rapidly degrade as your database gets large.
Duplicates will be filtered by any one of the filters created. If multiple fields must all match for an entry to be a duplicate, create a single filter and add each of those fields as constraints, as shown in the third filter created above. In other words, constraints added to a single filter will be ANDed together, while seperate filters will be ORed.
Add a constraint that checks the value of new entries against the value of entries already in the database for a given column and table.
Returns void.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
Sometimes the data will need to be filtered across multiple tables, or possibly different constaints might indicate a duplicate. An example of this is a person might be a duplicate if their SSN matches OR if their driver's license number matches. Alternatively, they may be a duplicate when they have the same first name, last name, and phone number.
Duplicate filters are checked in the order they are added, so consider perfomance when creating duplicate filters. If, for instance, most duplicates will match on the social security number, create that filter before the others. Also make sure to add indexes into your database on those columns that you are selecting by or else performance will rapidly degrade as your database gets large.
Duplicates will be filtered by any one of the filters created. If multiple fields must all match for an entry to be a duplicate, create a single filter and add each of those fields as constraints, as shown in the third filter created above. In other words, constraints added to a single filter will be ANDed together, while seperate filters will be ORed.
Create an SqlDuplicateFilter for a specific table and register it with the data manager.
Returns an SqlDuplicateFilter that can then be configured for duplicate entries.
Version | Description |
---|---|
5.0 | Available for professional and enterprise editions. |
Oftentimes you want to write extracted data directly to an XML file. This class facilitates doing that. Before working with the methods below, you may wish to read our documentation about writing extracted data to XML, which contains examples of scripts that utilize these methods.
This feature is only available to Enterprise editions of screen-scraper.
Initiate a XmlWriter object.
Returns a XmlWriter. If an error is experienced it will be thrown.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
5.5.3a | Added the constructor that takes a character set. |
com.screenscraper.xml.XmlWriter
Add a node to the XML file.
Returns the added element object.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Add multiple nodes under a single node (new or already in existence).
OR
OR
Returns the main added element object, if one was created. It there was not a main element that was added then it returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Close the XmlWriter.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
The REST API was first released in the stable version 5.0 (alpha 4.5.18a). It is not a true REST API but rather an API accessible via GET requests. But for the sake of naming we call it the screen-scraper REST API. It will allow you to issue web interface commands through GET requests.
The basic structure to all REST API requests is to specify the action GET parameter with what you want to do. Some actions will require other parameters to be set as well. Here are some available actions and their parameters.
For any of this to work screen-scraper has to be running in server mode.
This feature is only available to Enterprise editions of screen-scraper.
The returned file now contains the scrapeable_session_id of the scrape to ease in manipulating it with other REST Interface actions.
All requests require that you pass your registered email address, which will be determined when you sign up for the anonymization service. This is passed as a URL-encoded string in the URL query string using the key registered_email. Your password will also be required, which is passed to the server via the password parameter.
Each call to the server is done via a GET request. The possible requests are described below:
Expect an average delay of around 20 seconds before receiving a response from the system for reach request made.
Here's an example of what would be returned from this request:
ec2-75-101-238-93.compute-1.amazonaws.com:3128 i-61955e08
ec2-75-131-250-53.compute-1.amazonaws.com:3128 i-6e955e07
Each proxy gets its own line. The host and port are given first, then a space character, then the instance ID.
You'll use the instance ID if you want to report a proxy as bad (so that it will be terminated and one will be spawned in its place).
After terminating a proxy, it will take a minute or two to spawn one in its place. You'll want to query the server periodically in order to refresh your current pool of proxies.
When writing scripts within screen-scraper, there are a number of objects and methods available to you. You can view the stable objects and classes available to scripts in the API section of our documentation. This sections only documents those methods that are in a current alpha release. You are welcome to use them but know that they are prone to change. We always work for backwards compatibility of stable features but with alpha features we will not guarantee compatibility until they appear in a stable version.
Alpha methods and objects are only available if you have screen-scraper upgrade to unstable versions. We don't guarantee that the methods will not change after their introduction if improvements are required, desired, or purposes change.
The examples are given using Interpreted Java as the scripting language. This is in accordance with the stable API.
Applies an XPath expression to the current HTML response. If tidying the response failed this method will also fail.
An XmlNode. See example for usage.
Version | Description |
---|---|
6.0.1a | Available in Professional and Enterprise editions. |