The scraping engine is the backbone of screen-scraper and provides four built-in objects. These objects are: session, scrapeableFile, dataSet, and dataRecord. We have also included the RunnableScrapingSession class as it best pertains to the engine.
For details on which objects are available to scripts in the context of a scrape see the variable scope section of the documentation.
The dataRecord object is populated using the names of tokens from extractor patterns.
This object gives access to the most recently extracted data record. This will most likely only be used in scripts that get accessed after each time an extractor pattern is applied. This object simply extends Hashtable (documentation on its methods can be found in Java's documentation).
The dataRecord is populated using the token names in the extractor patterns. You'll find a few of the most commonly used methods below. DataRecord objects can also be created from scratch, and subsequently added to DataSet objects using the addDataRecord method.
See example usage: Iterate over DataSets & DataRecords.
Create a new DataRecord object.
This method does not receive any parameters.
Returns DataRecord object.
Version | Description |
---|---|
4.5 | Available for all editions. |
com.screenscraper.common.DataRecord
See additional example usage: Iterate over DataSets & DataRecords.
Get the value of a DataRecord field.
Returns the value associated with the specified key. Usually it will be a string but, if you have manually added fields, it can be an integer, boolean, long, or other object.
Version | Description |
---|---|
4.5 | Available for all editions. |
Add a new field to the DataRecord or update the value of an existing field.
Returns the value previously associated with the specified key. If the key did not exist then it will return null.
Version | Description |
---|---|
4.5 | Available for all editions. |
See additional example usage: Iterate over DataSets & DataRecords.
Remove a field from the DataRecord.
Returns the value previously associated with the specified key. If the key did not exist then it will return null.
Version | Description |
---|---|
4.5 | Available for all editions. |
The dataSet object holds all data records extracted by an extractor pattern after it has been applied as many times as possible to the HTML retrieved by a scrapeable file. A data set is analogous to a result or record set that would be returned from a database query. A data set contains any number of data records, which are analogous to rows in a database.
The dataSet object provides methods to aid in getting at the information that has been gathered.
See example usage: Iterate over DataSets & DataRecords.
Manually create a DataSet.
Returns DataSet object.
Version | Description |
---|---|
4.5 | Available for all editions. |
com.screenscraper.common.DataSet
See additional example usage: Iterate over DataSets & DataRecords.
Add a DataRecord to a DataSet.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
See additional example usage: Iterate over DataSets & DataRecords.
Remove all DataRecord objects from the DataSet.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
See additional example usage: Iterate over DataSets & DataRecords.
Remove a DataRecord from the DataSet.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Retrieve a field's value in a data set based on another field.
Returns the value in the returned column, usually a string (unless records have been manually added). If no match is found, null is returned.
Version | Description |
---|---|
5.0 | Added for all editions. |
Get a single piece of data held by a DataRecord in the DataSet.
Returns the value associated with the DataRecord identifier. It will be a string unless you have added values to the DataRecord whose values are not strings.
Version | Description |
---|---|
4.5 | Available for all editions. |
Get all DataRecords in the DataSet.
This method does not receive any parameters.
Returns an ArrayList of DataRecord objects.
Version | Description |
---|---|
4.5 | Available for all editions. |
This method is provided as a convenience, the recommended way to iterate over data records in a data set is to use getNumDataRecords and getDataRecord.
Get the character set being applied the scraped data.
This method does not receive any parameters.
Returns the character set applied to the scraped data, as a string. If a character set has not been specified then it will default to the character set specified in settings dialog box.
Version | Description |
---|---|
5.0 | Added for all editions. |
Get one DataRecord in the DataSet.
Returns a DataRecord (Hashtable object). If there is not a DataRecord at the specified index an error will be thrown.
Version | Description |
---|---|
4.5 | Available for all editions. |
Get the first non-null value, in a data set, for a given token.
Returns the first non-null value in the column, usually a string (unless records have been manually added). If none is found, null is returned.
Version | Description |
---|---|
5.0 | Added for all editions. |
Get the number of DataRecords in the DataSet.
This method does not receive any parameters.
Returns the number of DataRecords in the DataSet, as an integer.
Version | Description |
---|---|
4.5 | Available for all editions. |
Merge data records from two data sets.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
Set the character set to be used for rendering dataSet values.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
This will only change the character set on the current data set. If you want it to be changed for all data sets, you would need to change it in the settings dialog box or screen-scraper.properties file.
Get the number of DataRecords in the DataSet.
This method does not receive any parameters.
Returns the number of DataRecords in the DataSet, as an integer.
Version | Description |
---|---|
6.0.3a | Available for all editions. |
Write DataSet string and integer contents to a file. The fields will be tab-delimited and records hard-return delimited.
Returns void. If the file cannot be written to then an error will be thrown.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
This object contains various methods used to log information about a running scraping session to log files, the workbench "Log" tab, and the web interface.
Creates an automatic progress bar and adds it to the progress bars. These progress bars match their progress to a value from a session variable and a list of values. When web messages are output with the webDebug, webInfo, webWarn, or webError methods, a progress bar will be drawn to give a visual representation of the current progress of the scrape.
Note that when using auto progress bars, it is advised to not use any manually monitored ones, as it can cause conflicts. Anytime an auto progress bar has no session variable set for its monitored key, it deletes itself and all children progress bars (including manual ones). As long as you keep that in mind, it should be safe to use both types together.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.31a | Available in enterprise edition. |
5.5.43a | Moved from session to log class. |
Watches for all session variables whose keys end with the postfix specified, and will output their values when monitored variables are logged.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.42a | Moved from session to log class. |
Watches for all session variables whose keys begin with the prefix specified, and will output their values when monitored variables are logged.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.42a | Moved from session to log class. |
Adds a specific name and value to be logged with the web messages methods or logMonitoredValues method
The previous value associated with the name, or null if there wasn't one
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.42a | Moved from session to log class. |
Watches the value of a session variable, and will output it each time monitored variables are output
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.42a | Moved from session to log class. |
Adds a new progress bar. If no progress bar exists, this will be set as the root, otherwise it will be the child of the lowest progress bar. When web messages are output with the webDebug, webInfo, webWarn, or webError methods, a progress bar will be drawn to give a visual representation of the current progress of the scrape. The addProgressBarIfNotStopped versions remove the progress bar if the scrape has not been stopped, which is useful for determining when a scrape was stopped.
This method returns a reference to the new progress bar, which can be used to update the current progress
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.31a | Available in enterprise edition. |
5.5.43a | Moved from session to log class. |
Appends a status message to be displayed in the web interface.
None
Version | Description |
---|---|
5.5.32a | Available in Enterprise edition. |
5.5.43a | Moved from session to log class. |
Adds a file to the cache. This can be used to add anything to the cache, from a text file to an image that was downloaded, or any other file that would be useful.
A File that represents the cached file.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
5.5.43a | Moved from session to log class. |
Caches the HTML and headers of the scrapeable file. This will include both the request and response headers.
A File that represents the cached file.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
5.5.43a | Moved from session to log class. |
Adds text to the cache. This will create a new text file in the cache and store the given content in it.
A File that represents the cached file.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
5.5.43a | Moved from session to log class. |
Write message to the log.
Returns void.
Version | Description |
---|---|
5.5 | Now accepts any Object as a message |
4.5 | Available for all editions. |
When the workbench is running, this will be found under the log tab for the scraping
session. When screen-scraper is running in server mode, the message will get sent to the corresponding .log file found in screen-scraper's log folder. When screen-scraper is invoked from the command
line, the message will get sent to standard out.
Enables caching for this scrape. When caching is enabled, each time a scrapeable file is downloaded it will be saved to the file system. Once the session is completed the cache will be either zipped or the directory renamed, depending on the conditions that were specified when the cache was enabled. Optionally this will save the log files to the cached location, and will save everything from the error.log file that was added while the cache was enabled.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
5.5.32a | Renamed from enableCache to enableCaching |
5.5.43a | Moved from session to log class. |
Ends the caching for the scrape. This method will be called once all the scripts and files are run/scraped. It can be called in a script to end the caching early (thereby only caching a portion of the scrape). This only deals with saving downloaded content to the file system, not with reading it back in during a scrape.
This method takes no parameters
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
5.5.32a | Renamed from endCache to endCaching. |
5.5.43a | Moved from session to log class. |
Write message to the log.
Returns void.
Version | Description |
---|---|
5.5 | Now accepts any Object as a message |
4.5 | Available for all editions. |
When the workbench is running, this will be found under the log tab for the scraping
session. When screen-scraper is running in server mode, the message will get sent to the corresponding .log file found in screen-scraper's log folder. When screen-scraper is invoked from the command
line, the message will get sent to standard out.
Returns whether or not the cache is enabled for the scrape. When enabled, it simply means that each ScrapeableFile will save the content it downloads from the server to the file system so it can be viewed later, generally for debugging purposes.
This method takes no parameters
Returns true if caching is currently enabled for this session
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.32a | Available enterprise and professional editions (Returns false in basic edition, but doesn't throw an Exception). Renamed from getCacheEnabled to getCachingEnabled. |
5.5.43a | Moved from session to log class. |
Returns the progress bar specified. If the index if given, returns the progress bar at that index (0 is the root, 1 is the first child, etc...). If the title is given, returns the most recently added progress bar with the given title
The ProgressBar indicated, or null if none was found matching the required criteria
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.31a | Available in enterprise edition. |
5.5.43a | Moved from session to log class. |
Write message to the log.
Returns void.
Version | Description |
---|---|
5.5 | Now accepts any Object as a message |
4.5 | Available for all editions. |
When the workbench is running, this will be found under the log tab for the scraping
session. When screen-scraper is running in server mode, the message will get sent to the corresponding .log file found in screen-scraper's log folder. When screen-scraper is invoked from the command
line, the message will get sent to standard out.
Write message to the log.
Returns void.
Version | Description |
---|---|
5.5 | Now accepts any Object as a message |
4.5 | Available for all editions. |
When the workbench is running, this will be found under the log tab for the scraping
session. When screen-scraper is running in server mode, the message will get sent to the corresponding .log file found in screen-scraper's log folder. When screen-scraper is invoked from the command
line, the message will get sent to standard out.
Logs all the values in a Data Record to the log, with one line per value. If a value in the record is a List, Set, Map, Data Set, Scrapeable File, or Exception, it will have detailed output as well.
This method returns nothing
Version | Description |
---|---|
5.5.26a | Available in all editions. |
5.5.43a | Moved from session to log class. |
The output from the above call might look something like this:
DataRecord --- A_FLOAT : 3.14159 --- A_LIST : List ------ Element 0 : Value 1 ------ Element 1 : Value 2 ------ Element 2 : Value 3 ------ Element 3 : Set --------- Element : A value --------- Element : More value --------- Element : Other stuff --- A_MAP : Map ------ KEY_1 : 1 ------ KEY_2 : 2 ------ KEY_3 : 3 --- A_SET : Set Logged above as "------ Element 3 : " --- A_STRING : Screen-Scraper --- AN_INT : 5
Logs an Exception, with a full stack trace, at the Error level
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Moved from session to log class. |
Logs the values of all the currently monitored variables, the progress of the scrape, if known, and puts the message at the top. Also logs any additional values being watched. Logs values at the specified level.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Moved from session to log class. |
Logs closing values to indicate the scrape is complete and what values were when everything finished. It will log at whatever the highest level logged to was. For instance, if a webWarn had been logged during the scrape, this will log at the warning level.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Moved from session to log class. |
Logs the Object in a semi intelligent way. For example, Maps are logged as key-value pairs, lists are logged with one element per line, all elements of a set are logged, etc... Some objects will just log their value using String.valueOf() if it isn't a standard type of data set/list
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Moved from session to log class. |
Logs useful information about the current instance of Screen-Scraper, as well as the Java VM and the General Utility version being used. Information will be logged as an info message in the web interface (when running in server mode) and the log.
This method takes no parameters
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Moved from session to log class. |
Stops watching for a postfix in session variables
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Moved from session to log class. |
Stops watching for a prefix in session variables
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Moved from session to log class. |
Removes a specific name from the manually set values to be logged. Doesn't affect the value of session variables
The previous value associated with the name, or null if there wasn't one
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Moved from session to log class. |
Stops watching the specified variable
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Moved from session to log class. |
Removes the specified progress bar. The removeProgressBarIfNotStopped version removes the progress bar if the scrape has not been stopped, which is useful for determining when a scrape was stopped.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.31a | Available in enterprise edition. |
5.5.43a | Moved from session to log class. |
Write message to the log.
Returns void.
Version | Description |
---|---|
5.5 | Now accepts any Object as a message |
4.5 | Available for all editions. |
When the workbench is running, this will be found under the log tab for the scraping
session. When screen-scraper is running in server mode, the message will get sent to the corresponding .log file found in screen-scraper's log folder. When screen-scraper is invoked from the command
line, the message will get sent to standard out.
Logs closing values to indicate the scrape is complete and what values were when everything finished. It will log at whatever the highest level logged to was. For instance, if a webWarn had been logged during the scrape, this will log at the warning level. When running in Professional edition, this simply outputs to the log.
Using this method is preferred over logMonitoredValuesClose (which only logs to the log), because if at a later point the scrape is run in server mode for enterprise edition, a useful message is output in the web interface without needing to modify the scrape.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
5.5.43a | Moved from session to log class. |
Logs a debug message to the web interface status message area. Uses the message header as the top of the message, and then logs all currently monitored session variables underneath as well as the current progress (if known) of the scrape. Also outputs the message to the log. When running in Professional edition, this simply outputs to the log.
Using this method is preferred over logMonitoredValues (which only logs to the log), because if at a later point the scrape is run in server mode for enterprise edition, a useful message is output in the web interface without needing to modify the scrape.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
5.5.43a | Moved from session to log class. |
Logs an error message to the web interface status message area. Uses the message header as the top of the message, and then logs all currently monitored session variables underneath as well as the current progress (if known) of the scrape. Also outputs the message to the log. When running in Professional edition, this simply outputs to the log.
Using this method is preferred over logMonitoredValues (which only logs to the log), because if at a later point the scrape is run in server mode for enterprise edition, a useful message is output in the web interface without needing to modify the scrape.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
5.5.43a | Moved from session to log class. |
Logs an info message to the web interface status message area. Uses the message header as the top of the message, and then logs all currently monitored session variables underneath as well as the current progress (if known) of the scrape. Also outputs the message to the log. When running in Professional edition, this simply outputs to the log.
Using this method is preferred over logMonitoredValues (which only logs to the log), because if at a later point the scrape is run in server mode for enterprise edition, a useful message is output in the web interface without needing to modify the scrape.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
5.5.43a | Moved from session to log class. |
Logs a warning message to the web interface status message area. Uses the message header as the top of the message, and then logs all currently monitored session variables underneath as well as the current progress (if known) of the scrape. Also outputs the message to the log. When running in Professional edition, this simply outputs to the log.
Using this method is preferred over logMonitoredValues (which only logs to the log), because if at a later point the scrape is run in server mode for enterprise edition, a useful message is output in the web interface without needing to modify the scrape.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
5.5.43a | Moved from session to log class. |
This is a class that can be instantiated within a script in order to run a scraping session.
Also see:
The Maximum number of concurrent running scraping sessions in the settings dialog box will control how many scraping sessions can be run simultaneously.
Initiates a RunnableScrapingSession object using the name of an existing scraping session.
Returns a RunnableScrapingSession. On failure an error will be thrown.
Version | Description |
---|---|
5.0 | inheritHttpState added as optional parameter. |
4.5 | Available for professional and enterprise editions. |
com.screenscraper.scraper
Retrieve the name of the scraping session in the runnableScrapingSession.
This method does not receive any parameters.
Returns a string with the name of the scraping session.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Get the timeout of the session in the runnableScrapingSession.
This method does not receive any parameters.
Returns a integer representing the timeout length in minutes.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Retrieve the the value of a session variable. This method should be called after scrape method has returned.
Returns the value of the session variable: object, boolean, int, string, etc. If the variable doesn't exists it returns null.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Run the session scraping.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
The default is for the script to continue executing without waiting for the scraping session to finish. You can use setDoLazyScrape to force the script to wait until the scape finishes before continuing the script.
Indicate whether or not the scraping session should run concurrently with (at the same time as) other scraping sessions. The default for doLazyScrape is true.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
We recommend not setting this value to false! When running scraping sessions in the workbench, it will cause the interface to freeze up until sessions have completed.
If you'd like to run multiple scraping sessions serially (one after another), the best option is to set the Maximum number of concurrent running scraping sessions to 1 in the settings window.
Sets the timeout of the session. That is, after the given number of minutes have passed the session will automatically terminate.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
This method must be called before scrape.
Set the value of a session variable.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
The scrapeableFile object refers to the current file being requested from a given server. It houses both the request for a file and response and can be manipulated to meet any necessary requirements: GET and POST parameters, referer information, cookies, FILE parameters, HTTP headers, characterset, and such.
Dynamically adds a GET parameter to the URL of the current scrapeable file. If a parameter with the given sequence already exists, it will be replaced by the one created from this method call. Calling this method is the equivalent in the workbench of adding a parameter under the "Parameters" tab, and designating the type as GET. Once the scraping session is completed the original HTTP parameters (those under the "Parameters" tab in the workbench) will be restored.
None
Version | Description |
---|---|
5.5.32a | Available in Professional and Enterprise editions. |
Add an HTTP header to be sent along with the request.
Returns void. If you are not using enterprise edition it will throw an error.
Version | Description |
---|---|
5.0 | Available for professional and enterprise edition. |
4.5 | Available for enterprise edition. |
In certain rare cases it may be necessary to explicitly add a custom header of the POST data of an HTTP request. This may be required in cases where a site is using AJAX, and the POST payload of a request is sent as XML (e.g., using the setRequestEntity method). This method must be invoked before the HTTP request is made (e.g., "Before file is scraped" for a scrapeable file).
Dynamically add an HTTPParameter to the current scrapeable file.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
The HTTPParameter constructor is as follows: HTTPParameter( String key, String value, int sequence, String type ). Valid types for the constructor are GET, POST, and FILE. Calling this method will have no effect unless it's invoked before the file is scraped.
Dynamically adds a POST parameter to the existing set of POST parameters. If a parameter with the given sequence already exists, it will be replaced by the one created from this method call. If the method call is used that doesn't take a sequence, the new POST parameter will carry a sequence just higher than the highest existing sequence. Calling this method is the equivalent in the workbench of adding a parameter under the "Parameters" tab, and designating the type as POST. Once the scraping session is completed the original HTTP parameters (those under the "Parameters" tab in the workbench) will be restored.
None
Version | Description |
---|---|
5.5.32a | Available in Professional and Enterprise editions. |
Manually apply an extractor pattern to a string.
Returns DataSet on success. Failures will be written out to the log as errors.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
An example of how to manually extract data is available.
Manually retrieve the value of a single extractor token.
Returns the match from the last data record, as a string, on success. On failure it returns null and writes a error to the log.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
If you want it to be from the first data record you could use getDataRecord.
Gets the ASPX .NET values from the string. The standard values are __VIEWSTATE, __EVENTTARGET, __EVENTVALIDATION, and __EVENTARGUMENT. Values will be stored in the returned DataRecord as ASPX_VIEWSTATE, ASPX_EVENTTARGET, etc...
A DataRecord object with each ASPX name as ASPX_[NAME] mapped to it's value. Note that when onlyStandard is false, any parameter that starts with the name __ will be returned in this DataRecord
Version | Description |
---|---|
5.5.26a | Available in all editions. |
Retrieve the authentication expectation of the request.
This method does not receive any parameters.
Returns whether the scrapeable file expects to have to authenticate and so will send the information initially instead of waiting for the request for it, as a boolean.
Version | Description |
---|---|
5.0 | Available for all editions. |
Get the character set being used in the page response rendering.
This method does not receive any parameters.
Returns the character set applied to the scraped page, as a string. If a character set has not been specified then it will default to the character set specified in settings dialog box.
Version | Description |
---|---|
4.5 | Available for all editions. |
If you are having trouble with characters displaying incorrectly, we encourage you to read about how to go about finding a solution using one of our FAQs.
Retrieve contents of the response.
This method does not receive any parameters.
Returns contents of the last response, as a string. If the file has not been scraped it will return an empty string.
Version | Description |
---|---|
4.5 | Available for all editions. |
Retrieve the POST payload type being used to interpret the page. This can be important with scraping some site's implementation of AJAX, where the payload in explicitly set as xml.
This method does not receive any parameters.
Returns the content type, as a string (e.g., text/html or text/xml).
Version | Description |
---|---|
5.0 | Available for all editions. |
Retrieve the POST data.
This method does not receive any parameters.
Returns the POST data for the scrapeable file, as a string. If called after the file has been scraped the session variable token will be resolved to their values; otherwise, the tokens will simply be removed from the string.
Version | Description |
---|---|
4.5 | Available for all editions. |
Get the URL of the file.
This method does not receive any parameters.
Returns the URL of the scrapeable file, as a string. If called after the file has been scraped the session variable tokens will be resolved to their values; otherwise, the tokens will simply be removed from the string.
Version | Description |
---|---|
4.5 | Available for all editions. |
Indicates whether or not the most recent extractor pattern application timed out.
None
Version | Description |
---|---|
5.5.36a | Available in all editions. |
Determine whether or not the contents of this response are being forced to be recognized as non-binary.
This method does not receive any parameters.
Returns true if the scrapeable file is being forced to be treated as non-binary; otherwise, it returns false.
Version | Description |
---|---|
5.0 | Added for all editions. |
Gets the value of the header in the response of the scrapeable file, or returns null if it couldn't be found
The value of the header, or null if not found
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
Gets the header section of the HTTP Response
This method takes no parameters
A String containing the HTTP Response Headers
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
Gets the headers of the HTTP Response as a map, and returns them.
This method takes no parameters
A Map from header name to header value for the response headers.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
Indicates whether or not the most recent attempt to tidy the HTML failed.
None
Version | Description |
---|---|
5.5.36a | Available in all editions. |
Indicates whether or not the maximum attempts to request a given scrapeable file were reached.
None
Version | Description |
---|---|
5.5.36a | Available in all editions. |
Retrieve the kilobyte limit for information retrieved by the scrapeable file, any additional information will not be retrieved.
This method does not receive any parameters.
Returns the current kilobyte limit on the response, as an integer.
Version | Description |
---|---|
5.0 | Add for professional and enterprise editions. |
Get the name of the scrapeable file.
This method does not receive any parameters.
Returns the name of the scrapeable file, as a string.
Version | Description |
---|---|
4.5 | Available for all editions. |
Retrieve the non-tidied HTML of the scrapeable file.
This method does not receive any parameters.
Returns the non-tidied contents of the scrapeable file, as a string. On failure it returns null.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
By default non-tidied html is not retained. For this method to return anything other than null you must use setRetainNonTidiedHTML to force non-tidied html to be retained.
Gets an array of strings containing the redirect URL's for the current scrapeable file request attempt.
This method does not receive any parameters.
Returns the array of strings; may be empty.
Version | Description |
---|---|
6.0.24a | Available in Professional and Enterprise editions. |
Determine if the scrapeable file is set to retain non-tidied html.
This method does not receive any parameters.
Returns boolean flag for non-tidied contents being retained.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Returns the retry policy. Note that in any 'After file is scraped' scripts this is null
This method takes no parameters.
The Retry Policy that will be used by this scrapeable file
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
Determine the HTTP status code sent by the server.
This method does not receive any parameters.
Returns integer corresponding to the HTTP status code of the response.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Retrieve the name of the user agent making the request.
This method does not receive any parameters.
Returns the user agent, as a string.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Determine if an input or output error occurred when requesting file.
This method does not receive any parameters.
Returns true if an error has occurred; otherwise, it returns false.
Version | Description |
---|---|
5.0 | Added for all editions. |
This method should be run after the scrapeable file has been scraped.
Determine whether any extractor patterns associated with the scrapeable file found a match.
This method does not receive any parameters.
Returns boolean corresponding to whether any extractor pattern matched in the scrapeable file.
Version | Description |
---|---|
4.5 | Available for all editions. |
Remove all of the HTTP parameters from the current scrapeable file.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Remove an HTTP header from a scrapeable file.
Returns void.
Version | Description |
---|---|
5.0.5a | Introduced for enterprise edition. |
Dynamically removes an HTTPParameter. The order of the remaining parameters are adjusted immediately.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
5.5.32a: Added method call that takes a String. | Available for Professional and Enterprise editions. |
If calling this method more than once in the same script, when used in conjunction with the addHTTPParameter method, it is important to keep track of how the list is reordered before calling either method again.
Calling this method will have no effect unless it's invoked before the file is scraped.
This method can be used for both GET and POST parameters.
Resequences an HTTP parameter.
None
Version | Description |
---|---|
5.5.32a | Available in Professional and Enterprise editions. |
Resolves a relative URL to an absolute URL based on the current URL of this scrapeable file.
Returns string containing the complete url to the file. On failure it will return the relative path and an error will be written to the log.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Write non-tidied contents of the scrapeable file response to a text file.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
This method must be called before the file is scraped.
Because the response header are also saved in the file, if the file is anything except a text file it will not be valid (e.g. images, pdfs).
Save the file returned from a scrapeable file request.
Returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
This method must be called from a scrapeable file before the file is scraped. Do not call this method from a script which is invoked by other means such as after an extractor pattern match or from within another script.
It is preferable to use downloadFile; however, at times you may have to send POST parameters in order to access a file. If that is the case, you would use this method.
This method cannot save local file requests to another location.
Set the authentication expectation of the request.
Returns void.
Version | Description |
---|---|
5.0 | Available for all editions. |
Set the character set used in a specific scrapeable file's response renderings. This can be particularly helpful when the page renders characters incorrectly.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
This method must be called before the file is scraped.
If you are having trouble with characters displaying incorrectly, we encourage you to read about how to go about finding a solution using one of our FAQs.
Set POST payload type. This is particularly helpful with scraping some site's implementation of AJAX, where the payload in explicitly set as xml.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
This method must be called before the file is scraped.
This method is usually used in connection with setRequestEntity as that method specifies the content of the POST data.
Set content type header to multipart/form-data.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
This method must be called before the file is scraped.
Occasionally a site will expect a multi-part request when a file is not being sent in the request.
If you include a file upload parameter under the parameters tab of the scrapeable file the request will automatically be multi-part.
Set whether or not the contents of this response should be forced to be treated as non-binary. Default forceNonBinary value is false.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
This is provided in the case where screen-scraper misidentifies a non-binary file as a binary file. It doesn't happen often but is possible.
Determines whether or not a POST request should be forced.
Returns void.
Version | Description |
---|---|
6.0.14a | Available in Professional and Enterprise editions. |
Sets the request type to use.
ScrapeableFile.RequestType is an enum with the following options as values
If the method sets the request to one of those types, all paramenters set as GET in the paramenters tab will be appended to the url (like normal) and all parameters set as POST parameters will be used to buld the request entity. If there are POST values on a type that doesn't support a request entity an exception will be thrown when the request is issued.
Returns void.
Version | Description |
---|---|
6.0.55a | Available in Professional and Enterprise editions. |
Overwrite the content of the "last response"
Returns void.
This method must be called from an extractor pattern before the pattern is run.
Limit the amount of information retrieved by the scrapeable file. This method can be useful in cases of very large responses where the desired information is found in the first portion of the response. It can also help to make the scraping process more efficient by only downloading the needed information.
Returns void.
Version | Description |
---|---|
5.0 | Add for professional and enterprise editions. |
This method must be called before the file is scraped.
Set referer HTTP header.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
This method must be called before the file is scraped.
Set POST payload data. This is particularly helpful with scraping some site's implementation of AJAX, where the payload in explicitly set as xml.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
This method must be called before the file is scraped.
This method is usually used in connection with setContentType as that method specifies the content of the POST data.
Though you can set plain text POST data using this method it is preferable to use the addHTTPParameter method for this task.
Set whether or not non-tidied HTML is to be retained for the current scrapeable file.
Returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
If, after the file is scraped, you want to be able to use getNonTidiedHTML this method has to be called before the file is scraped.
Sets a Retry Policy that will be run to check if a page should be re-downloaded or not. The policy will be checked after all the extractors have run, and will check for an error on the page based on a set of conditions. If the policy shows an error on the page, it can run scripts or other code to attempt to remedy the situation, and then it will rescrape the file.
The file will be re-downloaded without rerunning any of the scripts that run before the file is downloaded, and before any of the scripts marked to run after the file is scraped. If there is any change that needs to be made to session variables/headers, etc... they should be made in the script or runnable that will be executed. Also, the policy can specify that session variables should be restored to their previous values before the file is rescraped. If it does, they will be reset after the error checking portion of the policy but before the policy runs the code to make changes before a retry.
The retry policy should be set in a script run 'Before file is scraped', but can also be set by a script on an extractor pattern. It it is set on an extractor pattern, session variables will not be restored if the retry is required
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
Explicitly state the user agent making the request.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
This method must be called before the file is scraped.
Determine if an error occurred with the request. Errors are considered to be server timeouts as well as any status code outside of the range 200-399.
This method does not receive any parameters.
Returns true for server timeouts as well as any status code outside of the range 200-399; otherwise, it returns false.
Version | Description |
---|---|
4.5 | Available for all editions. |
This method must be called after the file is scraped.
If you want to know what the status code was you can use getStatusCode.
This object refers to the current scraping session that is running. To make the methods a little easier to sort through they have been grouped into related methods. The groups have been named to ease in finding them when they are needed.
The following methods are provided to aid you in setting up an anonymous scraping session. If you are using your own server proxy pool you will use the methods to allow screen-scraper to interact with and manage your proxy pool. If you are using automatic anonymization then the only method you will use is currentProxyServerIsBad as screen-scraper will manage the servers using the anonymization settings from your setup.
See an example of Anonymization via Manual Proxy Pools.
Remove proxy server from proxy pool. This is only used with anonymization and indicates that one server in the pool is bad and should be removed.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
If you are using automatic anonymization or manual proxy pools, a new proxy server will be created as a result of the method call.
When checking if a request you have made is invalid it is best not to rely on the HTTP status code (eg. 404) alone as the status codes are not always accurate. It is recommended that you also scrape a known string (eg. "Not found") from the response HTML that validates the status code.
Get the current proxy server from the proxy server pool.
This method does not receive any parameters.
Returns the current proxy server being used.
Version | Description |
---|---|
4.5 | Available for all editions. |
Holds the proxy server pool object that allows proxies to be cycled through.
Returns true if there is an available proxy server pool.
Version | Description |
---|---|
4.5 | Available for all editions. |
Determine whether proxies are set to be terminated when the scrape ends.
This method does not receive any parameters.
Returns true if a proxy will be terminated; otherwise, it returns false.
Version | Description |
---|---|
5.0 | Available for all editions. |
Determine whether proxies are being used from proxy pool.
This method does not receive any parameters.
Returns true if a proxy pool is being used; otherwise, it returns false.
Version | Description |
---|---|
4.5 | Available for all editions. |
Associate a proxy pool with a scraping session.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Manually set the outcome of proxies when the scrape ends.
Returns void.
Version | Description |
---|---|
5.0 | Available for all editions. |
Determine if proxies from a proxyServerPool be used when making scrapeable file request.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
If you are already going through a proxy server, screen-scraper must be told the credentials in order to get out to the internet. These methods are all provided to manually tell screen-scraper how to get through your external proxy.
If you always go through the same external proxy you would probably want to set the credentials in screen-scraper's proxy settings so that you don't have to specify them in all of your scrapes.
Retrieve the external NT proxy domain.
This method does not receive any parameters.
Returns the external NT domain, as a string.
Version | Description |
---|---|
5.0 | Added for all editions. |
Retrieve the external NT proxy host.
This method does not receive any parameters.
Returns the external NT host, as a string.
Version | Description |
---|---|
5.0 | Added for all editions. |
Retrieve the external NT proxy password.
This method does not receive any parameters.
Returns the external NT password, as a string.
Version | Description |
---|---|
5.0 | Added for all editions. |
Retrieve the external NT proxy username.
This method does not receive any parameters.
Returns the external NT username, as a string.
Version | Description |
---|---|
5.0 | Added for all editions. |
Retrieve the external proxy host.
This method does not receive any parameters.
Returns the external host, as a string.
Version | Description |
---|---|
5.0 | Available for all editions. |
Retrieve the external proxy password.
This method does not receive any parameters.
Returns the external password, as a string.
Version | Description |
---|---|
5.0 | Available for all editions. |
Retrieve the external proxy port.
This method does not receive any parameters.
Returns the external port, as a string.
Version | Description |
---|---|
5.0 | Available for all editions. |
Retrieve the external proxy username.
This method does not receive any parameters.
Returns the external username, as a string.
Version | Description |
---|---|
5.0 | Available for all editions. |
Manually set external NT proxy domain.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
If you are using this method on all of your scripts you might want to set it in screen-scraper's external NT proxy settings.
If you are using NTLM (Windows NT) authentication you'll need to designate settings for both the standard external proxy as well as the external NT proxy.
Manually set external NT proxy host/domain.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
If you are using this method on all of your scripts you might want to set it in screen-scraper's external NT proxy settings.
If you are using NTLM (Windows NT) authentication you'll need to designate settings for both the standard external proxy as well as the external NT proxy.
Manually set external NT proxy password.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
If you are using this method on all of your scripts you might want to set it in screen-scraper's external NT proxy settings.
If you are using NTLM (Windows NT) authentication you'll need to designate settings for both the standard external proxy as well as the external NT proxy.
Manually set external NT proxy username.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
If you are using this method on all of your scripts you might want to set it in screen-scraper's external NT proxy settings.
If you are using NTLM (Windows NT) authentication you'll need to designate settings for both the standard external proxy as well as the external NT proxy.
Manually set external proxy host/domain.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
If you are using this method on all of your scripts you might want to set it in screen-scraper's external proxy settings.
Manually set external proxy password.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
If you are using this method on all of your scripts you might want to set it in screen-scraper's external proxy settings.
Manually set external proxy port.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
If you are using this method on all of your scripts you might want to set it in screen-scraper's external proxy settings.
Manually set external proxy username.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
If you are using this method on all of your scripts you might want to set it in screen-scraper's external proxy settings.
Use of log is a great tool to ensure that your scrapes are working correctly as well as troubleshooting problems that arise. Though logging large amounts of information may slow down a scrape, the best way around this is not to remove log writing requests but rather change the verbosity of the logging when running the scrape in a production environment. If you do this, know that you make it harder to troubleshoot some problems should they arise.
The number of methods provided is merely to enhance your ability to log information according to importance.
Get the name of the current log file.
This method does not receive any parameters.
Returns the name of the log file, as a string.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
This method can be very helpful when screen-scraper is running in server mode and you are tracking the log where the scrape of a record is located, or for tracking the location of errors in larger scrapes.
Write message to the log.
Returns void.
Version | Description |
---|---|
5.5 | Now accepts any Object as a message |
4.5 | Available for all editions. |
When the workbench is running, this will be found under the log tab for the scraping session. When screen-scraper is running in server mode, the message will get sent to the corresponding .log file found in screen-scraper's log folder. When screen-scraper is invoked from the command line, the message will get sent to standard out.
Write current date and time to log (at most verbose level). It is formatted to be human readable.
This method does not receive any parameters.
Returns void. If an error occurs, an error will be thrown.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Write current time to log (at most verbose level). The time is formatted to be human readable.
This method does not receive any parameters.
Returns void. If an error occurs, an error will be thrown.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Write message to the log, at the the debug level (most verbose).
Returns void.
Version | Description |
---|---|
5.5 | Now accepts any Object as a message |
4.5 | Available for professional and enterprise editions. |
Write scrape run time to the log (at most verbose level). It is formatted to be human readable, including breaking it into days, hours, minutes, and seconds.
This method does not receive any parameters.
Returns void. If an error occurs, an error will be thrown.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Write message to the log, at the the error level (least verbose).
Returns void. If an error occurs, an error will be thrown.
Version | Description |
---|---|
5.5 | Now accepts any Object as a message |
4.5 | Available for professional and enterprise editions. |
Write message to the log, at the the info level (second most verbose).
Returns void. If an error occurs, an error will be thrown.
Version | Description |
---|---|
5.5 | Now accepts any Object as a message |
4.5 | Available for professional and enterprise editions. |
Write all session variables to log.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
Write message to the log, at the the warn level (third most verbose).
Returns void. If an error occurs, an error will be thrown.
Version | Description |
---|---|
5.5 | Now accepts any Object as a message |
4.5 | Available for professional and enterprise editions. |
These methods are used in connection with the web interface of screen-scraper. Their use will provide the interface with more detailed information regarding the state of a running scrape. If you are not running the scrapes using the web interface then these methods are not particularly helpful to you.
As the web interface is an enterprise edition feature, these methods are only available in enterprise edition users.
Add to the value of duplicate records scraped. (As opposed to new or error records.)
Returns void.
Version | Description |
---|---|
7.0 | Available for enterprise edition. |
Add to the value error records. (As opposed to duplicate or new records.)
Returns void.
Version | Description |
---|---|
7.0 | Available for enterprise edition. |
Add to the value of new records scraped. (As opposed to duplicate or error records.)
Returns void.
Version | Description |
---|---|
7.0 | Available for enterprise edition. |
Add to the value of number of records scraped.
Returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Append an error message to any existing error messages.
Returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Get the current error message.
This method does not receive any parameters.
Returns current error message, as a string.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Determine the fatal error status of the scrape.
This method does not receive any parameters.
Returns whether a fatal error has occurred, as a boolean .
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Get the number of records that have been scraped.
This method does not receive any parameters.
Returns number of records scraped, as a integer.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Reset the count on the number of scraped records.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
5.0 | Available for all editions. |
Set the current error message.
Returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Set the fatal error status of the scrape.
Returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Set the number of records that have been scraped.
Returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Add a runnable that will be executed at the given time.
Note: session.addEventCallback is automatically executed at a priority of 0.
Returns void.
Version | Description |
---|---|
6.0.55a | Introduced for pro and enterprise editions. |
The EventFireTime is an interface which defines the methods that a fire time must have and so the addEventCallback method can take different types of fire times.
A number of different types of classes based on this interface have been defined for you which call out the various parts of a scrape that you can add event handlers to. Those are defined below.
Version | Description |
---|---|
6.0.55a | Introduced for pro and enterprise editions. |
*Note: When using the Async HTTP client you will have access to the request builder from ScrapeableFileEventData.getRedirectRequestBuilder() which can be used to modify and adjust the request before it is sent. If you use the Apache HTTP client the getRedirectRequestBuilder() method will always return null.
Version | Description |
---|---|
6.0.55a | Introduced for pro and enterprise editions. |
Returns the RedirectToURL value for the object.
This method does not receive any parameters.
Returns the RedirectToURL value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Version | Description |
---|---|
6.0.55a | Introduced for pro and enterprise editions. |
*Note: Calling a setVariable or getVariable method in here WILL trigger the events for those again. Avoid infinite recursion please!
Version | Description |
---|---|
6.0.55a | Introduced for pro and enterprise editions. |
Version | Description |
---|---|
6.0.55a | Introduced for pro and enterprise editions. |
Creates an EventHandler callback object which will be called when the event triggers
Version | Description |
---|---|
6.0.55a | Introduced for pro and enterprise editions. |
Returns the name of the handler. This method doesn't need to be implemented but helps with debugging.
This method does not receive any parameters.
Returns the name of the handler. This method doesn't need to be implemented but helps with debugging.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Processes the event, and potentially returns a useful value modifying something in the internal code as defined by the EventFireTime used to launch this event.
Returns a value based on which AbstractEventData class is used.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
The AbstractEventData class is an abstract class which allows for the accessing of various data values found within ScreenScraper. Below are the various classes that extend AbstractEventData
AbstractEventData is extended by the following classes and it is those classes that should be used in place of AbstractEventData.
Returns the LastReturnValue for the object. This is the value previously returned by another callback. This can be null, if no callbacks have been fired yet for this event. A null value is also the default return value for the given event.
This method does not receive any parameters.
Returns the LastReturnValue for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Sets the LastReturnValue fro the object.
Returns void.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
ExtractorPatternEventData extends AbstractEventData
This contains the data for various extractor pattern operations
Inherits the following methods from AbstractEventData
Returns the status of the extractor pattern timeout. Returns true if and only if the extractor pattern was applied and timed out while doing so. Otherwise it will return false.
This method does not receive any parameters.
Returns a boolean value representing the status of the extractor pattern timeout.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the DataRecord value for the object.
This method does not receive any parameters.
Returns the DataRecord value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the DataSet value for the object.
This method does not receive any parameters.
Returns the DataSet value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the ExtractorPattern value for the object.
This method does not receive any parameters.
Returns the ExtractorPattern value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the Scrapeablefile value for the object.
This method does not receive any parameters.
Returns the Scrapeablefile value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the Session value for the object.
This method does not receive any parameters.
Returns the Session value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
ScrapeableFileEventData extends AbstractEventData
This contains the data for various scrapeable file operations
Inherits the following methods from AbstractEventData
Returns the HttpResponseData for the object.
This method does not receive any parameters.
Returns the HttpResponseData for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the RedirectRequestBuilder for the object. Use this to add headers, etc... for the redirect. It can be null depending on the HTTP client being used, and whether or not it supports manually playing with the redirect.
This method does not receive any parameters.
Returns the RedirectRequestBuilder for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the Scrapeablefile value for the object.
This method does not receive any parameters.
Returns the Scrapeablefile value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the Session value for the object.
This method does not receive any parameters.
Returns the Session value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
ScriptEventData extends AbstractEventData
This contains the data for various script operations
Inherits the following methods from AbstractEventData
Returns the DataRecord value for the object.
This method does not receive any parameters.
Returns the DataRecord value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the DataSet value for the object.
This method does not receive any parameters.
Returns the DataSet value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the Scrapeablefile value for the object.
This method does not receive any parameters.
Returns the Scrapeablefile value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the ScriptException for the object.
This method does not receive any parameters.
Returns the ScriptException for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the ScriptName value for the object.
This method does not receive any parameters.
Returns the ScriptName value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the Session value for the object.
This method does not receive any parameters.
Returns the Session value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
SessionEventData extends AbstractEventData
This contains the data for various session operations
Inherits the following methods from AbstractEventData
Returns the IncrementRecordsAmount value for the object.
This method does not receive any parameters.
Returns the IncrementRecordsAmount value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the Session value for the object.
This method does not receive any parameters.
Returns the Session value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the VariableName value for the object.
This method does not receive any parameters.
Returns the VariableName value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Returns the VariableValue value for the object.
This method does not receive any parameters.
Returns the VariableValue value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
StringEventData extends AbstractEventData
This contains the data for various string operations
Inherits the following methods from AbstractEventData
Returns the Input value for the object.
This method does not receive any parameters.
Returns the Input value for the object.
Version | Description |
---|---|
6.0.55a | Available for all editions. |
Add to the value of a session variable.
Returns void. If the variable doesn't exist, or is not a string or integer, a message will be added to the log. If it cannot add to the variable for any other reason it will write an error to the log.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Pause scrape and display breakpoint window. If the scrape is running in server mode, to avoid the break, logVariables will be called in place of breakpoint.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
Remove all session variables.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Clear stored cookies.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Clears the value of all session variables that match the keys in the Map. This will ignore a key of DATARECORD.
This method is provided using a Map or Collection rather than a List or Set to work easier with the setSessionVariables method.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Changed from session.removeSessionVariablesInMap to session.clearVariables. |
Decode HTML Entities on a session variable.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
Downloads the file to the local file system.
Returns true on successful download of the file otherwise it return false.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. Lazy scrape only available for enterprise edition. |
If the file to download requires that POST data is sent in order to get the file you would use saveFileOnRequest with a scrapeable file.
Using this method in a script takes the place of requesting the target URL as a scrapeable file.
Manual start the execution of a script.
Returns void. If the file doesn't exist a message will be written to the log. If the called script has an error in it a warning will be written to the log.
Version | Description |
---|---|
5.0 | Scripts called using this method are now exported with the scraping session. |
4.5 | Available for professional and enterprise editions. |
Executes the named script, but preserves the current context (dataRecord, scrapeableFile, etc...)
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
Get the general character set being used in page response renderings.
This method does not receive any parameters.
Returns the character set applied to the scraping session's files, as a string. If a character set has not been specified then it will default to the character set specified in settings dialog box.
Version | Description |
---|---|
4.5 | Available for all editions. |
If you are having trouble with characters displaying incorrectly, we encourage you to read about how to go about finding a solution using one of our FAQs.
Retrieve the timeout value for scrapeable files in the session.
This method does not receive any parameters.
Returns the timeout value in milliseconds, as an integer.
Version | Description |
---|---|
5.0.1a | Introduced for all editions. |
Get the current cookies.
This method does not receive any parameters.
Returns an array of the cookies in the session.
Version | Description |
---|---|
5.0 | Available for all editions. |
Checks to see if this is currently set to run in debug mode. This is useful for developing scrapes, as enabling debug mode logs a warning message, so it is easier to notice a scrape with hard-coded values used for development. Also logs a warning in the web interface or log each time monitored variables are logged with the logMonitoredValues or webMessage methods are called.
This method takes no parameters.
True if debug mode is enabled, false otherwise.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Gets the default retry policy to be used by each scrapeable file when one wasn't set for it.
This method takes no parameters
The default return policy, or null if there isn't one
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
Get how long the current session has been running.
This method does not receive any parameters.
Returns number of milliseconds the scrape has been running, as a long (8-byte integer).
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
If you would like to log the running time of the scraping session you should use logElapsedRunningTime.
Get the logging level of the scrape.
This method does not receive any parameters.
Returns the logging level, as an integer. Currently there are four levels: 1 = Debug, 2 = Info, 3 = Warn, 4 = Error.
Version | Description |
---|---|
5.0.1a | Introduced for all editions. |
Retrieve the maximum number of concurrent file downloads being allowed.
This methods does not receive any parameters.
Returns the max number of concurrent file downloads allowed, as an integer.
Version | Description |
---|---|
5.0 | Added for professional and enterprise editions. |
Retrieve the number of attempts that scrapeable files should make to get the requested page.
This method does not receive any parameters.
Returns the number of attempts that will be made, as a integer.
Version | Description |
---|---|
5.0 | Available for all editions. |
Get the total number of scripts allowed on the stack before the scraping session is forcibly stopped.
This method does not receive any parameters.
Returns max number of scripts that can be running at a time, as an integer.
Version | Description |
---|---|
5.0 | Added for all editions. |
Get the name of the current scraping session.
This method does not receive any parameters.
Returns the name of the scraping session, as a string.
Version | Description |
---|---|
4.5 | Available for all editions. |
Get the number of scripts currently running.
This method does not receive any parameters.
Returns number of running scripts, as an integer.
Version | Description |
---|---|
5.0 | Added for all editions. |
Determine whether or not non-tidied HTML is to be retained for all scrapeable files in this scraping session.
This method does not receive any parameters.
Returns whether non-tidied HTML is be retained for all scrapeable files or not, as a boolean.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Get the unique identifier for the scraping session.
This method does not receive any parameters.
Returns unique session id for the scraping session, as an integer.
Version | Description |
---|---|
5.0 | Added for enterprise edition. |
Retrieve the time at which the scrape started.
This method does not receive any parameters.
Returns the start time of the scrape in milliseconds, as a long.
Version | Description |
---|---|
4.5 | Available for all editions. |
Gets the current time zone of the Scraping Session
This method takes no parameters.
The time zone this scrape is set to.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Retrieve the value of a saved session variable.
Returns the value of the session variable. This will be a string unless you have used setVariable to place something other than a string into a session variable.
Version | Description |
---|---|
4.5 | Available for all editions. |
Retrieve the value of a saved session variable (alias of getVariable).
Returns the value of the session variable. This will be a string unless you have used setVariable to place something other than a string into a session variable.
Version | Description |
---|---|
4.5 | Added for all editions. |
Returns whether or not we are currently running in the command line. This is a convenience method for doing something different in a script when running in the command line as opposed to other modes
This method does not receive any parameters.
Returns true if and only if the scrape is currently running in the command line.
Version | Description |
---|---|
6.0.37a | Introduced for all editions. |
Returns whether or not we are currently running in the server. This is a convenience method for doing something different in a script when running in the server as opposed to other modes
This method does not receive any parameters.
Returns true if and only if the scrape is currently running in the server.
Version | Description |
---|---|
6.0.37a | Introduced for all editions. |
Returns whether or not we are currently running in the workbench. This is a convenience method for doing something different in a script when running in the workbench as opposed to other modes
This method does not receive any parameters.
Returns true if and only if the scrape is currently running in the workbench.
Version | Description |
---|---|
6.0.37a | Introduced for all editions. |
Loads the state that would have been previously saved by invoking the session.saveStateToString method.
None
Version | Description |
---|---|
5.5.30a | Available in Professional and Enterprise editions. |
Load session variables from a file.
Returns void. If there is a problem retrieving the file contents an I/O error will be written to the log.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
See also: saveVariables.
If you want to create your own file of session variables, the format is a hard return-delimited list of name/value pairs. Both the key and value should be URL-encoded.
Saves the current state of the scraping session to a string. An example use case for this method would be a scraping session that logs in to a site, extracts some information, and then is stopped, saving its state out to a file. A second scraping session could then be run, loading the state back in from the file, which would keep the session logged in so that other information could be obtained without logging in once again. By default the scraping session will save out information such as the URL to use as a referer. More information can be saved using the boolean flags described below.
None
Version | Description |
---|---|
5.5.30a | Available in Professional and Enterprise editions. |
Saves all current string and integer variables to a file.
Returns void. If there is a problem retrieving the file contents an I/O error will be written to the log.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Manually scrape a scrapeable file.
Returns void. If there is a problem accessing the scrapeable file an message will be written to the log.
Version | Description |
---|---|
4.5 | Available for all editions. |
Invokes a scrapeable file using a string of content instead of a web page or local file.
None
Version | Description |
---|---|
5.5.13a | Available in all editions. |
Send data to the external script that initiated the scrape. This isn't currently supported with all drivers (e.g., remote scraping session), check the documentation on the language of the external script for more information.
Returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
Set the general character set used in page response renderings. This can be particularly helpful when the pages render characters incorrectly.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
This method must be invoked before the session starts.
If you are having trouble with characters displaying incorrectly, we encourage you to ready about how to go about finding a solution using one of our FAQs.
Set the timeout value for scrapeable files in the session.
Returns void.
Version | Description |
---|---|
5.0.1a | Introduced for all editions. |
Manually set a cookie in the current session state.
Returns void.
Version | Description |
---|---|
4.5 | Available for professional and enterprise editions. |
This method should be rarely used as screen-scraper automatically manages cookies. In cases where cookies are set via JavaScript, this function might be necessary.
Sets the debug state for the scrape. Enabled debug mode simply outputs a warning periodically while running, to help prevent running a production scrape in debug mode.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Sets a retry policy that will affect all files in the scrape. This policy will be used by all scrapeable files that do not have a retry policy set for them. If a retry policy was manually set for them, this one will not be used.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
Sets the path to the keystore file. Some web sites require a special type of authentication that requires the use of a keystore file. See our blog entry on Using Client Certificates for more detail. Calling this method is the equivalent of setting the corresponding value under the "Advanced" tab for the scraping session in the workbench.
None
Version | Description |
---|---|
5.5.10a | Available in all editions. |
Sets the password for the keystore file. Some web sites require a special type of authentication that requires the use of a keystore file. See our blog entry on Using Client Certificates for more detail. Calling this method is the equivalent of setting the corresponding value under the "Advanced" tab for the scraping session in the workbench.
None
Version | Description |
---|---|
5.5.10a | Available in all editions. |
Set the logging level of the scrape.
Returns void.
Version | Description |
---|---|
5.0.1a | Introduced for all editions. |
Set the maximum number of concurrent file downloads to a allow.
Returns void.
Version | Description |
---|---|
5.0 | Added for professional and enterprise editions. |
Set the number of attempts that scrapeable files should make to get the requested page.
Returns void.
Version | Description |
---|---|
5.0 | Available for all editions. |
Get the total number of scripts that can be running concurrently. Default value for maxScriptsOnStack is 50.
Returns void.
Version | Description |
---|---|
5.0 | Added for enterprise edition. |
Before you start upping the value of the number of scripts that can be on the stack you should make sure that your scrape is not eating more then it should. One thing to consider is recursion instead of iterating. This is discussed in more details on our blog or in the Tips, Tricks, and Samples section of this site.
Causes the "User-Agent" header sent by screen-scraper to be randomized. The user agent strings from which screen-scraper will select are found in the "resource\conf\user_agents.txt" file.
None
Version | Description |
---|---|
5.5.34a | Available in Professional and Enterprise editions. |
Set whether or not non-tidied HTML is to be retained for all scrapeable files.
Returns void.
Version | Description |
---|---|
4.5 | Available for enterprise edition. |
If, after the file is scraped, you want to be able to use getNonTidiedHTML this method has to be called before a file is scraped.
Sets the value of all session variables that match the keys in the Map to the values in the Map. This will ignore a key of DATARECORD.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
5.5.43a | Changed from session.setSessionVariablesFromMap to session.setSessionVariables. |
Sets a status message to be displayed in the web interface.
None
Version | Description |
---|---|
5.5.32a | Available in Enterprise edition. |
If this method is passed the value of true, it will cause screen-scraper to stop the current scraping session if an extractor pattern timeout occurs.
None
Version | Description |
---|---|
5.5.36a | Available in Professional and Enterprise editions. |
If this method is passed the value of true, it will cause screen-scraper to stop the current scraping session if the maximum attempts to request a file is reached.
None
Version | Description |
---|---|
5.5.36a | Available in Professional and Enterprise editions. |
If this method is passed the value of true, it will cause screen-scraper to stop the current scraping session if a script error occurs.
None
Version | Description |
---|---|
5.5.36a | Available in Professional and Enterprise editions. |
Sets the time zone that will be used when using a method that returns a time formatted as a string.
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
If this method is passed the value of true, it will cause screen-scraper to utilize whatever character set is specified by the server in its "Content-Type" response header. If no such header exists, screen-scraper will default to either the character set indicated for the scraping session or the global character set (in that order).
None
Version | Description |
---|---|
5.5.11a | Available in all editions. |
Sets the user agent to be used for all requests.
None
Version | Description |
---|---|
5.5.23a | Available in Professional and Enterprise editions. |
Set the value of a session variable.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Set the value of a session variable (alias of setVariable).
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |
Determine if the scrape has been stopped. This can be done using the stop button in the workbench or the stop scraping button on the web interface (for enterprise users).
This method does not receive any parameters.
Returns true if the scrape has been requested to stop; otherwise, it returns false.
Version | Description |
---|---|
5.0 | Added for enterprise edition. |
Stop the current scraping session.
This method does not receive any parameters.
Returns void.
Version | Description |
---|---|
4.5 | Available for all editions. |
Waits for any file downloads to complete before returning. This should be used in tandem with the session.downloadFile method call that takes the "doLazy" paraameter.
None
None
Version | Description |
---|---|
5.5.43a | Available in Enterprise edition. |
The sutil class provides general functions used to manipulate and work with extracted data. It also allows you to get information regarding screen-scraper such as its memory usage or version.
In the course of a scrape it you might want to gather images associated with the other information being gathered. These methods are provided to not only download the images but to gather size information and resize to your desired size.
These methods are only available to enterprise edition users.
Get the height of an image.
Returns the height in pixels of the image file, as an integer. If the file doesn't exist or is not an image an error will be thrown and -1 will be returned.
Version | Description |
---|---|
5.0 | Moved from session to sutil. |
4.5 | Available for enterprise edition. |
Get the width of an image.
Returns the width in pixels of the image file, as an integer. If the file doesn't exist or is not an image an error will be thrown and -1 will be returned.
Version | Description |
---|---|
5.0 | Moved from session to sutil. |
4.5 | Available for enterprise edition. |
Internally, only one function is used to resize all images; however, to facilitate the resizing of images, we have provided you with three methods. Each method will help you specify what measurement is most important (width or height) and whether the image should retain its aspect ratio.
Resize image, retaining aspect ratio, based on specified height.
Returns void. If an error is encountered it will be thrown.
Version | Description |
---|---|
5.0 | Moved from session to sutil. |
4.5 | Available for enterprise edition. |
Resize image, retaining aspect ratio, based on specified width.
Returns void. If an error is encountered it will be thrown.
Version | Description |
---|---|
5.0 | Moved from session to sutil. |
4.5 | Available for enterprise edition. |
Resize image to a specified size.
Returns void. If an error is encountered it will be thrown.
Version | Description |
---|---|
5.0 | Moved from session to sutil. |
4.5 | Available for enterprise edition. |
This method can cause distortions of the image if the aspect ratio of the original and target images are different.
To be used in conjunction with the ImageDecoder class.
This class represents decoded images. The objects can be queried for the text that was in the image, as well as any error that occurred while the image was being decoded. When the returned text is incorrect, there is a method that can be used to report it as bad. This can be used for sites like decaptcher.com, where refunds are given for incorrectly interpreted images.
Gets any error message, or returns null if there was no error
This method takes no parameters
The error message returned
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Gets the result from decoding the image. Most likely this will be a String, but each implementation could return a specific object type.
This method takes no parameters
The text extracted from the image, or null if there was an error
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Handles an incorrectly resolved image. Some types of decoders won't have anything here
This method takes no parameters
This method returns void.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Returns true if there was an error, false otherwise. Also returns false if the image has not been resolved yet
This method takes no parameters
True if there was an error, false otherwise
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Class to convert images to text for interacting with CAPTCHA challenges. There are currently two implementations:
When a reference to an image is passed to an instance of this class, it returns a DecodedImage object that can be queried for the resulting text, errors, and can report an image as poorly converted.
See example attached.
Requires an account with decaptcher.com.
Type of ImageDecoder in the com.screenscraper.util.images
package that uses the decaptcher.com service to convert images to text. The constructor is DecaptcherDecoder(ScrapingSession session, String username, String password) or DecaptcherDecoder(ScrapingSession session, String username, String password, String apiUrl).
Returns void. If it runs into any problems accessing the decaptcher.com service an error will be thrown.
Version | Description |
---|---|
5.5.29a | Available in all editions |
5.5.40a | Added the port parameter. The service now requires the correct port in order to authenticate. |
Initialization script
Type of ImageDecoder in the com.screenscraper.util.images
package that uses a popup window prompting the user to enter the text read from an image. Useful for debugging purposes, as the input text should always be correct (so long as it is typed correctly). Helpful during testing to avoid costs associated with paid-for CAPTCHA decoding services such as decaptcher.com.
Returns void. If it runs into any problems decoding an image an error will be thrown.
Version | Description |
---|---|
5.5.29a | Available in all editions |
Initialize script
Converts the image given to a DecodedImage that will handle it. Does not delete the file.
A DecodedImage used to get the text, errors, and possibly report a result as bad.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Converts the image at the given URL to a DecodedImage that will handle it. Temporarily saves the file in the screen-scraper root folder, but deletes it once it has been decoded. By default, this will use the scraping session's HttpClient to request the URL.
A DecodedImage used to get the text, errors, and possibly report a result as bad.
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Converts the Date given to a string in a specified format, or in the "MM/dd/yyyy HH:mm:ss.SS zzz" if no format is given.
A String representing the date given
Version | Description |
---|---|
5.5.26a | Available in all editions. |
Decode HTML Entities.
Returns string with decoded HTML entities.
Version | Description |
---|---|
5.0 | Added for all editions. |
Converts a String to a Date object using the given format. If null is given as a format, "MM/dd/yyyy HH:mm:ss.SS zzz" is used
The Date object matching the date given in the String, or null if it couldn't be parsed with the given format
Version | Description |
---|---|
5.5.26a | Available in all editions. |
Replaces the UTF variants on whitespace with a regular space character.
Returns the converted string.
Version | Description |
---|---|
6.0.55a | Available in all editions. |
Checks to see if one date is within a certain number of days of another.
Version | Description |
---|---|
5.5.13a | Available in all editions. |
Compare two strings ignoring case.
Returns true if the values of the two strings are equal when case is not considered; otherwise, it returns false.
Version | Description |
---|---|
5.0 | Added for all editions. |
Returns a number formatted in such a way that it could be parsed as a Float, such as xxxxxxxxx.xxxx. It attempts to figure out if the number is formatted as European or American style, but if it cannot determine which it is, it defaults to American. If the number is something with a k on the end, it will convert the k to thousand (as 000). It will also try to convert m for million and b for billion. It also assumes that you won't have a number like 3.123k or 3.765m, however 3.54m is fine. It figures if you wanted all three of those digits you would have specified it as 3765k or 3,765k
Returns a String formatted as a phone number, such as +1 (123) 456-7890x2, or null if the input was null
Version | Description |
---|---|
5.5.26a | Available in all editions. |
Converts a String to a US formatted phone number, as +1 (123) 456-7890x2. Expects a 7 digit or 10+ digit phone number. The extension is optional, and will be any digits found after an x. This allows for extensions listed as ext, x, or extension.
Returns a String formatted as a phone number, such as +1 (123) 456-7890x2, or null if the input was null
Version | Description |
---|---|
5.5.26a | Available in all editions. |
Formats and returns a US style zip code as 12345-6789. If the given zip code isn't 5 or 9 digits, will log a warning, but it will put 5 digits before the - and anything else (if any) after the -
Zip code formatted String, such as 12345-6789 or 12345
Version | Description |
---|---|
5.5.26a | Available in all editions. |
Returns the current date in a specified format, or uses the "MM/dd/yyyy HH:mm:ss.SS zzz" if null is given. Uses the session's timezone.
A String representing the date and time this method was invoked
Version | Description |
---|---|
5.5.26a | Available in all editions. |
Retrieve the file path of the screen-scraper installation.
This method does not receive parameters.
Returns the installation directory file path, as a string.
Version | Description |
---|---|
5.0 | Added for all editions. |
Get memory usage of screen-scraper.
This method does not receive any parameters.
Returns the average percentage of memory used by screen-scraper over the past 30 seconds, as an integer.
Version | Description |
---|---|
5.0 | Moved from session to sutil. |
4.5 | Available for enterprise edition. |
For tips on optimizing screen-scraper's memory usage so that it can run faster, see our FAQ on optimization.
Get the mime-type of a local file.
Returns the mime-type of the file, as a string.
Version | Description |
---|---|
5.0 | Added for all editions. |
Get the number of runnable scraping sessions.
This method does not receive any parameters.
Returns the number of scraping sessions in this instance of screen-scraper, as a integer.
Version | Description |
---|---|
5.0 | Added for all editions. |
Gets the number of scraping sessions that are currently being run.
An int representing the number of running scraping sessions.
Version | Description |
---|---|
5.5.42a | Available in Enterprise edition. |
Gets a DataSet containing each of the elements of a <select> tag. The returned DataRecords will contain a key for the text found between the tags (possibly with html tags removed), a value indicating if it was the selected option, and the value to submit for the specific option. Note that this only looks for option tags, and as such passing in text containing more than a single select tag will produce false output.
A DataSet with one record per option. Values extracted will be stored in
VALUE : The value the browser would submit for this option
TEXT : The text that was between the tags
SELECTED : A boolean that is true if this option was selected by default
Version | Description |
---|---|
5.5.26a | Available in all editions. |
Gets all the options from a radio button group. The values are returned in a data record. Any labels that are to be ignored will not be included in the returned set. Not all buttons have a label, as radio buttons do not require a label, and it would be difficult to know in a regular expression exactly what to extract as the label unless there is a label tag.
DataSet containing one record for each of the extracted radio buttons. Values will be stored in
VALUE : The value the browser would submit for this radio button
TEXT : The text that represents this button, or null if no label could be found for it
SELECTED : A boolean that is true if this button was selected by default
ID : The ID of the radio button, or null if no ID was found
Version | Description |
---|---|
5.5.29a | Available in all editions. |
Gets a random referrer page from a list of many different search engine web sites and a few other pages.
This method does not receive any parameters.
Returns a random referrer URL.
Version | Description |
---|---|
6.0.1a | Introduced for all editions. |
Returns a random User Agent. The list isn't closely monitored, so it may not include newer user agents, and may include extremely old ones as well.
This method does not receive any parameters.
Returns a random user agent.
Version | Description |
---|---|
6.0.1a | Introduced for all editions. |
Get edition of screen-scraper instance.
This method does not receive any parameters.
Returns the edition name, as a string.
Version | Description |
---|---|
5.0 | Added for all editions. |
Get version of screen-scraper instance.
This method does not receive any parameters.
Returns the version number, as a string.
Version | Description |
---|---|
5.0 | Added for all editions. |
Determine if the value of a string is an integer.
Returns true if the string is an integer; otherwise, it returns false. If it is passed an object that is not a string, including an integer, an error will be thrown.
Version | Description |
---|---|
5.0 | Added for all editions. |
Determine if an object's value is null or empty.
Returns true if the value of the object is null or an empty string; otherwise, it returns false.
Version | Description |
---|---|
5.0 | Added for all editions. |
Determine if operating system is a Linux platform.
This method does not receive parameters.
Returns true if the operating system is Linux; otherwise, it returns false.
Version | Description |
---|---|
5.0 | Added for all editions. |
Determine if operating system is a Mac platform.
This method does not receive parameters.
Returns true if the operating system is Mac; otherwise, it returns false.
Version | Description |
---|---|
5.0 | Added for all editions. |
Determine if operating system is a Windows platform.
This method does not receive parameters.
Returns true if the operating system is Windows; otherwise, it returns false.
Version | Description |
---|---|
5.0 | Added for all editions. |
Retrieve the response contents of a GET request.
Returns contents of the response, as a string.
Version | Description |
---|---|
5.0 | Added for all editions. |
This method will use any proxy settings that have been specified in the Settings dialog box.
Makes a GET request and returns the result as a string. This method will use the proxy settings indicated in the "Settings" dialog box, if any.
This method does not receive any parameters.
Version | Description |
---|---|
6.0.6a | Introduced for all editions. |
Makes a GET request and returns the result as a string. This method will use the proxy settings attached to the current scraping session.
This method does not receive any parameters.
Version | Description |
---|---|
6.0.6a | Introduced for all editions. |
Retrieve the response header contents.
Returns contents of the response, as a two-dimensional array.
Version | Description |
---|---|
5.0 | Added for all editions. |
This method will use any proxy settings that have been specified in the Settings dialog box..
Merges two data records by copying all values from the second record over values of the first record, and returning a new DataRecord with these values. Doesn't modify either original record
A new DataRecord with the merged values
Version | Description |
---|---|
5.5.26a | Available in all editions. |
Get an object in string format.
Returns an empty string if the value of the object is null; otherwise, returns the value of the toString method of the object.
Version | Description |
---|---|
5.0 | Added for all editions. |
Attempts to parse a string to a name. The parser is not perfect and works best on english formatted names (for example, "John Smith Jr." or "Guerrero, Antonio K". This uses standard settings for the parser. To get more control over how the name is parsed, use the EnglishNameParser class.
Returns the parsed name, as a Name object.
Version | Description |
---|---|
6.0.59a | Available for professional and enterprise editions. |
Attempts to parse a string to a name. The parser is not perfect and works best on english formatted names (for example, "John Smith Jr." or "Guerrero, Antonio K". This uses standard settings for the parser. To get more control over how the name is parsed, use the EnglishNameParser class.
Returns the parsed name, as a Name object.
Version | Description |
---|---|
6.0.59a | Available for professional and enterprise editions. |
Attempts to parse a string to an address. The parser is not perfect and works best on US addresses. Most likely other address formats can be parsed with the USAddressParser class by providing different constraints in the builder. This method is here for convenience in working with US addresses.
Returns the parsed address, as a Address object.
Version | Description |
---|---|
6.0.59a | Available for professional and enterprise editions. |
Pause scraping session.
Returns void.
Version | Description |
---|---|
5.0 | Moved from session to sutil. |
4.5 | Available for professional and enterprise editions. |
Pausing the scraping session also pauses the execution of the scripts including the one that initiates the pause.
Pauses for a random amount of time. This is also setup to stop immediately if the stop scrape button is clicked, and to allow breakpoints to be triggered while it is pausing.
Returns void.
Version | Description |
---|---|
5.5.29a | Available in professional and enterprise editions. |
Change a date format.
Returns formatted date according to the specified format, as a string.
Version | Description |
---|---|
5.0 | Moved from session to sutil. |
4.5 | Available for professional and enterprise editions. Unspecified source format available for enterprise edition. |
The date formats are not the same for the two methods. Read carefully.
Send an email using SMTP mail server specified in the settings.
Returns void. If it runs into any problems while attempting to send the email an error will be thrown.
Version | Description |
---|---|
6.0.35a | Now supports alternate content types. |
5.0 | Moved from session to sutil. |
4.5 | Available for enterprise edition. |
Sorts the elements in a set into an ordered list.
This method returns a sorted list of elements that are in the set.
Version | Description |
---|---|
5.5.26a | Available in all editions. |
Determine if one string is the start of another, without regards for case.
Returns true if string starts with start when case is not considered; otherwise, it returns false.
Version | Description |
---|---|
5.0 | Added for all editions. |
Parse string into a floating point number.
Returns the string's value as a floating point number.
Version | Description |
---|---|
5.0.1a | Introduced for professional and enterprise editions. |
Strips HTML from a string, replacing some tags with corresonding text-only equivalents.
Returns the stripped content.
Version | Description |
---|---|
6.0.20a | Available in only the Enterprise edition. |
Tidies the DataRecord by performing actions based on the values of the settings map given (or getDefaultTidySettings if none is given). Each value in the record that is a string will be tidied. Keys are not modified. The record given will not be modified, but a new record with the tidied values will be returned. If no settings are given, will use the values obtained from sUtil.getDefaultTidySettings().
The settings tidy settings and their default values are given below. If a key is missing in the settings map, that operation will not be performed.
Map Key | Default Value | Description of operation performed |
---|---|---|
trim | true | Trims whitespace from values |
convertNullStringToLiteral | true | Converts the string 'null' (without quotes) to the null literal (unless it has quotes around it, such as "null") |
convertLinks | true | Preserves links by converting <a href="link">text</a> to text (link), will try to resolve urls if scrapeableFile isn't null. Note that if there isn't a start and end <a> tag, this will do nothing |
removeTags | true | Remove html tags, and attempts to convert line break HTML tags such as <br> to a new line in the result |
removeSurroundingQuotes | true | Remove quotes from values surrounded by them -- "value" becomes value |
convertEntities (professional and enterprise editions only) | true | Convert html entities |
removeNewLines | false | Remove all new lines from the text. Replaces them with a space |
removeMultipleSpaces | true | Convert multiple spaces to a single space, and preserve new lines |
convertBlankToNull | false | Convert blank strings to null literal |
A new DataRecord containing all the tidied values and any values that were not Strings in the original record. The values that were Strings but were not tidied as well as the DATARECORD value will not be in the returned record.
Version | Description |
---|---|
5.5.26a | Available in all editions. |
5.5.28a | Now uses a Map for the settings, rather than bit flags. |
Tidies the string by performing actions based on the values of the settings map.
The tidy settings and their default values are given below. If a key is missing in the settings map, that operation will not be performed.
Map Key | Default Value | Description of operation performed |
---|---|---|
trim | true | Trims whitespace from values |
convertNullStringToLiteral | true | Converts the string 'null' (without quotes) to the null literal (unless it has quotes around it, such as "null") |
convertLinks | true | Preserves links by converting <a href="link">text</a> to text (link), will try to resolve urls if scrapeableFile isn't null. Note that if there isn't a start and end <a> tag, this will do nothing |
removeTags | true | Remove html tags, and attempts to convert line break HTML tags such as <br> to a new line in the result |
removeSurroundingQuotes | true | Remove quotes from values surrounded by them -- "value" becomes value |
convertEntities (professional and enterprise editions only) | true | Convert html entities |
removeNewLines | false | Remove all new lines from the text. Replaces them with a space |
removeMultipleSpaces | true | Convert multiple spaces to a single space, and preserve new lines |
convertBlankToNull | false | Convert blank strings to null literal |
The tidied string
Version | Description |
---|---|
5.5.26a | Available in all editions. |
5.5.28a | Now uses a Map for the settings, rather than bit flags. |
Assuming the extracted text's HTML code was:
<a href="http://www.somelink.com">This</a> was great because of these reasons:<br />
1 - Some reason<br />
2 - Another reason<br />
3 - Final reason
The output text would be:
This (http://www.somelink.com) was great because of these reasons:
1 - Some reason
2 - Another reason
3 - Final reason
Unzip a zipped file. Contents will appear in the same directory as the zipped file.
Returns void. If a file input/output error is experienced it will be thrown.
Version | Description |
---|---|
5.0 | Added for all editions. |
Write to a file.
Returns void.
Version | Description |
---|---|
5.0 | Added for all editions. |