Alpha Change Log
Alpha versions are used to fix minor bugs and feature enhancement testing before they are added to stable versions. As such anything that is in the alpha version is prone to change and instability as they are being improved. This log will follow the changes as they are made for your convenience.
View Release Notes for public versions.
Alpha Version Logs
- Added sutil.getRandomUserAgent and sutil.getRandomReferer.
- Added IDE style completions. Two new properties are needed for this to work:
ShowVariableCompletionsAt=2 (this is the number of characters that must be typed before a completion list appears) and GenericCompletions=true (this sets a flag indicating that generic completions should be used).
- Added session.getCurrentStack (a basic method to get the stack).
- Added scrapeableFile.applyXPathExpression and sutil.applyXPathExpression.
- Added dataSet.size, which is equivalent to dataSet.getNumDataRecords.
- Now nulling session variables for appropriate extractor pattern tokens after each extractor pattern match instead of after the pattern has been applied.
- Fixed a bug where the HTTP connection pool was getting shut down prematurely.
- Fixed a bug related to the previous change to null session variables.
- Fixed a bug such that a scrapeable session ID is now being generated even for scraping sessions that will run in the future.
- Fixed a bug where nodes in the tree weren't being highlighted correctly.
- Scrapeable files can now be added via a URL.
- If the DatabasePort and WebServerShutdownPort properties are omitted from the screen-scraper.properties file they'll now be automatically set to the value of an open port.
- The ProxyPort will now only be tested and used when screen-scraper is running in server mode if the AllowProxyScripting is set to true.
- Added a "Load Response from Clipboard" button to the scrapeable file panel.
- Updated BeanShell to the latest version, disabling unstable Windows scripting in the process (e.g., VBScript).
- sutil.makeGETRequest and sutil.makeHEADRequest now use proxy settings from the corresponding scraping session.
- Temporarily rolled back to the previous version of BeanShell because of a bug.
- Upgraded Bean Shell to the latest version.
- Searches within a proxy session now include notes.
- Fixed an issue that would cause the workbench to freeze when the breakpiont window was up.
- Now using global proxy settings if no session proxy settings are found.
- Improved cookie handling in the proxy server.
- Fixed a bug that would cause a proxy session to not be completely saved.
- Added sutil.makeGETRequestNoSessionProxy.
- Fixed a bug that would cause the proxy to misbehave when filtering out less-useful transactions.
- Now decoding parameters when adding a scrapeable file from a URL.
- The request entity for a scrapeable file can now be set in the workbench.
- Fixed a bug where scraping session nodes in the tree were getting collapsed incorrectly.
- Updated web server to use Jetty.
- Fixed a bug related to generating scrapeable files from a proxy transaction.
- Fixed a bug related to adding jar files from the ext folder.
- Fixed a bug related to redirects within sutil.makeGETRequest.
- Fixed a problem with the scraping server not starting up.
- Fixed an issue with the web server on Windows.
- If request entity text box is blanked now setting the value to null.
- Errors will no longer be thrown if a scraping session has already been stopped.
- Added sutil.makeGETRequestUseSessionProxy. The sutil.makeGETRequest method will now use no proxy.
- Fixed an issue related to loading external jar files when running in server mode.
- Fixed a bug related to automatic anonymization.
- Extractor patterns invoked manually can now be tested on a sub-set of the HTML page.
- Added scrapeableFile.setForcePOST.
- Upgraded internal GWT libraries.
- Prettied up the web UI.
- Added machine-readable values to REST interface output.
- Now propertly handling en-dash characters in URL's.
- Can now handle HTTP responses that send two status lines.
- In-line documentation in the script editor improved. Inside screen-scraper's doc folder if a javadoc folder is found containing api documentation it will be made available within the script editor.
- Re-enabled SOAP interface.
- Now writing an error message to the log when a scraping session import via the SOAP interface fails.
- Added a global find feature.
- Added RetryPolicy.runOnAllAttemptsFailed()
- Fixed a bug in RetryPolicy related to scraping files recursively.
- The scrapeableFile.addHTTPHeader method is now available in Professional Edition.
- Widened the proxy text boxes.
- Increased the height of the sub-extractor text panel.
- Fixed "When to run" combo box to select the correct value when clicked.
- Fixed a bug related to editing long HTTP parameters.
- Added sutil.stripHTML.
- Responses from ss web server now being compressed.
- Automatic internal DB backups can be set with ShouldBackUpInternalDB property.
- Save button now becoming active when a long HTTP parameter is updated.
- Fixed a bug related to history navigation buttons.
- Fixed a bug related to removing completed scraping sessions via the web UI.
- Added session.clearProxySettings() method.
- Fixed a bug related to editing extractor pattern tokens that have the same identifier.
- Added mouseover row highlights to web UI.
- Impoved stability in multi-threaded scrapes.
- Updated the date picker for the web UI.
- Now diplaying server time in "Settings" dialog box in web UI.
- Added disk space usage to web UI.
- Added String scrapeableFile.getRedirectURLs().
- Added proxy filters.
- Comparing scrapeable file requests and proxy requests now takes into account raw request entities.
- Script error line numbers are now hyperlinked.
- Fixed a naming issue when copying scrapeable files.
- Fixed a bug related to HTML in error messages.
- Fixed a bug related to comparing HTTP requests in the workbench.
- Fixed a bug in rendering session variables in the web UI.
- Fixed disk usage indicator in web UI.
- Fixed a bug in how POST params are rendered when comparing HTTP requests.
- Fixed a bug related to proxy pools.
- Fixed an internal issue related to tracking running scraping sessions.
- Now aborting a running scraping session if unable to find a valid proxy while using the proxy pool.
- Minor fix to the DataManager.
- Error messages are no longer hyperlinked when not running the workbench.
- Fixed an issue where one script producing an error would interrupte a series of scripts.
- Fixed an SSL issue when running on AIX.
- Fixed an issue where scraping some SSL sites would generate an error.
- Fixed a bug related to a recent change to how SSL is initialized. Added the "Use only SSL version 3" checkbox under the "Advanced" tab for a scraping session.
- Fixed a couple of bugs related to a fix in the previous build.
- One more bug fix related to the recent SSL changes.
- Altering external proxy settings for a proxy session now take effects when restarting the proxy session.
- sutil.sendMail now supports alternate content types.
- Updated password fields to obscure text.
- Update HttpClient and NTLM authentication.
- Code folding in scripts and last response
- Syntax highlighting in last response
- DataManager updates
- Added a runOnAllAttemptsFailed() method to RetryPolicy
- Added convenience methods: isRunningInWorkbench(), isRunningFromCommandLine(), isRunningInServer() to session
- Improved handling of NTLM proxies
- Fixed a thread blocking issue when invoking a RunnableScrapingSession.
- Fixed an issue related to reusing HTTP connections.
- Fixed a naming issue related to generating multiple scrapeable files from proxy transactions.
- Fixed an issue where imported scripts weren't being properly associated with corresponding objects.
- Added scrapeableFile.getResolvedURL()
- Updated the ss_updater.py file to use the REST interface.
- Fixed a concurrency issue related to running the same scraping session multiple times.
- Use of $ in the regular expression field for an extractor pattern token is now allowed.
- Fixed a bug where invoking a scrapeable file manually was causing the tree in the workbench to malfunction.
- Upgraded HttpClient to version 4.3.
- Downgraded back to HttpClient 4.2.
- Includes experimental code for parsing mailing addresses.
- do_lazy_scrape can now be passed as a parameter when running a scraping session via the REST interface.
- Added finalize_scrapeable_session action to the REST interface.
- Updated a few URL's for remote services.
- Fixed a bug related to running a scraping session via the REST interface.
- Fixed an issue resolving relative URL's beginning with ?.
- Improved an issue related to connections remaining open when using external proxy servers.
- Fixed a bug where lazy scrapes would halt prematurely when running from the command line.
- Fixed a bug caused by extractor pattern token names containing numbers.
- Fixed two bugs related to searching in the log and script pane.
- Fixed a minor memory leak.
- Fixed two bugs related to finding in text areas.
- Fixed a threading issue with anonymous proxies.
- Added ConvertHTMLEntitiesByDefault and TrimWhiteSpaceByDefault to screen-scraper.properties.
- The KeyManagerFactory algorithm to be used can now be set via the KeyManagerFactory property in the screen-scraper.properties file.
- Changed Extractor patterns so they use a thread pool and subextractors can run concurrently. After running various tests with subextractors this resulted on average in about 25% - 50% increase in extraction speed.
- Bug fixes and improvements to the address parser.
- When using a proxy pool, proxies now start cycling from a random offset rather than always starting at the first proxy in the list.
- Modified the extractor pattern name box to have an orange background if the extractor will save the dataset automatically to a session variable (similar to the red sequence number box if it will be run manually).
- Bug fix to the form classes when the action tag is missing for the form.
- Updates and improvements to the completions provider (the dynamic one, not the old one) so class names are completed and constructors also get completions sooner.
- scrapeableFile.setForcedRequestType(ScrapeableFile.RequestType) method (should be called before file is scraped). ScrapeableFile.RequestType is an enum with GET, POST, PUT, HEAD, DELETE, OPTIONS as values. If the method is called to set the request to one of those types, all parameters set as GET in the parameters tab will be appended to the url (like normal) and all parameters set as POST parameters will be used to build the request entity. If there are POST values on a type that doesn't support a request entity, and exception will be thrown.
- Added a sutil.convertUTFWhitespace method that takes an input string and converts all the different UTF whitespace characters to a standard space character.
- Added a table parser. The table parser takes in a block of HTML representing a table and parses it to a Table object. Cells of the table can be retrieved using the Table.getCell(row, column) method, and this will account for rowspan and colspan cells. Requesting a cell that is spanned by another cell will return an object that shows the same data as the cell that was spanned, but has a flag indicating it is a span cell.
- Added / updated an event callback system to ScrapingSession. This allows adding callbacks to be run at various times, similar to scripts, but declared in an initialization script. One useful scenario is to have an initialize script that sets up a connection to a database, and then add a callback to run after all scripts have executed that closes the connection to the database.
- Similarly the ScrapeProfiler class can be used in conjunction with the event system to hunt down slow spots in a scrape. It will show the runtime of each script, extractor, scrapeable file, and which session variables are not used. Attaching this to a scrape will cause it to run much slower, but can provide insight when things go wrong.
- Added the default_token_config.xml file to resources/conf. This file, if present, allows setting default regular expressions / options for extractor tokens. For example, the included one will set the regular expression to [^<>]* for any all upper-case token names with either < on on the right or > on the left of the token, and will also set the Convert Entities checkbox and Trim checkbox.
- Added two values to the properties file to allow access to the web interface from IP addresses that aren't on the allowed list. These properties are WebInterfaceUser and WebInterfacePassword. Note that we are still using http not https, so they should be aware adding those properties makes them slightly more vulnerable since they aren't submitted securely to the server.
- Fixed an issue that may cause some instances of screen-scraper to crash on start-up.