![]() |
Miscellaneous |
![]() |
Anonymization |
Overview
Under certain circumstances you may want to anonymize your scraping so that the target site is unable to trace back your IP address. For example, this might be desirable if you're scraping a competitor's site, or if the web site is blocking too many requests from a given IP address.
There are a few different ways to go about this using screen-scraper. In this section we'll go over each one in detail.
Automatic Anonymization
By far the simplest and most effective way to anonymize scraping in screen-scraper is to use the built-in automatic anonymization feature. Once you've done the initial setup, this can be as simple as checking a box. The anonymization service built in to screen-scraper is a paid service, and you'll need to sign up for it before making use of it. To do so, please send us a support request. See below for the cost of the anonymizaiton service.
The screen-scraper automatic anonymization service works by sending each HTTP request made in a scraping session through a separate high-speed HTTP proxy server. The end effect of this is that the site you're scraping will see any request you make as coming from one of several different IP addresses, rather than your actual IP address. These HTTP proxy servers are actually virtual machines that get spawned and terminated as you need them. You'll use screen-scraper to either manually or automatically spawn and terminate the proxy servers.
Once you've signed up for the anonymization service you'll be given a password that will be associated with your registered email address. Your password will be entered into the "Anonymous Proxy" section of the "Settings" window:

The anonymous proxy servers will be set up in such a way that they only allow connections from your IP address. This way no one else can use any of the proxies without your authorization. In the "Settings" window you'll want to enter a comma-delimited list of allowed IP addresses for the computers that will be utilizing the service. If you'll be running your anonymized scraping sessions on the same machine (or local network) you're currently on, you can click the "Get the IP address for this computer" to determine your current IP address. We find that as many as 10 proxy servers but no fewer than five are adequate for most situations.
As the proxy servers get spawned and terminated, it's a good idea to establish the maximum number of running proxy servers you'd like to allow. This is done via the "Max running servers" setting. Just below this box you can click the "Refresh" button to see how many HTTP proxies are currently running. Because you pay for proxy servers by the hour, if you don't have your scraping session set up to automatically shut them down at the end, you'll want to use the "Terminate all running proxy servers." button in order to do that.
Aside from these global settings, there are a few settings that apply to each scraping session you'd like to anonymize. You can edit these settings under the "Anoymization" tab of your scraping session. The settings should be self-explanatory.
Once you've configured all of the necessary settings, try running your scraping session to test it out. You'll see messages in the log that indicate what proxy servers are being used, how many have been spawned, etc.
As your anonymous scraping session runs, you'll notice that screen-scraper will automatically regulate the pool of proxy servers. For example, if screen-scraper gets a timed out connection or a 403 response (authorization denied), it will terminate the current proxy server, and automatically spawn a new one in its place. This way you will likely always have a complete set of proxy servers, regardless of how frequently the target web site might be blocking your requests. You can also manually report a proxy server as blocked by calling session.currentProxyServerIsBad() in a script.
There is a $150 setup fee for the anonymization service. Beyond that, the charge for each running proxy server is 25 cents per proxy per hour. Once again, to enroll in the service please send us a support request.
While the automatic anonymization service provides an excellent way to cloak your IP address it is still possible that the target web site will block enough of the anonymized IP addresses that the anonymization could fail. Unfortunately we can't make any guarantees that you won't get blocked; however, by using the automatic anonymization service the chances of getting blocked are reduced dramatically.
Users of the automatic anonymization service must first agree by email to be bound by Amazon's Amazon Web Services Customer Agreement (please take special note of section "5.4. Amazon Elastic Compute Cloud (Amazon EC2)" of the agreement, which specifically outlines permitted activities). When using the automatic anonymization method, while the remote web site may not be able to determine your IP address, your activity will still be logged. If you attempt to use the proxy service for any illegal activities, the chances are very good that you will be prosecuted.
Using the anonymization service outside of screen-scraper:
All requests require that you pass your registered email address, which will be determined when you sign up for the anonymization service. This is passed as a URL-encoded string in the URL query string using the key "registered_email". Your password will also be required, which is passed to the server via the "password" parameter.
Each call to the server is done via a GET request. The possible requests are described below.
Note: Expect an average delay of around 20 seconds before receiving a response from the system for reach request made.
Update anonymization settings:
https://www.screen-scraper.com/screen-scraper/proxy/update_settings/?registered_email=foo%40bar.com&password=mypass&ip_addresses_allowed_to_connect=123.45.67.89%2C98.75.54.321&max_running_proxies=5Terminate all proxies:
https://www.screen-scraper.com/screen-scraper/proxy/terminate_instances/?registered_email=foo%40bar.com&password=mypasshttps://www.screen-scraper.com/screen-scraper/proxy/spawn_instances/?registered_email=foo%40bar.com&password=mypass&num_instances=5Get the current number of running proxies:
https://www.screen-scraper.com/screen-scraper/proxy/get_num_proxies/?registered_email=foo%40bar.com&password=mypasshttps://www.screen-scraper.com/screen-scraper/proxy/get_current_proxies/?registered_email=foo%40bar.com&password=mypassEach proxy gets its own line. The host and port are given first, then a space character, then the instance ID. You'll use the instance ID if you want to report a proxy as bad (so that it will be terminated and one will be spawned in its place).
Terminate a single proxy and spawn a new one in its place:
https://www.screen-scraper.com/screen-scraper/proxy/report_bad_proxy/?registered_email=foo%40bar.com&password=mypass&instance_id=i-2cfc3541Anonymization Via Manual Proxy Pools
If the automatic anonymization method isn't right for you, the next best alternative might be to manually handle working with screen-scraper's built-in proxy pool. The basic approach involves running a script at the beginning of your scraping session that sets up the pool, then calling session.currentProxyServerIsBad() as you find that proxy servers are getting blocked. In order to use a proxy pool, you'll also need to get a list of anonymous proxy servers. Generally you can turn these by Googling around a bit.
The best way to demonstrate the use of proxy pools is by an example. So here it is:
import com.screenscraper.util.*; |
That's about all there is to it. Aside from occasionally calling session.currentProxyServerIsBad(), you may also want to call session.setUseProxyFromPool to turn anonymization on and off within the scraping sesison.
From here:
![]() |
How HTTP Works |
Hypertext transfer protocol provides a way for clients such as web browsers to communicate with web servers. There's quite a bit on the web that's written on the topic, so for the time being we'll just provide some good links for you.
From here:
![]() |
Importing and Exporting Objects |
Overview
Scraping sessions and scripts can be exported from screen-scraper to external files. You might consider doing this in order to back up your work, but the principle purpose is to allow for examples to be exported in order to help others learn how to use screen-scraper.
Exporting Objects from screen-scraper
In order to export a scraping session or script to an external file simply select the object you wish to export then click on the corresponding "Export" button. You'll be asked to save the file to a location of your choice. You're also free to name the file what you wish, though we recommend you leave the "(scraping session)" or "(script)" portion of the name in tact so that you can identify the type of the object later on. When you export a scraping session from screen-scraper all scripts directly associated with that scraping session will be exported within the same file.
Note that when exporting a scraping session all scripts referenced from that scraping session will be exported along with it. This does not include, however, scripts used to invoke the scraping session.
Importing Objects into screen-scraper
To import a scraping session or script into screen-scraper select the "Import..." option from the "File" menu. Locate the XML file corresponding to the object you wish to import, and select "Open". If you've selected a valid file the objects contained within that file will be imported into the application.
You can also import exported scraping sessions and scripts into screen-scraper by copying them into the "import" folder you'll find in the directory where screen-scraper was installed. This can be especially useful while screen-scraper is running as a server, which allows the objects to be imported on the fly (that is, without stopping the server). screen-scraper will check this directory just before executing a scraping session, and import any files found in it. Note that imported files will be removed from the import folder once they are imported by screen-scraper.
In cases where you want to pack up scraping sessions and scripts along with other files needed to run a scrape, you can compress them all into an "update.zip" file. This file should replicate the directory structure of screen-scraper. For example, you might have a folder called "import" that contains a scraping session. You might also have a CSV file in the root of the zip file that contains parameters needed to run the scraping session. You can zip all of these up into an "update.zip" file, then place that file inside an "update" folder found in screen-scraper's folder. When screen-scraper starts up it will unzip the file, copy all of its contents to the corresponding locations, then delete the "update.zip" file.
From here:
![]() |
Generating RSS and Atom Feeds |
Overview
screen-scraper has the ability to automatically generate RSS and Atom feeds from extracted data. If you're unfamiliar with RSS and Atom feeds you might take a minute to read up on the topic first. In order to use the RSS/Atom functionality you need to be using the Enterprise Edition of screen-scraper.
The documentation on this page is a bit abstract. If you're interested in building RSS/Atom feeds with screen-scraper it would probably be a good idea for you to go through our Sixth Tutorial, which will walk you through the process in detail.
How it Works
A small web server runs within screen-scraper that interacts with the scraping engine. As such, you can access a URL within a browser or RSS/Atom reader that will cause screen-scraper to invoke a scraping session, then return back an RSS or Atom feed.
The basic syntax for the URL you'll use to generate a feed looks like this:
http://(host:port)/ss/xmlfeed?scraping_session=(scraping session name)[&key1=value1&key2=value2...]
For example, if you were running screen-scraper on your local machine, and wanted to generate a feed for the "Shopping Site" example used in our tutorials with the search term "bug" the URL would look like this:
http://localhost/ss/xmlfeed?scraping_session=Shopping+Site&SEARCH=bug
As with any other URL, each of the parameters must be properly URL-encoded. Key/value pairs can also be passed in as POST parameters.
The only required parameter is "scraping_session". screen-scraper will create session variables out of any other parameters that get passed in.
Setting Up the Scraping Session
The scraping session must have certain named elements present in order to generate the feed. They are as follows:
When the XML feed is requested through your browser or reader screen-scraper will invoke the scraping session named by the "scraping_session" parameter. Once the scraping session completes screen-scraper will look for a DataSet called "XML_FEED", iterate over its constituent DataRecord objects, building the feed from them.
From here:
![]() |
The Web Interface |
Overview
The screen-scraper web interface allows you to administer aspects of the scraping process. This includes monitoring running scraping sessions, importing and exporting scraping sessions, and scheduling scraping sessions to be run on a periodic basis.
When screen-scraper is running in server mode, you can access the web interface on your local machine at the following URL: http://localhost:8779/web.htm. If you've changed the "Web/SOAP Server" port in the workbench, you'll need to use the port you designated there. Also, depending on the operating system you're running, instead of "localhost", you may need to use "127.0.0.1" or the IP address of the machine.
The same security settings that apply to accessing the screen-scraper server from a remote application apply also to the web interface. That is, you designate IP addresses of machines allowed to connect. For more information on this, see the "IMPORTANT NOTE" under Running screen-scraper as a server.
The web interface makes use of three primary tabs that allow to manage various aspects of the scraping process. We'll review each of them in detail.

The "Runnable" Tab
This tab displays all scraping sessions loaded into the current instance. It will display basic information on scraping sessions that are currently running, as well as scraping sessions that have run in the past. It also allows you to start and schedule scraping sessions. By checking boxes on the left side you can start multiple scraping sessions simultaneously. Use the "Refresh" button to see the latest information on each scraping session. The various columns in the table are sortable by clicking the appropriate header. A description of each column follows:
The "Run/Running" Tab
This tab displays information on scraping sessions that are either currently running or have run in the past. You can use this table to compare run times, the number of records scraped, and also to monitor scraping session logs. After checking the boxes in the leftmost column you can stop or remove multiple scraping session records. Note that removing records for scraping sessions that have run doesn't remove the scraping sessions themselves, just the records related to the time when they were run. Here's a more detailed description of each of the columns:
The "Scheduled" Tab
On this tab you can manage scraping sessions that have been scheduled to be run. See below for more on scheduling scraping sessions. The columns in the table are described below:
Scheduling Scraping Sessions
Through the web interface, you can schedule scraping sessions to be run at a future time, and set a frequency so that they'll run on an ongoing basis. To schedule a scraping session click the "Schedule" button under the "Runnable" tab. You can also alter the settings for an existing scheduled scraping session under the "Scheduled" tab.
When working with scheduled scraping sessions you'll do so via a separate tabbed dialog box. Each tab and it's corresponding parameters are described below.
In the "General" tab, you'll manage the following settings:
The "Schedule" tab allows you to set the following:
The "Thresholds" tab allows you to set percentages that can be checked when a scraping session is running in order to determine whether or not a scraping session differs significantly from the previous time it was run.
Settings
The "Settings" dialog box simply allows you to set default values for scheduled scraping sessions. See the section above for details on each of the values.
From here: