Anonymization

Anonymization

Overview

Under certain circumstances you may want to anonymize your scraping so that the target site is unable to trace back your IP address. For example, this might be desirable if you're scraping a competitor's site, or if the web site is blocking too many requests from a given IP address.

There are a few different ways to go about this using screen-scraper. In this section we'll go over each one in detail.

Automatic Anonymization

By far the simplest and most effective way to anonymize scraping in screen-scraper is to use the built-in automatic anonymization feature. Once you've done the initial setup, this can be as simple as checking a box. The anonymization service built in to screen-scraper is a paid service, and you'll need to sign up for it before making use of it. To do so, please send us a support request. See below for the cost of the anonymizaiton service.

The screen-scraper automatic anonymization service works by sending each HTTP request made in a scraping session through a separate high-speed HTTP proxy server. The end effect of this is that the site you're scraping will see any request you make as coming from one of several different IP addresses, rather than your actual IP address. These HTTP proxy servers are actually virtual machines that get spawned and terminated as you need them. You'll use screen-scraper to either manually or automatically spawn and terminate the proxy servers.

Once you've signed up for the anonymization service you'll be given a password that will be associated with your registered email address. Your password will be entered into the "Anonymous Proxy" section of the "Settings" window:



The anonymous proxy servers will be set up in such a way that they only allow connections from your IP address. This way no one else can use any of the proxies without your authorization. In the "Settings" window you'll want to enter a comma-delimited list of allowed IP addresses for the computers that will be utilizing the service. If you'll be running your anonymized scraping sessions on the same machine (or local network) you're currently on, you can click the "Get the IP address for this computer" to determine your current IP address. We find that as many as 10 proxy servers but no fewer than five are adequate for most situations.

As the proxy servers get spawned and terminated, it's a good idea to establish the maximum number of running proxy servers you'd like to allow. This is done via the "Max running servers" setting. Just below this box you can click the "Refresh" button to see how many HTTP proxies are currently running. Because you pay for proxy servers by the hour, if you don't have your scraping session set up to automatically shut them down at the end, you'll want to use the "Terminate all running proxy servers." button in order to do that.

Aside from these global settings, there are a few settings that apply to each scraping session you'd like to anonymize. You can edit these settings under the "Anoymization" tab of your scraping session. The settings should be self-explanatory.

Once you've configured all of the necessary settings, try running your scraping session to test it out. You'll see messages in the log that indicate what proxy servers are being used, how many have been spawned, etc.

As your anonymous scraping session runs, you'll notice that screen-scraper will automatically regulate the pool of proxy servers. For example, if screen-scraper gets a timed out connection or a 403 response (authorization denied), it will terminate the current proxy server, and automatically spawn a new one in its place. This way you will likely always have a complete set of proxy servers, regardless of how frequently the target web site might be blocking your requests. You can also manually report a proxy server as blocked by calling session.currentProxyServerIsBad() in a script.

There is a $150 setup fee for the anonymization service. Beyond that, the charge for each running proxy server is 25 cents per proxy per hour. Once again, to enroll in the service please send us a support request.

While the automatic anonymization service provides an excellent way to cloak your IP address it is still possible that the target web site will block enough of the anonymized IP addresses that the anonymization could fail. Unfortunately we can't make any guarantees that you won't get blocked; however, by using the automatic anonymization service the chances of getting blocked are reduced dramatically.

Users of the automatic anonymization service must first agree by email to be bound by Amazon's Amazon Web Services Customer Agreement (please take special note of section "5.4. Amazon Elastic Compute Cloud (Amazon EC2)" of the agreement, which specifically outlines permitted activities). When using the automatic anonymization method, while the remote web site may not be able to determine your IP address, your activity will still be logged. If you attempt to use the proxy service for any illegal activities, the chances are very good that you will be prosecuted.

Using the anonymization service outside of screen-scraper:

All requests require that you pass your registered email address, which will be determined when you sign up for the anonymization service. This is passed as a URL-encoded string in the URL query string using the key "registered_email". Your password will also be required, which is passed to the server via the "password" parameter.

Each call to the server is done via a GET request. The possible requests are described below.

Note: Expect an average delay of around 20 seconds before receiving a response from the system for reach request made.

Update anonymization settings:

https://www.screen-scraper.com/screen-scraper/proxy/update_settings/?registered_email=foo%40bar.com&password=mypass&ip_addresses_allowed_to_connect=123.45.67.89%2C98.75.54.321&max_running_proxies=5

- ip_addresses_allowed_to_connect: This is a URL-encoded comma-delimited list of IP addresses that should be allowed to connect to the HTTP proxy servers.
- max_running_proxies: The maximum number of proxies that should be allowed. It's important that this be set, as lag between terminating and spawning proxies could otherwise cause more proxies than desired to be spawned.

Terminate all proxies:

https://www.screen-scraper.com/screen-scraper/proxy/terminate_instances/?registered_email=foo%40bar.com&password=mypass

Spawn proxies:
https://www.screen-scraper.com/screen-scraper/proxy/spawn_instances/?registered_email=foo%40bar.com&password=mypass&num_instances=5

- num_instances refers to the number of proxies to be spawned.

Get the current number of running proxies:

https://www.screen-scraper.com/screen-scraper/proxy/get_num_proxies/?registered_email=foo%40bar.com&password=mypass

Get a current list of running proxies:
https://www.screen-scraper.com/screen-scraper/proxy/get_current_proxies/?registered_email=foo%40bar.com&password=mypass

Here's an example of what would be returned from this request:
ec2-75-101-238-93.compute-1.amazonaws.com:3128 i-61955e08
ec2-75-131-250-53.compute-1.amazonaws.com:3128 i-6e955e07

Each proxy gets its own line. The host and port are given first, then a space character, then the instance ID. You'll use the instance ID if you want to report a proxy as bad (so that it will be terminated and one will be spawned in its place).

Terminate a single proxy and spawn a new one in its place:

https://www.screen-scraper.com/screen-scraper/proxy/report_bad_proxy/?registered_email=foo%40bar.com&password=mypass&instance_id=i-2cfc3541

After terminating a proxy, it will take a minute or two to spawn one in its place. You'll want to query the server periodically in order to refresh your current pool of proxies.

Anonymization Via Manual Proxy Pools

If the automatic anonymization method isn't right for you, the next best alternative might be to manually handle working with screen-scraper's built-in proxy pool. The basic approach involves running a script at the beginning of your scraping session that sets up the pool, then calling session.currentProxyServerIsBad() as you find that proxy servers are getting blocked. In order to use a proxy pool, you'll also need to get a list of anonymous proxy servers. Generally you can turn these by Googling around a bit.

The best way to demonstrate the use of proxy pools is by an example. So here it is:

import com.screenscraper.util.*;

// Create a new ProxyServerPool object. This object will
// control how screen-scraper interacts with proxy servers.
proxyServerPool = new ProxyServerPool();

// We give the current scraping session a reference to
// the proxy pool. This step should ideally be done right
// after the object is created (as in the previous step).
session.setProxyServerPool( proxyServerPool );

// This tells the pool to populate itself from a file
// containing a list of proxy servers. The format is very
// simple--you should have a proxy server on each line of
// the file, with the host separated from the port by a colon.
// For example:
// one.proxy.com:8888
// two.proxy.com:3128
// 29.283.928.10:8080
// But obviously without the slashes at the beginning.
proxyServerPool.populateFromFile( "proxies.txt" );

// screen-scraper can iterate through all of the proxies to
// ensure theyre responsive. This can be a time-consuming
// process unless it's done in a multi-threaded fashion.
// This method call tells screen-scraper to validate up to
// 25 proxies at a time.
proxyServerPool.setNumProxiesToValidateConcurrently( 25 );

// This method call tells screen-scraper to filter the list of
// proxy servers using 7 seconds as a timeout value. That is,
// if a server doesnt respond within 7 seconds, it's deemed
// to be invalid.
proxyServerPool.filter( 7 );

// Once filtering is done, it's often helpful to write the good
// set of proxies out to a file. That way you may not have to
// filter again the next time.
proxyServerPool.writeProxyPoolToFile( good_proxies.txt );

// You might also want to write out the list of proxy servers
// to screen-scraper's log.
proxyServerPool.outputProxyServersToLog();

// This is the switch that tells the scraping session to make
// use of the proxy servers. Note that this can be turned on
// and off during the course of the scrape. You may want to
// anonymize some pages, but not others.
session.setUseProxyFromPool( true );

// As a scrapiing session runs, screen-scraper will filter out
// proxies that become non-responsive. If the number of proxies
// gets down to a specified level, screen-scraper can repopulate
// itself. Thats what this method call controls.
proxyServerPool.setRepopulateThreshold( 5 );

That's about all there is to it. Aside from occasionally calling session.currentProxyServerIsBad(), you may also want to call session.setUseProxyFromPool to turn anonymization on and off within the scraping sesison.


From here: