NavigationUser loginscreen-scraper.com welcomes...
Currently online
There are currently 0 users and 3 guests online.
|
Anonymization
Overview Under certain circumstances you may want to anonymize your scraping so that the target site is unable to trace back your IP address. For example, this might be desirable if you're scraping a competitor's site, or if the web site is blocking too many requests from a given IP address. There are a few different ways to go about this using screen-scraper. In this section we'll go over each one in detail. Automatic Anonymization By far the simplest and most effective way to anonymize scraping in screen-scraper is to use the built-in automatic anonymization feature. Once you've done the initial setup, this can be as simple as checking a box. The anonymization service built in to screen-scraper is a paid service, and you'll need to sign up for it before making use of it. To do so, please send us a support request. See below for the cost of the anonymizaiton service. The screen-scraper automatic anonymization service works by sending each HTTP request made in a scraping session through a separate high-speed HTTP proxy server. The end effect of this is that the site you're scraping will see any request you make as coming from one of several different IP addresses, rather than your actual IP address. These HTTP proxy servers are actually virtual machines that get spawned and terminated as you need them. You'll use screen-scraper to either manually or automatically spawn and terminate the proxy servers. Once you've signed up for the anonymization service you'll be given a password that will be associated with your registered email address. Your password will be entered into the "Anonymous Proxy" section of the "Settings" window:
![]() The anonymous proxy servers will be set up in such a way that they only allow connections from your IP address. This way no one else can use any of the proxies without your authorization. In the "Settings" window you'll want to enter a comma-delimited list of allowed IP addresses for the computers that will be utilizing the service. If you'll be running your anonymized scraping sessions on the same machine (or local network) you're currently on, you can click the "Get the IP address for this computer" to determine your current IP address. We find that as many as 10 proxy servers but no fewer than five are adequate for most situations. As the proxy servers get spawned and terminated, it's a good idea to establish the maximum number of running proxy servers you'd like to allow. This is done via the "Max running servers" setting. Just below this box you can click the "Refresh" button to see how many HTTP proxies are currently running. Because you pay for proxy servers by the hour, if you don't have your scraping session set up to automatically shut them down at the end, you'll want to use the "Terminate all running proxy servers." button in order to do that. Aside from these global settings, there are a few settings that apply to each scraping session you'd like to anonymize. You can edit these settings under the "Anoymization" tab of your scraping session. The settings should be self-explanatory. Once you've configured all of the necessary settings, try running your scraping session to test it out. You'll see messages in the log that indicate what proxy servers are being used, how many have been spawned, etc. As your anonymous scraping session runs, you'll notice that screen-scraper will automatically regulate the pool of proxy servers. For example, if screen-scraper gets a timed out connection or a 403 response (authorization denied), it will terminate the current proxy server, and automatically spawn a new one in its place. This way you will likely always have a complete set of proxy servers, regardless of how frequently the target web site might be blocking your requests. You can also manually report a proxy server as blocked by calling session.currentProxyServerIsBad() in a script. There is a $150 setup fee for the anonymization service. Beyond that, the charge for each running proxy server is 25 cents per proxy per hour. Once again, to enroll in the service please send us a support request. While the automatic anonymization service provides an excellent way to cloak your IP address it is still possible that the target web site will block enough of the anonymized IP addresses that the anonymization could fail. Unfortunately we can't make any guarantees that you won't get blocked; however, by using the automatic anonymization service the chances of getting blocked are reduced dramatically. Users of the automatic anonymization service must first agree by email to be bound by Amazon's Amazon Web Services Customer Agreement (please take special note of section "5.4. Amazon Elastic Compute Cloud (Amazon EC2)" of the agreement, which specifically outlines permitted activities). When using the automatic anonymization method, while the remote web site may not be able to determine your IP address, your activity will still be logged. If you attempt to use the proxy service for any illegal activities, the chances are very good that you will be prosecuted. Using the anonymization service outside of screen-scraper: All requests require that you pass your registered email address, which will be determined when you sign up for the anonymization service. This is passed as a URL-encoded string in the URL query string using the key "registered_email". Your password will also be required, which is passed to the server via the "password" parameter. Each call to the server is done via a GET request. The possible requests are described below. Note: Expect an average delay of around 20 seconds before receiving a response from the system for reach request made. Update anonymization settings: https://www.screen-scraper.com/screen-scraper/proxy/update_settings/?registered_email=foo%40bar.com&password=mypass&ip_addresses_allowed_to_connect=123.45.67.89%2C98.75.54.321&max_running_proxies=5- ip_addresses_allowed_to_connect: This is a URL-encoded comma-delimited list of IP addresses that should be allowed to connect to the HTTP proxy servers. - max_running_proxies: The maximum number of proxies that should be allowed. It's important that this be set, as lag between terminating and spawning proxies could otherwise cause more proxies than desired to be spawned. Terminate all proxies: https://www.screen-scraper.com/screen-scraper/proxy/terminate_instances/?registered_email=foo%40bar.com&password=mypassSpawn proxies: https://www.screen-scraper.com/screen-scraper/proxy/spawn_instances/?registered_email=foo%40bar.com&password=mypass&num_instances=5- num_instances refers to the number of proxies to be spawned. Get the current number of running proxies: https://www.screen-scraper.com/screen-scraper/proxy/get_num_proxies/?registered_email=foo%40bar.com&password=mypassGet a current list of running proxies: https://www.screen-scraper.com/screen-scraper/proxy/get_current_proxies/?registered_email=foo%40bar.com&password=mypassHere's an example of what would be returned from this request: ec2-75-101-238-93.compute-1.amazonaws.com:3128 i-61955e08 ec2-75-131-250-53.compute-1.amazonaws.com:3128 i-6e955e07 Each proxy gets its own line. The host and port are given first, then a space character, then the instance ID. You'll use the instance ID if you want to report a proxy as bad (so that it will be terminated and one will be spawned in its place). Terminate a single proxy and spawn a new one in its place: https://www.screen-scraper.com/screen-scraper/proxy/report_bad_proxy/?registered_email=foo%40bar.com&password=mypass&instance_id=i-2cfc3541After terminating a proxy, it will take a minute or two to spawn one in its place. You'll want to query the server periodically in order to refresh your current pool of proxies. Anonymization Via Manual Proxy Pools If the automatic anonymization method isn't right for you, the next best alternative might be to manually handle working with screen-scraper's built-in proxy pool. The basic approach involves running a script at the beginning of your scraping session that sets up the pool, then calling session.currentProxyServerIsBad() as you find that proxy servers are getting blocked. In order to use a proxy pool, you'll also need to get a list of anonymous proxy servers. Generally you can turn these by Googling around a bit. The best way to demonstrate the use of proxy pools is by an example. So here it is:
That's about all there is to it. Aside from occasionally calling session.currentProxyServerIsBad(), you may also want to call From here:
|
SearchNew Video!Tags Throughout this Site |
Recent comments
10 hours 56 sec ago
10 hours 8 min ago
12 hours 14 min ago
1 day 7 hours ago
1 day 7 hours ago
1 day 8 hours ago
1 day 8 hours ago
1 day 9 hours ago
1 day 9 hours ago
3 days 5 hours ago