Miscellaneous

Miscellaneous

Anonymization

Anonymization

Overview

Under certain circumstances you may want to anonymize your scraping so that the target site is unable to trace back your IP address. For example, this might be desirable if you're scraping a competitor's site, or if the web site is blocking too many requests from a given IP address.

There are a few different ways to go about this using screen-scraper. In this section we'll go over each one in detail.

Automatic Anonymization

By far the simplest and most effective way to anonymize scraping in screen-scraper is to use the built-in automatic anonymization feature. Once you've done the initial setup, this can be as simple as checking a box. The anonymization service built in to screen-scraper is a paid service, and you'll need to sign up for it before making use of it. To do so, please send us a support request. See below for the cost of the anonymizaiton service.

The screen-scraper automatic anonymization service works by sending each HTTP request made in a scraping session through a separate high-speed HTTP proxy server. The end effect of this is that the site you're scraping will see any request you make as coming from one of several different IP addresses, rather than your actual IP address. These HTTP proxy servers are actually virtual machines that get spawned and terminated as you need them. You'll use screen-scraper to either manually or automatically spawn and terminate the proxy servers.

Once you've signed up for the anonymization service you'll be given a password that will be associated with your registered email address. Your password will be entered into the "Anonymous Proxy" section of the "Settings" window:



The anonymous proxy servers will be set up in such a way that they only allow connections from your IP address. This way no one else can use any of the proxies without your authorization. In the "Settings" window you'll want to enter a comma-delimited list of allowed IP addresses for the computers that will be utilizing the service. If you'll be running your anonymized scraping sessions on the same machine (or local network) you're currently on, you can click the "Get the IP address for this computer" to determine your current IP address. We find that as many as 10 proxy servers but no fewer than five are adequate for most situations.

As the proxy servers get spawned and terminated, it's a good idea to establish the maximum number of running proxy servers you'd like to allow. This is done via the "Max running servers" setting. Just below this box you can click the "Refresh" button to see how many HTTP proxies are currently running. Because you pay for proxy servers by the hour, if you don't have your scraping session set up to automatically shut them down at the end, you'll want to use the "Terminate all running proxy servers." button in order to do that.

Aside from these global settings, there are a few settings that apply to each scraping session you'd like to anonymize. You can edit these settings under the "Anoymization" tab of your scraping session. The settings should be self-explanatory.

Once you've configured all of the necessary settings, try running your scraping session to test it out. You'll see messages in the log that indicate what proxy servers are being used, how many have been spawned, etc.

As your anonymous scraping session runs, you'll notice that screen-scraper will automatically regulate the pool of proxy servers. For example, if screen-scraper gets a timed out connection or a 403 response (authorization denied), it will terminate the current proxy server, and automatically spawn a new one in its place. This way you will likely always have a complete set of proxy servers, regardless of how frequently the target web site might be blocking your requests. You can also manually report a proxy server as blocked by calling session.currentProxyServerIsBad() in a script.

There is a $150 setup fee for the anonymization service. Beyond that, the charge for each running proxy server is 25 cents per proxy per hour. Once again, to enroll in the service please send us a support request.

While the automatic anonymization service provides an excellent way to cloak your IP address it is still possible that the target web site will block enough of the anonymized IP addresses that the anonymization could fail. Unfortunately we can't make any guarantees that you won't get blocked; however, by using the automatic anonymization service the chances of getting blocked are reduced dramatically.

Users of the automatic anonymization service must first agree by email to be bound by Amazon's Amazon Web Services Customer Agreement (please take special note of section "5.4. Amazon Elastic Compute Cloud (Amazon EC2)" of the agreement, which specifically outlines permitted activities). When using the automatic anonymization method, while the remote web site may not be able to determine your IP address, your activity will still be logged. If you attempt to use the proxy service for any illegal activities, the chances are very good that you will be prosecuted.

Using the anonymization service outside of screen-scraper:

All requests require that you pass your registered email address, which will be determined when you sign up for the anonymization service. This is passed as a URL-encoded string in the URL query string using the key "registered_email". Your password will also be required, which is passed to the server via the "password" parameter.

Each call to the server is done via a GET request. The possible requests are described below.

Note: Expect an average delay of around 20 seconds before receiving a response from the system for reach request made.

Update anonymization settings:

https://www.screen-scraper.com/screen-scraper/proxy/update_settings/?registered_email=foo%40bar.com&password=mypass&ip_addresses_allowed_to_connect=123.45.67.89%2C98.75.54.321&max_running_proxies=5

- ip_addresses_allowed_to_connect: This is a URL-encoded comma-delimited list of IP addresses that should be allowed to connect to the HTTP proxy servers.
- max_running_proxies: The maximum number of proxies that should be allowed. It's important that this be set, as lag between terminating and spawning proxies could otherwise cause more proxies than desired to be spawned.

Terminate all proxies:

https://www.screen-scraper.com/screen-scraper/proxy/terminate_instances/?registered_email=foo%40bar.com&password=mypass

Spawn proxies:
https://www.screen-scraper.com/screen-scraper/proxy/spawn_instances/?registered_email=foo%40bar.com&password=mypass&num_instances=5

- num_instances refers to the number of proxies to be spawned.

Get the current number of running proxies:

https://www.screen-scraper.com/screen-scraper/proxy/get_num_proxies/?registered_email=foo%40bar.com&password=mypass

Get a current list of running proxies:
https://www.screen-scraper.com/screen-scraper/proxy/get_current_proxies/?registered_email=foo%40bar.com&password=mypass

Here's an example of what would be returned from this request:
ec2-75-101-238-93.compute-1.amazonaws.com:3128 i-61955e08
ec2-75-131-250-53.compute-1.amazonaws.com:3128 i-6e955e07

Each proxy gets its own line. The host and port are given first, then a space character, then the instance ID. You'll use the instance ID if you want to report a proxy as bad (so that it will be terminated and one will be spawned in its place).

Terminate a single proxy and spawn a new one in its place:

https://www.screen-scraper.com/screen-scraper/proxy/report_bad_proxy/?registered_email=foo%40bar.com&password=mypass&instance_id=i-2cfc3541

After terminating a proxy, it will take a minute or two to spawn one in its place. You'll want to query the server periodically in order to refresh your current pool of proxies.

Anonymization Via Manual Proxy Pools

If the automatic anonymization method isn't right for you, the next best alternative might be to manually handle working with screen-scraper's built-in proxy pool. The basic approach involves running a script at the beginning of your scraping session that sets up the pool, then calling session.currentProxyServerIsBad() as you find that proxy servers are getting blocked. In order to use a proxy pool, you'll also need to get a list of anonymous proxy servers. Generally you can turn these by Googling around a bit.

The best way to demonstrate the use of proxy pools is by an example. So here it is:

import com.screenscraper.util.*;

// Create a new ProxyServerPool object. This object will
// control how screen-scraper interacts with proxy servers.
proxyServerPool = new ProxyServerPool();

// We give the current scraping session a reference to
// the proxy pool. This step should ideally be done right
// after the object is created (as in the previous step).
session.setProxyServerPool( proxyServerPool );

// This tells the pool to populate itself from a file
// containing a list of proxy servers. The format is very
// simple--you should have a proxy server on each line of
// the file, with the host separated from the port by a colon.
// For example:
// one.proxy.com:8888
// two.proxy.com:3128
// 29.283.928.10:8080
// But obviously without the slashes at the beginning.
proxyServerPool.populateFromFile( "proxies.txt" );

// screen-scraper can iterate through all of the proxies to
// ensure theyre responsive. This can be a time-consuming
// process unless it's done in a multi-threaded fashion.
// This method call tells screen-scraper to validate up to
// 25 proxies at a time.
proxyServerPool.setNumProxiesToValidateConcurrently( 25 );

// This method call tells screen-scraper to filter the list of
// proxy servers using 7 seconds as a timeout value. That is,
// if a server doesnt respond within 7 seconds, it's deemed
// to be invalid.
proxyServerPool.filter( 7 );

// Once filtering is done, it's often helpful to write the good
// set of proxies out to a file. That way you may not have to
// filter again the next time.
proxyServerPool.writeProxyPoolToFile( good_proxies.txt );

// You might also want to write out the list of proxy servers
// to screen-scraper's log.
proxyServerPool.outputProxyServersToLog();

// This is the switch that tells the scraping session to make
// use of the proxy servers. Note that this can be turned on
// and off during the course of the scrape. You may want to
// anonymize some pages, but not others.
session.setUseProxyFromPool( true );

// As a scrapiing session runs, screen-scraper will filter out
// proxies that become non-responsive. If the number of proxies
// gets down to a specified level, screen-scraper can repopulate
// itself. Thats what this method call controls.
proxyServerPool.setRepopulateThreshold( 5 );

That's about all there is to it. Aside from occasionally calling session.currentProxyServerIsBad(), you may also want to call session.setUseProxyFromPool to turn anonymization on and off within the scraping sesison.


From here:

How HTTP Works

How HTTP Works

Hypertext transfer protocol provides a way for clients such as web browsers to communicate with web servers. There's quite a bit on the web that's written on the topic, so for the time being we'll just provide some good links for you.

From here:

Importing and Exporting Objects

Importing and Exporting Objects

Overview

Scraping sessions and scripts can be exported from screen-scraper to external files. You might consider doing this in order to back up your work, but the principle purpose is to allow for examples to be exported in order to help others learn how to use screen-scraper.

Exporting Objects from screen-scraper

In order to export a scraping session or script to an external file simply select the object you wish to export then click on the corresponding "Export" button. You'll be asked to save the file to a location of your choice. You're also free to name the file what you wish, though we recommend you leave the "(scraping session)" or "(script)" portion of the name in tact so that you can identify the type of the object later on. When you export a scraping session from screen-scraper all scripts directly associated with that scraping session will be exported within the same file.

Note that when exporting a scraping session all scripts referenced from that scraping session will be exported along with it. This does not include, however, scripts used to invoke the scraping session.

Importing Objects into screen-scraper

To import a scraping session or script into screen-scraper select the "Import..." option from the "File" menu. Locate the XML file corresponding to the object you wish to import, and select "Open". If you've selected a valid file the objects contained within that file will be imported into the application.

You can also import exported scraping sessions and scripts into screen-scraper by copying them into the "import" folder you'll find in the directory where screen-scraper was installed. This can be especially useful while screen-scraper is running as a server, which allows the objects to be imported on the fly (that is, without stopping the server). screen-scraper will check this directory just before executing a scraping session, and import any files found in it. Note that imported files will be removed from the import folder once they are imported by screen-scraper.

In cases where you want to pack up scraping sessions and scripts along with other files needed to run a scrape, you can compress them all into an "update.zip" file. This file should replicate the directory structure of screen-scraper. For example, you might have a folder called "import" that contains a scraping session. You might also have a CSV file in the root of the zip file that contains parameters needed to run the scraping session. You can zip all of these up into an "update.zip" file, then place that file inside an "update" folder found in screen-scraper's folder. When screen-scraper starts up it will unzip the file, copy all of its contents to the corresponding locations, then delete the "update.zip" file.


From here:

Generating RSS and Atom Feeds

Generating RSS and Atom Feeds

Overview

screen-scraper has the ability to automatically generate RSS and Atom feeds from extracted data. If you're unfamiliar with RSS and Atom feeds you might take a minute to read up on the topic first. In order to use the RSS/Atom functionality you need to be using the Enterprise Edition of screen-scraper.

The documentation on this page is a bit abstract. If you're interested in building RSS/Atom feeds with screen-scraper it would probably be a good idea for you to go through our Sixth Tutorial, which will walk you through the process in detail.

How it Works

A small web server runs within screen-scraper that interacts with the scraping engine. As such, you can access a URL within a browser or RSS/Atom reader that will cause screen-scraper to invoke a scraping session, then return back an RSS or Atom feed.

The basic syntax for the URL you'll use to generate a feed looks like this:

http://(host:port)/ss/xmlfeed?scraping_session=(scraping session name)[&key1=value1&key2=value2...]

For example, if you were running screen-scraper on your local machine, and wanted to generate a feed for the "Shopping Site" example used in our tutorials with the search term "bug" the URL would look like this:

http://localhost/ss/xmlfeed?scraping_session=Shopping+Site&SEARCH=bug

As with any other URL, each of the parameters must be properly URL-encoded. Key/value pairs can also be passed in as POST parameters.

The only required parameter is "scraping_session". screen-scraper will create session variables out of any other parameters that get passed in.

Setting Up the Scraping Session

The scraping session must have certain named elements present in order to generate the feed. They are as follows:

  • XML_FEED_TITLE optional: A String session variable containing the name that will be used for the entire feed. (e.g., "CNN Headlines")
  • XML_FEED_LINK optional: A String session variable containing the link associated with the feed. (e.g., "http://www.cnn.com/")
  • XML_FEED_DESCRIPTION optional: A String session variable containing the description of the feed. (e.g., "The latest news headlines from CNN.com")
  • XML_FEED_FORMAT optional: A String session variable indicating the format of the feed. Valid values are atom_0.3, rss_0.9, rss_0.91N, rss_0.91U, rss_0.92, rss_0.93, rss_0.94, rss_1.0, and rss_2.0. If omitted, the default value is rss_1.0.
  • XML_FEED (required): This session variable should hold a DataSet consisting of DataRecords that will make up the various feed items (e.g., each news headline). Each DataRecord should contain values using the names given below.
  • TITLE: The title of the feed item.
  • LINK: The link of the feed item.
  • DESCRIPTION: The description of the feed item.
  • PUBLISHED_DATE: The published date of the feed item. This should be a Java Date object.

When the XML feed is requested through your browser or reader screen-scraper will invoke the scraping session named by the "scraping_session" parameter. Once the scraping session completes screen-scraper will look for a DataSet called "XML_FEED", iterate over its constituent DataRecord objects, building the feed from them.


From here:

The Web Interface

The Web Interface

Overview

The screen-scraper web interface allows you to administer aspects of the scraping process. This includes monitoring running scraping sessions, importing and exporting scraping sessions, and scheduling scraping sessions to be run on a periodic basis.

When screen-scraper is running in server mode, you can access the web interface on your local machine at the following URL: http://localhost:8779/web.htm. If you've changed the "Web/SOAP Server" port in the workbench, you'll need to use the port you designated there. Also, depending on the operating system you're running, instead of "localhost", you may need to use "127.0.0.1" or the IP address of the machine.

The same security settings that apply to accessing the screen-scraper server from a remote application apply also to the web interface. That is, you designate IP addresses of machines allowed to connect. For more information on this, see the "IMPORTANT NOTE" under Running screen-scraper as a server.

The web interface makes use of three primary tabs that allow to manage various aspects of the scraping process. We'll review each of them in detail.

Home

The "Runnable" Tab

This tab displays all scraping sessions loaded into the current instance. It will display basic information on scraping sessions that are currently running, as well as scraping sessions that have run in the past. It also allows you to start and schedule scraping sessions. By checking boxes on the left side you can start multiple scraping sessions simultaneously. Use the "Refresh" button to see the latest information on each scraping session. The various columns in the table are sortable by clicking the appropriate header. A description of each column follows:

  • Name: The name of the scraping session.
  • Start Time: The date and time the scraping session was last started.
  • Running Time: The amount of time the scraping session has taken to run. In the case of a scraping session that is currently in process, this number will update as you refresh the table.
  • Previous Running Time: The amount of time the scraping session took the last time it ran.
  • Num Records: The number of records the scraping session has extracted. This number is determined as your scraping session invokes the session.addToNumRecordsScraped method.
  • Previous Num Records: The number of records the scraping session extracted the last time it ran.
  • Status: Indicates the current status of the scraping session. Possibilities include "In Process", "Completed", and "Error".
  • Export: Allows you to export the scraping session, just as you would from the workbench.
  • Run Now: Runs the scraping session.
  • Schedule: Allows you to schedule the scraping session to be run. See below for more on scheduling scraping sessions.

The "Run/Running" Tab

This tab displays information on scraping sessions that are either currently running or have run in the past. You can use this table to compare run times, the number of records scraped, and also to monitor scraping session logs. After checking the boxes in the leftmost column you can stop or remove multiple scraping session records. Note that removing records for scraping sessions that have run doesn't remove the scraping sessions themselves, just the records related to the time when they were run. Here's a more detailed description of each of the columns:

  • Name: The name of the scraping session.
  • Start Time: The date and time the scraping session was last started.
  • Running Time: The amount of time the scraping session has taken to run. In the case of a scraping session that is currently in process, this number will update as you refresh the table.
  • Previous Running Time: The amount of time the scraping session took the last time it ran.
  • Num Records: The number of records the scraping session has extracted. This number is determined as your scraping session invokes the session.addToNumRecordsScraped method.
  • Previous Num Records: The number of records the scraping session extracted the last time it ran.
  • Status: Indicates the current status of the scraping session. Possibilities include "In Process", "Completed", and "Error".
  • Error: Indicates whether or not an error has occurred in the scraping session.
  • Error Message: In the event of an error, displays the corresponding message.
  • Peek Log: Pops up a box that allows you to monitor the scraping session log.
  • Stop Scraping: Stops the scraping session if it's currently running.

The "Scheduled" Tab

On this tab you can manage scraping sessions that have been scheduled to be run. See below for more on scheduling scraping sessions. The columns in the table are described below:

  • Scraping Session: The name of the scheduled scraping session.
  • Timeout: The amount of time in minutes the scraping session should be allowed to run. If this value is 0 or a negative number, the scraping session will not time out.
  • Date/Time: The date and time the scraping session is next scheduled to be run.
  • Session Variables: Any session variables that are to be passed to the scraping session when it runs.
  • Disable/Enable: Allows you to temporarily enable or disable the scheduled scraping session. If the scraping session is disabled, it will not run even if it's scheduled to do so.
  • Edit: Pops up a dialog box that allows you to manage the scheduled scraping session.
  • Remove: Removes the scheduled scraping session.

Scheduling Scraping Sessions

Through the web interface, you can schedule scraping sessions to be run at a future time, and set a frequency so that they'll run on an ongoing basis. To schedule a scraping session click the "Schedule" button under the "Runnable" tab. You can also alter the settings for an existing scheduled scraping session under the "Scheduled" tab.

When working with scheduled scraping sessions you'll do so via a separate tabbed dialog box. Each tab and it's corresponding parameters are described below.

In the "General" tab, you'll manage the following settings:

  • Scraping Session: The name of the scheduled scraping session.
  • Timeout: The number of minutes the scraping session should be allowed to run. If this value is blank, 0, or negative, the scraping session will not time out.
  • Session Variables: This is a list of session variables that will be passed to the scraping session when it is run.

The "Schedule" tab allows you to set the following:

  • Date: The calendar date when the scraping session is to run next. Click the box to bring up a graphical calendar from which you can select the desired date.
  • Time: The time of day when the scraping session is to run next. This should be a 24-hour (military) time.
  • Repeat Every: Use this to set the frequency with which the scraping session is to run. For example, if you enter "2" into the "Hours" box, the scraping session will run when it is scheduled, then be re-scheduled to run once again two hours from the time it started. If these boxes are left blank, the scraping session will run once and not be re-scheduled.

The "Thresholds" tab allows you to set percentages that can be checked when a scraping session is running in order to determine whether or not a scraping session differs significantly from the previous time it was run.

  • Time: A percentage of time whereby two runs of a scraping session may differ.
  • Record Count: A percentage of records scraped whereby two runs of a scraping session may differ.

Settings

The "Settings" dialog box simply allows you to set default values for scheduled scraping sessions. See the section above for details on each of the values.


From here: