The Proxy Server

The Proxy Server

Proxy Server Overview

Proxy Server Overview

Purpose

screen-scraper's proxy server allows you to view HTTP requests and responses as they pass between your web browser and remote servers. In scraping files from web sites there are a few more details than you typically worry about when surfing, such as HTTP headers and POST data. The proxy server makes all of these details visible to you.

Description

When running, the proxy server listens on a specified port for incoming HTTP requests from your web browser. Upon receiving a request the proxy server records it, then sends it along to the server it was intended for. When that server responds the response is sent first to the proxy server, which, once again, makes a record of it, then sends it along to your web browser.

Viewing HTTPS requests

Often one of the headaches of scraping information from sites that use HTTPS is that it's not always easy to tell what's getting passed back and forth in the way of cookies, POST data, etc. Even if you put a proxy server in the way that lets you view the requests and responses, the information is encrypted as it's leaving your browser and as it's leaving the web server that responds to the request. screen-scraper gets around this problem by using it's own temporary certificate to ecrypt traffic from itself to the browser and then encrypting each request before sending it up to the server. The result of this is that your browser will issue a warning about the certificate that screen-scraper returned. You can safely accept the certificate and be assured that all your traffic is encrypted.

Running the proxy in server mode

screen-scraper has the ability to act as a proxy while in server mode. Combined with the ability to execute scripts, this new functionality opens up many new possibilities for how you use screen-scraper, including setting up blacklists, application integration and many more. To learn how to configure screen-scraper for this see the settings documentation and look for "Default proxy session to use when running in server mode" .


From here:

Using the Proxy Server

Using the Proxy Server

Configuring the proxy server

First you will need to create a proxy session, which is really just a way to organize your interactions with specific web sites. Typically you'll have a proxy server for each site you want to scrape. Create a new proxy session by clicking on the New Proxy Session button (looks like a globe) or by selecting "File->New Proxy Session" from the menu.

The settings in the proxy server are the name, port and whether you want to have the proxy server log binary files such as images. Typically you would name the proxy after the site that you are accessing, the port set to 8777 and have the "Don't log binary files" selected.

Configuring your web browser

Confirguring a web browser to use a proxy server is generally pretty straightforward, but varies somewhat for each browser. For more detailed instructions on setting up your specific browser to use a proxy try one of the links at the bottom of this screen.

Running the proxy server

Assuming you've configured everything and set up a proxy session, from here you should be able to start up the proxy server by selecting the proxy session in the tree on the left, then clicking on the "Start Proxy Server" button. Now just surf away.

Viewing requests and responses

After you've surfed a bit with your web browser click on the "Progress" tab. From here you can view all of the HTTP and HTTPS requests and responses logged by the proxy server. The upper pane lists all of the transactions (a request/response combination). Clicking on a transaction brings up its details in the lower pane. You can delete transactions by selecting them and either hitting the "Delete" key or by right-clicking them (or Option-clicking on Mac OS X) and selecting "Delete".

The proxy server log

Currently the proxy server just logs very basic information about its activity, and probably isn't of much interest.

Viewing encrypted transactions

In Internet Explorer 7 you have to adjust your security settings. In Tools > Internet Options under the security tab slide the security level to medium. When accessing a site that uses HTTPS encryption you will encounter a browser warning that looks like this:



IE domain mismatch warning

This warning occurs because screen-scraper is using a temporary certificate for encryption that will not match the url that you are accessing. You can safely ignore this warning by clicking "Continue to this website (not recommended)".

Currently Firefox 3 will not allow you to navigate to a page with a certificate/domain mismatch. We recommend using Opera 9.5 for ssl proxy sessions.

Using an external proxy server

If you normally use an external proxy server when connecting to the Internet (on your local area network, for example), you'll need to set another property within screen-scraper. View the settings screen by selecting Options->Settings from the menu. On the "External Proxy" tab you'll notice a series of boxes toward the bottom that allow you to set parameters related to your proxy server. It should be relatively self-explanatory what needs to be designated. If you happen to be using NTLM (Windows NT) authentication you'll need to designate settings for both the "standard" proxy as well as the NTLM one.


From here:

Setting up specific browsers to use a proxy server:

Vista Users

Please see our note on using the Proxy server within Vista.

Using Scripts with the Proxy Server

Using Scripts with the Proxy Server

Overview

screen-scraper has the ability to run custom made scripts while the proxy server is running. This allows you to harness the full power of the scripting environment like you can in scraping sessions. It is recomended that you read using scripts before continuing since many of the concepts apply to invoking scripts in the proxy server environment.

Using the scripts

Scripts are added to a proxy session by selecting proxy session in the tree view then selecting the "Scripts" and clicking on the "Add Script" button. You will notice that a script will then be added to the scripts table. You will need to click on the script name and select the script that you want to run. The options "Sequence", "When to Run" and "Enabled" function similarly to other places in screen-scraper where scripts can be invoked. In the proxy server environment the "When to Run" options specify when in the proxy cycle the script will be invoked. Depending on when you decide to run your script certain built in objects will be in scope that are unique to the proxy environment.

Built-in objects

screen-scraper offers a few objects that you can work with in a script in the proxy environment. See the "Variable scope" section (following this one) for more details.

  • proxySession. This variable allows for interaction with the currently running proxy session. It has the following methods:
    • getVariable( String identifier ). Retrieves the value of a saved proxy session variable designated by identifier.
      example: cityCode = proxySession.getVariable( "CITY_CODE" );
    • setVariable( String identifier, Object value ). Designates that value should
      be saved for the duration of the proxy session, and can be accessed using the getVariable
      method using identifier.
      example: proxySession.setVariable( "CITY_CODE", dataSet.get( 0, "CITY_CODE" ) );
    • log( String message ). Causes message to be writen to the "Log" panel for the
      currently running proxy session.
      example: proxySession.log( "Inserting request parameters into the database." );
  • request. This variable allows for interaction with the currently received HTTP request.
    It has the following methods:

    • addPOSTParameter( String key, String value ). Adds a POST parameter to the request.
      example: request.addPOSTParameter( "selectedState" , "Alaska");
    • removePOSTParameter( String key ). Removes a POST parameter from the request designated by key.
      example: request.removePOSTParameter( "selectedState" );
    • getURLAsString(). Returns the requested URL.
      example: url = request.getURLAsString();
    • setRequestLine(String requestMethod, String url, String httpVersion).
      Sets the complete request line for the HTTP request. The url string must be a valid uri.
      example: request.setRequestLine( "GET" , "http://somesite.com/somepage.html", "HTTP/1.1");
    • addHTTPHeader(String key, String value).
      Adds an HTTP header to the request.
      example: request.setHTTPHeader( "Cookie" , "someCookieValue");
    • removeHTTPHeader(String key, String value).
      Removes an HTTP header from the request. A key and value must be supplied since requests can have headers with duplicate keys.
      example: request.removeHTTPHeader( "Cookie" , "someCookieValue");
  • response. This variable allows for interaction with the currently received HTTP response. It has the following methods:

    • getStatusLine(). Gets the status line returned from the server.
      example: statusLine = response.getStatusLine();
    • setStatusLine( String statusLine ). Sets the status line of the response.
      example: response.setStatusLine( "HTTP/1.1 200 OK" );
    • addHTTPHeader(String key, String value).
      Adds an HTTP header to the request.
      example: response.setHTTPHeader( "Set-Cookie" , "someCookieValue");
    • removeHTTPHeader(String key, String value).
      Removes an HTTP header from the request. A key and value must be supplied since requests can have headers with duplicate keys.
      example: response.removeHTTPHeader( "Set-Cookie" , "someCookieValue");
    • getContentAsString().
      Gets the content of the response.
      example: content = response.removeHTTPHeader( "Cookie" , "someCookieValue");
    • setContentAsString(String content).
      Sets the content of the response.
      example: response.setContentAsString( "<html><head ... </html>");

Variable scope

Depending on when a script gets run different variables may be in scope. The table that follows specifies what variables will be in scope depending on when a given script is run.

When Script is Run proxySession in scope request in scope response in scope
Beginning of proxy session X
Before HTTP request X X
After HTTP request X X
Before HTTP response X X X
After HTTP response X X X

Debugging scripts

One
of the best ways to fix
errors is to simply watch the proxy session log (under the "Log" tab) and the "error.log" file (located in the "log" directory where screen-scraper was installed) for script errors. When a problem arises in executing a script screen-scraper will output a series of error-related statements to the logs. Often a good approach in debugging is to build your script bit by bit, running it frequently to ensure that it runs without errors as you add each piece.


From here: