![]() |
The Proxy Server |
![]() |
Proxy Server Overview |
Purpose
screen-scraper's proxy server allows you to view HTTP requests and responses as they pass between your web browser and remote servers. In scraping files from web sites there are a few more details than you typically worry about when surfing, such as HTTP headers and POST data. The proxy server makes all of these details visible to you.
Description
When running, the proxy server listens on a specified port for incoming HTTP requests from your web browser. Upon receiving a request the proxy server records it, then sends it along to the server it was intended for. When that server responds the response is sent first to the proxy server, which, once again, makes a record of it, then sends it along to your web browser.
Viewing HTTPS requests
Often one of the headaches of scraping information from sites that use HTTPS is that it's not always easy to tell what's getting passed back and forth in the way of cookies, POST data, etc. Even if you put a proxy server in the way that lets you view the requests and responses, the information is encrypted as it's leaving your browser and as it's leaving the web server that responds to the request. screen-scraper gets around this problem by using it's own temporary certificate to ecrypt traffic from itself to the browser and then encrypting each request before sending it up to the server. The result of this is that your browser will issue a warning about the certificate that screen-scraper returned. You can safely accept the certificate and be assured that all your traffic is encrypted.
Running the proxy in server mode
screen-scraper has the ability to act as a proxy while in server mode. Combined with the ability to execute scripts, this new functionality opens up many new possibilities for how you use screen-scraper, including setting up blacklists, application integration and many more. To learn how to configure screen-scraper for this see the settings documentation and look for "Default proxy session to use when running in server mode" .
From here:
![]() |
Using the Proxy Server |
Configuring the proxy server
First you will need to create a proxy session, which is really just a way to organize your interactions with specific web sites. Typically you'll have a proxy server for each site you want to scrape. Create a new proxy session by clicking on the New Proxy Session button (looks like a globe) or by selecting "File->New Proxy Session" from the menu.
The settings in the proxy server are the name, port and whether you want to have the proxy server log binary files such as images. Typically you would name the proxy after the site that you are accessing, the port set to 8777 and have the "Don't log binary files" selected.
Configuring your web browser
Confirguring a web browser to use a proxy server is generally pretty straightforward, but varies somewhat for each browser. For more detailed instructions on setting up your specific browser to use a proxy try one of the links at the bottom of this screen.
Running the proxy server
Assuming you've configured everything and set up a proxy session, from here you should be able to start up the proxy server by selecting the proxy session in the tree on the left, then clicking on the "Start Proxy Server" button. Now just surf away.
Viewing requests and responses
After you've surfed a bit with your web browser click on the "Progress" tab. From here you can view all of the HTTP and HTTPS requests and responses logged by the proxy server. The upper pane lists all of the transactions (a request/response combination). Clicking on a transaction brings up its details in the lower pane. You can delete transactions by selecting them and either hitting the "Delete" key or by right-clicking them (or Option-clicking on Mac OS X) and selecting "Delete".
The proxy server log
Currently the proxy server just logs very basic information about its activity, and probably isn't of much interest.
Viewing encrypted transactions
In Internet Explorer 7 you have to adjust your security settings. In Tools > Internet Options under the security tab slide the security level to medium. When accessing a site that uses HTTPS encryption you will encounter a browser warning that looks like this:

This warning occurs because screen-scraper is using a temporary certificate for encryption that will not match the url that you are accessing. You can safely ignore this warning by clicking "Continue to this website (not recommended)".
Currently Firefox 3 will not allow you to navigate to a page with a certificate/domain mismatch. We recommend using Opera 9.5 for ssl proxy sessions.
Using an external proxy server
If you normally use an external proxy server when connecting to the Internet (on your local area network, for example), you'll need to set another property within screen-scraper. View the settings screen by selecting Options->Settings from the menu. On the "External Proxy" tab you'll notice a series of boxes toward the bottom that allow you to set parameters related to your proxy server. It should be relatively self-explanatory what needs to be designated. If you happen to be using NTLM (Windows NT) authentication you'll need to designate settings for both the "standard" proxy as well as the NTLM one.
From here:
Setting up specific browsers to use a proxy server:
Vista Users
![]() |
Using Scripts with the Proxy Server |
Overview
screen-scraper has the ability to run custom made scripts while the proxy server is running. This allows you to harness the full power of the scripting environment like you can in scraping sessions. It is recomended that you read using scripts before continuing since many of the concepts apply to invoking scripts in the proxy server environment.
Using the scripts
Scripts are added to a proxy session by selecting proxy session in the tree view then selecting the "Scripts" and clicking on the "Add Script" button. You will notice that a script will then be added to the scripts table. You will need to click on the script name and select the script that you want to run. The options "Sequence", "When to Run" and "Enabled" function similarly to other places in screen-scraper where scripts can be invoked. In the proxy server environment the "When to Run" options specify when in the proxy cycle the script will be invoked. Depending on when you decide to run your script certain built in objects will be in scope that are unique to the proxy environment.
Built-in objects
screen-scraper offers a few objects that you can work with in a script in the proxy environment. See the "Variable scope" section (following this one) for more details.
identifier.cityCode = proxySession.getVariable( "CITY_CODE" );
value shouldgetVariableidentifier.proxySession.setVariable( "CITY_CODE", dataSet.get( 0, "CITY_CODE" ) );
message to be writen to the "Log" panel for theproxySession.log( "Inserting request parameters into the database." );
request.addPOSTParameter( "selectedState" , "Alaska");
key.request.removePOSTParameter( "selectedState" );
url = request.getURLAsString();
request.setRequestLine( "GET" , "http://somesite.com/somepage.html", "HTTP/1.1");
request.setHTTPHeader( "Cookie" , "someCookieValue");
request.removeHTTPHeader( "Cookie" , "someCookieValue");
statusLine = response.getStatusLine();
response.setStatusLine( "HTTP/1.1 200 OK" );
response.setHTTPHeader( "Set-Cookie" , "someCookieValue");
response.removeHTTPHeader( "Set-Cookie" , "someCookieValue");
content = response.removeHTTPHeader( "Cookie" , "someCookieValue");
response.setContentAsString( "<html><head ... </html>");
Variable scope
Depending on when a script gets run different variables may be in scope. The table that follows specifies what variables will be in scope depending on when a given script is run.
| When Script is Run | proxySession in scope | request in scope | response in scope |
| Beginning of proxy session | X | ||
| Before HTTP request | X | X | |
| After HTTP request | X | X | |
| Before HTTP response | X | X | X |
| After HTTP response | X | X | X |
Debugging scripts
One
of the best ways to fix
errors is to simply watch the proxy session log (under the "Log" tab) and the "error.log" file (located in the "log" directory where screen-scraper was installed) for script errors. When a problem arises in executing a script screen-scraper will output a series of error-related statements to the logs. Often a good approach in debugging is to build your script bit by bit, running it frequently to ensure that it runs without errors as you add each piece.
From here: