Calling a scrape file during a Proxy Session - how to?

Hi -

SS Pro newbie here currently evaluating Trial version. So far so good!

Bit confused now however.

I am running a Proxy Session. I want to scrape some data from each webpage I visit during the Proxy Session, while

the Proxy Session is running. That can be done, yes? (The documentation talks about harnessing the power of the

scripting engine while running proxy sessions, so I'm assuming so).

I have created a simple Extractor page that takes the data I need, called "Product_Data".

I have also created a simple Script called "Get Product Data" that has 1 line in it (VBScript):-
Call session.ScrapeFile( "Product_Data" )

Lastly, under the 'Scripts' tab in the Proxy Session, I have added the script "Get Product Data", sequence=1, When

to run=After HTTP Response, Enabled=Yes.

This doesn't work (an expert will probably see why straight away).

In the log, I get:-
Processing scripts before an HTTP request.

Requesting URL: http://mydomain.com/page1.html

Processing scripts after an HTTP request.

Processing scripts before an HTTP response.

Processing scripts after an HTTP response.

Processing script: "Get Product Data"

In call script

Thread-127: An error occurred while processing the script: Get Product Data

Thread-127: The error message was: Scripting engine failure
Courtesy of Java: method name:ScrapeFile: 1:0
Java Exception: class com.ibm.bsf.BSFException Method:ScrapeFile in class com.screenscraper.httpeek.HTTPSessionnot

found.

(scode=0x80020009 wcode=0x0)

Could somebody explain how I can call the scrape file after each HTTP response? I'm obviously missing some important

knowledge about calling scrape files in Proxy sessions. I need enlightening!

The script is definitely being called after each HTTP response. But do I need to somehow capture the current URL I'm

on in the Proxy Session and pass this in in some way to the Scrape file each time? If so, how? (At the moment, I

have left the 'URL' field in the 'Product_Data' scrapefile page as a blank - is this a mistake?).

If anybody could give any help on the above, that would be greatly appreciated since as you can see I am a little

bit in the dark at the moment over this!

many thanks

pete

Calling a scrape file during a Proxy Session - how to?

hi Scott,

Thank you once again for your very kind reply.

What you suggest about copying the raw responses into individual .html files I can understand, but would be a lot of work for the quantity of pages I am anticipating scraping.

Therefore, I think I will need to overcome the '500' error problem.

This means I will have to pm you with further information, as the site I would like to scrape from is a commercial site which requires a login, and I shouldn't really divulge my username and password publicly.

pete

Calling a scrape file during a Proxy Session - how to?

Pete,

There's a way around the 500 error you're getting but if it would be faster for you just to scrape the data from the results of the proxy click "Display Raw Response" for each of the pages, select all of the HTML in the pop-up window by holding down Ctrl and press A. This selects all of the text. Then, copy it by holding down Ctrl and pressing C. Then, open up your favorite text editor (i.e. Notepad) and paste in the HTML (Ctrl+V). Save each file to your local system (remember where) as .html files. Then, for each of the pages you want data from you can replace the URL from the pages that you moved over to the file path on your local system (i.e. c:\myfile.html). If you have all of the files running in sequence screen-scraper will iterate through each and grab data based on the extractor patterns you use.

I hope this makes sense. If you plan to do this more than a few times it would be best if you overcame the 500 error issue. We can help with that if you post the URLs you're having trouble with. If you'd like, you can private message me but we prefer that everyone can benefit from the solutions offered.

Thanks,
Scott

Calling a scrape file during a Proxy Session - how to?

Hi Scott -

Many thanks for your reply and for putting me straight on this (that you don't scrape during a Proxy Session).

What I need to know now is: once the Proxy Session is over & stopped, I have a list of webpages showing in 'Proxy Session', under the 'Progress' tab, under 'HTTP Transactions'. In the bottom pane, under 'Response', is the full HTML of each of these visited pages saved in SS.

Is there some way I can loop through these locally-saved pages and submit each of them to a scrape file, so I can get data off of them?

If so, what kind of script would I need to kick this off? Is there any example of doing this type of thing in the documentation at all?

The problem I'm having with this particular website I want to scrape is that although I can run a Proxy Session and 'record' the exact pages I need, if I try to go straight to them in SS Pro in a 'normal' SS way (like in your Shopping Site tutorial) all I get back from the server is a '500' error for each page. But if I can scrape from the locally-saved files held on my PC that wouldn't be a problem.

Can you help please?

Many thanks,

pete

p.s. what is the name of the file that hold the 'recording' of the proxy session, where the 'Response' HTML is kept? I can't seem to find it on my PC.

Calling a scrape file during a Proxy Session - how to?

Pete,

It looks like you're getting the general idea except for one important aspect. The proxy server's purpose is, for the most part, meant only to collect the data about each page that you'll be crafting your scraping session from.

So, rather than building your scraping session in the proxy server you'll be recording your browsing with the proxy server and then move the relevant pages over one-by-one as scrapeable files. As you do you'll notice that the URL for each page stays in place; as well as, the HTTP data (post/gets), along with the last request and response. This data is meant to be a guide for what pages need in order for screen-scraper to successfully navigate to them. And the data in the last response, in particular, is useful for creating extractor patterns.

This should be more apparent if you start your education with the tutorials we have on our site.

http://www.screen-scraper.com/support/tutorials/tutorials.php

Please feel free to post again with any questions you may have. There's no such thing as a bad question but we do encourage you to use Interpretive Java rather than VBScript. screen-scraper is written in Java, and as you may or may not know, Microsoft has not always played nice with Java making the integration of VBScript with screen-scraper less than elegant.

Thanks,
Scott