Automating page-by-page extraction using POST variables.

I have used the information from Tutorial 3 to be able to successfully scrape a site which changes pages using GET, and have now hit a brick wall on another site using POST (javascript:__doPostBack) which doesnt pass the variable in the URL. I have found the variable that controls the page displayed in the Parameters tab in Screen Scraper and successfully changed it many times on single screen scrapes and got the correct data from the correct page. The problem comes when I attempt to automate the application by making the page number auto-increment by passing a session variable in the post section. Starting with page 0 the first page scrapes fine and I use a script like in tutorial 3 to append the data to a file. However, once my script runs and auto-increments the PAGE session variable and runs the scraping session again, instead of the data from the second page, the returned HTML from the second scrape does not contain the extractor pattern that should be there. If I hard coded the page number in there for each and every page, it runs fine, it only does this when I attempt to auto increment.

I start the session from a script which sets the variable PAGE to 0 to start like so:

runnableScrapingSession.setVariable( "PAGE", "0" );

The scraping session begins and passes the PAGE variable in the POST (see here: http://i171.photobucket.com/albums/u288/jeffreydean1/screenie2.jpg )

Page 0 comes through fine and scrapes perfectly. When it attempts to scrape page 1 however, my log shows this (I snipped the viewstate variable, as it was extremely long):

Scraping file: "Relapse cd 2"
Relapse cd 2: Preliminary URL: http://shop.relapse.com/store/product.aspx
Relapse cd 2: POST data: __EVENTTARGET=ctrlProductBrowser%3AdgProducts%3A_ctl1%3A_ctl1&__EVENTARGUMENT=&__VIEWSTATE= (snipped for extreme length) dDwxTc4OtrlLeftNav%3AddlSearchBy=1&ctrlLeftNav%3AtxtSearchString=&ctrlLeftNav%3AtxtEmail=yourname%40isp.com&ddlGenres=0&ddlProductFormats=0
Relapse cd 2: Resolved URL: http://shop.relapse.com/store/product.aspx
Relapse cd 2: Sending request.
Relapse cd 2: Processing scripts before all pattern applications.
Relapse cd 2: Extracting data for pattern "null"
Relapse cd 2: The pattern did not find any matches.

And instead of the info I expect, the last response tab of the scrapable file shows this instead of what I should have gotten for my extractor pattern:

"(span id="ctrlProductBrowser_lblNoResults")No results... please select a different genre or format.(/span) (/td)" (HTML changed to ()'s so i can post this message)

Can anybody offer any advise as to what I may be doing wrong and how to fix these issues? It strikes me as fairly simple but I simply cannot figure out what I have done wrong. I COULD simply hand-change the number of the page and just run a single scraping session over and over again appending the data, but that's rather inelegant a solution. Thanks in advance!

Automating page-by-page extraction using POST variables.

jeffreydean1,

Please have a look at a blog entry I recently completed on the topic. Hopefully, it will give you some tools to work with.

http://blog.screen-scraper.com/2008/06/04/scraping-aspnet-sites/

-Scott

Automating page-by-page extraction using POST variables.

jeffreydean1,

You're the latest unfortunate soul to have waded into the stinking mire that is ASP.Net. Once you get the hang of it the stench wears off, though.

The trick to ASP.Net sites with their inconceivably large "VIEWSTATE" and "EVENTTARGET" post values is to scrape those values and pass them along as post parameters to the next page. Each page requires that these be passed. If you're not sure if one of their silly values is required it's best to err on the side of including it.

That's the main hurdle. The second issue sometimes is that each page requires a particular referrer in order to display. So, if you're following all of your pages in the order that your browser does you should be ok but if by chance you're accessing a page out of order or happen to miss a 300 redirect occurring, you'll need to manually set the right referrer.

[url=http://livehttpheaders.mozdev.org/]livehttpheaders[/url] can be helpful as it can reveal HTTP transactions not easily identified or simply lost with the screen-scraper proxy.

Be sure to consult [url=http://www.screen-scraper.com/support/docs/api_documentation.php]our API[/url] for how to set referrers and if you happen to need to set one of those pesky post parameters manually. You usually don't. Simply replacing a post parameter's value with your scraped data under the parameters tab usually suffices (i.e. ~#SCRAPED_VIEWSTATE#~).

I addressed this issue in a previous posting here.

http://www.screen-scraper.com/forum/phpBB2/viewtopic.php?t=1035&highlight=viewstate

I hope this helps. Let us know if you have any further questions.

Thanks,
Scott