AJAX Passing Sessions during scraping not working

Find that more sites are now using AJAX which presents new challenges trying to scrape.

I'm trying to scrape the following site:

http://www.marksandspencer.com/Red-Wine-Food-Wine/b/44097030?ie=UTF8&sor...

The problem is that because the main content is in an AJAX area I'm struggling to get it to appear in a scrapable file even after I've created the scrapable file through a proxy server.

Even the direct URL with the parameters being passed directly through the parameters screen is not working.

Find the scraping files attached.

I'm at a loss. The scraper info reads:

Starting scraper.
Running scraping session: M&S - Wine AJAX Test
Processing scripts before scraping session begins.
Scraping file: "M&S - Wine AJAX Test"
M&S - Wine AJAX Test: Preliminary URL: http://www.marksandspencer.com/gp/santana/portlet/embeddedSearch.html
M&S - Wine AJAX Test: Using strict mode.
M&S - Wine AJAX Test: Resolved URL: http://www.marksandspencer.com/gp/santana/portlet/embeddedSearch.html?_e...
M&S - Wine AJAX Test: Sending request.
Starting scraper.
Running scraping session: M&S - Wine AJAX Test
Processing scripts before scraping session begins.
Processing script: "M&S - Wine AJAX Initialisation"
An error occurred while processing the script: M&S - Wine AJAX Initialisation
The error message was: Class or variable not found:prod_type.length : at Line: 14.
Processing scripts after scraping session has ended.
Scraping session "M&S - Wine AJAX Test" finished.
Starting scraper.
Running scraping session: M&S - Wine AJAX Test
Processing scripts before scraping session begins.
Processing script: "M&S - Wine AJAX Initialisation"
***Beginning STATE: null
Scraping file: "M&S - Wine AJAX Test"
M&S - Wine AJAX Test: Preliminary URL: http://www.marksandspencer.com/gp/santana/portlet/embeddedSearch.html
M&S - Wine AJAX Test: Using strict mode.
M&S - Wine AJAX Test: Resolved URL: http://www.marksandspencer.com/gp/santana/portlet/embeddedSearch.html?_e...
M&S - Wine AJAX Test: Sending request.
M&S - Wine AJAX Test: An input/output error occurred while connecting to 'http://www.marksandspencer.com/gp/santana/portlet/embeddedSearch.html'. The message was connect: Address is invalid on local machine, or port is not valid on remote machine.
Processing scripts after scraping session has ended.
Scraping session "M&S - Wine AJAX Test" finished.
M&S - Wine AJAX Test: An input/output error occurred while connecting to 'http://www.marksandspencer.com/gp/santana/portlet/embeddedSearch.html'. The message was connect: Address is invalid on local machine, or port is not valid on remote machine.
Processing scripts after scraping session has ended.
Scraping session "M&S - Wine AJAX Test" finished.

The Last Request is below. It seems the session ID is being passed and yet the page is not being displayed in the last response.

GET http://www.marksandspencer.com/gp/santana/portlet/embeddedSearch.html?_e... HTTP/1.1
Accept-Encoding: gzip, deflate
Referer: http://www.marksandspencer.com/Red-Wine-Food-Wine/b/44097030#_encoding=U... 1&showAll=&rh=n:44097030&isBrowse=1&page=1
Proxy-Connection: Keep-Alive
Cookie: session-id-time=1268179200l; session-id=279-4993839-8045014; ubid-acbuk=280-2008109-6006421; s_sess=%20s_cc%3Dtrue%3B%20s_v5%3DFood%2520%2526%2520Wine%253EWine%253ERed%3B%20s_sq%3D%3B; s_pers=%20s_nr%3D1267649569183%7C1270241569183%3B%20s_visit%3D1%7C1267654061156%3B%20gpv_p5%3DBrowse%253AFood%2520%2526%2520Wine%253EWine%253ERed%7C1267654061218%3B
Accept-Language: en-gb
Host: www.marksandspencer.com
x-requested-with: XMLHttpRequest
User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; Tablet PC 2.0)
Accept: */*

The Scraper files can be downloaded from: http://www.blue-curve.com/M&S%20-%20Wine%20AJAX%20Test%20(Scraping%20Session).sss

Any advice you can offer would be greatly appreciated.

It looks like you're

It looks like you're requesting the page properly. When that page loads, however, there should be some JavaScript on there making some more HTTP requests that would populate the rest of the data. If you can proxy those requests yon should see responses with what you need. The responses can con in various types:

  • JSON
  • XML
  • Text/HTML

So you may need to parse what you get, but you should be able to do it.

Problem...

I found something very strange.

Having reimported the proxy files and then utilising the existing scrapable file without adding any parameters I'm finding that halfway through the scrape it loads the file successfully.

More specifically:
- During "M&S - Wine AJAX Test: Sending request."
- "Last Response" displays the correct data.

Once it finishes however the data vanishes and all I'm left with is:

HTTP/1.1 200 OK
Content-Encoding: gzip
Server: Server
Set-Cookie: session-id=275-0385924-3163019; path=/; domain=www.marksandspencer.com; expires=Thu Mar 11 00:00:00 2010 GMT
Vary: Accept-Encoding,User-Agent
Transfer-Encoding: chunked
Cneonction: close
x-amz-id-1: 02E5KTFG06NV4FEYJGG7
Content-Type: text/html; charset=UTF-8
Date: Thu, 04 Mar 2010 21:39:49 GMT
x-amz-id-2: Y6xtLpPfEsDMnylIlxA7D5TnRDzoeN9wudrJCP6rTMY=
Set-Cookie: session-id-time=1268265600l; path=/; domain=www.marksandspencer.com; expires=Thu Mar 11 00:00:00 2010 GMT

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

Any idea why this might be?

I looked through the rest of the transactions and there were no separate files so I'm no closer to extracting the data.

I looked at that site, and I

I looked at that site, and I can't tell what you're trying to do, but it doesn't seem too hard. I whipped up a short little scrape that will loop through some results and it will get some prices.

My scrape is here (right click and "save target as"): http://projects.screen-scraper.com/misc/MarkandSpencer%20(Scraping%20Session).sss

I tried

Hi Jason liked the script but it's only got 12 products per page.

I tried to invoke the show 60 per page but that's where I came unstuck. Any suggestions?

Looks like there is just one

Looks like there is just one parameter in the search results URL:

show60PerPage=1

I tried this and keep I

I tried this and keep I coming back to the same problem that is the main results are not present when I scrape the following page:

http://www.marksandspencer.com/Red-Wine-Food-Wine/b/44097030#_encoding=UTF8&show60PerPage=1

It seems the above page is a container for the AJAX content.

So
Container: http://www.marksandspencer.com/Red-Wine-Food-Wine/b/44097030#_encoding=UTF8&show60PerPage=1
Content: http://www.marksandspencer.com/gp/santana/portlet/embeddedSearch.html?_encoding=UTF8&rs=44097030&fromPS=&sort=salesrank&mnSBrand=core&viewID=leaf&pos=emsrch_pag%5FPage%201&showAll=1&rh=n%3A44097030&isBrowse=1&page=1&browseNodeId=44097030&sessionID=279-4993839-8045014&isAjaxRequest=1

Are you able to scrape the following page?
http://www.marksandspencer.com/gp/santana/portlet/embeddedSearch.html?_encoding=UTF8&rs=44097030&fromPS=&sort=salesrank&mnSBrand=core&viewID=leaf&pos=emsrch_pag%5FPage%201&showAll=1&rh=n%3A44097030&isBrowse=1&page=1&browseNodeId=44097030&sessionID=279-4993839-8045014&isAjaxRequest=1

Though the link works fin in

Though the link works fin in my browser, I had to first request http://www.marksandspencer.com and get the session-id, and scrape it. I then replaced the session id in the URL, and turned off HTML Tidy and I can get the data off of there.

Thanks for your help Jason.

Thanks for your help Jason. Finally got it working but in doing so I can't get the scraper to stop when it fails to find a match.

It just keeps looping round.

Find a link to the Scraper File below but it all seems a little strange as I don't usually have this problem.

http://www.blue-curve.com/M&S%20-%20Wine%20AJAX%20Test100311.sss

Any ideas?

griffen, The reason your

griffen,

The reason your scrape is looping over and over is because you're calling the "M&S - AJAX Scrape Cat Page" script "After pattern is applied". This means that regardless of whether there is a match when the pattern is applied it will call the script and since the script subsequently calls the same scrapeable file it just loops again and again.

You may want to call the script "After each pattern application" instead. That way the script will only get called when there is a match.

-Scott

Thanks Scott. Spot on and

Thanks Scott.

Spot on and thankfully now working.