Session variable dropped from url

Hi, can't run this one down and would love some help. I'm scraping a password protected site to retrieve pricing information. As part of the log-in, the server generates a unique ID (mscssid) which I scrape and store to a session variable. This variable is then passed in the url for each subsequent request.

Here's the catch: things work fine with a single lazy scrape. The string is extracted, stored in a session variable and successfully substituted as part of subsequant urls. However, when I attempt to lazy scrape two or more products programatically it fails. The scraper log indicates that although it has successfully scraped the ID, it is not inserting it as part of the URL - as though it were null, even though it has successfully extracted it.

Any thoughts here would be much appreciated. FYI i've got separate scrapers working for other similar password protected sites and am able to scrape 10+ products simultaneously (ie programatically in rapid sequence using lazy scrape to instantiate separte scrapers). I've updated to the latest windows version 2.6 as well.

Also, it would really, really throw a wrench into our product not to be able to use lazy scrape, as an asynchronous scrape is critical. Thanks in advance for any help,

John

FYI, great product so far - we're hoping to launch a beta version of our service using it in q1 2006

Session variable dropped from url

Hi John,

I think you're absolutely right about the redirect causing the problem. In the successful run you get this

THE_SITE_login Resolved URL http//www.THE_SITEcom/xt_shopper_lookup.asp
THE_SITE_login Sending request.
THE_SITE_login Redirecting to http//www.THE_SITE.com/orderentry.asp?mscssid=%7BE2482593%2DF492%2D46D3%2D9190%2D9EB314DAD139%7D

Whereas in the unsuccessful run you get this

THE_SITE_login Resolved URL http//www.THE_SITE.com/xt_shopper_lookup.asp
THE_SITE_login Sending request.
THE_SITE_login Redirecting to http//www.THE_SITE.com/orderentry.asp?mscssid=%7B98A42298%2DFA1C%2D4C54%2D9D64%2DBE4EADB35C09%7D
THE_SITE_login Redirecting to http//www.THE_SITE.com/removeFrames.asp?Page=i%5Fshop%2Easp

Not knowing the site, it's difficult for me to say why it might be doing this. Can you tell why it might be redirecting you to that "removeFrames" page? Also, on that removeFrames page is the MSCSSID value available such that you could still extract it regardless of which page it happens to redirect you to?

Best,

Todd

Session variable dropped from url

Thanks Todd, glad to here you all use lazy scraping extensively. I've tried manually testing by submitting nearly simultaneous sessions in different browsers (clicking submit one after the other) with no problems. I log and store scrape start/stop times and they are executing at the same time with no issues, so it doesn't appear to be an IP restriction.

In a blinding flash of the obvious, the session id isn't being picked up after the first product is scraped. Based on the log files, it appears to be an issue with an extra redirect that likely has something to do with the fact that i'm requesting one frame from a frame-based site. No sure where to go from here... Any thoughts or suggestions would be most welcome.

I've posted a copy of an unsuccessful log file, along with a successful one for your review. Thanks!

John

UNSUCCESSFUL:
Starting scraper.
Running scraping session: THE_SITE
Processing scripts before scraping session begins.
Scraping file: "THE_SITE_home_page"
THE_SITE_home_page: Preliminary URL: http://www.THE_SITE.com
THE_SITE_home_page: Resolved URL: http://www.THE_SITE.com
THE_SITE_home_page: Sending request.
Scraping file: "THE_SITE_login"
THE_SITE_login: Preliminary URL: http://www.THE_SITE.com/xt_shopper_lookup.asp
THE_SITE_login: POST data: shopper_username=(INTENTIONALLY REMOVED)&shopper_password=(INTENTIONALLY REMOVED)&cmdSubmit.x=0&cmdSubmit.y=0
THE_SITE_login: Resolved URL: http://www.THE_SITE.com/xt_shopper_lookup.asp
THE_SITE_login: Sending request.
THE_SITE_login: Redirecting to: http://www.THE_SITE.com/orderentry.asp?mscssid=%7B98A42298%2DFA1C%2D4C54%2D9D64%2DBE4EADB35C09%7D
THE_SITE_login: Redirecting to: http://www.THE_SITE.com/removeFrames.asp?Page=i%5Fshop%2Easp
THE_SITE_login: Extracting data for pattern "MSCSSID"
THE_SITE_login: The pattern did not find any matches.
THE_SITE_login: Warning! No matches were made by any of the extractor patterns associated with this scrapeable file.
Scraping file: "THE_SITE_product"
THE_SITE_product: Processing scripts before a file is scraped.
THE_SITE_product: Preliminary URL: http://www.THE_SITE.com/product.asp?productcode=~#VARID#~&quantity=0&mscssid=~#MSCSSID#~&pai=

[NOTE: mscssid does not resolve here, but is stored in a session variable above]

THE_SITE_product: Resolved URL: http://www.THE_SITE.com/product.asp?productcode=015876&quantity=0&mscssid=&pai=
THE_SITE_product: Sending request.
THE_SITE_product: Redirecting to: http://www.THE_SITE.com/Default.asp
THE_SITE_product: Extracting data for pattern "product_price"
THE_SITE_product: The pattern did not find any matches.
THE_SITE_product: Processing scripts after all pattern applications.
THE_SITE_product: Processing scripts after a file is scraped.
PRODQUANT and or PRODPRICE were null: ACTIVE_PRICE_ID: 226Page html: null
Processing scripts after scraping session has ended.
Scraping session finished.

SUCCESSFUL:
Starting scraper.
Running scraping session: THE_SITE
Processing scripts before scraping session begins.
Scraping file: "THE_SITE_home_page"
THE_SITE_home_page: Preliminary URL: http://www.THE_SITE.com
THE_SITE_home_page: Resolved URL: http://www.THE_SITE.com
THE_SITE_home_page: Sending request.
Scraping file: "THE_SITE_login"
THE_SITE_login: Preliminary URL: http://www.THE_SITE.com/xt_shopper_lookup.asp
THE_SITE_login: POST data: shopper_username=(INTENTIONALLY_REMOVED)&shopper_password=(INTENTIONALLY_REMOVED)&cmdSubmit.x=0&cmdSubmit.y=0
THE_SITE_login: Resolved URL: http://www.THE_SITEcom/xt_shopper_lookup.asp
THE_SITE_login: Sending request.
THE_SITE_login: Redirecting to: http://www.THE_SITE.com/orderentry.asp?mscssid=%7BE2482593%2DF492%2D46D3%2D9190%2D9EB314DAD139%7D
THE_SITE_login: Extracting data for pattern "MSCSSID"
THE_SITE_login: The following data elements were found:
MSCSSID--DataRecord 0:
Storing this value in a session variable.
MSCSSID=%7BE2482593%2DF492%2D46D3%2D9190%2D9EB314DAD139%7D
Scraping file: "THE_SITE_product"
THE_SITE_product: Processing scripts before a file is scraped.
THE_SITE_product: Preliminary URL: http://www.THE_SITE.com/product.asp?productcode=~#VARID#~&quantity=0&mscssid=~#MSCSSID#~&pai=

[NOTE: URL resolves properly]

THE_SITE_product: Resolved URL: http://www.THE_SITEcom/product.asp?productcode=015874&quantity=0&mscssid=%7BE2482593%2DF492%2D46D3%2D9190%2D9EB314DAD139%7D&pai=
THE_SITE_product: Sending request.
THE_SITE_product: Extracting data for pattern "product_price"
Processing script: "write_to_db"
Processing script: "write_to_db"
THE_SITE_product: The following data elements were found:
product_price--DataRecord 0:
Storing this value in a session variable.
PRODPRICE=8.22
THE_SITE_product: Processing scripts after a pattern application.
THE_SITE_product: Processing scripts after all pattern applications.
THE_SITE_product: Processing scripts after a file is scraped.
Processing script: "write_to_db"
Processing scripts after scraping session has ended.
Scraping session finished.

Session variable dropped from url

Hi John,

Thanks for posting. Based on your description I'm not completely sure what the issue could be, but one possibility may be that the site your scraping only allows for one login for each IP address. That is, if you're scraping the site from two simultaneous scraping sessions but using the same IP address (i.e., on the same machine) it could be that the site is disallowing that. In some ways you can think of scraping sessions as separate web browsers--things like cookies and variables are kept completely separate from one another.

It might help to debug if you could post at least a portion of your log where the point of failure seems to be occurring. For example, you might post the portion that shows the values being extracted, then also the parts where they aren't getting properly inserted into the URL.

We use lazy scrapes extensively in our own work, so I'm pretty confident they're working as they should. However, it's also certainly possible that there are bugs or issues that we haven't run across.

Kind regards,

Todd Wilson