Redirect failing

Hi there

Any help with this problem would be much appreciated :)

A url that I am scraping produces a redirect which SS then attempts to follow

Vebra Property Details: Resolved URL: http://www.vebra.com/home/search/vdetails.asp?src=vebra&fd=0&bd=1&db=1&c...
Vebra Property Details: Sending request.
Vebra Property Details: Redirecting to: http://212.50.188.107/cgi-win/vebra.cgi?details1?src=vebra&PropertyCode=...
Vebra Property Details: An HTTP error occurred while connecting to 'http://212.50.188.107/cgi-win/vebra.cgi?details1?src=vebra&PropertyCode=1007003/ASHGR/38878/3'. The message was: Unable to parse header: HTTP/1.0 200 OK.

I have manually tried pasting both the first url and the redirect url into my browser and this works successfully. However it seems that SS is not able to follow the redirect. Is there anything I can do to check what might be the problem?

Many thanks
cbs7

Redirect failing

Hi Todd,

Thanks a lot for your reply. I did as you suggested and my scraping session works fine now.

Regards,
Hemanth

Redirect failing

This is actually a different issue. If you view the source on this page (just stop it from loading before it redirects):

here

You'll see this snippet of JavaScript, which is handling the redirect:

function redirige(){
strUrl="/infogreffe/listeRegComSimple.do";
window.setTimeout('location.replace(strUrl)', 0);
}

screen-scraper will follow HTTP redirects (e.g., 302 responses), but it won't parse an HTML file looking for JavaScript methods that would cause a redirect. In this case you'll just need to create another scrapeable file with this URL:

http://www.infogreffe.fr/infogreffe/listeRegComSimple.do

Todd

Redirect failing

Hi,

I am trying to scrape data from http://www.infogreffe.fr/. Here too, I find a similar problem. For example, when i attempt to search for a company named panconsult finance corporation, i can paste the following URL in my browser: http://www.infogreffe.fr/infogreffe/attenteRechercheSimple.xml?step=start&search=rcs_simple&oups=/listeRegComSimple.do&denomination=panconsult+finance+corporation&commune=&departement=&numeroRcs=
and it works. But when i try the same using Screen Scraper, it does not follow the redirect but stops at the page that says "Searching....". Are we talking about the same problem here, or should I post this as a new thread?

Thanks and regards,
Hemanth

Redirect failing

Hi,

The tricky thing is that this isn't a bug directly within screen-scraper--it's in HttpClient, which screen-scraper simply links to. The trouble is that the server is issuing an invalid HTTP response:

HTTP/1.1 200 OK
Server: Microsoft-IIS/4.0
Date: Tue, 08 Aug 2006 16:31:46 GMT
HTTP/1.0 200 OK
Content-type: Text/HTML

Note the repeated "HTTP" status line. I submitted the issue to the HttpClient list, and this was the response I received:

'We have seen something similar a couple of years ago. This kind of problem is not that uncommon, especially in HTTP responses generated by CGI scripts. As far as I remember the argument was all about "common browsers tolerate such protocol violations", which I personally do not find very convincing"

It doesn't sound like they're very compelled to address the issue. Unfortunately, this leaves only one option I can think of--fork HttpClient and implement the fix ourselves. I really have no desire to do this, however, because we would need to perpetuate that fix in any subsequent version of HttpClient they release. I'll keep the conversation going with them, though. Maybe they'll see the light and provide a fix. Thanks for your patience in the meantime.

Kind regards,

Todd

Redirect failing

Thanks for the reply Todd. How can I submit a bug for Screen Scraper? Do I just send an email?

Many thanks
cbs7

Redirect failing

Hi,

Well, this is certainly a rare event--the underlying library we use to handle the HTTP in screen-scraper (http://jakarta.apache.org/commons/httpclient/) is simply unable to parse the response from this web server. The difficulty is that the code is outside of ours. HttpClient is one of the most robust HTTP libraries I'm aware of, so it's surprising that it would have difficulty with this site. Our best bet would be to submit a bug or attempt a patch, but I can't make any guarantees as to when a fix would be available. Sorry we can't be more help on this one.

Kind regards,

Todd Wilson