Robot Blocking

Hi,

I'm trying to scrape hospital data from the national Blue Cross/Blue Shield site (provider.bcbs.com), and it looks like they're using some kind of bot system to prevent automatic extractions. Has anyone run across anything like this before?

Thanks.

John

That sort of thing happens

That sort of thing happens sometimes. There is only so many things they can do. Do you have any more info on the steps they are employing? I might be able to help you to avoid them.

Robot Blocking

Not really...just what the HTML shows

head>
meta name="generator" content="HTML Tidy, see www.w3.org" />
meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />
meta http-equiv="cache-control" content="max-age=0" />
meta http-equiv="cache-control" content="no-cache" />
meta http-equiv="expires" content="0" />
meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
meta http-equiv="pragma" content="no-cache" />
meta http-equiv="refresh" content="10; url=/distil_r_captcha.html?Ref=/Landing/GetSpecialties?alphabet=H&distil_RID=D23AC572-2A54-11E5-872A-DACC5A502F37&distil_TID=20150714181909" />
script type="text/javascript">
(function(window){
try {
if (typeof sessionStorage !== 'undefined'){
sessionStorage.setItem('distil_referrer', document.referrer);
}
} catch (e){}
})(window);
/script>"

The tag <meta

The tag

<meta http-equiv="refresh" content="10; url=/distil_r_captcha.html?Ref=/Landing/GetSpecialties?alphabet=H&distil_RID=D23AC572-2A54-11E5-872A-DACC5A502F37&distil_TID=20150714181909" />

Is a re-direct, but screen-scraper isn't following it because there is a delay.

If you follow that link, there is a CAPTHCA.

There is a way to fill in a CAPTCHA and submit it, but it might be best to try to use some pauses to avoid triggering the CAPTCHA. Would pauses and/or changing the user-agent string be possible?

Pauses

I can certainly add pauses, but I'm not sure where to put them since the redirect URL never appears in the proxy session list. Is it just a matter of adding them at various points and seeing if it changes the response?

Regarding the user agent string, are there specific changes that I should implement?

Thanks for your help thus far.

For pauses, you might want to

For pauses, you might want to pause on every request. You could use sutil.randomPause() for 1500 to 5000 milliseconds to see how that does.

I don't know a particular user string to use. Just using various ones makes it a little less obvious that you're a scrape.