NullPointerException for certain HTML Pages

I have to start with the customary “It’s a great product†and IMHO it is ;-)

I have a strange issue with a scraping session. I'm scraping details from a real estate site. The setup is as follows:

A Scraping Session with one scrapable file. The scraping session has a script attached to it that executes before the scraping session begins. The script loops through each postcode (zip code) and calls the scrapable file including the appropriate postcode in the URL. During the scrapable file execution, I extract the relevant data and call another script to write it to an access DB (after each pattern application).

The problem is this works 99.9% of the time ;-)

There are some pages on which the scraping session errors with the NullPointerException error (see below). The interesting points are:

* It always happens on exactly the same pages (for example postcode 2568 or 811)
* If I modify my start script and start from one of the problem postcodes it errors immediately (so its not memory or recursion issue).
* If I create a proxy session and watch the HTTP sequence all looks fine (all responses look fine and the page is retrieved fine)
* The stack seems to indicate its something to with the tidy/clean process. Even when I disable tidy the HTML in the options menu, the problem still happens - albeit quicker ;-) (so I maybe wrong about the stack).
* If I look at the Last Request tab, I can see data, but if I look at the Last Response tab, its empty.
* I updated SS to 3.0 and MS Script to 5.6...same issues.

Basically I'm scratching my head. Looking for some advice from someone with some product experience.

Starting scraper.
Running scraping session: Realestate.com.au
Processing scripts before scraping session begins.
Processing script: "Start RealEstate.com.au - VB"
***##########################################################***
***Starting scrape for 2568
***##########################################################***
Scraping file: "Scan Results"
Scan Results: Preliminary URL: ~#URL#~
Scan Results: Resolved URL: http://www.realestate.com.au/cgi-bin/rsearch?cu=fn-rea&a=qfp&q=Go&t=res&id=2568&o=d&p=50
Scan Results: Sending request.
Realestate.com.au: An error occurred while processing the script: Start RealEstate.com.au - VB
Realestate.com.au: The error message was: Scripting engine failure
Courtesy of Java: method name:scrapeFile: 28:3
Java Exception: class com.ibm.bsf.BSFException Target method exception(java.lang.NullPointerException) message is: nullstack tracejava.lang.NullPointerException
at org.w3c.tidy.Clean.cleanNode(Unknown Source)
at org.w3c.tidy.Clean.createStyleProperties(Unknown Source)
at org.w3c.tidy.Clean.createStyleProperties(Unknown Source)
at org.w3c.tidy.Clean.createStyleProperties(Unknown Source)
at org.w3c.tidy.Clean.createStyleProperties(Unknown Source)
at org.w3c.tidy.Clean.createStyleProperties(Unknown Source)
at org.w3c.tidy.Clean.createStyleProperties(Unknown Source)
at org.w3c.tidy.Clean.createStyleProperties(Unknown Source)
at org.w3c.tidy.Clean.createStyleProperties(Unknown Source)
at org.w3c.tidy.Clean.createStyleProperties(Unknown Source)
at org.w3c.tidy.Clean.cleanTree(Unknown Source)
at org.w3c.tidy.Tidy.parse(Unknown Source)
at com.screenscraper.util.General.tidyHTML(General.java:1618)
at com.screenscraper.scraper.ScrapeableFile.scrapeData(ScrapeableFile.java:3593)
at com.screenscraper.scraper.ScrapeableFile.scrape(ScrapeableFile.java:2126)
at com.screenscraper.scraper.ScrapingSession.scrapeFile(ScrapingSession.java:2262)
at com.screenscraper.scraper.ScrapingSession.scrapeFile(ScrapingSession.java:2299)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at com.ibm.bsf.engines.activescript.JavaBean.callMethod(JavaBean.java:254)
at com.ibm.bsf.engines.activescript.ActiveScriptEngine.callMethod(ActiveScriptEngine.java:868)
at com.ibm.bsf.engines.activescript.ActiveScriptEngine.nativeEval(Native Method)
at com.ibm.bsf.engines.activescript.ActiveScriptEngine.exec(ActiveScriptEngine.java:760)
at com.ibm.bsf.BSFManager.exec(BSFManager.java:479)
at com.screenscraper.scraper.ScriptContext$ScriptRunner.run(ScriptContext.java:319)

(scode=0x80020009 wcode=0x0)

Processing scripts after scraping session has ended.
Scraping session finished.

This is the code that calls the scrapable file from the scraping session:

Set MyConnPCODE = CreateObject("ADODB.Connection")
MyConnPCODE.Open "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=D:\Documents and Settings\HomePC\My Documents\RealEstate\screenscraper.mdb"
Set rs=MyConnPCODE.execute("SELECT * FROM Postcodes")

rs.MoveFirst
OldPostcode = ""
NewPostcode = ""

Do Until rs.EOF
NewPostcode = rs("POSTCODE")
' This next command sets the postcode to a know problem postcode
NewPostcode = 2568
If NOT NewPostcode = OldPostcode Then
' Create the URL
URL = "http://www.realestate.com.au/cgi-bin/rsearch?cu=fn-rea&a=qfp&q=Go&t=res&id=" & NewPostcode & "&o=d&p=50"

' Make the current number available to the session
session.setVariable "URL", URL
session.setVariable "POSTCODE", NewPostcode

' This line will just write an output to the log, but is otherwise unneeded.
session.log(" ***##########################################################***")
session.log(" ***Starting scrape for " & NewPostcode )
session.log(" ***##########################################################***")

' Finally scrape the file
call session.scrapeFile("Scan Results")
OldPostcode = NewPostcode
End If
rs.MoveNext
Loop

Appreciate any suggestions anyone can offer.

NullPointerException for certain HTML Pages

Hi,

The ability to turn tidying on and off is actually a feature we decided to strip from the Basic Edition.

I've tried replicating the issue on my side, and I'm not having much luck. What version of screen-scraper are you running? If it's not 3.0, would you mind trying that? If it is 3.0, would you mind sending me your scraping session so that we can investigate a bit more closely? My email address is my first name at screen-scraper.com.

Thanks,

Todd

NullPointerException for certain HTML Pages

todd & fnirt, thanks for taking the time to post some suggestions

I installed the professional version as suggested. I tested it with tidyHTML enabled and disabled. With it enabled I would still see the exception error. With it disabled it would complete fine, although my extractor patterns didn't hit any matches (for obvious reasons).

So obviously its a problem relating to the tidyHTML feature in combination with some specific pages that I'm scraping. Any idea why disabling tidyHTML in the basic version doesn't give me any joy?

Appreciate your time...

MS ~©¿©~

NullPointerException for certain HTML Pages

In an unrelated related note:

We had one particular scrape which was just plain barfing, but every other instance of that scrape worked just fine. Turned out the site was putting out some nonstandard characters and turning htmltidy off prevented the problem from happening in the future. I think it's a peculiarity of htmltidy, not screen-scraper.

So as a rule when I get strange behavior turning off htmltidy is one of my first diagnostic moves.

Sorry I don't have details on my specific instance, but I wanted to throw our experience in.

NullPointerException for certain HTML Pages

Hi,

I can't say that I've seen that one before, but let's see if we can't get to the bottom of it. It definitely appears to be related to the tidying process. First, are you using the Basic Edition of screen-scraper? If so, what would you think about temporarily installing the Professional Edition (you can install both the Basic and the Professional editions on the same machine, just don't run them simultaneously)? If you do that, then import the scraping session from your first instance, you'll find under the "Advanced" tab for your scrapeable file a check box that will allow you to disable tidying. What happens if you un-check that box, then try re-running your scraping session?

Kind regards,

Todd Wilson