Test for existence of warning message?

Is there a way within a script to test for the existence of a warning message in a scrape session?

The warning that interests me most at this moment is
[b]Warning! The operation timed out while applying the extractor pattern.[/b]

In an ideal world it would be great to know which extractor/sub-extractor pattern timed out, but here in the pseudo-real-world, just being able to branch when the warning message surfaces would help a lot.

TIA.

Dave Nuttall
San Antonio, TX

Test for existence of warning message?

Dave,

It's possible you're encountering a bug in 3.0. I'm not able to replicate it using a copy of 3.0 but I'm just using a test file. Could you either private message or email me (my name & last initial @ our domain) just the HTML of the page you're having this happen on and if possible the scrapeable file containing your extractor patterns?

Of course, this may very well be fixed in a later version. Have you tried downloading the professional edition and updating the version? If not, I strongly recommend you try. If you have discovered a legitimate bug in the basic vs. I will see what I can do that you won't have to live with the bug.

Thanks,
Scott

Test for existence of warning message?

[quote="swilsonmc"]..SNIPPED
This may seem extreme but I don't see why it wouldn't work. Why not try this for your extractor pattern text
<html~@DATARECORD@~</html>
[/quote]

My pattern is almost that broad, in fact it is
<body><div class="BodyWindowDcCivil">~@DATARECORD@~</div></body>

In my experience when I have 2 or more sub-extractor patterns with the above as the starting point, is NONE of the tokens match, then it ignores and continues. But if the sub-extractor has 2 or more tokens and one doesn't match, then the whole thing fails.

Test for existence of warning message?

Dave,

Your timing out issue should be solvable. In fact, because you're using sub-extractors that don't always need to match you can be very liberal with how you set up the DATARECORD token. This may seem extreme but I don't see why it wouldn't work. Why not try this for your extractor pattern text:

<html~@DATARECORD@~</html>

Now, the only way this would time out would be because of a connection error and not related to your extractor pattern. This would work even if you have other extractor patterns for this scrapeable file (it's ok to overlap HTML between different extractor pattern text). If some of your sub-extractor patterns some times don't match you'll need to look at them individually for how well-placed they are in the HTML and how robust your regular expressions are but if you use this approach you should never get a time out error due to this pattern text not matching.

-Scott

Test for existence of warning message?

[quote="swilsonmc"]Dave,

I'd recommend taking a closer look at the option of using sub-extractors because for one given extractor pattern using the special DATARECORD token you can have any number of sub-extractor patterns that may or may not match - allowing, in theory, to have all four scenarios under one extractor pattern that would store the data that matched and not the data that didn't. Then, you could use the data that didn't match (checking for certain null values) as triggers for behaviors after the extractor pattern has been applied.
[/quote]

That's approximately (or maybe "exactly") as I have it. The real issue is that when one sub-extractor FAILS, anything it's sibling sub-extractor patterns found are LOST when it times out.

Test for existence of warning message?

Dave,

I'd recommend taking a closer look at the option of using sub-extractors because for one given extractor pattern using the special DATARECORD token you can have any number of sub-extractor patterns that may or may not match - allowing, in theory, to have all four scenarios under one extractor pattern that would store the data that matched and not the data that didn't. Then, you could use the data that didn't match (checking for certain null values) as triggers for behaviors after the extractor pattern has been applied.

Otherwise you may be juggling session variables in a complicated and possibly unnecessary way. However, to answer you're last question, yes, your 4 and 5 scripts will have access to the values matched in previous extractors so long as those values are being saved as session variables.

Please let me know if I can answer any questions about my suggestion.

Thanks,
Scott

Test for existence of warning message?

[quote="swilsonmc"]Dave,
Your situation does sound like it could be a common one. I'm having a hard time imagining the specifics of the HTML you're dealing with, though.
[/quote]

Let's try this for grins.

The "page" is the net result of what is likely a query to a multi-table database.

It is litigation data so it has "constants" such as a case-number, court, etc.

The number of "related people" varies from 2 (plaintiff/defendant minimum) to 4 (plaintiff/defendant plus an attorney for each)

The host produces the list apparently from a child table related to the baseline case information, so the HTML (actually its XHTML) is the result of stepping thru the array from the child table...some cases have two, so it opens and closes two "related people" displays, then starts the next section of the page.

The layout is identical with an


between the individuals.

If it didn't time-out when it wants to find #3 or #4, that would be fine because then the script after extraction would write all the values to the data file.

It almost is starting to sound like I need to have one pattern with the whole page DATARECORD to extract #1 and #2 names, then another "major" extractor of the same page but look for #3, and yet another to look for #4.

In theory I guess that means I should be able to have a fourth major extractor and run a script that accumulates from all four sessions. Since I'm always using session variables and initializing them as null in my initialization script, the only thing I don't know if it will work is if one script attached to number 4 or 5 extractor will be able to collect from the previous patterns.

In my initial thought, I would have simply triggered a reextraction script when the error message was trapped.

To my 3rd generation programming brain, that's wierd!

Thanks for your patience!
Dave

Test for existence of warning message?

Dave,

Your situation does sound like it could be a common one. I'm having a hard time imagining the specifics of the HTML you're dealing with, though.

It's ok to cast a wide net when you're using the DATARECORD token and sub-extractors. If you can accomplish capturing the data you need with the right triggers in place I would suggest just increasing the amount of HTML that the DATARECORD token captures to include this one area that comes and goes; as well as, it's more consistently-available neighbors.

So, if possible, don't have the one DATARECORD token standing alone waiting to fail. Try to integrate it with others around it because, remember, it's ok if sub-extractors don't match so long as the DATARECORD does.

Needing to speak in such abstracts is why seeing the session would be more helpful.

I hope this brief talk on theoretical sub-extractor scenarios helps.

-Scott

Test for existence of warning message?

[quote="swilsonmc"]...snipped... a method for handling a timeout event on extractor patterns, unfortunately. However, I can offer to take a look at your scraping session to see if there is a solution to the timeout issue you're having.[/quote]

Thanks for the offer, Scott. My preference is to see if I can learn enough to solve the problem.

The extractor pattern fails when the expected section of a page (DATARECORD) does not exist, although it does exist in at least 60-70% of the scraped pages.

When a particular sub-extractor pattern fails, the overall session never hits the script that is geared to "run after extraction" which means what ever data we DID capture, is lost for that effort unless we go back and hit it again with a less robust pattern matcher.

I would think this would be relatively common. My experience in life usually points to the "fact" that I'm seldom the first and probably not the last to experience any particular problem.

SO what do SS gurus do when confronted with that situation?

Thanks.
Dave

Test for existence of warning message?

Dave,

Sorry for leaving in the lurch. There isn't enough demand to add a method for handling a timeout event on extractor patterns, unfortunately. However, I can offer to take a look at your scraping session to see if there is a solution to the timeout issue you're having.

Would you mind sending your scrape to me either by posting it on a server somewhere or emailing it to my name plus last initial w @ our domain?

Let me know what version of screen-scraper you're running and be sure to specifically export any scripts that are not being called from either the scraping session itself or by a scrapeable file as they won't export when using the scraping session export option. This is usually any scripts that are called to start a scrape from a batch file or shell script or any scripts executed from within another script.

Thanks,
Scott

Test for existence of warning message?

Around noon on June 27, 2007, Scott seems to have said
[quote="swilsonmc"]Dave,

You raise a good point that if we're able to log the event of an extractor pattern timing out why can't we create a method for trapping it. I will bring this up here internally and see if we can't include it in a method for later release.

-Scott[/quote]

Did anything come out of the internal discussions?

My scrapes are OK so long as all the expected data is present, but in cases where a page sometimes has four sub-extractor patterns and sometimes only three, it ends up requiring two passes at the scrape to get what's there because when an extractor pattern fails, the script to execute after the scrape never gets executed.

Please advise at your earliest convenience, since I was asked explicitly to use the forums instead of communicating directly.

Thanks.
Dave

Test for existence of warning message?

[quote="swilsonmc"]Dave,

You raise a good point that if we're able to log the event of an extractor pattern timing out why can't we create a method for trapping it. I will bring this up here internally and see if we can't include it in a method for later release.

-Scott[/quote]

I noticed some brief notes that Todd posted a while back regarding [b]scrapeFile.noExtractorPattern()[/b] but I haven't really tried to figure out if that is the way to go or perhaps as you suggest you folks in the "think tank" can do better!

Thanks.
Dave

Test for existence of warning message?

Dave,

You raise a good point that if we're able to log the event of an extractor pattern timing out why can't we create a method for trapping it. I will bring this up here internally and see if we can't include it in a method for later release.

-Scott

Test for existence of warning message?

[quote="swilsonmc"]
Unfortunately there is not a way to trap and respond to that specific error. However, if an extractor pattern is sometimes timing out I would recommend working on refining the regular expressions you're using and/or reducing down any extraneous HTML within your Extractor text. Perhaps consider using Sub-extractor patterns if it makes sense to.

[/quote]

Thanks, Scott.
I know exactly which extractor pattern(s) are timing out because I've been forced to create a secondary scrape to capture what the first one fails to FIND.

The data item is a 2nd street address line that exists when it exists but no indication of its potential is provided when it is not used.

When that happens, the extractor/sub-extractor routines time out and the system proceeds to the next target.

When the "comprehensive" scrape is finished and we find there are targets which should have resulted in data, we simply build a new target list and scrape with the reduced extractor-expectations.

At a minimum, I was hoping to be able to "log" the targeted data (a 15-character string) so that such a log would be the input to the second pass at the site.

I assumed (incorrectly obviously) that if the warning message shows up in the LOG of the scrape session, the "surely" would be a way to trap it. Oh well, just send me back to the Social Security Office to draw a paycheck instead of trying to do this "young man's work"!!!

Thanks.
Dave

Test for existence of warning message?

Dave,

Unfortunately there is not a way to trap and respond to that specific error. However, if an extractor pattern is sometimes timing out I would recommend working on refining the regular expressions you're using and/or reducing down any extraneous HTML within your Extractor text. Perhaps consider using Sub-extractor patterns if it makes sense to.

I'm sure folks on the forum would like to show off their Regex prowess if you could challenge us with some finicky code you're dealing with.

If you're not sure what code is causing your extractor patterns to time out you can be sure to capture every stitch of log entry for even the longest scrape jobs by running your scrape from the command line.

See how here.
http://www.screen-scraper.com/support/tutorials/tutorial2/interacting_with_screen-scraper_externally.php

For other times that you want to capture when there is either a non-200 HTTP response during a session or a connection time out you can utilize the following method (unfortunately, yes, this does not work for extractor pattern time outs).

http://www.screen-scraper.com/support/docs/api_documentation.php#wasErrorOnRequest

Please feel free to share any pesky code issues you're having. We got a helpful community here.

Thanks,
Scott