more help translating &

I'm working on a scrape that extracts urls. When I extract these links from a
results page, the & are all in & format, which results in a "page not found" problem when I feed them back to scrape the page behind the link.

I found a link to an htmlparser in the forums, which I am calling in a script after each extraction of a url. Unfortunately, it's not quite clear how to apply the parser. Currently, my code looks like this:

import org.htmlparser.util.Translate;
//get yourStr of text from screen-scraper
'&' = Translate.decode( & );

However, I imagine that I need to tell screenscraper to apply this to
the particular token and then save it again in the translated form? And I'm not sure how to do this. Any tips are greatly appreciated!

Roger

more help translating &

Hi Todd,

Both of these solutions make perfect sense, although the first is much easier to implement in my current project because the urls tend to vary in ways that I cannot yet predict. With this fix, my scrape is working great now.

Thanks. You are great and screen-scraper is a gem!

Roger

more help translating &

Hi,

Just to give you a bit of background first, the reason you see these & is because of screen-scraper's tidier, which cleans up the HTML in order to facilitate extraction. You're correct that they can interfere at times, however.

There are two ways to approach dealing with them. One is to remove them in a script, as you're attempting to do. Assuming you've saved an extracted value in the session variable "MY_DATA" that might contain these characters, you could remove them like so:

session.setVariable( "MY_DATA", session.getVariable( "MY_DATA" ).replaceAll( "&", "&" ) );

Alternatively, and this is the option we generally take, you could make your extractor patterns more precise for the links you're working with. For example, given the following link:

http://www.foo.com/mypage.php?foo=bar&one=two

Rather than using an extractor pattern like this:

http://www.foo.com/mypage.php?~@QUERY_STRING@~

You might use a pattern like this:

http://www.foo.com/mypage.php?foo=~@BAR@~&one=~@ONE@~

As a side-note, it's generally also a good idea to include regular expressions with those tokens. Usually the expression that doesn't match double quotes works best.

Kind regards,

Todd Wilson