Sub - extractor patterns

I need a little help - and I'm a real newbie! And coding and scripting is almost a foreign language . . . . so please first principles!

After I've created an extractor pattern from a section of code taken from a scraper session using the Last Response tab - I inserted the token ~@datarecord@~ and generated an Extracted Data list with all the extracted code I want to scrape with a number of sub-extractor patterns (or do I use just one sub-extractor pattern - insert one line of code and populate it with a number of tokens??)

I have then copied one of the value lines of code from the Extracted Data into a sub-extractor text field.

I have then inserted a token into this code - but when I click 'Apply Sub-Pattern to Last Scraped Data - the Extracted Data window that pops up is the same as the window from the main Extractor function.

If I click the Add Extractor Pattern tab - and paste the code copied from the value line into the Extractor Pattern Text window - insert a token and click Apply to Last Data - I can extract exactly the info I'm after . .

Why can I extract the data values in the Extractor Pattern - but not after creating a datarecord ??? I believe I'm just missing something simple - but obviously vital!!

I'm endeavouring to extract a number of separate pieces of data out of individual contact records that display one below the other on the web page - but want to keep the extracted information together referenced by contact name - so I can write it out to a text file for importing into an excel spreadsheet for sorting.

Rather frustrating . . so any help and guidance would be greatly appreciated.

Sub - extractor patterns

Tamara,

We've deprecated the ~@IGNORE@~ tag because it matches too much. Try replacing the work IGNORE with, say, JUNK. Also, make sure you have something more than just a trailing space as the last thing that your DATARECORD token picks up. Include at least the next "<" or whatever non-whitespace character is next.

- DATARECORD <br /><language id="EN">Centro Acuático Málaga 2008</language><br />C/ Miguel Mérida Nicolich, 2 29004 Málaga &#40;Malaga&#41;<br />Tel. +34 952 244 520 <

I hope this helps in time.

-Scott

I cannot get sub-extractor patterns to work

Hi,

I have been reading this topic over and over again. But still I have not been able to make a sub-extractor pattern work.

I have this Extractor pattern (DATARECORD is saved in a session variable, NONHTML -> reg expression non-html tags):

~@IGNORE@~~@IGNORE@~Useful information~@IGNORE@~~@DATARECORD@~

Results in Apply Pattern to Last Scraped Data:

- Sequence 0

- DATARECORD
Centro Acuático Málaga 2008
C/ Miguel Mérida Nicolich, 2 29004 Málaga (Malaga)
Tel. +34 952 244 520

- NONHTML id="DetailHeaderUserControl_FlagHeaderUserControl_lblUser" class="lblNombreWhite"

Now I want the phone number. So my sub-extractor pattern:

Tel~@IGNORE@~~@PHONE@~~@IGNORE@~

PHONE is saved in a session variable. Regular expression is "[\d|+| ]*". My intention for this expression is that the phone number can only contain numbers, "+" or spaces.

I get the same window with "apply pattern to last scraped data" at the extractor pattern and at the sub-extractor pattern.

My task is due tomorrow. :-S

Thanks a lot,
Tamara Vos

Sub - extractor patterns

THANK YOU! I knew it had to be something that simple--and I was reading your instructions over and over again, thinking ~@DATARECORD@~ was just an example name and what made it a special token was the fact that it was separate from the other extractor patterns, and had sub-extractor patterns associated with it. Just a mis-reading of the language. Thank you so much!

Sub - extractor patterns

apm,

The one thing you're missing is the special token ~@DATARECORD@~. This token designates the HTML of which the sub-extractor patterns will be applied. If everything is as it's suppose to be you should be able to just change your ~@EXTRA_INFO@~ token to ~@DATARECORD@~ and magically your sub-extractor pattern will match.

http://www.screen-scraper.com/support/docs/using_extractor_patterns.php

Give that a try and let us know how it goes.

-Scott

same as natR

I am having the same problem with getting the sub-pattern extraction to work. "When I click "Apply Sub-Pattern to Last Scraped Data" - the Extracted Data window that pops up is the same as the window from the main Extractor function."

I have set up a main extractor pattern with all the variable data that I need to extract, called "Extra Info" and I've set up a sub-pattern for it from which I wish to extract "Descr."

The Main Extractor Pattern is:

[color=darkblue]

Year~@IGNORE@~</a> </td>
</tr>

~@EXTRA_INFO@~

<!-- filename&#58; full-body -->
<tr>
<td class="td1" id="bold" width="15%" valign="top" nowrap="nowrap">Provenance

[/color]

This seems to work fine, and results in:

[color=darkblue]

<!-- filename&#58; full-body --><tr><td class="td1" id="bold" width="15%" valign="top" nowrap="nowrap">Descr.

Sub - extractor patterns

natR,

Make the HTML contained in the special ~@DATARECORD@~ token as broad as you can while ensuring it matches only once. Then, given this chunk of HTML you'll be able to match data even as it moves locations by setting sub-extractors that are brief and concise. Don't forget that you're matching against the HTML within the DATARECORD token and not the entire page.

The trick in creating a concise sub-extractor pattern is to identify unique aspects of the surrounding HTML that will help in reducing the amount of HTML you use in a given text window. The reason for this is so that if you can reduce the HTML you use to where you can have only one sub-extractor token vs., say, two you'll better ensure a match on the one where if one of the two didn't match you'd loose them both.

Another trick may be to use tokens for stuff you don't necessarily want to extract and when the content changes but in a predictable way.

For example, given this HTML.

<div class='FPA_Heading_3'>Commonwealth Bank of Australia</div>

If the "3" in "FPA_Heading_3" were to change from being a 3 to either a 1 or a 2 you could encapsulate it in a token and set the regular expression to match any of the three numbers. Your regex would simply look like this.

[1|2|3]*

What this does then is allows you to create extractor text using the div tag for more purposes then just for when it's a three. This actually makes the HTML less unique but to your advantage.

Another point to note here is that unlike main extractor tokens, sub-extractor tokens do not need to match. So, you could have 100 sub-extractor patterns and if only one of them matches it will proceed as if 99 of them matched.

Therefore, the window that shows matching results in the workbench will reflect that. It won't show every column for each token if not all tokens match.

I agree it's not ideal and we've determined that showing blank columns isn't best, either - due mainly to the potential number of tokens and available screen size limitations. Perhaps, a numeric readout that says, 10/25 tokens match!

I hope you find a something useful here. If you have a specific bit of code to share, please do.

Thanks,
Scott

Sub - extractor patterns

I'm stuck on the same step James was once stuck on.

When I click "Apply Sub-Pattern to Last Scraped Data" - the Extracted Data window that pops up is the same as the window from the main Extractor function.

It's not clear to me from reading the transcript above...how do I view the results of a sub-pattern?

Sub - extractor patterns

Hi Kalyanu

I have run a scrapping session using the files you supplied and have been able to extract some to the data I'm after from the session. I can see and understanding the logic in designing the sub - extractors where they 'connect' to a particular piece of data in the scraped data - ie using the word 'Phone: as an anchor for the sub- extractor ~@PHONE@~ to extract the phone number.

But how do I extract information from a HTML string when the actual piece of information I'm targeting 'moves around' within the extractor HTML, Extracted Data, value field - and does not have a recognisable anchor in the html string?

Probably the best way to explain my question is by mapping out the information I'm endeavouring to extract.

The information I want to extract is:

1. Company name
2. Salutation
3. First name
4. Last name
5. Address
6. City
7. State
8. Postcode (zip)
9. Phone #
10. Email address

And keep it all in one data record, relating to a particular contact, so I can write it to a text file and then import it into an excel spread sheet.

After running the extractor ~@DATARECORD@~ I extract 19 HTML strings from the last response page, but there is 3 variations of the HTML string in the extracted data value field.

[b]String A[/b] - which includes reference to an image file immediately after the company name and the 'address' information relating to the contact is divided by a line break tag. This is the one I'm having the most trouble with.

<div class='FPA_Heading_3'>RetireInvest Pty Limited</div><img alt='Certified Financial Planner' src='/files/cfplogo_small.gif' /><br />Mr Steven Lippiatt<br />Level 4 RetireInvest Building<br />456-460 Hunter Street<br />NEWCASTLE<br />NSW<br />AUSTRALIA, 2300<br />Phone&#58; 02 4929 7433<br />Facsimile&#58; 02 4929 6775<br />Mobile&#58; 0407 787 786<br />Email&#58; <a href='mailto&#58;[email protected]'>[email protected]</a><br />

[b]String B[/b] - No image file reference and only a single piece of data to be extracted for the 'address'.

<div class='FPA_Heading_3'>Commonwealth Bank of Australia</div>Mr Bradley Rousell - FPA &#40;Aff&#41;<br />PO Box 689<br />NEWCASTLE<br />NSW<br />AUSTRALIA, 2300<br />Phone&#58; 02 4922 2812<br />Facsimile&#58; 02 4922 2899<br />Mobile&#58; 04 0285 2477<br />Email&#58; <a href='mailto&#58;[email protected]'>[email protected]</a><br />

[b]String C[/b] - No image reference, but the 'address' information is again divided by a line break tag.

<div class='FPA_Heading_3'>Commonwealth Securities</div>Mr John Cleary - FPA &#40;Aff&#41;<br />Level 2<br />136-140 Hunter Street<br />NEWCASTLE<br />NSW<br />AUSTRALIA, 2300<br />Phone&#58; 04 0404 4382<br />Facsimile&#58; 02 4922 2899<br />Mobile&#58; 0404 044 382<br />Email&#58; <a href='mailto&#58;[email protected]'>[email protected]</a><br />

How do I code my sub-extractor token to extract the information I want from three variations of the main extractor - value field.

Your guidance will be greatly appreciated - and to avoid any confusion could I ask you to email the extractor and sub-extractor file - as you did this afternoon - that way there is no confusion.

Thanks
James

Sub - extractor patterns

Seems like you will need to specify some regular expressions for extractor patterns to work in this case

IPAC Securities Limited

Certified Financial Planner
Mr Tony Bottaro
Level 3
251 Wharf Road
NEWCASTLE
NSW
AUSTRALIA, 2300
Phone: 02 4927 5600
Facsimile: 02 4927 5464
Mobile: 0418 295 127
Email: [email protected]

Specialises in:

Investments
Superannuation and retirement planning

For the data above that you've extracted there is only one

tag, so for company name use a subtractor extractor pattern:
~@companyName@~

. In addition also select the companyName token right click on it and select edit token. In the edit token menu go to regular expressions tab. From the drop down list select non-html tags. This will specify that ~@companyName@~ token will not have any html tags.

Similarly to grab phone no use :
Phone: ~@phoneNumber@~
Similar to previous one, try making phoneNumber token non-html tag through regular expression. This will make the token just grab the phone number instead of grabbing everything till the last
tag i.e. "02 4927 5600
Facsimile: 02 4927 5464
Mobile: 0418 295 127
"

Apply similar concept and add different sub-extractor pattern to get different pieces of data.

Hope this helps.

Sub-extrator patterns

Hi Kalyanu

Maybe I'm missing something - but I still can't get the sub-extractor to extract the information I want

Here is the raw data copied from the scraper session:

<tr>
<td>
<hr />
<div class='FPA_Heading_3'>IPAC Securities Limited</div>

<img alt='Certified Financial Planner' src='/files/cfplogo_small.gif' /><br />
Mr Tony Bottaro<br />
Level 3<br />
251 Wharf Road<br />
NEWCASTLE<br />
NSW<br />
AUSTRALIA, 2300<br />
Phone&#58; 02 4927 5600<br />
Facsimile&#58; 02 4927 5464<br />
Mobile&#58; 0418 295 127<br />
Email&#58; <a href='mailto&#58;[email protected]'>[email protected]</a><br />
<br />
<i>Specialises in&#58;</i><br />
<dl>
<dd>Investments</dd>

<dd>Superannuation and retirement planning</dd>
</dl>
</td>
</tr>

Here is the ~@datarecord@~ extraction:

IPAC Securities Limited</div><img alt='Certified Financial Planner' src='/files/cfplogo_small.gif' /><br />Mr Tony Bottaro<br />Level 3<br />251 Wharf Road<br />NEWCASTLE<br />NSW<br />AUSTRALIA, 2300<br />Phone&#58; 02 4927 5600<br />Facsimile&#58; 02 4927 5464<br />Mobile&#58; 0418 295 127<br />Email&#58; <a href='mailto&#58;[email protected]'>[email protected]</a><br /><br /><i>Specialises in&#58;</i><br /><dl><dd>Investments</dd><dd>Superannuation and retirement planning</dd></dl>

From this I'm trying to sub-extract company name, first name etc - and keep them all in one record.

Currently I'm pasting all this extracted code into the Sub-Extractor and creating one sub-extract for one token - say ~@address@~ - but can't get a value to extract?

IPAC Securities Limited</div><img alt='Certified Financial Planner' src='/files/cfplogo_small.gif' /><br />Mr Tony Bottaro<br />~@address@~<br />NEWCASTLE<br />NSW<br />AUSTRALIA, 2300<br />Phone&#58; 02 4927 5600<br />Facsimile&#58; 02 4927 5464<br />Mobile&#58; 0418 295 127<br />Email&#58; <a href='mailto&#58;[email protected]'>[email protected]</a><br /><br /><i>Specialises in&#58;</i><br /><dl><dd>Investments</dd><dd>Superannuation and retirement planning</dd></dl>

Kalyanu - guidance please ???

Re: sub-extractor patterns

James,
To solve the your problem of getting the contact information I would suggest you the following steps:
1) Create an extractor pattern ~@datarecord@~ as you've right now that will grab the contact information of all the individuals with contact information of one individual per sequence. i.e the extractor pattern should uniquely identify one contact information.
2) Check by applying pattern to last scraped data and use the data generated by the extractor pattern to create sub-extractor patterns
3) Create different sub-extractor patterns to grab separate pieces of data from individual contact information. For example different sub-extractor pattern for name and address.
4) You can then add script to your extractor pattern that would write sub-extractor pattern to your file. Make the script run after each pattern application so that script will run for all of the individual records.

Right now, you might be having problem getting the value you wanted from the sub-extractor pattern as the extractor pattern have the tendency to maximize themselves over the extracted data.
i.e. if you create some extractor pattern ~@data@~ the extractor pattern will look for the and that are farthest apart from each other instead of the ones closest.

Also in screen-scraper you could specify your extractor token to be of specific type that would make your extractor pattern more unique. To do that right click on the token and click edit token. Then in the edit token window go to the regular expression tab. Here you can either add your own regular expression or just select type of value you want your token to be, from the drop down menu.

Hope this will help you solve the problem.