Pattern doesn't fit, but still extracts?

I have a bit of a problem. I've set up a scraping session and all the necessary scripts and I have it working almost perfectly. The one problem I do run into is that there's an area of the website that varies. There's either 1, 2 or 3 table cells.

So I've made a extractor pattern for each variation like so:

1 Table Cell:

~@A1@~ (~@B1@~)

~@C1@yrs, ~@D1@~mths

~@E1@~

2 Table Cells:

~@A1@~ (~@B1@~)

~@C1@yrs, ~@D1@~mths

~@E1@~

~@A2@~ (~@B2@~)

~@C2@yrs, ~@D2@~mths

~@E2@~

3 Table Cells:

~@A1@~ (~@B1@~)

~@C1@yrs, ~@D1@~mths

~@E1@~

~@A2@~ (~@B2@~)

~@C2@yrs, ~@D2@~mths

~@E2@~

~@A3@~ (~@B3@~)

~@C3@yrs, ~@D3@~mths

~@E3@~

The problem is that even when there are 3 cells and the 1st and 2nd patterns don't fit (as far as I can tell), they still seem to grab data. The same thing happens when there are two table cells and the 1 table cell pattern is executed.

So let's say I have the following block of HTML that the pattern is being applied to:

Bob (male)

24yrs, 8mths

Programmer

John (male)

22yrs, 6mths

Designer

Mary (female)

35yrs, 11mths

Administrator

Instead of returning a null response because the pattern doesn't fit, it gives me this output:

Designer Mary (female)
35yrs, 11mths

Administrator

The problem is this then messes with the output when I attempt to write the data to a file. Can someone see what I'm doing wrong? I don't see how the 2nd pattern manages to even fit into the block of HTML with 3 table cells. Surely the inclusion of the would mean it doesn't fit?

Is there a way to order the sequence in which sub-extractor patterns are executed? Maybe if the patterns are run in the order of 1 cell, 2 cells and 3 cells, then it'll be okay. But Screen-Scraper seems to be pretty random as to what order the sub-extractor patterns appear in the program. I'm at my wit's end here.

Brendan O on 08/16/2007 at 10:14 pm

screen-scraper public support

Pattern doesn't fit, but still extracts?

Brendan,

I truly apologize for not catching this sooner. Yes, that means that extractData is only available in the professional edition. Because I didn't mention this sooner I would like to offer you a discount on a professional edition license. If you are interested please either private message me or email me (my name, last initial w @ our domain) for details.

To address your other questions...

You should feel free to add to and modify my samples as you need to. Keep in mind any other new tokens you add to the People's extractor pattern text will be subject to the same looping as the others tokens.

I would just encourage you to add to it in a spirit of learning by trial and error. Perhaps back up your session and save it as a benchmark you can revert back to in needed.

Please let me know what you would like to do to proceed.

Thank you,
Scott

swilsonmc on 08/29/2007 at 10:58 am

Pattern doesn't fit, but still extracts?

You must be getting sick of me at this point, Scott. Thanks again for your patience.

I've stripped it right down to the basics and I've got my main extractor pattern and the second extractor pattern. I've removed all the sub-extractor patterns entirely (including ones unrelated to this particular block of HTML that I'm currently trying to figure out). I've renamed my extraction pattern and the data record token (People's Data and PEOPLES_DATA). When I run my scraping session, it seems to grab all the values properly. But seems to run into a problem when writing the data file. Here's a couple of lines from the log:

[quote]Processing script: "Write data to a file"
Details page: Returning because extractData was called in the basic edition.[/quote]

Does this mean this feature is not available in the basic edition?

Also, once I have got this working, is it okay to add unrelated sub-extractor patterns to the People's data extractor pattern? Or will I need to make another "data record" extraction pattern?

Brendan O on 08/28/2007 at 7:30 pm

Pattern doesn't fit, but still extracts?

Brendan,

Firstly, I want to make sure that you've modified you're session from where it was when we first started this conversation. In the sample I've provided notice how I have only two extractor pattern texts. You, too, will only need two for extracting the data we've been talking about. So, when we started you had three sub-extractors but now you should have only one normal extractor pattern text (no sub-extractors). I used the code you supplied as the structure in my example. Follow my example as closely as you can and hopefully you won't have any issues.

To address your specific questions, you'll want to name your token something other than DATARECORD since that name is considered special by screen-scraper and may cause undesired results for your scenario. In the example I've provided I'm using the name "PEOPLES_DATA".

After running your session, when you click on "Apply Pattern to Last Scraped Data", the results for "PEOPLES_DATA" is what your other extractor pattern will be using to match against - not the entire HTML page like you're used to. Try copying and pasting the results into a text editor to see what you're going to be matching with. Often, clicking on "Apply Pattern to Last Scraped Data" for your other extractor pattern will still match the rest of the HTML since the results of "PEOPLES_DATA" did come from the main HTML. But, what's going to happen is when you apply the elements of your other extractor pattern text (in the example I provided my extractor pattern text is title "Person") to the HTML results from "PEOPLES_DATA" the "Person" extractor text should match multiple time. That's where the [b]for[/b] loop meets the write to file code. So, yes, they need to be together in the same script.

Your write to file script could look something like this.

FileWriter out = null;

try
{
session.log( "Writing data to a file." );

// Open up the file to be appended to.
out = new FileWriter( "sitefacts6.csv", true );

//From manual extractor example
peoplesData = dataRecord.get( "PEOPLES_DATA" );

people = scrapeableFile.extractData( peoplesData, "Person" );

for ( int i = 0; i < people.getNumDataRecords(); i++ )
{

person = people.getDataRecord(i);

// Write out the data to the file.
out.write( person.get( "A" ) + "\t" );
out.write( person.get( "B" ) + "\t" );
out.write( person.get( "C" ) + "\t" );
out.write( person.get( "D" ) + "\t" );
out.write( person.get( "E" ) + "\t" );
out.write( person.get( "DRM" ) + "\t" );
out.write( person.get( "PARTNER" ) );
out.write( "\n" );

}
// Close up the file.
out.close();
}
catch( Exception e )
{
session.log( "An error occurred while writing the data to a file: " + e.getMessage() );
}

Reference the sample session liberally and let me know what specific questions you may have.

Thanks,
Scott

swilsonmc on 08/28/2007 at 3:20 pm

Pattern doesn't fit, but still extracts?

Hi Scott,

Still having trouble sorting this out. I don't know how to integrate it with what I already have. Just to make sure I'm doing things properly so far..

This is the script you gave me:

[quote]//With a successful match of the people's HTML table
//assign the block of HTML containing the people's data to
//a local data record variable.

//
//"PEOPLES_DATA" is the name we gave to the extractor pattern token
//in the first extractor pattern.

peoplesData = dataRecord.get( "PEOPLES_DATA" );

//Here is where you are manually applying the pattern for
//an individual's data to the block of HTML containing all of the
//people's data.
//Store the resulting pattern matches as a local data set object.

//
//"Person" is the title we assigned to the second extractor pattern.
//"peoplesData" is the local variable from above which contains the
//HTML we're going to apply the pattern of each individual to.

people = scrapeableFile.extractData( peoplesData, "Person" );

session.log("\n-- Number of people to loop through **" + people.getNumDataRecords() + "** --\n");

//Loop through each person's data record contained in the
//manually applied data set.
for ( int i = 0; i < people.getNumDataRecords(); i++ )
{
//Assign the current data record object to a local variable.
person = people.getDataRecord(i);

session.log("DataRecords for loop #" + i + ":");

//Write out the value to which the specified
//key is mapped. The keys are the names of
//the tokens used in the 2nd extractor pattern.
session.log("A**" + person.get("A") + "**");
session.log("B**" + person.get("B") + "**");
session.log("C**" + person.get("C") + "**");
session.log("D**" + person.get("D") + "**");
session.log("E**" + person.get("E") + "**");

session.log("\n");
}[/quote]

My main extractor pattern uses DATARECORD as the token. As long as I name my second extractor pattern "Person", the only thing I need to change is this...

[quote]peoplesData = dataRecord.get( "DATARECORD" );[/quote]

.. right?

Here's a truncated version of my current write to file script. Pretty much exactly what's found in the third tutorial.

[quote]FileWriter out = null;

try
{
session.log( "Writing data to a file." );

// Open up the file to be appended to.
out = new FileWriter( "sitefacts6.csv", true );

// Write out the data to the file.
out.write( dataRecord.get( "DRM" ) + "\t" );
out.write( dataRecord.get( "PARTNER" ) );
out.write( "\n" );

// Close up the file.
out.close();
}
catch( Exception e )
{
session.log( "An error occurred while writing the data to a file: " + e.getMessage() );
}[/quote]

And I guess this is what it would look like (the tab parts are probably wrong in the A,B,C,D,E part).

[quote]FileWriter out = null;

try
{
session.log( "Writing data to a file." );

// Open up the file to be appended to.
out = new FileWriter( "sitefacts6.csv", true );

// Write out the data to the file.
out.write("A**" + person.get("A") + "**" + "\t" );
out.write("B**" + person.get("B") + "**" + "\t" );
out.write("C**" + person.get("C") + "**" + "\t" );
out.write("D**" + person.get("D") + "**" + "\t" );
out.write("E**" + person.get("E") + "**" + "\t" );
out.write( dataRecord.get( "DRM" ) + "\t" );
out.write( dataRecord.get( "PARTNER" ) );
out.write( "\n" );

// Close up the file.
out.close();
}
catch( Exception e )
{
session.log( "An error occurred while writing the data to a file: " + e.getMessage() );
}[/quote]

So I think I have the write data to file aspect of it covered. But the manually extract data script.. I assume I can put that in with the write data script instead of creating a second script, or do I need to create a second script and make the manually extract data script run first, followed by the write data to file script? Or can I combine the two in one script, and if so, how?

Thanks for being so patient with me so far!

Brendan O on 08/27/2007 at 10:50 pm

Pattern doesn't fit, but still extracts?

Brendan,

It should be easy to modify the sample session to write to a file. Where my "session.log" entries (lines 37-41) is where you would put your "out.write's". And because it will be looping through the scraped results you should not need the variables A2, B2 or A3, B3, etc. In my example session you'll notice that my variables are simply A, B, C, etc.

Also, because of the newly introduced loop, be sure to have your lines:

out = new FileWriter( "sitefacts.csv", true );
&
out.close();

reside outside of the loop.

I'm sorry I didn't realize that the advanced tab with the "This pattern will be invoked manually from a script" setting wasn't available in the basic edition. This should not be a problem for you, though. Just be sure that you don't set the variables in your equivalent of my "Person" extractor pattern as session variables.

Please let me know if you have any additional questions. I hope this works for you. Let us know how it goes.

Thanks,
Scott

swilsonmc on 08/20/2007 at 10:54 am

Pattern doesn't fit, but still extracts?

Hi Scott. Thanks for the reply.. Couple of things.

Firstly, I don't see the advanced tab below the extractor patterns that you're talking about. I'm using screen-scraper basic edition if that helps.

Also, a lot of that code in the script doesn't make a great deal of sense to me but I see what's going on. What I don't know is how I can use that information and have a script write it to a file. For example, what I'd originally attempted before I came to this forum was to write the data out like this...

[quote]FileWriter out = null;

try
{
session.log( "Writing data to a file." );

// Open up the file to be appended to.
out = new FileWriter( "sitefacts.csv", true );

// Write out the data to the file.
out.write( dataRecord.get( "A1" ) + "," );
out.write( dataRecord.get( "B1" ) + "," );
out.write( dataRecord.get( "C1" ) + "," );
out.write( dataRecord.get( "D1" ) + "," );
out.write( dataRecord.get( "E1" ) + "," );
out.write( dataRecord.get( "A2" ) + "," );
out.write( dataRecord.get( "B2" ) + "," );
out.write( dataRecord.get( "C2" ) + "," );
out.write( dataRecord.get( "D2" ) + "," );
out.write( dataRecord.get( "E2" ) + "," );
out.write( dataRecord.get( "A3" ) + "," );
out.write( dataRecord.get( "B3" ) + "," );
out.write( dataRecord.get( "C3" ) + "," );
out.write( dataRecord.get( "D3" ) + "," );
out.write( dataRecord.get( "E3" ) );
out.write( "\n" );

// Close up the file.
out.close();
}
catch( Exception e )
{
session.log( "An error occurred while writing the data to a file: " + e.getMessage() );
}[/quote]

And get an output like this (first line being the results of scraping the page with the HTML block I've used earlier in this thread and the other two being instances of either 1 or 2 table cells instead of 3).

[quote]Bob,Male,24,8,Programmer,John,Male,22,6,Designer,Mary,Female,35,11,Administrator
Brendan,Male,24,0,Designer,null,null,null,null,null,null,null,null,null,null
Harry,Male,26,2,Cleaner,Sally,Female,21,7,Accountant,null,null,null,null,null[/quote]

So how can I use the information that your scrape has gathered and have it output the data into a text file or CSV file the way I have it formatted above? Thanks.

Brendan O on 08/19/2007 at 11:40 pm

Pattern doesn't fit, but still extracts?

Brendan,

I'm assuming the three patterns you listed are sub-extractor patterns applied to a main extractor - perhaps utilizing the DATARECORD special token? If so, you have the right idea you'll just need to go about it in a different way.

Rather than creating three separate patterns for the three scenarios you'll end up using just one pattern. But because sub-extractor patterns match once and only once you'll need to apply the pattern to your HTML and manually loop through the pattern match results.

Download and import the sample scraping session found on this sample HTML page.

http://www.screen-scraper.com/support/examples/manual-extraction-example.html

First, go ahead and run the scraping session. Then, go to the "Page to scrape" scrapeable file and click on the Extractor Patterns tab. You'll notice two extractor patterns. The first one, "Peoples Data", represents the block of code with three people's data. The second pattern, "Person" represents a pattern to match each of the elements you're wanting to extract.

Click on the Advanced tab under the second extractor pattern and you'll notice an important setting. We have checked the box next to "This extractor pattern will be invoked manually from a script." This essentially tells the scrapeable file to bypass this extractor pattern when going in sequence because the first extractor pattern will be manually calling it instead.

Next, you'll notice the first extractor pattern is set to invoke the script "Manually Extract Data" after each pattern application. Go ahead and click on the Apply Pattern to Last Scraped Data button. Now, inside the DataSet pop-up window, either tripple-click inside the "HTML_CONTENT" column or double-click and select all of the content by using ctrl+a on Windows/Linux or Apple+a on Mac.

Open you're favorite text editor and paste the contents that you just copied. You'll notice that it extracted the content of all three scenarios. The second extractor pattern is going to be applied to that HTML. Go back to the Extractor Patterns tab and you'll notice that the second extractor pattern is stripped down enough to match each of the scenarios individually.

Now, open the "Manually Extract Data" script. It's heavily commented to explain what each aspect is doing and where the different data elements come from.

Please let me know if you have any questions about how this process works or if you have any ideas for how to improve the instructions.

Thank you,
Scott

swilsonmc on 08/17/2007 at 2:33 pm

Pattern doesn't fit, but still extracts?

Quick follow-up. My thought is that it's not looking at the entire pattern, but rather just what's surrounding each particular token. So even though the rest of the pattern doesn't fit, is it maybe just looking at...

[quote].....
~@E3@~

[/quote]

And running it from there? Is there any way I can prevent this? Thanks.

Brendan O on 08/16/2007 at 10:53 pm

Search

Community

screen-scraper

User login

Pattern doesn't fit, but still extracts?

Pattern doesn't fit, but still extracts?

Pattern doesn't fit, but still extracts?

Pattern doesn't fit, but still extracts?

Pattern doesn't fit, but still extracts?

Pattern doesn't fit, but still extracts?

Pattern doesn't fit, but still extracts?

Pattern doesn't fit, but still extracts?

Pattern doesn't fit, but still extracts?