Using Extractor Patterns

Using Extractor Patterns

Overview

Extractor patterns allow you to pinpoint select snippets of data that you want extracted from a web page. They're often the most confusing part of screen-scraper, so you'll want to look over this page carefully. An extractor pattern is a block of text (usually HTML) that contains special tokens that will match pieces of data you're interested in extracting. These tokens are text labels surrounded by the delimiters ~@ and @~. The label between the delimiters should contain only alpha-numeric characters and underscores.

You can think of an extractor pattern like a stencil. A stencil is an image in cut-out form, often made of thin cardboard. As you place a stencil over a piece of paper, apply paint to it, then remove the stencil, the paint remains only where there were holes in the stencil. Analogously, you can think of placing an extractor pattern over the HTML of a web page where the tokens correspond to the holes where the paint would pass through. After an extractor pattern is applied it reveals the portions of the web page you'd like to extract.

Extractor pattern tokens designate regions where data elements are to be captured. For example, given the following HTML snippet:

<p>This is the <b>piece of text</b> I'm interested in.</p>

you would extract "piece of text" by creating an extractor pattern with a token positioned like so:

<p>This is the <b>~@EXTRACTED_TEXT@~</b> I'm interested in.</p>

The extracted text could then be accessed via the identifier "EXTRACTED_TEXT".

If you haven't done so already, we'd recommend going through our first tutorial to get a better feel for using extractor patterns.

Managing extractor patterns

Any number of extractor patterns can be associated with a given scrapeable file, and are managed by clicking on a scrapeable file, then on the "Extractor Patterns" tab. Add an extractor pattern by clicking on the "Add Extractor Pattern" button. Any number of extractor patterns can be applied to a given scrapeable file, and they will be applied to the file in a designated sequence. Any number of tokens can appear within an extractor pattern.

The recommended way to create a token is to simply select a region of text in an extractor pattern, right-click the selected region and select "Generate extractor pattern token from selected text" in the pop-up menu (note that this can be done either when viewing the extractor pattern as HTML or as plain text). Creating an extractor pattern token in this manner will open a window that allows you to edit the attributes of the token. It's also often helpful to use an external text editor when creating extractor patterns where you can store snippets of HTML you're working with. You can then copy text into screen-scraper, as needed.

If an extractor pattern takes too long to match a block of text it will timeout. The timeout setting may be adjusted from the "Settings" window (click on the Options->Settings menu item) under the "General" tab. If you find that your extractor pattern is timing out you might try adjusting it by using more precise regular expressions. The tips at the bottom of this page might also help.

Note that when creating extractor patterns you should use the HTML that will be found under the "Last Response" tab associated with a scrapeable file. By default, screen-scraper will "tidy" the HTML once it's been scraped, meaning that it will format it in a consistent way that makes it easier to work with. If you use the HTML by viewing the source for a page in your web browser it will likely be different from the HTML that screen-scraper generates.

Main tab



The main tab allows you to edit the primary attributes of the pattern, and contains the following elements:

  • Delete Extractor Pattern: Deletes the current extractor pattern.
  • Apply Pattern to Last Scraped Data: It's often helpful to test out your extractor pattern to ensure that it's doing what you expect. To test your extractor pattern just click on the "Apply Pattern to Last Scraped Data" button. This will pop up a new window with the results of the match. Depending on how many times your pattern matches in the text of the last response (the HTML that appears under the "Last Response" tab), you should notice one or more data sets. This is the information that will be available from a script using the dataRecord and dataSet variables (see using scripts for details).
  • Copy Pattern: (enterprise edition only) Copies the extractor pattern so that it can be pasted into a different scrapeable file.
  • Identifier: A name used to identify the pattern. You'll use this when invoking the extractData and extractOneValue methods.
  • Sequence: Determines the order in which the extractor patterns will be applied to the HTML.
  • Pattern text: Used to hold the text for the extractor pattern. This will also include the extractor pattern tokens that are analogous to the holes in the stencil.
  • Scripts: This table allows you to indicate scripts that should be run as the extractor pattern finds matches. Much like other programming languages, screen-scraper can invoke code based on specified events. In this case, you can invoke scripts before the pattern is applied, after each match it finds, or after all matches have been made. For example, if your pattern finds 10 matches, and you designate a script to be run "After each pattern application", that script will get invoked 10 separate times. See using scripts for more details.

Sub-extractor patterns tab

This tab allows you to add, edit, delete, and test sub-extractor patterns. See the "Using sub-extractor patterns" section below for more on this.

Advanced tab (professional and enterprise editions only)

The advanced tab provides extended control over extractor patterns, described below.

  • Automatically save the data set generated by this extractor pattern in a session variable: If this box is checked screen-scraper will place the dataSet object generated when this extractor pattern is applied into a session variable using the identifier as the key (i.e. session variable name). For example, if your extractor pattern were named "PRODUCTS", and you checked this box, screen-scraper would apply the pattern and place the resulting dataSet into a session variable named "PRODUCTS" to be used later on. We recommend that you generally avoid checking this box unless it's absolutely needed because of memory issues it may cause. If this box is checked, screen-scraper will continue to append data to the dataSet, and all of that data will be kept in memory. The preferred method is to save data as it's being extracted, generally by invoking a script "After each pattern applicationi" that pulls the data from dataRecord objects or session variables.
  • Filter duplicate records: When this box and the "Cache the data set" box are checked screen-scraper will filter duplicates from extracted records. See the "Filtering duplicate records" section below for more details.
  • Cache the data set: In some cases you'll want to store extracted data in a session variable, but the data set will potentially grow to be very large. The "Cache the data set" checkbox will cause the extracted data to be written out to the file system as it's being extracted so that it doesn't consume RAM. When you attempt to access the data set from a script or external code it will be read from the disk into RAM temporarily so that it can be used. You'll also need to check this box if you want to filter duplicates.
  • This extractor pattern will be invoked manually from a script: If you check this box the extractor pattern will not be invoked automatically by screen-scraper. Instead, you'll invoke it in a script using the extractData and extractOneValue methods described on the using scripts page.

Extractor pattern tokens

Extractor pattern tokens can be edited by double-clicking them, or by selecting the label between the ~@ @~ delimiters, then right-clicking, and selecting "Edit extractor pattern token". This will display a small dialog box with a tabbed pane. Each pane is described below.

Extractor pattern tokens "General" tab

  • Identifier: This is a string that will be used to identify the piece of data that gets extracted as a result of this token. You should use only alphanumeric characters and underscores here.
  • Save in session variable? Checking this box causes the value extracted by the token to be saved in a session variable using the token's identifier. See using session variables for more information.
  • Regular Expression: Here you can designate a regular expression that will be used to match the text covered by this token. You can either enter one in the text box, or select one from the drop-down list. The regular expressions that appear in the drop-down list can be edited by selecting "Edit regular expressions" from the "Options" menu. In most cases you should designate a regular expression for tokens. This makes the extraction more efficient and helps to guard against future changes that might be made to the target web site.

Extractor pattern tokens "Mapping" tab (enterprise edition only)

The mapping tab allows you to alter extracted values. Often once you extract data from a web page you need to put it into a consistent format. For example, you may want products with very similar names to have identical names.

screen-scraper makes use of mapping sets when determining how to map a given extracted value. A mapping set may contain any number of mappings, which screen-scraper will analyze in sequence until it finds a match, or runs out of mappings. As such, you'll often want to put more specific mappings higher in sequence than more general mappings.

The various columns in a mapping are defined below:

  • From The value screen-scraper should match.
  • To Once a match is found, indicates the new value the extracted data will assume.
  • Type Determines the type of match that should be made in working with the value in the "From" field. The "Equals" option will match if an exact match is found, the "Contains" value will match if the value contains the text in the "From" field, and the "Regular Expression" type uses the "From" value as a regular expression to attempt to find a match.
  • Case Sensitive? Indicates whether or not the match should be case sensitive.
  • Sequence Determines the sequence in which the particular mapping should be analyzed.

You can create a new mapping set by simply typing a name into the "Set" box. Sets can be deleted via the "Delete Set" button, and an individual mapping can be added by clicking the "Add Mapping" button. Individual mappings can be deleted by selecting them, then right-clicking and selecting "Delete", or by pressing the "Delete" key on your keyboard after selecting them.

Consider the screen-shot of the "Mapping" tab above. If the extracted value were "Widget 123" screen-scraper would first try to match using the "Widget 1" mapping. Because this is an "Equals" match the mapping wouldn't occur, so screen-scraper would proceed to the second mapping. The second mapping would match because a "Contains" type was designated. That is, the text "Widget 123" contains the text "Widget". As such, the extracted data "Widget 123" would become "Product ABC", because that is the "To" value designated for the second mapping.

When using regular expressions in your mapping you can also make use of back references. Back references allow you to preserve values in the original text when mapped to the "To" value. For example, if you were mapping the value "Widget 123" you could use the regular expression "Widget (\d*)". In the "To" column you could then enter the value "Product \1", which, when mapped, would convert "Widget 123" to "Product 123". The value in parentheses in the "From" column gets inserted via the \1 marker found in the "To" column.

Extractor pattern tokens advanced tab (enterprise edition only)

  • Use to filter duplicates: Indicates that this token should be used when filtering duplicates. See the "Filtering duplicates" section below for more details.
  • Strip HTML: Check this box if you'd like screen-scraper to pull out any HTML tags from the extracted value.
  • Resolve relatively URL to absolute URL: If checked, this will resolve a relative URL (e.g., /myimage.gif) into an absolute URL (e.g., http://www.mysite.com/myimage.gif).
  • Convert HTML entities: This will cause any html entities to be converted into plain text (e.g., it will convert &amp; into &).

Filtering duplicates (professional and enterprise editions only)

When extracting records from web sites you'll often want to filter out duplicates. screen-scraper provides a method whereby this can be done automatically. To filter duplicates for data extracted by a given extractor pattern you'll wnat to go to the "Advanced" tab, then check the boxes labeled "Automatically save the data set generated by this extractor pattern in a session variable", "Filter duplicate records", and "Cache the data set". This will cause screen-scraper to generate a session variable with the same name as the extractor pattern identifier, and will save any records extracted by the pattern to the file system rather than saving them in memory.

Once you've set up the extractor pattern to cache and save the data set, you'll need to designate the fields that would identify a unique record. That is, when filtering duplicates screen-scraper will compare the values for designated columns in order to determine if a duplicate record already exists (more or less like a database compound key). You designate an extractor pattern token to be used in determining uniqueness by editing it, and checking the "Use to filter duplicates" box found under the "Advanced" tab.

Because screen-scraper filters duplicates as it's scraping you'll want to wait until the end to make use of the data. For example, if you want all of the data written to a .CSV file you would want to invoke the script that does that after the scraping session has ended. That way you can guarantee that all of the data has been extracted and filtered before you save it.

Using sub-extractor patterns

Sub-extractor patterns allow you to extract data in smaller pieces, providing significantly more flexibility in pinpointing the specific pieces you're after. Consider a search results page consisting of rows and columns of data. Using normal extractor patterns you would use a single pattern to extract the data from all columns for a single row. In many cases this works just fine; however, the process gets more complicated when each row differs significantly. For example, certain cell rows may be in different colors or their contents may be completely missing. With a normal extractor pattern it would be difficult to account for the variability in the cells. By using sub-extractor patterns you could create a normal extractor pattern to extract an entire row, then use individual sub-extractor patterns to pull out the individual cells.

Consider the following HTML table:

Name Phone Address
Juan Ferrero 111-222-3333 123 Elm St.
Joe Bloggs No contact information available
Sherry Lloyd 234-5678 (needs area code) 456 Maple Rd.

Here is the corresponding HTML source:

<table cellpadding="2" border="1">
<tr><th>Name</th><th>Phone</th><th>Address</th></tr>
<tr><td class="Name">Juan Ferrero</td><td class="Phone">111-222-3333</td><td class="Address">123 Elm St.</td></tr>
<tr><td class="Name" bgcolor="red">Joe Bloggs</td><td colspan="2">No contact information available</td></tr>
<tr><td class="Name">Sherry Lloyd</td><td class="Phone" bgcolor="yellow">234-5678 (needs area code)</td><td class="Address">456 Maple Rd.</td></tr>
</table>

It would be difficult (if not impossible) to write a single extractor pattern that would extract the information for each row because the contents of the cells differ so significantly. The different colored cells and the cell spanning two columns make the data too inconsistent to be extracted using a single pattern.

Consider this extractor pattern:

<tr><td~@DATARECORD@~/td></tr>

If applied to the HTML above the extractor pattern would produce the following three matches:

1.  class="Name">Juan Ferrero</td><td class="Phone">111-222-3333</td><td class="Address">123 Elm St.<
2.  class="Name" bgcolor="red">Joe Bloggs</td><td colspan="2">No contact information available<
3.  class="Name">Sherry Lloyd</td><td class="Phone" bgcolor="yellow">234-5678 (needs area code)</td><td class="Address">456 Maple Rd.<

Sub-extractor patterns would allow you to extract individual pieces of information from each row. For example, consider this sub-extractor pattern:

Name">~@NAME@~</td>

If applied to each of the individual extracted rows above the following three pieces of information would be extracted:

1.  Juan Ferrero
2. 
3.  Sherry Lloyd

Note that "Joe Bloggs" didn't get extracted because the cell his name in is red. Let's adjust the sub-extractor pattern slightly:

Name"~@nonhtml@~>~@NAME@~</td>

The ~@nonhtml@~ tag represents an extractor pattern token that uses the "Non-HTML tags" regular expression:  [^<>]*. Matching anything between where it is covering until it encounters either an opening or closing HTML bracket. In this particular case the effect is that all three names get extracted. To extract the phone number you'd use this sub-extractor pattern:

<td class="Phone"~@nonhtml@~>~@PHONE@~</td>

We have the case, however, of the cell in the second row that spans two columns, which would not get extracted by the sub-extractor pattern. We may still want this information, however, so we create the following sub-extractor pattern, just in case the cell exists:

<td colspan="2">~@PHONE@~<

If applied to our data we'd get the following results:

1. 
2. No contact information available
3.

Sub-extractor patterns aggregate everything that's extracted into a single data set. Using all of our extractor and sub-extractor patterns together we'd get the following data set:

Data record # Name Phone
Data record #1 Juan Ferrero 111-222-3333
Data record #2 Joe Bloggs No contact information available
Data record #3 Sherry Lloyd 234-5678 (needs area code)

There are a few important things to note about sub-extractor patterns:

  • Note that we had two sub-extractor patterns holding a token with the same name (PHONE). If a sub-extractor pattern doesn't match anything it simply has no effect, which allows another sub-extractor pattern to match something instead. Sub-extractor patterns are applied in sequence, and the sub-extractor pattern that matches something will take precedence over those that don't.
  • The ~@DATARECORD@~ extractor pattern token is special in that it defines a block of data that you wish to apply sub-extractor patterns to.
  • When using sub-extractor patterns only the first match will be used. That is, even if a sub-extractor pattern could match multiple times, only the data corresponding to the first match will be extracted.

Tips on using extractor patterns

  • Test your patterns frequently. Extractor patterns take some practice. Especially when you're first trying them out you'll want to test them as you're working with them. It often helps to test it after every couple of tokens you insert.
  • Use regular expressions to make your extractor patterns more precise. One of the most common problems encountered occurs when an extractor pattern matches too much data, which usually includes a lot of HTML. There are a couple of ways to address this problem. One is to extend the pattern outward. That is, include HTML that falls before and after the block you're trying to match. The second approach, which is generally the easier of the two, is to include regular expressions. We've included a number of common regular expressions that you can select from the drop-down list. In general, if you can use more precise regular expressions you can reduce the amount of HTML in the extractor pattern. Doing so makes your patterns more resilient to changes that might be made to the web site you're scraping.
  • Ensure that the pattern extracts the number of data records you expect it to. Oftentimes your pattern might not be as flexible as you think it is. Test it out to make sure it matches as many times as you think it should.
  • Try tidying the HTML. This will ensure that white space is handled consistently and will often clean up extraneous characters. The setting that determines whether or not HTML gets tidied by default is adjusted under the "General" tab of the settings window (click on Options->Settings from the menu), and also under the "Advanced" tab for the scrapeable file (which overrides the checkbox in the "Settings" window).