SearchNavigationUser login |
Using Extractor Patterns
Overview Extractor patterns allow you to pinpoint select snippets of data that you want extracted from a web page. They're often the most confusing part of screen-scraper, so you'll want to look over this page carefully. An extractor pattern is a block of text (usually HTML) that contains special tokens that will match pieces of data you're interested in extracting. These tokens are text labels surrounded by the delimiters ~@ and @~. The label between the delimiters should contain only alpha-numeric characters and underscores. You can think of an extractor pattern like a stencil. A stencil is an image in cut-out form, often made of thin cardboard. As you place a stencil over a piece of paper, apply paint to it, then remove the stencil, the paint remains only where there were holes in the stencil. Analogously, you can think of placing an extractor pattern over the HTML of a web page where the tokens correspond to the holes where the paint would pass through. After an extractor pattern is applied it reveals the portions of the web page you'd like to extract. Extractor pattern tokens designate regions where data elements are to be captured. For example, given the following HTML snippet: you would extract "piece of text" by creating an extractor pattern with a token positioned like so: The extracted text could then be accessed via the identifier "EXTRACTED_TEXT". If you haven't done so already, we'd recommend going through our first tutorial to get a better feel for using extractor patterns. Managing extractor patterns
![]() Any number of extractor patterns can be associated with a given scrapeable file, and are managed by clicking on a scrapeable file, then on the "Extractor Patterns" tab. Add an extractor pattern by clicking on the "Add Extractor Pattern" button. Any number of extractor patterns can be applied to a given scrapeable file, and they will be applied to the file in a designated sequence. Any number of tokens can appear within an extractor pattern. The recommended way to create a token is to simply select a region of text in an extractor pattern, right-click the selected region and select "Generate extractor pattern token from selected text" in the pop-up menu (note that this can be done either when viewing the extractor pattern as HTML or as plain text). Creating an extractor pattern token in this manner will open a window that allows you to edit the attributes of the token. It's also often helpful to use an external text editor when creating extractor patterns where you can store snippets of HTML you're working with. You can then copy text into screen-scraper, as needed. If an extractor pattern takes too long to match a block of text it will timeout. The timeout setting may be adjusted from the "Settings" window (click on the Options->Settings menu item) under the "General" tab. If you find that your extractor pattern is timing out you might try adjusting it by using more precise regular expressions. The tips at the bottom of this page might also help. Note that when creating extractor patterns you should use the HTML that will be found under the "Last Response" tab associated with a scrapeable file. By default, screen-scraper will "tidy" the HTML once it's been scraped, meaning that it will format it in a consistent way that makes it easier to work with. If you use the HTML by viewing the source for a page in your web browser it will likely be different from the HTML that screen-scraper generates. Main tab
![]() The main tab allows you to edit the primary attributes of the pattern, and contains the following elements:
Sub-extractor patterns tab
![]() This tab allows you to add, edit, delete, and test sub-extractor patterns. See the "Using sub-extractor patterns" section below for more on this. Advanced tab (professional and enterprise editions only)
![]() The advanced tab provides extended control over extractor patterns, described below.
Extractor pattern tokens Extractor pattern tokens can be edited by double-clicking them, or by selecting the label between the ~@ @~ delimiters, then right-clicking, and selecting "Edit extractor pattern token". This will display a small dialog box with a tabbed pane. Each pane is described below. Extractor pattern tokens "General" tab
![]()
Extractor pattern tokens "Mapping" tab (enterprise edition only)
![]() The mapping tab allows you to alter extracted values. Often once you extract data from a web page you need to put it into a consistent format. For example, you may want products with very similar names to have identical names. screen-scraper makes use of mapping sets when determining how to map a given extracted value. A mapping set may contain any number of mappings, which screen-scraper will analyze in sequence until it finds a match, or runs out of mappings. As such, you'll often want to put more specific mappings higher in sequence than more general mappings. The various columns in a mapping are defined below:
You can create a new mapping set by simply typing a name into the "Set" box. Sets can be deleted via the "Delete Set" button, and an individual mapping can be added by clicking the "Add Mapping" button. Individual mappings can be deleted by selecting them, then right-clicking and selecting "Delete", or by pressing the "Delete" key on your keyboard after selecting them. Consider the screen-shot of the "Mapping" tab above. If the extracted value were "Widget 123" screen-scraper would first try to match using the "Widget 1" mapping. Because this is an "Equals" match the mapping wouldn't occur, so screen-scraper would proceed to the second mapping. The second mapping would match because a "Contains" type was designated. That is, the text "Widget 123" contains the text "Widget". As such, the extracted data "Widget 123" would become "Product ABC", because that is the "To" value designated for the second mapping. When using regular expressions in your mapping you can also make use of back references. Back references allow you to preserve values in the original text when mapped to the "To" value. For example, if you were mapping the value "Widget 123" you could use the regular expression "Widget (\d*)". In the "To" column you could then enter the value "Product \1", which, when mapped, would convert "Widget 123" to "Product 123". The value in parentheses in the "From" column gets inserted via the \1 marker found in the "To" column. Extractor pattern tokens advanced tab (enterprise edition only)
![]()
Filtering duplicates (professional and enterprise editions only) When extracting records from web sites you'll often want to filter out duplicates. screen-scraper provides a method whereby this can be done automatically. To filter duplicates for data extracted by a given extractor pattern you'll wnat to go to the "Advanced" tab, then check the boxes labeled "Automatically save the data set generated by this extractor pattern in a session variable", "Filter duplicate records", and "Cache the data set". This will cause screen-scraper to generate a session variable with the same name as the extractor pattern identifier, and will save any records extracted by the pattern to the file system rather than saving them in memory. Once you've set up the extractor pattern to cache and save the data set, you'll need to designate the fields that would identify a unique record. That is, when filtering duplicates screen-scraper will compare the values for designated columns in order to determine if a duplicate record already exists (more or less like a database compound key). You designate an extractor pattern token to be used in determining uniqueness by editing it, and checking the "Use to filter duplicates" box found under the "Advanced" tab. Because screen-scraper filters duplicates as it's scraping you'll want to wait until the end to make use of the data. For example, if you want all of the data written to a .CSV file you would want to invoke the script that does that after the scraping session has ended. That way you can guarantee that all of the data has been extracted and filtered before you save it. Using sub-extractor patterns Sub-extractor patterns allow you to extract data in smaller pieces, providing significantly more flexibility in pinpointing the specific pieces you're after. Consider a search results page consisting of rows and columns of data. Using normal extractor patterns you would use a single pattern to extract the data from all columns for a single row. In many cases this works just fine; however, the process gets more complicated when each row differs significantly. For example, certain cell rows may be in different colors or their contents may be completely missing. With a normal extractor pattern it would be difficult to account for the variability in the cells. By using sub-extractor patterns you could create a normal extractor pattern to extract an entire row, then use individual sub-extractor patterns to pull out the individual cells. Consider the following HTML table:
Here is the corresponding HTML source:
It would be difficult (if not impossible) to write a single extractor pattern that would extract the information for each row because the contents of the cells differ so significantly. The different colored cells and the cell spanning two columns make the data too inconsistent to be extracted using a single pattern. Consider this extractor pattern:
<tr><td~@DATARECORD@~/td></tr>If applied to the HTML above the extractor pattern would produce the following three matches:
Sub-extractor patterns would allow you to extract individual pieces of information from each row. For example, consider this sub-extractor pattern: If applied to each of the individual extracted rows above the following three pieces of information would be extracted:
1. Juan FerreroNote that "Joe Bloggs" didn't get extracted because the cell his name in is red. Let's adjust the sub-extractor pattern slightly:
Name"~@nonhtml@~>~@NAME@~</td>
The ~@nonhtml@~ tag represents an extractor pattern token that uses the "Non-HTML tags" regular expression:
<td class="Phone"~@nonhtml@~>~@PHONE@~</td>We have the case, however, of the cell in the second row that spans two columns, which would not get extracted by the sub-extractor pattern. We may still want this information, however, so we create the following sub-extractor pattern, just in case the cell exists: If applied to our data we'd get the following results:
1. Sub-extractor patterns aggregate everything that's extracted into a single data set. Using all of our extractor and sub-extractor patterns together we'd get the following data set:
There are a few important things to note about sub-extractor patterns:
Tips on using extractor patterns
|
||||||||||||||||||||||||||||