Filter Extraction Duplicates

This feature is only available to Professional and Enterprise edition users.

Overview

When extracting records from web sites you'll often want to filter out duplicates. screen-scraper provides a method whereby this can be done automatically. To filter duplicates for data extracted by a given extractor pattern you'll want to make sure that Automatically save the data set generated by this extractor pattern in a session variable, Filter duplicate records, and Cache the data set are checked in the extractor pattern advanced tab. This will cause screen-scraper to generate a session variable with the same name as the extractor pattern identifier, and will save any records extracted by the pattern to the file system rather than saving them in memory.

Once you've set up the extractor pattern to cache and save the data set, you'll need to designate the fields that would identify a unique record. That is, when filtering duplicates screen-scraper will compare the values for designated columns in order to determine if a duplicate record already exists (more or less like a database compound key). To specify which extractor tokens to use, check the Use to filer duplicates box in the extractor token advanced tab.

Because screen-scraper filters duplicates as it's scraping you'll want to wait until the end of the matching before you make use of the data. For example, if you want all of the data written to a .CSV file you would want to invoke the script that does that after the scraping session has ended. That way you can guarantee that all of the data has been extracted and filtered before you save it.