Tips and Suggestions

Any recommendations on how to handle projects that involve large numbers of scraping sessions?

In cases where you're dealing with large numbers of scraping sessions, it becomes too cumbersome to retain them all in the workbench. Even if you organize them neatly into folders, there will likely still be too many to viably work with. Rather than keep all scraping sessions in the workbench at once, we generally find it useful to export and save them all to a central directory, which, ideally is under version control using something like Subversion or CVS. When you need to work with a particular scraping session, you simply import it from the repository. Every once in a while, you export the scraping session back to the central directory. Ideally the directory also gets backed up once in a while so that you don't lose any work. When working with a project where there are a large number of scraping sessions, you'll also often have a series of "general" scripts that get used by most, if not all, of your scraping sessions. For example, you might have one script that gets invoked by every scraping session, which is in charge of opening a database connection or initializing a file to which extracted data will be written. We typically handle these "general" scripts by storing them in a separate folder, alongside where all of the scraping sessions are stored. This directory should get versioned and backed up as well. The difference with the "general" scripts is that it's typically a good idea to keep them all in the workbench in their own folder. Usually there aren't very many of them, and they get used often enough that you'll typically want to just retain them in the screen-scraper workbench.

How do I send dynamic POST parameters in screen-scraper?

If you've gone through our first few tutorials, you know that session variables can be embedded in URL's by using a token like this: ~#FOO#~ (see this page for a detailed example of this). Well, the very same technique can be used with POST variables. When you create a scrapeable file that uses POST parameters, they'll be displayed under the "Parameters" tab for that scrapeable file. In any of those POST parameters you can use the same type of token mentioned before. For example, if you're logging in to a web site (as described here), instead of hard-coding the username and password, you might instead substitute the tokens ~#USERNAME#~ and ~#PASSWORD#~ in the "Value" column, for the respective parameters. Prior to invoking that scrapeable file, you could then set two session variables corresponding to the username and password, which values would then be substituted for the ~#USERNAME#~ and ~#PASSWORD#~ tokens.

I'm trying to scrape an HTML form that requires the user to type in text shown in an image. Can screen-scraper handle this?

This is known as a CAPTCHA mechanism, and is intended to discourage automated form submissions. There are essentially two ways of working around these:
Oftentimes sites will use a poorly implemented CAPTCHA such that it can be determined up front what the text will read. For example, the site may actually have only four or five images, and it simply cycles through them. By looking at the names of the images one could determine what the corresponding text will be. The text could then be used to populate the appropriate HTML form.
Assuming the CAPTCHA mechanism works as it should (i.e., that a human being would have to type in the text shown in the image), it gets a bit trickier to deal with. The best route would probably be to run a scraping session as you normally would, then, once you arrive at the page containing the CAPTCHA, follow these steps:

  1. Download the CAPTCHA image to the local hard drive (e.g., using the session.downloadFile method).
  2. Using a screen-scraper script, pop up a dialog box using Java code that displays the image, and contains a text box that will accept user input. Within a script you have full access to the Java API, so you could pop up something like a custom JDialog containing the image and text box.
  3. Have a person type into the text box the characters displayed in the image.
  4. Accept the text entered by the user, then drop it into a screen-scraper session variable.
  5. Use the value in the session variable to populate the HTML form element.

This obviously isn't ideal, but, unfortunately, there may not be another way. The CAPTCHA images are designed such that they can't be read by a machine. As such, human intervention is required.

How do I extract data from two tables that are basically identical in structure?

This isn't a scenario you'll run into too often, but it's common enough that we decided to include it in the FAQ. At times you may run into a page containing various tables of data. All of the tables are essentially identical in structure, but when you extract the data you want to be able to tell which rows of data came from which tables. For example, consider this page. If you view the HTML from the page you'll notice that the structure of the two tables is basically the same. If you use a normal extractor pattern that matches a row of data, though, you're going to get all four rows of data, and won't be able to tell which row came from which table. That is, your first inclination might be to use an extractor pattern like this:



<tr>
<td class="datacell">~@CELL_DATA1@~</td>
<td class="datacell">~@CELL_DATA2@~</td>
<td class="datacell">~@CELL_DATA3@~</td>
<td class="datacell">~@CELL_DATA4@~</td>
</tr>





It matches the data just fine, but you don't know which table each row came from.
In situations like these there are two possible approaches. The first is to use regular expressions that match the data in such a way that you are able to differentiate between the table rows. For example, download this scraping session and import it into screen-scraper. If you run it, you'll notice that it extracts the data from each table separately. It does this by using regular expressions that differentiate the data in the first table (whose cells all end with the letter "a") from the data in the second table (whose cells all end with the letter "x"). You can see this by opening the "Table 1 row" or "Table 2 row" extractor patterns, and editing the properties on any of the tokens (e.g., ~@CELL_DATA1@~). If you look under the "Regular Expression" tab, you'll see the expression that makes the match.
Unfortunately, it's not always the case that regular expressions will allow you to distinguish between table rows. The alternative is to handle the data extraction in scripts. Note that this approach requires the professional edition of screen-scraper, and makes use of the scrapeableFile.extractData method. Download this scraping session and import it into scraping session. Again, if you run it, you'll notice that it extracts the data from the two tables separately. The scripts here provide the key to extracting the data. Take a look at the "Similar tables--extract table 1 data" script. It gets invoked after the "Table 1 data" extractor pattern matches.
If you've encountered a similar situation to the one presented here it's possible you can use these examples to tackle the task. Take a careful look through the extractor patterns and scripts to see how they're set up. If you have questions on them or run into any trouble, don't hesitate to post to our support forum.

Will screen-scraper notify me if the site I'm scraping changes?

Once you've set up screen-scraper to extract data from a web site there's a good chance the web site will change at some point. Oftentimes cosmetic changes such as the addition of a font tag or changing text from bold to italic won't affect anything, but if the site makes more dramatic changes, such as altering their navigation system, then your scraping session will break. This generally results causes screen-scraper to either fail to extract records from the site entirely, or scrape significantly fewer records than it had previously. It also usually means that you'll need to update your scraping session to account for the changes in the web site.

There are two approaches we generally take to addressing this issue. The first (and best) approach is to track the number of records screen-scraper extracts each time the scraping session is run. Let's suppose you're extracting records from a site that, on average, will yield about 100 records. If you run the scrape one day and it suddenly only extracts 10 records then something has likely changed with the site, so you'll probably need to adjust your scraping session to account for it. The second approach is to have a special extractor pattern or two that checks for a specific piece of text that you know should be present every time you scrape. This approach is most useful in cases where a site doesn't yield a consistent number of records. If your special extractor pattern doesn't match the text it's looking for then something has likely changed on the site.

Along with all of this you'll likely want some kind of notification system so that you can be made aware when the site changes. To do this you might consider something like screen-scraper's sendMail function. Even better would be to set up an external application that monitors the number of records scraped each time, then logs an error in a database or log file if something comes up.

How do I make a backup of the work I've done in screen-scraper?

As with any work you do on your computer, it's good to back it up once in a while. The preferred method for doing this in screen-scraper is to export your scraping sessions and scripts as XML files (note that you only need to back up the scripts that aren't referenced in scraping sessions--any scripts called from within scraping sessions will be automatically exported along with the scraping session). Once the files have been exported you might also consider storing them in a versioning system such as CVS or Subversion.

screen-scraper will automatically back up your database periodically to ensure that you don't lose any work. You can also manually invoke this backup process by selecting "Backup Database" from the "File" menu. The database backups are stored in the "resourcedbbackup" folder. The directories within that folder contain previous versions of your database. If your database has somehow become corrupted, you may be able to simply revert back to a previous version. Help on that can be found here.

I'm running screen-scraper in a machine that isn't connected to the Internet. How do I transfer my screen-scraper license to it?

Follow these steps:

  1. On a machine that does have Internet access, install and register screen-scraper.
  2. Install screen-scraper on the machine that isn't connected to the Internet, if necessary.
  3. With both instances of screen-scraper closed (i.e., not running the workbench, server, or in command line mode), copy all files beginning with "ss" from the "resourcedb" folder of the licensed instance of screen-scraper on top of the corresponding files of the unlicensed instance of screen-scraper, overwriting them.

Note that in doing this you'll be copying the entire screen-scraper database from one machine to another, so along with the licensing information it will also copy any scraping sessions, proxy sessions, and scripts. This will also mean overwriting any of those objects found on the unlicnesed instance. Before copying the database over, care should be taken to export any objects from the unlicensed instance that you'd like to retain.

I'm running screen-scraper in a GUI-less environment (e.g., not running XWindows). How do I update it to the latest version?

When screen-scraper normally updates itself it downloads a zip file from our server, decompresses it, copies the files it contains on top of the existing files, then updates its version number. You'll instead need to do this manually. To do so follow these steps:

  1. Download the update file. The URL for the udpate you'll need can be generated for you using our updater form.
  2. Decompress it.
  3. Copy the contents on top of your existing screen-scraper files.
  4. Edit the "Version" property of your "resource/conf/screen-scraper.properties" file so that it reflects the new version.

The next time you launch screen-scraper you'll have the updated version.

We're in the process of creating a browser-based interface for screen-scraper that will allow you to update screen-scraper without having to go through this manual process.

I'm running screen-scraper in a GUI-less environment (e.g., not running XWindows). How do I transfer my screen-scraper license to my server?

If you're using the Enterprise Edition of screen-scraper, this can be done via the web interface.

In the Professional or Enterprise Editions of screen-scraper, create a text file in screen-scraper's folder named register.txt file that contains a single line with the email address under which you registered screen-scraper. Start up either the screen-scraper server or invoke screen-scraper from the command line. screen-scraper will read in that file, validate the license, then write the result of the validation to a file called register_result.txt. Once the license has been validated, the register_result.txt file can be deleted.

How can I optimize screen-scraper's performance?

Here are some tips:

  • Allocate more memory to screen-scraper. This can be done under the "Settings" dialog box (click the wrench icon) via the "Maximum memory allocation" setting.
  • Run long scrapes either from the command line or in server mode. The workbench is really just designed for creating scraping sessions and such; if you try to run long scrapes from it you could encounter memory problems.
  • Only save values in session variables when you have to. This is especially true for data sets extracted by extractor patterns. Each time you save a value in a session variable screen-scraper keeps it in memory for the life of the scraping session unless you explicitly null it out. For an extractor pattern, under the "Advanced" tab, when you click the "Automatically save the data set generated by this extractor pattern in a session variable" checkbox you're telling screen-scraper to retain that entire data set in memory. This is fine for relatively small data sets, but should be avoided for large ones. The performance hit for doing this can be mitigated by also checking the "Cache the data set" checkbox (also found under the "Advanced" tab), but when the value for the variable is requested screen-scraper will still need to read it into memory temporarily.
  • Write data out as it gets extracted. This is a corollary to the previous point. Rather than saving data sets in memory you should instead write scripts that will either write the data out to a file or insert it into a database as it gets extracted. A common way of doing this is to write compiled Java code that takes a DataRecord containing extracted data, and handles inserting it into a database. See "I'd like to insert the data screen-scraper extracts into a database. How do I do that?" for more on this.
  • Don't tidy HTML. This can make working with extractor patterns a bit trickier, but can save a fair amount on CPU usage. You can tell screen-scraper not to tidy HTML by unchecking the "Tidy HTML after scraping?" box found under the "Advanced" tab for a scrapeable file.
  • Reuse objects. This is a general principle of programming, and should be followed when using screen-scraper. For example, if you're connecting to a database within screen-scraper scripts, rather than disconnecting and reconnecting each time you need to issue a SQL statement, you should instead keep a connection object in a session variable so that it can be reused.
  • Use compiled code where possible. This will generally mean writing Java code, compiling it into a jar file, then placing it into screen-scraper's "libext" folder. The jar will then be automatically added to screen-scraper's classpath such that you can refer to it in your scripts (e.g., you can include "import" statements in your scripts in order to use your classes).
  • Reduce the number of scraping sessions you run in parallel. screen-scraper has the ability to run multiple scraping sessions simultaneously. This is often necessary and desirable, but it can also have an impact on memory usage and the performance of each scraping session. You can set the number of scraping sessions you'd like to allow screen-scraper to run simultaneously by opening the "Settings" dialog box (click on the wrench icon), then adjusting the value labeled "Maximum number of concurrent running scraping sessions".
  • Avoid requesting files that are unnecessary. Oftentimes in order to get to the page containing the data you'd like to extract screen-scraper will need to first request a few other pages (e.g., one that handles logging in to the site). It's often worth it to experiment a bit by disabling certain files that you would normally request in your web browser (e.g., frames in a frameset) to see if they're actually required in order to be able to request the page containing the data you want.
  • Allocate more memory to screen-scraper. This can be done by opening the "Settings" dialog box (click on the wrench icon), then adjusting the value labeled "Maximum memory allocation in megabytes".
  • Fix extractor patterns that are timing out. Extractor patterns that time out can leave threads running which, over time, can consume a fair amount of memory. To see if your extractor patterns are timing out look for a message like this in your log: "Warning! The operation timed out while applying the extractor pattern, so it is being skipped." To fix these extractor patterns you'll typically want to remove any ~@IGNORE@~ tags, replacing them instead with tokens that use specific regular expressions. You should also try to add regular expressions to other tokens so as to make the match more precise. You can also often avoid timeouts by using sub-extractor patterns instead of full extractor patterns. This allows the extraction to be done in a more piecemeal fashion, which is more efficient.
  • Disable logging. This can be done in the "Settings" window (click on the wrench icon) under the "Servers" section, by un-checking the box labeled "Generate log files". You should, of course, only do this, though, once you're satisfied that your scraping sessions are all working as you'd like them to.
Syndicate content