NavigationUser loginscreen-scraper.com welcomes...
Currently online
There are currently 0 users and 2 guests online.
|
Tips and SuggestionsAny recommendations on how to handle projects that involve large numbers of scraping sessions?In cases where you're dealing with large numbers of scraping sessions, it becomes too cumbersome to retain them all in the workbench. Even if you organize them neatly into folders, there will likely still be too many to viably work with. Rather than keep all scraping sessions in the workbench at once, we generally find it useful to export and save them all to a central directory, which, ideally is under version control using something like Subversion or CVS. When you need to work with a particular scraping session, you simply import it from the repository. Every once in a while, you export the scraping session back to the central directory. Ideally the directory also gets backed up once in a while so that you don't lose any work. When working with a project where there are a large number of scraping sessions, you'll also often have a series of "general" scripts that get used by most, if not all, of your scraping sessions. For example, you might have one script that gets invoked by every scraping session, which is in charge of opening a database connection or initializing a file to which extracted data will be written. We typically handle these "general" scripts by storing them in a separate folder, alongside where all of the scraping sessions are stored. This directory should get versioned and backed up as well. The difference with the "general" scripts is that it's typically a good idea to keep them all in the workbench in their own folder. Usually there aren't very many of them, and they get used often enough that you'll typically want to just retain them in the screen-scraper workbench.
How do I send dynamic POST parameters in screen-scraper?If you've gone through our first few tutorials, you know that session variables can be embedded in URL's by using a token like this: ~#FOO#~ (see this page for a detailed example of this). Well, the very same technique can be used with POST variables. When you create a scrapeable file that uses POST parameters, they'll be displayed under the "Parameters" tab for that scrapeable file. In any of those POST parameters you can use the same type of token mentioned before. For example, if you're logging in to a web site (as described here), instead of hard-coding the username and password, you might instead substitute the tokens ~#USERNAME#~ and ~#PASSWORD#~ in the "Value" column, for the respective parameters. Prior to invoking that scrapeable file, you could then set two session variables corresponding to the username and password, which values would then be substituted for the ~#USERNAME#~ and ~#PASSWORD#~ tokens.
I'm trying to scrape an HTML form that requires the user to type in text shown in an image. Can screen-scraper handle this?This is known as a CAPTCHA mechanism, and is intended to discourage automated form submissions. There are essentially two ways of working around these:
This obviously isn't ideal, but, unfortunately, there may not be another way. The CAPTCHA images are designed such that they can't be read by a machine. As such, human intervention is required.
How do I extract data from two tables that are basically identical in structure?This isn't a scenario you'll run into too often, but it's common enough that we decided to include it in the FAQ. At times you may run into a page containing various tables of data. All of the tables are essentially identical in structure, but when you extract the data you want to be able to tell which rows of data came from which tables. For example, consider this page. If you view the HTML from the page you'll notice that the structure of the two tables is basically the same. If you use a normal extractor pattern that matches a row of data, though, you're going to get all four rows of data, and won't be able to tell which row came from which table. That is, your first inclination might be to use an extractor pattern like this:
Will screen-scraper notify me if the site I'm scraping changes?Once you've set up screen-scraper to extract data from a web site there's a good chance the web site will change at some point. Oftentimes cosmetic changes such as the addition of a font tag or changing text from bold to italic won't affect anything, but if the site makes more dramatic changes, such as altering their navigation system, then your scraping session will break. This generally results causes screen-scraper to either fail to extract records from the site entirely, or scrape significantly fewer records than it had previously. It also usually means that you'll need to update your scraping session to account for the changes in the web site. There are two approaches we generally take to addressing this issue. The first (and best) approach is to track the number of records screen-scraper extracts each time the scraping session is run. Let's suppose you're extracting records from a site that, on average, will yield about 100 records. If you run the scrape one day and it suddenly only extracts 10 records then something has likely changed with the site, so you'll probably need to adjust your scraping session to account for it. The second approach is to have a special extractor pattern or two that checks for a specific piece of text that you know should be present every time you scrape. This approach is most useful in cases where a site doesn't yield a consistent number of records. If your special extractor pattern doesn't match the text it's looking for then something has likely changed on the site. Along with all of this you'll likely want some kind of notification system so that you can be made aware when the site changes. To do this you might consider something like screen-scraper's sendMail function. Even better would be to set up an external application that monitors the number of records scraped each time, then logs an error in a database or log file if something comes up.
How do I make a backup of the work I've done in screen-scraper?As with any work you do on your computer, it's good to back it up once in a while. The preferred method for doing this in screen-scraper is to export your scraping sessions and scripts as XML files (note that you only need to back up the scripts that aren't referenced in scraping sessions--any scripts called from within scraping sessions will be automatically exported along with the scraping session). Once the files have been exported you might also consider storing them in a versioning system such as CVS or Subversion. screen-scraper will automatically back up your database periodically to ensure that you don't lose any work. You can also manually invoke this backup process by selecting "Backup Database" from the "File" menu. The database backups are stored in the "resourcedbbackup" folder. The directories within that folder contain previous versions of your database. If your database has somehow become corrupted, you may be able to simply revert back to a previous version. Help on that can be found here.
I'm running screen-scraper in a machine that isn't connected to the Internet. How do I transfer my screen-scraper license to it?Follow these steps:
Note that in doing this you'll be copying the entire screen-scraper database from one machine to another, so along with the licensing information it will also copy any scraping sessions, proxy sessions, and scripts. This will also mean overwriting any of those objects found on the unlicnesed instance. Before copying the database over, care should be taken to export any objects from the unlicensed instance that you'd like to retain.
I'm running screen-scraper in a GUI-less environment (e.g., not running XWindows). How do I update it to the latest version?When screen-scraper normally updates itself it downloads a zip file from our server, decompresses it, copies the files it contains on top of the existing files, then updates its version number. You'll instead need to do this manually. To do so follow these steps:
The next time you launch screen-scraper you'll have the updated version. We're in the process of creating a browser-based interface for screen-scraper that will allow you to update screen-scraper without having to go through this manual process.
I'm running screen-scraper in a GUI-less environment (e.g., not running XWindows). How do I transfer my screen-scraper license to my server?If you're using the Enterprise Edition of screen-scraper, this can be done via the web interface. In the Professional or Enterprise Editions of screen-scraper, create a text file in screen-scraper's folder named register.txt file that contains a single line with the email address under which you registered screen-scraper. Start up either the screen-scraper server or invoke screen-scraper from the command line. screen-scraper will read in that file, validate the license, then write the result of the validation to a file called register_result.txt. Once the license has been validated, the register_result.txt file can be deleted.
How can I optimize screen-scraper's performance?Here are some tips:
|
SearchNew Video!Tags Throughout this Site |
Recent comments
1 day 19 hours ago
1 day 19 hours ago
1 day 19 hours ago
4 days 19 hours ago
4 days 19 hours ago
5 days 1 hour ago
5 days 19 hours ago
5 days 19 hours ago
5 days 23 hours ago
6 days 20 hours ago