Input

Overview

The basic idea of initializing is discussed in the second and third tutorials and serves one of two purposes:

  1. Prepare Objects: If you are saving the scraped information to a database, CSV, or XML file then you will likely want to initialize these objects before you start. Also, if you will be iterating over pages, you might need to start your iterator before the scrape begins.
  2. Debug Script: In this form the script is meant only to allow you to run a scrape with variables that will later be received from an external script but are required for it to run.

As you can guess, you might have both of these needs in a single script of in two different scripts. Regardless, here we present different methods for initializations scripts including such variables as where you get the values of your variables.

Input from CSV

This script is extremely useful because it's purpose is to enable you to read inputs in from a csv list. For Example, if you wanted to input all 50 state abbreviations as input parameters for a scrape then this script would cycle through them all. Furthermore, this script truly begins to show the power of an Initialize script as a looping mechanism.

This particular example uses a csv of streets in Bristol RI. Each street in Bristol is seperated by commas and only one street per line. The "while" loop at the bottom of the example retrieves streets one by one until the buffered reader runs out of lines. These streets are stored as a session variable named STREET and used as an input later on. Each time the buffered reader brings in a new street it blasts the last one out of the STREET session variable.

import java.io.*;

//you need to point the input file in the right direction. This is a relative path to an input folder in the location where you installed Screen-scraper.
session.setVariable("INPUT_FILE", "input/BRISTOL-STREETS.csv");

//this buffered reader gathers in the csv one line at a time. Your csv will need to be seperated into lines as well with one entity per line.
BufferedReader buffer = new BufferedReader(new FileReader(session.getVariable("INPUT_FILE")));

//because for this scrape my city was BRISTOL and my state was RI I set these as session variables to be used later as inputs.
session.setVariable("CITY", "BRISTOL");
session.setVariable("STATE", "RI");

//this is the loop that I was referring to earlier. As long as the line from the buffered reader is not null it sets the line as a session variable and //calls the "Search Results" scrapeable file.
while ( (line = buffer.readLine()) != null ){
    session.setVariable("ZIP", line);
    session.log("***Beginning zip code " + session.getVariable("ZIP"));

    session.scrapeFile("Search Results");
}

buffer.close();

Reading in from a CSV is incredibly powerful; however, it is not the only way to use a loop. For information on how to use an array for inputs please see the "Moderate Initialize -- Input from Array".

The next script (below) deals with input CSV files that have more than one piece of information per row (more than one column).

import java.io.*;

////////////////////////////////////////////
session.setVariable("INPUT_FILE", "input/streets_towns.csv");
////////////////////////////////////////////

BufferedReader buffer = new BufferedReader(new FileReader(session.getVariable("INPUT_FILE")));
String line = "";

while (( line = buffer.readLine()) != null ){
    String[] lineParts = line.split(",");

     // Set the variables with the parts from the line
    session.setVariable("CITY", lineParts[1]);
    session.setVariable("STREET", lineParts[0]);

    // Output to the log
    session.log("Now scraping city: " + session.getVariable("CITY") + " and street: " + session.getVariable("STREET"));

    // Scrape next scrapeable file
    session.scrapeFile("MyScrape--2 Search Results");
}

buffer.close();

Read CSV

Sometimes a CSV file will use quotes to wrap data (in case that data contains a comma that does not signify a new field). Since it's a common thing to do, a script to read a CSV should anticipate and deal that that eventuality. The main workhorse of this script is the function. By passing a CSV line to it, it will parse the fields into an array.

String[] parseCSVLine(String line, int index, int columnsToGet){
    int START_STATE = 0;
    int FIRST_QUOTE = 1;
    int SECOND_QUOTE = 2;
    int IN_WORD = 3;
    int IN_WORD_WITHOUT_QUOTES = 4;
    int state = START_STATE;
    String word = "";
    ArrayList lines = new ArrayList();
    char[] chars = line.toCharArray();

     for (int i = 0; i < chars.length; i++){
        char c = chars[i];

        if (c == '"'){
            if (state == START_STATE){
                state = FIRST_QUOTE;
            }
            else if ((state == FIRST_QUOTE) || (state == IN_WORD)){
                state = SECOND_QUOTE;
            }
            else if (state == SECOND_QUOTE){
                word += ("" + c);
                state = IN_WORD;
            }
        }
        else if (c == ','){
            if ((state == SECOND_QUOTE) || (state == IN_WORD_WITHOUT_QUOTES)){
                state = START_STATE;

                lines.add(word);
                if (lines.size() == columnsToGet) break;
                    word = "";
            }
            else if (state == START_STATE){
                state = START_STATE;
                lines.add(word.replaceAll("\"\"", "\""));
            }
            else{
                word += ("" + c);
                state = IN_WORD;
            }
        }
        else{
            if (state == START_STATE) state = IN_WORD_WITHOUT_QUOTES;
            else if (state != IN_WORD_WITHOUT_QUOTES){
                 state = IN_WORD;
                word += ("" + c);
            }
        }
    }
    if (lines.size() < columnsToGet){
        if ((state == SECOND_QUOTE) || (state == IN_WORD_WITHOUT_QUOTES))
             lines.add(word.replaceAll("\"\"", "\""));
    }
    String[] linesArray = new String[lines.size()];

    for (int i = 0; i < lines.size(); i++){
        linesArray[i] = (String) lines.get(i);
    }

    return linesArray;
}

// File from which to read.
File inputFile = new File( "test_input.csv" );

FileReader in = new FileReader( inputFile );
BufferedReader buffRead = new BufferedReader( in );

// Read the file in line-by-line.
int index = 0;
while( ( searchTerm = buffRead.readLine() )!=null){
    // Don't read header row
    if (index>0){
        // Parse the line into an array
        line = parseCSVLine(searchTerm, index, 5);

        // Get the values
        name = line[0];
        date = line[1];
        address = line[2];
        city = line[3];
        state = line[4];
        zip = line[5];

        // Set the needed values as session vaiables
        session.setVariable("NAME", name);
        session.setVariable("ZIP", zip);

        // Scrape for those values
        session.scrapeFile("Serach results");
    }
    index++;
}

// Close up the file.
in.close();
buffRead.close();

Alternatively you can read the csv via the opencsv package that is included with screen-scraper. This may be more robust for different formats of csv

import au.com.bytecode.opencsv.CSVReader;

//initialize the reader
File f = new File("input/AK.csv");
CSVReader reader = new CSVReader(new FileReader(f));

//read the file saving it into a List of Maps
String[] headers = reader.readNext();
List lines = new ArrayList();
String[] line;
while((line = reader.readNext())!=null)
{
        Map m = new HashMap();
        for(int i=0;i<headers.length;i++)
        {
                m.put(headers[i],line[i]);
        }
        lines.add(m);
}
reader.close();

//print out what we read
for(int i=0;i<lines.size();i++)
{
        session.log(String.valueOf(lines.get(i)));
}

Input from array

The following script is really useful when you need to loop through a short series of input parameters. Using an array will allow you to rapidly develop a group of inputs that you would like to use; however, you will need to know every input parameter. For example, if you wanted to use the following state abbreviations as inputs [UT, NY, AZ, MO] then building an array would be really quick, but if you needed all 50 states it would probably be easier to access those from a csv (need to know how to use a csv input? check out my other post titled "Moderate Initialize -- Input from CSV").

import java.io.*;

String[] states = {"DE", "FL", "GA", "MD", "NH", "NC", "PA", "RI", "SC", "TN", "VT", "VA", "MS"};
int i = 0;

while ( i<states.length )
{
    if (!session.shouldStopScraping())
    {
        session.setVariable("STATE", states[i]);
        session.log("***Beginning STATE: " + session.getVariable("STATE"));
       
        session.scrapeFile("Search Results");
        i++;
    }
}

Input from multiple files

Many sites requiring the user to input a zip code when performing a search. For example, when searching for car listings, a site will ask for the zip code where you would like to find a car (and perhaps distance from the entered zip code that would be acceptable). The follow script is designed to iterate through a set of input files, which each contain a list of zip codes for that state. The input files in this case are located within a folder named "input" in the screen-scraper directory. The files are named in the format "zips_CA", for example, which would contain California's zip codes.

import java.io.*;

String[] states =  {"AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "PR", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"};

i = 0;

// Iterate through each state abbreviation in the array above
while (i < states.length){
    ////////////////////////////////////////////
    // The file changes depending on what state we are scraping
    session.setVariable("INPUT_FILE", "input/zips_"+ states[i] + ".csv");
    ////////////////////////////////////////////

    BufferedReader buffer = new BufferedReader(new FileReader(session.getVariable("INPUT_FILE")));
    String line = "";

    while ((line = buffer.readLine()) != null){
        // The input file in this case will have one zip code per line
        session.setVariable("ZIPCODE", line);

        session.log("***Beginning zip code " + session.getVariable("ZIPCODE"));

        // Scrape the "Search Results" with the new zip code retrieved from the
        // current state's file
        session.scrapeFile("Search Results");
    }
i++;
}

Attachment Size
zips_AL.csv 5.73 KB
zips_AR.csv 4.16 KB
zips_AZ.csv 3.03 KB
zips_CA.csv 20.7 KB
zips_CO.csv 4.53 KB

Simply Set Variables

When a Scraping Session is started it can be a good idea to feed certain pieces of information to the session before it begins resolving URLs. This simple version of the Initialize script is to demonstrate how you might start on a certain page. While basic, understanding when a script like this would be used is pivotal in making screen scraper work for you.

session.setVariable( "PAGE", 0);
session.scrapeFile( "Your First Page Goes Here!" );

The above code is useful where "PAGE" is an input parameter in the first page you would like to scrape.

Occasionally a site will be structured so that instead of page numbers the site displays records 1-10 or 20-29. If this is the case your Initialize script could look something like this:

session.setVariable( "DISPLAY_RECORD_MIN", 1 );
session.setVariable( "DISPLAY_RECORD_MAX", 10 );
session.scrapeFile( "Your First Page Goes Here!" );

Once again "DISPLAY_RECORD_MIN" and "DISPLAY_RECORD_MAX" are input parameters on the first page you would like to scrape.

If you feel you understand this one, I'd encourage you to check out the other Initialize scripts in this code repository.

U.S. Zip codes (CSV Files)

The following files contains zipcodes for the that state. The file "zips_US.CSV" contains all US zip codes within one file. If you wish to download all of the CSVs at once you may choose to download the file "zips_all_states.zip".

Note: If you've forgotten the state abbreviations please visit http://www.usps.com/ncsc/lookups/usps_abbreviations.html

Last updated 5/8/2008

Attachment Size
zips_AL.csv 5.73 KB
zips_AR.csv 4.16 KB
zips_AZ.csv 3.03 KB
zips_CA.csv 20.7 KB
zips_CO.csv 4.53 KB
zips_CT.csv 2.58 KB
zips_DE.csv 686 bytes
zips_FL.csv 10.1 KB
zips_GA.csv 5.92 KB
zips_IA.csv 6.25 KB
zips_ID.csv 1.94 KB
zips_IL.csv 9.31 KB
zips_IN.csv 5.79 KB
zips_KY.csv 6.87 KB
zips_LA.csv 4.21 KB
zips_MA.csv 4.17 KB
zips_MD.csv 4.23 KB
zips_ME.csv 2.98 KB
zips_MI.csv 6.84 KB
zips_MN.csv 6.05 KB
zips_MO.csv 6.98 KB
zips_NC.csv 7.43 KB
zips_ND.csv 2.41 KB
zips_NE.csv 3.65 KB
zips_NH.csv 1.65 KB
zips_NJ.csv 4.33 KB
zips_NM.csv 2.5 KB
zips_NV.csv 1.47 KB
zips_NY.csv 13.04 KB
zips_OH.csv 8.54 KB
zips_OK.csv 4.55 KB
zips_OR.csv 2.82 KB
zips_PA.csv 15.06 KB
zips_RI.csv 546 bytes
zips_SC.csv 3.68 KB
zips_SD.csv 2.36 KB
zips_TN.csv 5.43 KB
zips_TX.csv 18.09 KB
zips_UT.csv 2 KB
zips_VA.csv 8.51 KB
zips_VT.csv 1.8 KB
zips_WA.csv 4.21 KB
zips_WI.csv 5.31 KB
zips_WV.csv 5.89 KB
zips_WY.csv 1.14 KB
zips_all_states.zip 178.54 KB
zips_US.csv 295.08 KB

Forms

The form class can be a life saver when it comes to dealing with sites that use forms for their inputs and have a lot of dynamic parameters

There are really only two cases in which using the form class is preferrable to doing the paramenters any other way. Those cases are:

  1. The page is using a bunch of dynamic parameters (number of keys and/or names of keys changing)
  2. This goes with the other, but if you get to a page that has data filled in already you just want to submit as-is, but it won't always be the same

In general though, it'll be easier for debugging if you can stick with the regular parameter tab

Form Creation

import com.screenscraper.util.form.*;

// The form text being built should include the form open and close tag.
// Any inputs are used, not just what is inside the form tags, so
// limit the input text to the form area.  If there is only one
// form on the page you can use scrapeableFile.getContentBodyOnly()
// as this doesn't care what additional text is included.
Form form = scrapeableFile.buildForm(dataRecord.get("TEXT"));

// Be sure to save the form in a session variable so it can be used
// by the scrapeable file which will use the form data
session.setVariable("_FORM", form);

// The form object is now ready to be used to submit what is currently
// on the page, or can be manipulated with input values being set

// Set a value on the form.  If the form didn't contain that input key,
// one will be added for it
form.setValue("zip", "12345");

// Set a value on the form, but validate it can be set to that.  This isn't
// fool proof, but does some checking.  For instance, if the input was
// a select type, it will throw an exception if there wasn't an option
// with the given value.  It also handles some other error checking based
// on the input type, but any Javascript checks won't be checked
form.setValueChecked("selector", "op1");

// Remove the specified input from the form.  This is useful if there are
// multiple submit buttons, for instance.  In that case the one that
// is clicked on is the value sent to the server..
form.removeInput("Update");

Form Use

import com.screenscraper.util.form.*;

// To use the form data, it needs to be set in a script run
// "Before file is scraped"

// Get the form from the session (or where ever it is stored)
Form form = session.getVariable("_FORM");

// Call this method to set the values.  This includes the URL
// if a URL was found in the form tag when building the form
form.setScrapeableFileParameters(scrapeableFile);