changing Tor´s identity through screen scraper possible?

Hello,

A site I have been scraping has started blocking when a certain rate of requests are reached. I imagine it´s something like: If the same i.p address makes more than x requests in x seconds then asume it´s a machine so block it.

I am investigating using Tor and its identity switching feature. I know I can change the identity by pressing a button in Tor´s gui, but I have also read somewhere I can interface with Tor´s controller and send a command to switch identity.

In the Screen Scraper forum there is a post where a screen scraper tor java library is mentioned and a link to download it (SSTor.jar). I have downloaded and put it in /screen-scraper proffesional/lib/ext (using mac) but not sure if with this library I can send the switch identity command to Tor´s controller and how.

Thank you very much for any help,
bogavante

If you already have the

If you already have the SSTor.jar in the lib/ext directory, the you can import the attached scraping session and see the means to start/stop tor, and in the script to check for blocked node you can see the means to get a new identity.

Thanks, that could help me

Thanks, that could help me figure it out how to do it, yes.
However I downloaded the scrape file you attached and when run, I get:

.
.
.
IP: Processing scripts before a file is scraped.
IP: Resolved URL: http://www.icanhazip.com
Using proxy server: 127.0.0.1:0
IP: Sending request.
IP: An input/output error occurred while connecting to 'http://www.icanhazip.com'. The message was Connection to http://127.0.0.1:0 refused.
.
.

Also, I noticed that in the advanced tab of this scrape, the External Proxy authentication is set to 127.0.0.1 on port 31042
Not sure if I have to leave it like that or change it to something else.

Thanks,
bogavante

P.D: I have both Tor and Privoxy properly running in my mac osx system. I have both of them chained through port 9050 and in my osx system Network configuration
I have the http and https proxies set up for 127.0.0.1 and port 8118

I'm having the same issues as

I'm having the same issues as above. We are running this on our webserver so 127.0.0.1 resolves to our default website in IIS. Here's the log of the example TOR screenscraper:

Starting scraper.
Running scraping session: tor
Processing scripts before scraping session begins.
Processing script: "tor startup"
Scraping file: "IP"
IP: Processing scripts before a file is scraped.
IP: Resolved URL: http://www.icanhazip.com
Using proxy server: 127.0.0.1:0
IP: Sending request.
IP: Redirecting to: http://www.icanhazip.com/login.aspx?ReturnUrl=%2fdefault.aspx
IP: Extracting data for pattern "Untitled Extractor Pattern"
IP: The pattern did not find any matches.
IP: Warning! No matches were made by any of the extractor patterns associated with this scrapeable file.
IP: Processing scripts after a file is scraped.
Processing script: "tor check for blocked node"
tor: Successfully added 1 to session variable TOR_RETRY_COUNT.
Error on request. Retrying 1 of 50

Any ideas why the scraper can't connect on the right socket?

Do you have the SSTor.jar in

Do you have the SSTor.jar in the screen-scraper/lib/ext directory, and restated screen-scraper?

The script should modify the

The script should modify the settings you see on the advanced tab.

You also don't need to have tor running. Just installed. You should also use polipo instead of privoxy.

struggling with this one. So

struggling with this one.

So I have unistalled Privoxy and now have fully working Polipo connected to Tor through Polipo´s socksParentProxy setting set to port 9050, which is the one that Tor is allways opening the socks listener at.

I execute the example scrape you attached with:

Tor stopped & Polipo running & settings empty in SS advanced tab
or
Tor stopped & Polipo stopped & settings empty in SS advanced tab

Any of both ways, I am getting the in SS the return of tor.startup() as 0, therefore the http port being assigned by setExternalProxyPort is being 0, so the rest of the script doesn´t work.

Any idea why?
thank you,
Boga.

Download and install Vidalia.

  1. Download and install Vidalia. Install it to the default location, which should be in your "Applications" folder.
  2. Put the SSTor.jar into the "screen-scraper x edition/lib/ext" directory
  3. Create a sub-directory in screen-scraper named "tor" and copy into it the files: torrc and polipo.conf
  4. Add the path "/Applications/Vidalia.app/Contents/MacOS" to your Mac's system PATH variable. Instrctions on doing that can be found here: http://overwatering.org/blog/2012/08/setting-path-osx-mountain-lion/.
  5. Once the PATH variable is set screen-scraper should be able to find the tor and polipo binary files it needs to interact with tor.

    If you're unable to set the PATH system variable using the "launchd.conf" file you can also do everything from a terminal. You'll first set the path with this command:

    export PATH=$PATH:/Applications/Vidalia.app/Contents/MacOS

    Once that is set you'll need to launch screen-scraper from the command line (so that it can use the PATH variable you just set). To launch screen-scraper from the command line you'll first want to cd into the directory where screen-scraper is installed. To launch the workbench you can issue this command:
    java -Xmx128m -jar screen-scraper.jar

    where "128" is the maximum amount of memory you'd like to allocate to screen-scraper. The screen-scraper server can be launched as it normally is from the command line.

Thank you! I followed the

Thank you!

I followed the instructions and on step 5 I have had to set the path with the export PATH=$PATH...

And because I have Polipo installed in my c:/ folder(it didn´t come with Tor so I had to download it and install it separately) I had to add that folder in the the export path command so that it finds Polipo.

This is the output that I get in the workbench running your example scraping session:

Starting scraper.
Running scraping session: tor
Processing scripts before scraping session begins.
Processing script: "tor startup"
Scraping file: "IP"
IP: Processing scripts before a file is scraped.
IP: Resolved URL: http://www.icanhazip.com
Using proxy server: 127.0.0.1:16540
IP: Sending request.
IP: Extracting data for pattern "Untitled Extractor Pattern"
IP: The following data elements were found:
Untitled Extractor Pattern--DataRecord 0:  
IPADDRESS=31.172.30.1
Storing this value in a session variable.
IP: Processing scripts after a file is scraped.
Processing script: "tor check for blocked node"
Processing scripts after scraping session has ended.
Processing script: "tor shutdown"
Processing scripts always to be run at the end.
Scraping session "tor" finished.

and this is the output I got in the Terminal window:

TESTING PORT 50519
TESTING PORT 36737
STARTING TOR
tor -f tor/torrc --SocksPort 50519 --ControlPort 36737 --DataDirectory tor/tor1350317452223931000 > tor/tor1350317452223931000/tor.log
TESTING PORT 16540
STARTING HTTPPROXY POLIPO
polipo -c tor/polipo.conf socksParentProxy=127.0.0.1:50519 proxyPort=16540 logFile=tor/tor1350317452223931000/polipo.log
shutting down polipo
STOPPING HTTPPROXY POLIPO
shutting down tor
STOPPING TOR
AUTHENTICATING
515 Authentication failed: Password did not match HashedControlPassword value from configuration. Maybe you tried a plain text password? If so, the standard requires that you put it in double quotes.
SENDING SHUTDOWN COMMAND
null
everything shut down
/Applications/ScreenScraperProfessional

What am I doing wrong?
cheers,
Boga

I figured it out. In the Tor

I figured it out. In the Tor folder in ScreenScraper the torrc configuration file contains the HashedControlPassword variable and an encrypted password, so I removed it and now it works.
I´ll play a little bit with it to see if I can understand it and see how I can call it from my own script to avoi being blocked.
cheers,
boga

difference in Linux?

Hi again,

If I would want to run it in a Linux Ubuntu installation of screen scraper and Vidalia, how would those instructions be different?

thank you very much,
boga

No different at all, really.

No different at all, really. If you install vidalia through a package manager it will be added to your path for you, so you won't need to set it. Aside from that just the same.

TOR not working on Google network

I recently installed TOR to try and anonymize the scraping session I was doing on Google, sadly it looks like they have totally locked down all the exit ports. Even using the retry and request new identity 50 times option hasn't resulted in a single successful page extraction. Whilst it was a good idea to try and use this we've found it easier to just be a little more relaxed on the page requests and run the scrape directly from our servers. We find that a delay of 10-15 seconds per page request gives very few 503 errors.

Google is good at blocking

Google is good at blocking anything. I have the same issues.