SS 3.0: Tidy error log file is being used by another process

Hello:

I'm trying to run down the cause of missing output records due to skipped input files. It appears that because the tidy error log file is being used, the scraping engine skips the file. I get the following error in the error.log file:

An error occurred while generating the Tidy error log file: E:\Program Files\screen-scraper pro\log\tidy.log (The process cannot access the file because it is being used by another process)

On 12,334 input files, about 71 files are missing from the output. On another test of this input data file, 26 ouput records were missing, and 24 error lines above were in the tidy error.log file. But inspection of the session log file (captured to disk) reveals all 26 tidy error messages, so 2 of the errors weren't captured to the tidy error.log file.

On a run last night (3/20/2007), there were 119 skipped files, with 117 noted in the tidy error.log. They are random, with no obvious pattern between subsequent runs.

On 196 input files, 1 file is missing, and receive 1 error message above in error.log

I'm running this with only one (1) maximum concurrent scraping sessions. But this happens with either max 5 or max 1 simultaneous sessions. I extended the timeout delay on Settings | Data Extractor Timeout from 1 sec to 10 secs, but still get the error.

Any help with this would be appreciated.

-- Roy Zider

SS 3.0: Tidy error log file is being used by another process

Todd:

GOT IT -- four perfect runs in a row.

(ddt read dpfname)   2132 seconds elapsed time
(ddt read dpfname)   12334 records read, 347.1 per min.
(ddt read dpfname)   12334 records completed.

(ddt read dpfname) Closed and done. (3/28/07 9:06 PM)

I change some of the FileWriter code, principally leaving the file open and flushing, closing at end.

I'll update with details if I can.

Thanks for your help -- this has been a terribly vexing problem for some time, and now appears to be solved.

-- Roy

SS 3.0: Tidy error log file is being used by another process

Thanks for the clarification, Roy. Just to summarize, it sounds like the main issue you're now dealing with is missing records, which seems to be due to this error:

(ddt lotdetail) Writing data to a file.
lotDetailscrape: An error occurred while processing the script: Sothebys
- write lotDetail data
lotDetailscrape: The error message was: The application script threw an
exception: java.io.FileNotFoundException: zzerrout.txt (The process
cannot access the file because it is being used by another process) BSF
info: null at line: 0 column: columnNo

Is that correct? Or are you suspicious that it may be missing records because of some other problem outside of that particular error?

In regard to this error, it's going to be a bit tricky because it's happening outside of screen-scraper's control. That is, as with the tidying issue, the Java Virtual Machine is apparently having trouble locking and unlocking files in a timely manner, such that it's interfering with your program flow. Previously we were able to resolve this by simply ceasing to write to the tidy.log file. In this case, however, you need to write to your file.

Here are a few ideas that come to mind:

- Write your data to separate files, then merge them afterward. If you can find a way to write to multiple files such that you can pretty well guarantee they won't be locked when they're to be written to, you may be able to resolve the issue.
- Read up a bit more on methods to write to files. There may be alternative Java classes and such you could use for this purpose that may work better than your current method.
- Similar to the previous suggestion, you might simply alter the way you're handling writing to the file. For example, an alternative approach may be to open the FileWriter at the very beginning of your scraping session, then store it in a session variable. In any script where you need to write to the file, you would pull the FileWriter out of the session variable, write the data, flush it (so that you can ensure it actually gets written and not buffered in memory), then leave it open. At the very end of your scraping session you'd want to close the FileWriter.
- I think you already tried this, and it's not a very elegant solution, but you may just insert pauses at key points to see if that allows the JVM sufficient time to close up the file.
- Write the data to a database instead of a file. This would probably be the most robust solution, but would also be a fair amount of work. You might take a look at our fifth tutorial to see if this seems viable for you ([url]http://www.screen-scraper.com/support/tutorials/tutorial5/tutorial_overview.php[/url]).

Hopefully those suggestions help. Feel free to reply back if I can help further.

Todd

SS 3.0: Tidy error log file is being used by another process

Hi Todd:

Sorry this is so confusing. Let me clarify things.

1. I'm running on Windows XP SP2, on top of a dual CPU platform (Tyan MPX with 2x AMD MP 2000+ CPUs).

2. I'm generally invoking the program from the command line, with driver scripts inside SS running the session and reading and writing files.

3. SS is sometimes open, sometimes not, while the program is running.

4. The program processes 12,334 input files, which are html detail pages I've downloaded previously (and separately) from Sotheby's web site in order to analyze them offline. The program writes data to a tab-delimited output file, and (now) also emit error messages at the occur to an error file. I capture the session output by redirecting the output at the DOS level to a file tt.txt. The input files are about 40KB each, the data output file runs 12,334 lines of output and is abou 11.5MB in size, and the session log captured to tt.txt is about 69MB and runs about 1.1 million lines of output. The run typically takes 33 minutes (2,000 seconds).

5. Sometimes programs are running in the background, sometimes not. I've tested both ways with no obvious differences.

6. The first source of error cropped up when I found that output records were being skipped (output not written). This was traced to some sort of locking problem with tidy.log, as noted earlier.

7. I also noted at the time that there were some missing records that weren't showing up in erro.log. That is, I would find that my output file would be, say, five records short, but there would only be three error records relating to tidy.log being busy in your error.log file. (Also noted earlier).

8. You kindly modified SS to stop using tidy.log late last week. I haven't gotten any tidy.log errors since then. I haven't had any errors in your ..\log\error.log file at all in the past couple of days, though I have made complete runs of the program at least five times and many other runs as well.

9. Notwithstanding this improvement (not having tidy.log locking), I still have missing records. There is no error in ..\log\error.log, to be sure, but I've found them by counting the lines of output in my output file. There is no pattern to the missing output, running two to four records at most. A couple of times I've gotten perfect output (no missing records), and have saved the output.

10. The record locking/output file busy problem hasn't gone away, as now it appears to be in one of my files. Here is the error message from the session data, where my program tries to open my own error logging file, but fails:

(ddt lotdetail) Writing data to a file.
lotDetailscrape: An error occurred while processing the script: Sothebys
- write lotDetail data
lotDetailscrape: The error message was: The application script threw an
exception: java.io.FileNotFoundException: zzerrout.txt (The process
cannot access the file because it is being used by another process) BSF
info: null at line: 0 column: columnNo

The statement it's trying to execute and failing at is this one, I think:

zzerrout = new FileWriter ( "zzerrout.txt", true ); // append mode

I added a statement to null it out before using it, but still didn't affect the missing output lines:

Filewriter zzerrout = null;

11. This has been a long-standing problem, going back at least two or three years. At that time I was scraping a site with 5000+ records per session, and we always had one or two MIA files. I had to move on before this problem was resolved.

But now with a completely different scraping program I'm still getting MIA records.

12. My workarond for this situation is 1) abandon tidy and rewrite the pattens to operate on the native html, and 2) make multiple runs to patch together a complete output file.

* * *

Hope these comments are helpful to you, Todd.

-- Roy

SS 3.0: Tidy error log file is being used by another process

Hi Roy,

Sorry, I guess I'm still not clear what's happening. Let me take a shot at inferring, and hopefully you can clarify.

You're obviously still getting errors. I'm guessing that the errors are different from the previous ones, which dealt with file locking on the tidy.log file. Now that we're no longer writing to that file, those errors should have gone away. Have you found that this is the case?

Assuming the errors are different from the previous ones, it sounds like you're able to find them in the log file for your scraping session, but it's unwieldy to deal with because it's so large. Is this true? If it is, it sounds like you may be wanting to use the error.log file instead, except that the error.log file only shows the error message, and not what scrapeable file it corresponds to. Is this correct?

Thanks,

Todd

SS 3.0: Tidy error log file is being used by another process

Todd:

Yes, I am running things from the command line, and piping the output to a text file. I have to to search through the output for warnings or errors, since there is no way in the error.log to identify the file being scraped. The piped session log is over a million lines of output. The warnings and errors can be found, obviously, but error.log itself does not identify the file where the error occurred.

-- Roy

SS 3.0: Tidy error log file is being used by another process

Hi Roy,

The error.log file captures errors that occur internal to screen-scraper. I think what you need to do is check over your scraping log. Are you running it from the command line? If so, I would recommend piping the output to a file, that way you can examine it later. Or maybe I'm misunderstanding. Just let us know how we can help.

Todd

SS 3.0: Tidy error log file is being used by another process

Hi, Todd:

Then if it continues to put out messages to error.log, then it would be helpful if it were to identify which file it is scraping. Otherwise, you have to capture the session log and read through that to find the error -- which sort of makes the error.log pointless.

-- Roy

SS 3.0: Tidy error log file is being used by another process

Hi,

It will continue to create error.log, which is definitely desirable. I'm not sure why you're still missing records, but hopefully perusing your logs turns something up.

Best,

Todd

SS 3.0: Tidy error log file is being used by another process

Todd:

Got it -- thanks.

Quick test -- doesn't create tidy.log, but does create error.log Error.log is empty, but there are still two missing output records in a reduced 68-file set I've been trying to reduce to zero. Earlier I noted a two-record difference between what was missing and the lines in error.log, so it's evidently still here. I'll get back to you about this discrepancy.

-- Roy

SS 3.0: Tidy error log file is being used by another process

Hi,

See this FAQ if you're having trouble upgrading

http//www.screen-scraper.com/support/faq/faq.php#NoUpdates

Todd

SS 3.0: Tidy error log file is being used by another process

Todd:

Where do I get the pre-release version for upgrade? "No updates" from my side, and nothing in your blog about it.

On the tidy.log situation, it's never identified any of the files it's tagged by name, so it's never been possible to go back and match error with source file. So on that basis alone it's been pretty lame. And it's a version from 2000, so perhaps it's time to upgrade that module.

FWIW, I'm running these data extractions on a dual CPU system (Tyan S2466N-4M, AMD MP2000+ CPUs). This may have some effect with program execution, even if they're supposedly thread-safe.

-- Roy

SS 3.0: Tidy error log file is being used by another process

Hi Roy,

After a bit of deliberation, we've decided that the best solution is to simply suppress the tidy logging. It's only rarely useful, and if it's causing file locking issues, then it's probably doing more harm than good.

If you'll update to version 3.0.12a (the very latest pre-release) you'll find that screen-scraper will no longer log to the tidy.log file, which should take care of the issue. Please let us know if you find otherwise.

Thanks,

Todd

SS 3.0: Tidy error log file is being used by another process

Todd:

Just ran test on 196 records using:

session.pause(1000);

Still got one missing record, one error in error.log for tidy.log.

Later, ran on full file set, now ran almost four hours, got 68 missing records:

(ddt read dpfname)   14156 seconds elapsed time
(ddt read dpfname)   12334 records read, 52.3 per min.
(ddt read dpfname)   12266 records completed.

So the pause isn't going to be the answer, I'm afraid.

-- Roy

SS 3.0: Tidy error log file is being used by another process

Unfortunately, the tidying is handled by a third-party library, so we don't have direct access to it. I'll see about pursuing an alternative solution, however, that may achieve the same effect.

Todd

SS 3.0: Tidy error log file is being used by another process

Todd:

Isn't there some way to test if the tidy.log is free before it scrapes? I will test your suggestion. although you realize of course that adding a second to each scrape adds 12,334/3600 = 3.4 hours to the run, rather than the 36 minutes it is taking at the moment.

-- Roy

SS 3.0: Tidy error log file is being used by another process

Hi Roy,

I have to admit, this isn't one I've seen. I'm guessing it's simply because the files are being processed in such rapid succession that the operating system doesn't release hold on the tidy.log file.

I know you want these to process quickly, but what would happen if you were to call session.pause for maybe 1 second in between each? If that works, is that a viable solution?

Todd