Power of Curl and OCR

2015-01-15 12:38

Ever needed to download hundreds of images and run them through OCR to get plain text out of it?

I just did by combine the power of Curl, Bash and Tesseract OCR.

Curl has this nice feature to be able to express ranges in URLs. The images I needed to fetch was numbered so insted of writing a loop in bash and iterating I could just do something like this:

$ curl -O http://example.net/img[1-800].png

Then I simply iterated over each recieved file feeding it to Tesseract:

$ for f in $(ls img*.png); do tesseract $f $(echo $f | cut -d'.' -f1) -l swe; done

This produced a nice bunch of text files with the same as the image scanned and with the suffix txt.

Of course I could have done it all in the shell loop but it's nice to know the tools you use and for just fetching a bunch of URLs it's easier and more readable than creating a tiny shell script.