Power of Curl and OCR
2015-01-15 12:38
Ever needed to download hundreds of images and run them through OCR to get plain text out of it?
I just did by combine the power of Curl, Bash and Tesseract OCR.
Curl has this nice feature to be able to express ranges in URLs. The images I needed to fetch was numbered so insted of writing a loop in bash and iterating I could just do something like this:
$ curl -O http://example.net/img[1-800].png
Then I simply iterated over each recieved file feeding it to Tesseract:
$ for f in $(ls img*.png); do tesseract $f $(echo $f | cut -d'.' -f1) -l swe; done
This produced a nice bunch of text files with the same as the image scanned and with the suffix txt.
Of course I could have done it all in the shell loop but it's nice to know the tools you use and for just fetching a bunch of URLs it's easier and more readable than creating a tiny shell script.