Extract text from image or pdf
How can you effectively extract text from a pdf or an image ? commmonly called OCR (optical character recognition). I found 2 extremly powerfull tools based on the open source engine Tesseract (Official website).
I am using windows and can be both used on this OS. One permit to convert scanned pdf to searchable pdf (as well as copiable). The other permit to get a screenshot from an area of your screen, convert it to text and store it in your clipboard.
- Ocrmypdf
- you need to use Ubuntu on windows more info here
- update your apt:
sudo apt-get update
- install it:
sudo apt install ocrmypdf
- check the documentation for the cmds
- here an easy example for frencg pdf:
ocrmypdf -l fra "input.pdf" "output.pdf"
- here an easy example for frencg pdf:
- To install new languages (for Ubuntu)
- check which exists:
apt-cache search tesseract-ocr
- install what you need:
sudo apt-get install tesseract-ocr-fra
- check which exists:
- normcap
- easy to install, just use the exe
Have a try :)