Reply to post

[FAQ]@Voice "does not support" Hindi PDF files (or other Indic languages PDFs)

Author
Admin
Administrator
  • Total Posts : 275
  • Reward points: 0
  • Joined: 2010/11/22 00:00:00
  • Location: USA
  • Status: offline
2019/04/10 16:20:28 (permalink)

@Voice "does not support" Hindi PDF files (or other Indic languages PDFs)

@Voice can extract text from any PDF file that does contain valid text (and not just images of scanned pages, where OCR needs to be employed to recognize text from images), AND contains a valid translation table from the font codes used in this PDF to Unicode standard. Somehow it's a very bad "tradition" that PDF files created in Hindi and other Indic languages do not provide such translation tables at all. To verify do this experiment:
 
Open your Hindi PDF file in Adobe Acrobat Reader (Adobe is the company that invented PDF and defines its standard), mark and copy some text in it, then switch to another app or program on a computer that can accept text, and paste it. Similar to what you see in such situations, instead of valid Hindi language text, you will see only random junk characters. It is simply not possible to know which valid characters are hiding under the numbers this file uses to represent each letter.
 
If you don't like this fact (we don't like it either), please complain rather to Adobe company, for permitting this stupidity in their own PDF format. And if Adobe's own PDF reader cannot extract text correctly from such a file, do not expect a miracle from me. I don't have any divine or psychic powers.
 
Greg

0 Replies Related Threads

    Jump to:
    © 2019 APG vNext Commercial Version 5.1