Reply to post

[FAQ]In text extracted from PDF, some characters are represented like {#46}

Author
Admin
Administrator
  • Total Posts : 275
  • Reward points: 0
  • Joined: 2010/11/22 00:00:00
  • Location: USA
  • Status: online
2019/04/19 07:19:23 (permalink)

In text extracted from PDF, some characters are represented like {#46}

Update Dec. 2020: if this happens to you and you need to read this PDF file, open it in @Voice app using the Open button on top (folder icon), then on the "PDF Text Import Settings" screen turn on the OCR option, select the correct language and proceed. It will take much longer, but will work well.
 
This problem is due to errors in PDF file encoding. They have an incorrect font-to-Unicode conversion table, lacking entries for the characters which @Voice shows as {#XX}, where XX can be any number. When extracting text from such PDF file, I have no idea what they mean when they say that the character number 46 from their font should be drawn in this place. I could skip them or replace with standard “unknown character” mark (I believe it’s a diamond shape with question mark inside).
 
There is no automatic fix for this. That’s why I write {#46} in such case, so that a smart user, knowing that there should be for example an exclamation point “!” where {$46} appears, could save the extracted text to TXT file in @Voice app, then use any good text editor's global search and replace text function, to change “{$46}” into “!”.
 
If you don’t believe me, please open such PDF in Adobe Acrobat Reader for example (Adobe company invented PDF and defines its standard), mark and copy some text, and paste that text into another app (e.g. email message text). You’ll probably get “unknown char” marks in these places, or they will be ignored, or some other random characters substituted for them. Whenever possible, do not download your reading material in PDF format. Anything else (TXT, HTML, DOC, EPUB…) formats are better for reading aloud.
 
Greg
post edited by Admin - 2020/12/15 05:09:55

0 Replies Related Threads

    Jump to:
    © 2021 APG vNext Commercial Version 5.1