Reply to post

[FAQ]In text extracted from PDF, some characters are represented like {#46}

  • Total Posts : 275
  • Reward points: 0
  • Joined: 2010/11/22 00:00:00
  • Location: USA
  • Status: offline
2019/04/19 07:19:23 (permalink)

In text extracted from PDF, some characters are represented like {#46}

This problem is due to errors in PDF file encoding. They have an incorrect font-to-Unicode conversion table, lacking entries for the characters which @Voice shows as {#XX}, where XX can be any number. When extracting text from such PDF file, I have no idea what they mean when they say that the character number 46 from their font should be drawn in this place. I could skip them or replace with standard “unknown character” mark (I believe it’s a diamond shape with question mark inside).
There is no automatic fix for this. That’s why I write {#46} in such case, so that a smart user, knowing that there should be for example an exclamation point “!” where {$46} appears, could save the extracted text to TXT file in @Voice app, then use any good text editor's global search and replace text function, to change “{$46}” into “!”.
If you don’t believe me, please open such PDF in Adobe Acrobat Reader for example (Adobe company invented PDF and defines its standard), mark and copy some text, and paste that text into another app (e.g. email message text). You’ll probably get “unknown char” marks in these places, or they will be ignored, or some other random characters substituted for them. Whenever possible, do not download your reading material in PDF format. Anything else (TXT, HTML, DOC, EPUB…) formats are better for reading aloud.
post edited by Admin - 2019/04/19 07:24:03

0 Replies Related Threads

    Jump to:
    © 2019 APG vNext Commercial Version 5.1