Join us now!

Username
Password
Verification
	Stay logged in

Forgot Your Password? Forgot your Username? Haven't received registration validation E-mail?

User Control Panel Log out

Forums
Posts

Latest Posts

Active Posts

Recently Visited

Search Results

View More
Blog

Recent Blog Posts

View More
Photos

Recent Photos

My Favorites

View More Photo Galleries
PMs

Unread PMs

Inbox

Send New PM View More
Page Extras
Menu
- Forum Themes

Reply to post

Mark Thread UnreadFlat Reading Mode ❐

[FAQ]In text extracted from PDF, some characters are represented like {#46}

Author Post Essentials Only Full Version
Admin Administrator Total Posts : 275 Reward points: 0 Joined: 2010/11/22 00:00:00 Location: USA Status: offline 2019/04/19 07:19:23 (permalink) In text extracted from PDF, some characters are represented like {#46} Update Dec. 2020: if this happens to you and you need to read this PDF file, open it in @Voice app using the Open button on top (folder icon), then on the "PDF Text Import Settings" screen turn on the OCR option, select the correct language and proceed. It will take much longer, but will work well. This problem is due to errors in PDF file encoding. They have an incorrect font-to-Unicode conversion table, lacking entries for the characters which @Voice shows as {#XX}, where XX can be any number. When extracting text from such PDF file, I have no idea what they mean when they say that the character number 46 from their font should be drawn in this place. I could skip them or replace with standard “unknown character” mark (I believe it’s a diamond shape with question mark inside). There is no automatic fix for this. That’s why I write {#46} in such case, so that a smart user, knowing that there should be for example an exclamation point “!” where {$46} appears, could save the extracted text to TXT file in @Voice app, then use any good text editor's global search and replace text function, to change “{$46}” into “!”. If you don’t believe me, please open such PDF in Adobe Acrobat Reader for example (Adobe company invented PDF and defines its standard), mark and copy some text, and paste that text into another app (e.g. email message text). You’ll probably get “unknown char” marks in these places, or they will be ignored, or some other random characters substituted for them. Whenever possible, do not download your reading material in PDF format. Anything else (TXT, HTML, DOC, EPUB…) formats are better for reading aloud. Greg post edited by Admin - 2020/12/15 05:09:55 Quote #1 0 Replies Related Threads

Author

Post

Essentials Only Full Version

Admin

Administrator

Total Posts : 275
Reward points: 0
Joined: 2010/11/22 00:00:00
Location: USA
Status: offline

2019/04/19 07:19:23 (permalink)

In text extracted from PDF, some characters are represented like {#46}

Update Dec. 2020: if this happens to you and you need to read this PDF file, open it in @Voice app using the Open button on top (folder icon), then on the "PDF Text Import Settings" screen turn on the OCR option, select the correct language and proceed. It will take much longer, but will work well.

This problem is due to errors in PDF file encoding. They have an incorrect font-to-Unicode conversion table, lacking entries for the characters which @Voice shows as {#XX}, where XX can be any number. When extracting text from such PDF file, I have no idea what they mean when they say that the character number 46 from their font should be drawn in this place. I could skip them or replace with standard “unknown character” mark (I believe it’s a diamond shape with question mark inside).

There is no automatic fix for this. That’s why I write {#46} in such case, so that a smart user, knowing that there should be for example an exclamation point “!” where {$46} appears, could save the extracted text to TXT file in @Voice app, then use any good text editor's global search and replace text function, to change “{$46}” into “!”.

If you don’t believe me, please open such PDF in Adobe Acrobat Reader for example (Adobe company invented PDF and defines its standard), mark and copy some text, and paste that text into another app (e.g. email message text). You’ll probably get “unknown char” marks in these places, or they will be ignored, or some other random characters substituted for them. Whenever possible, do not download your reading material in PDF format. Anything else (TXT, HTML, DOC, EPUB…) formats are better for reading aloud.

Greg

post edited by Admin - 2020/12/15 05:09:55

Quote #1

0 Replies Related Threads

Jump to: