Join us now!

Username
Password
Verification
	Stay logged in

Forgot Your Password? Forgot your Username? Haven't received registration validation E-mail?

User Control Panel Log out

Forums
Posts

Latest Posts

Active Posts

Recently Visited

Search Results

View More
Blog

Recent Blog Posts

View More
Photos

Recent Photos

My Favorites

View More Photo Galleries
PMs

Unread PMs

Inbox

Send New PM View More
Page Extras
Menu
- Forum Themes

Reply to post

Mark Thread UnreadFlat Reading Mode ❐

[FAQ]@Voice "does not support" Hindi PDF files (or other Indic languages PDFs)

Author Post Essentials Only Full Version
Admin Administrator Total Posts : 275 Reward points: 0 Joined: 2010/11/22 00:00:00 Location: USA Status: offline 2019/04/10 16:20:28 (permalink) @Voice "does not support" Hindi PDF files (or other Indic languages PDFs) @Voice can extract text from any PDF file that does contain valid text (and not just images of scanned pages, where OCR needs to be employed to recognize text from images), AND contains a valid translation table from the font codes used in this PDF to Unicode standard. Somehow it's a very bad "tradition" that PDF files created in Hindi and other Indic languages do not provide such translation tables at all. If you open such PDF in @Voice and instead of valid text see some gibberish, random characters, read below. You can now open and read aloud such files in @Voice app as follows: open the original PDF file again in @Voice app, using the "Open" button on top of the screen (folder icon). Then on the next screen entitled "PDF Text Import Settings" turn on the OCR option and choose the correct language for your file. Then proceed to extract the text. The OCR processing will take much longer than normal opening of a correctly encoded PDF, but it will read aloud fine, if the text quality on the PDF pages is good. If it's a long file, that you may need to open several times to continue reading, it's best to open the next time the file with extracted text, instead of opening the origina PDF again. The extracted text will be in the one of following folders, under the @Voice home folder, depending on which format of extraction you selected: - for Plain Text extraction - in PdfText folder, the file name will be the same as the original PDF, with .pdf.txt extension - for HTML extraction - in eBooks folder, the file name will be the same as the original PDF, with .pdf.epub extension Greg post edited by Admin - 2021/05/13 13:48:53 Quote #1 0 Replies Related Threads

Author

Post

Essentials Only Full Version

Admin

Administrator

Total Posts : 275
Reward points: 0
Joined: 2010/11/22 00:00:00
Location: USA
Status: offline

2019/04/10 16:20:28 (permalink)

@Voice "does not support" Hindi PDF files (or other Indic languages PDFs)

@Voice can extract text from any PDF file that does contain valid text (and not just images of scanned pages, where OCR needs to be employed to recognize text from images), AND contains a valid translation table from the font codes used in this PDF to Unicode standard. Somehow it's a very bad "tradition" that PDF files created in Hindi and other Indic languages do not provide such translation tables at all. If you open such PDF in @Voice and instead of valid text see some gibberish, random characters, read below.

You can now open and read aloud such files in @Voice app as follows: open the original PDF file again in @Voice app, using the "Open" button on top of the screen (folder icon). Then on the next screen entitled "PDF Text Import Settings" turn on the OCR option and choose the correct language for your file. Then proceed to extract the text. The OCR processing will take much longer than normal opening of a correctly encoded PDF, but it will read aloud fine, if the text quality on the PDF pages is good.

If it's a long file, that you may need to open several times to continue reading, it's best to open the next time the file with extracted text, instead of opening the origina PDF again. The extracted text will be in the one of following folders, under the @Voice home folder, depending on which format of extraction you selected:

- for Plain Text extraction - in PdfText folder, the file name will be the same as the original PDF, with .pdf.txt extension
- for HTML extraction - in eBooks folder, the file name will be the same as the original PDF, with .pdf.epub extension

Greg

post edited by Admin - 2021/05/13 13:48:53

Quote #1

0 Replies Related Threads

Jump to: