Reply to post

Helpful ReplyNeed Help with RegEX to Correct OCR Mistakes

Author
Admin
Administrator
  • Total Posts : 275
  • Reward points: 0
  • Joined: 2010/11/22 00:00:00
  • Location: USA
  • Status: offline
2020/12/16 05:31:38 (permalink)

Need Help with RegEX to Correct OCR Mistakes

The user gece posted the following question, but I clicked something incorrectly and accidentally deleted it - sorry! First here is the question, then I'll post my answer:
 
Hi,
I'm trying to have the Voice Aloud pronounce some words that were not correctly OCR'd in the scanned book. I have tried quite different formulas but couldn't succeed, any help would be very much appreciated! Obviously, I do need to learn some basics of RegEx, so I need to ask about which site you would suggest for that as well.
 
E.g.: I need to correct:
"1 am" to "I am"
"1 have" to "I have"
"1 analysed" to "I analysed" and so on...
 
Similarly,
"[ am" to "I am"... and so on
 
I suspect there would be a RegEx formula with "OR" options that would comprise all the variations above -- yet couldn't make one operate successfully...
I couldn't even make the non-RegEx options under "Edit Speech" correct the simple phrase "1 am" to "I am:" apparently, they don't work when there is more than one word/ a phrase to modify...
 
Many thanks in advance!
Admin
Administrator
  • Total Posts : 275
  • Reward points: 0
  • Joined: 2010/11/22 00:00:00
  • Location: USA
  • Status: offline
Re: Need Help with RegEX to Correct OCR Mistakes 2020/12/16 05:32:20 (permalink)
Usually the OCR engine in @Voice app (Tesseract) does not make such errors with the I pronoun. Did you correctly select English language when processing this file with OCR? Or was it a really bad quality page scan? Such replacements may introduce other confusion, for example replace number 1 with I, where really 1 was meant to be.
 
To replace isolated 1 followed by any word, with I, you could use:
 
Replacement type: RegEx
Pattern: \b1(\s+[a-z]+)\b
Replace: I $1
 
For "[ anything" (problematic if you encounter [any text] in square brackets):
 
Replacement type: RegEx
Pattern: (^|\s)\[\s+([a-z]+)\b
Replace: I $2
 
Note that in both cases the "Pattern" field data should be entered _exactly_ as shown above, without any spaces added, not in front, in-between the characters, and not at the end. In the Replace fields there is one space after I.
 
Greg
post edited by Admin - 2020/12/16 05:33:21
gece
User
  • Total Posts : 0
  • Reward points: 0
  • Joined: 2020/12/09 03:32:57
  • Status: offline
Re: Need Help with RegEX to Correct OCR Mistakes 2020/12/18 01:11:35 (permalink)
Thank you so much for the RegEx patterns! 
These will be very useful! 
 
I actually didn't use the Voice OCR engine for that file, and it indeed was a poor scan; I used a professional OCR app.
 
Admin
Administrator
  • Total Posts : 275
  • Reward points: 0
  • Joined: 2010/11/22 00:00:00
  • Location: USA
  • Status: offline
Re: Need Help with RegEX to Correct OCR Mistakes 2020/12/19 13:51:53 (permalink)
Well, maybe you should give a try the OCR included in @Voice app, it may give you better results... And if it does, the credit does not go to me, but to the many volunteer developers of Tesseract free open source OCR package.
post edited by Admin - 2020/12/19 14:09:10
gece
User
  • Total Posts : 0
  • Reward points: 0
  • Joined: 2020/12/09 03:32:57
  • Status: offline
Re: Need Help with RegEX to Correct OCR Mistakes 2020/12/23 03:31:52 (permalink)
Thank you, I didn't know that you did implement Tesseract in an Android app -- which would deserve at least "some credit" going your way! :) 
 
By the way, I'm afraid to ask too many questions in a short time but may I, finally(!), inquire here about two problems I've been experiencing in the app? 
 
a. That's most probably due the insufficient hardware on my phone but when I try to record audio read aloud to an .ogg file the app becomes unresponsive if the resulting file is, I guess, more than 10 MB... 
b. I couldn't find a way to navigate through large chunks of .epub files (apart from clinking on chapter titles, one can only go back/ forward one page at a time, it seems) -- which is why I wanted to record audio in the first place:  Sometimes I absent-mindedly continue to listen to a book while doing something else to then realize that I lost my place completely. 
 
As for deleting messages: No worries, indeed that's why I wrote those last questions here: Probably not being very helpful for others, they can be deleted easily from the thread. 
 
Cheers
 
 
Admin
Administrator
  • Total Posts : 275
  • Reward points: 0
  • Joined: 2010/11/22 00:00:00
  • Location: USA
  • Status: offline
Re: Need Help with RegEX to Correct OCR Mistakes 2020/12/23 05:02:46 (permalink) ☄ Helpfulby gece 2020/12/23 09:22:58
a. I'm not sure why this happens, so far no other users reported a similar problem. Maybe I should make a setting to pause recording and finish one OGG file, if it reaches pre-determined size or audio length, and start a new one... If you have an old slow phone only, but a big computer (Windows, Mac, Linux... or even a Chromebook), you could use @Voice there to make your recordings, under a good Android emulator (e.g. BlueStacks or Nox Player on Windows, Mac or Linux, Chromebooks run Android apps natively now).
 
b. Scrolling in @Voice to any part of text is easier than in any audio player. Simply scroll to the place you want, then double-tap a sentence there to read from that place. If you use horizontal scrolling (text divided into pages, horizontal scrolls switch pages) - use pinch gesture like you wanted to make a picture smaller. The pages become smaller and scroll continuously then, and there is a horizontal slider at the bottom for even faster scrolling. When you find the wanted page, tap it, it becomes normal size and the app returns to the normal scrolling regime by pages. Double-tap any sentence to read from there.
 
Greg
gece
User
  • Total Posts : 0
  • Reward points: 0
  • Joined: 2020/12/09 03:32:57
  • Status: offline
Re: Need Help with RegEX to Correct OCR Mistakes 2020/12/23 09:26:17 (permalink)
Many thanks for both your answers! I should give the computer a try. I don't know how I haven't noticed that before: That makes scrolling through the whole texts becomes quite easy and smooth. 
 
Jump to:
© 2024 APG vNext Commercial Version 5.1