Reply to post

Helpful ReplyHot!Sentence splitting

Author
j001
User
  • Total Posts : 0
  • Reward points: 0
  • Joined: 2019/07/31 03:32:38
  • Status: offline
2022/08/18 05:11:56 (permalink)

Sentence splitting

Hi,
Is there a way to prevent sentence splitting in certain situations? Eg. "v 80. rokoch 19. storočia..." Which means "in 1980s"
It reads incorrectly in my language. Thank you.
Admin
Administrator
  • Total Posts : 275
  • Reward points: 0
  • Joined: 2010/11/22 00:00:00
  • Location: USA
  • Status: offline
Re: Sentence splitting 2022/08/18 09:40:59 (permalink) ☄ Helpfulby CorbettMD 2022/09/19 15:35:00
Yes, @Voice would have to know that a number with a dot after it is an "abbreviation" to not end the sentence there. However then the sentence splitting would not work correctly where sentences end with a number, like "He was born in 1980." It's pretty stupid that our writing systems have the same character - dot - to end sentences or mark other things...
 
To teach @Voice abbreviations for any language, create a text file named abbrev-XXX.txt, where XXX is a 3 letter ISO code of the language. For example, if your language is Slovak, the file name should be abbrev-slo.txt. Then enter common abbreviations in that language, one per line. They may be just regular abbreviations, like:
 
Dr.
Gen.
 
and they may be regular expressions, when the line starts with * character. To treat a number followed by a dot as "abbreviation" (to not end the sentence on it), enter a line:
 
*\b\d+\.\s
 
Then save your abbrev-slo.txt file and copy it to the .config folder under @Voice home directory. You could do this e.g. using the "File browser" function on @Voice's Settings menu.
j001
User
  • Total Posts : 0
  • Reward points: 0
  • Joined: 2019/07/31 03:32:38
  • Status: offline
Re: Sentence splitting 2022/08/19 05:51:46 (permalink)
Admin
Then save your abbrev-slo.txt file and copy it to the .config folder under @Voice home directory. You could do this e.g. using the "File browser" function on @Voice's Settings menu.



Well, I did it and it doesn't work. There's also a file replace-slk.txt, so I've tried also abbrev-slk.txt. Still no change. Any idea what might be wrong? Thank you.
Admin
Administrator
  • Total Posts : 275
  • Reward points: 0
  • Joined: 2010/11/22 00:00:00
  • Location: USA
  • Status: offline
Re: Sentence splitting 2022/08/19 08:28:22 (permalink)
One thing is that after adding an abbreviations file, you would need to reload the article or ebook in @Voice, or best exit and restart the app. Send me your abbrev-slk.txt file by email attachment, so that I could test it. 
j001
User
  • Total Posts : 0
  • Reward points: 0
  • Joined: 2019/07/31 03:32:38
  • Status: offline
Re: Sentence splitting 2022/08/19 13:56:51 (permalink)
Admin
One thing is that after adding an abbreviations file, you would need to reload the article or ebook in @Voice, or best exit and restart the app. Send me your abbrev-slk.txt file by email attachment, so that I could test it. 



It didn't work when I restarted the app, but it does work after going back to table of contents and then reloading the chapter of the book. Everything works now. Thanks again.
I've added some common abbreviations and regex "*\b.+\.\s[a-z]" so it should keep a sentence together when there is lower-case letter after a period. Most cases should be solved with this.
CorbettMD
User
  • Total Posts : 0
  • Reward points: 0
  • Joined: 2021/08/28 10:59:32
  • Location: Toronto, Canada
  • Status: offline
Re: Sentence splitting 2022/09/19 15:40:41 (permalink)
This would be a good candidate for default app parsing behavior.  Until I found this post, I was quite puzzled by sentence parsing in scientific articles where numbers with decimals are very common (e.g. p-values).
 
It would be very helpful to have some degree of control over the chunk sizes that are processed in one go, as there are some larger constructs (e.g. long table legends) that I cannot reliably filter out, as they can easily contain multiple sentences, which breaks a regex that would rely on the text "chunk" containing both start and end tags.  One solution would be to run regex on the total file before it is tokenized, rather than on tokens themselves.
Admin
Administrator
  • Total Posts : 275
  • Reward points: 0
  • Joined: 2010/11/22 00:00:00
  • Location: USA
  • Status: offline
Re: Sentence splitting 2022/09/19 20:10:34 (permalink)
@CorbettMD - if you have a problem with sentence splitting in some specific text, please send me by email the file in which it happens, or a link to a web page where the problem occurs. And remember to explain what exactly goes wrong and how it should work. Only then I can rattle my brain and come up with some useful suggestion, or even modify the app code to work better.
 
Greg
Jump to:
© 2022 APG vNext Commercial Version 5.1