Sentence splitting | Hyperionics Support Forum

Join us now!

Username
Password
Verification
	Stay logged in

Forgot Your Password? Forgot your Username? Haven't received registration validation E-mail?

User Control Panel Log out

Forums
Posts

Latest Posts

Active Posts

Recently Visited

Search Results

View More
Blog

Recent Blog Posts

View More
Photos

Recent Photos

My Favorites

View More Photo Galleries
PMs

Unread PMs

Inbox

Send New PM View More
Page Extras
Menu
- Forum Themes

Reply to post

Mark Thread UnreadFlat Reading Mode ❐

Helpful ReplySentence splitting

Author Post Essentials Only Full Version
j001 User Total Posts : 0 Reward points: 0 Joined: 2019/07/31 03:32:38 Status: offline 2022/08/18 05:11:56 (permalink) Sentence splitting Hi, Is there a way to prevent sentence splitting in certain situations? Eg. "v 80. rokoch 19. storočia..." Which means "in 1980s" It reads incorrectly in my language. Thank you. Quote #1 List Solutions Only 6 Replies Related Threads
Admin Administrator Total Posts : 275 Reward points: 0 Joined: 2010/11/22 00:00:00 Location: USA Status: offline Re: Sentence splitting 2022/08/18 09:40:59 (permalink) ☄ Helpfulby CorbettMD 2022/09/19 15:35:00 Yes, @Voice would have to know that a number with a dot after it is an "abbreviation" to not end the sentence there. However then the sentence splitting would not work correctly where sentences end with a number, like "He was born in 1980." It's pretty stupid that our writing systems have the same character - dot - to end sentences or mark other things... To teach @Voice abbreviations for any language, create a text file named abbrev-XXX.txt, where XXX is a 3 letter ISO code of the language. For example, if your language is Slovak, the file name should be abbrev-slo.txt. Then enter common abbreviations in that language, one per line. They may be just regular abbreviations, like: Dr. Gen. and they may be regular expressions, when the line starts with * character. To treat a number followed by a dot as "abbreviation" (to not end the sentence on it), enter a line: *\b\d+\.\s Then save your abbrev-slo.txt file and copy it to the .config folder under @Voice home directory. You could do this e.g. using the "File browser" function on @Voice's Settings menu. Quote #2
j001 User Total Posts : 0 Reward points: 0 Joined: 2019/07/31 03:32:38 Status: offline Re: Sentence splitting 2022/08/19 05:51:46 (permalink) Admin Then save your abbrev-slo.txt file and copy it to the .config folder under @Voice home directory. You could do this e.g. using the "File browser" function on @Voice's Settings menu. Well, I did it and it doesn't work. There's also a file replace-slk.txt, so I've tried also abbrev-slk.txt. Still no change. Any idea what might be wrong? Thank you. Quote #3
Admin Administrator Total Posts : 275 Reward points: 0 Joined: 2010/11/22 00:00:00 Location: USA Status: offline Re: Sentence splitting 2022/08/19 08:28:22 (permalink) One thing is that after adding an abbreviations file, you would need to reload the article or ebook in @Voice, or best exit and restart the app. Send me your abbrev-slk.txt file by email attachment, so that I could test it. Quote #4
j001 User Total Posts : 0 Reward points: 0 Joined: 2019/07/31 03:32:38 Status: offline Re: Sentence splitting 2022/08/19 13:56:51 (permalink) Admin One thing is that after adding an abbreviations file, you would need to reload the article or ebook in @Voice, or best exit and restart the app. Send me your abbrev-slk.txt file by email attachment, so that I could test it. It didn't work when I restarted the app, but it does work after going back to table of contents and then reloading the chapter of the book. Everything works now. Thanks again. I've added some common abbreviations and regex "*\b.+\.\s[a-z]" so it should keep a sentence together when there is lower-case letter after a period. Most cases should be solved with this. Quote #5
CorbettMD User Total Posts : 0 Reward points: 0 Joined: 2021/08/28 10:59:32 Location: Toronto, Canada Status: offline Re: Sentence splitting 2022/09/19 15:40:41 (permalink) This would be a good candidate for default app parsing behavior. Until I found this post, I was quite puzzled by sentence parsing in scientific articles where numbers with decimals are very common (e.g. p-values). It would be very helpful to have some degree of control over the chunk sizes that are processed in one go, as there are some larger constructs (e.g. long table legends) that I cannot reliably filter out, as they can easily contain multiple sentences, which breaks a regex that would rely on the text "chunk" containing both start and end tags. One solution would be to run regex on the total file before it is tokenized, rather than on tokens themselves. Quote #6
Admin Administrator Total Posts : 275 Reward points: 0 Joined: 2010/11/22 00:00:00 Location: USA Status: offline Re: Sentence splitting 2022/09/19 20:10:34 (permalink) @CorbettMD - if you have a problem with sentence splitting in some specific text, please send me by email the file in which it happens, or a link to a web page where the problem occurs. And remember to explain what exactly goes wrong and how it should work. Only then I can rattle my brain and come up with some useful suggestion, or even modify the app code to work better. Greg Quote #7

Jump to:

© 2024 APG vNext Commercial Version 5.1