Reply to post

How to replace page numbers?

Author
Narayan
User
  • Total Posts : 0
  • Reward points: 0
  • Joined: 2021/08/22 23:54:15
  • Status: offline
2021/09/06 04:12:00 (permalink)

How to replace page numbers?

I am reading a badly formatted pdf article that has page numbers embedded in the text.
 
This appears occurs inside the text like this:
 
<blank line>
(p.x)
<blank line> 
 
Where I do not know the composition of the blank lines: They may be white spaces, or genuinely blank lines.
Also, the page numbers are 1-3 digits long.
 
I have set a pause of 2 seconds for "end of paragraph". So the reading pauses for 2 seconds, spells out the page number, and again pauses for 2 seconds. As a result, any sentence that overflows a page is broken awkwardly.
 
I used the edit speech feature, and entered \(\.p\d+\) as a pattern, and left the replacement blank. But it does not work at all. How do I correct this?
 
Also, how do I match the blank lines, given that they may actually contain spaces, newline character or new paragraph character? I tried to use " *" (a space followed by asterisk) to match these lines, but that did not work, either.  
(I used a separate Edit speech entry to match the blank line.)
 
 Finally I would like to combine these patterns to catch them as a single pattern that is to be ignored.
 
BTW the entries are enabled (tick appears in the checkbox), so that's not the issue...
 
 
 
post edited by Narayan - 2021/09/06 04:19:01

4 Replies Related Threads

    Admin
    Administrator
    • Total Posts : 275
    • Reward points: 0
    • Joined: 2010/11/22 00:00:00
    • Location: USA
    • Status: offline
    Re: How to replace page numbers? 2021/09/06 05:22:17 (permalink)
    It would be best if you sent me this PDF article by email attachment (email: atVoice@hyperionics.com), with a brief explanation of the problem there. Then I could suggest the best way of dealing with the problem. Most probably using the "Manually crop..." option on the "PDF Text Import Settings" screen, and shading all page numbers, so that they would not be part of text extraction, and then @Voice would combine sentences and paragraphs across page breaks.
    Narayan
    User
    • Total Posts : 0
    • Reward points: 0
    • Joined: 2021/08/22 23:54:15
    • Status: offline
    Re: How to replace page numbers? 2021/09/06 09:18:21 (permalink)
    Thanks for the prompt response!
     
    Unfortunately, the content is sensitive, and so I am unable to send you the pdf.
     
    But I can definitely send you a screenshot of the problematic part. Or, if you would like to examine the code of the problematic text, please let me know how to extract the code.
     
    The page break appears in random places, not on the edge of all pages. I guess that the person who sent me the pdf copied a pdf file into a Word file, edited some text out and converted the remaining text back to pdf. Thus the page breaks appear at random places in the text, along with page numbers...
     
    The result is weird: The reader seems to be gasping for breath mid-sentence!  
    Narayan
    User
    • Total Posts : 0
    • Reward points: 0
    • Joined: 2021/08/22 23:54:15
    • Status: offline
    Re: How to replace page numbers? 2021/09/06 09:53:01 (permalink)
    I got the corrected version from that person. So this case may please be closed.
    That said, this thread can still be used to guide the users about how to set the edit text correctly to omit the page numbers.
     
    Thanks!
    Admin
    Administrator
    • Total Posts : 275
    • Reward points: 0
    • Joined: 2010/11/22 00:00:00
    • Location: USA
    • Status: offline
    Re: How to replace page numbers? 2021/09/06 16:35:41 (permalink)
    Omitting these page number will just silence the reading of them, but the sentences will be still broken. It's best to open the original PDF in @Voice app and adjust the margins as I described above.
     
    As for the regular expression you used to silence things like "p.123", you made it:  
     
    \(\.p\d+\)
     
    This is obviously not correct, because the dot is before p, and not after, so your regex would match text like ".p123", but not "p.123". However in the original file that you've got there could be some invisible characters, maybe font settings etc., so it's impossible to advise precisely without having that file.
    Jump to:
    © 2021 APG vNext Commercial Version 5.1