Reply to post

Academic PDFs with footnotes / multiple font sizes

Author
cognifloyd
User
  • Total Posts : 0
  • Reward points: 0
  • Joined: 2016/03/17 21:07:19
  • Status: offline
2016/03/17 21:29:46 (permalink)

Academic PDFs with footnotes / multiple font sizes

@voice Aloud Reader has been a life saver in getting through some academic articles fairly quickly.
 
Is there anything that can be done about the footnotes in academic PDFs? Typically there is a visual line dividing the body text from the footnotes, but @Voice doesn't recognize the line (I'm not sure that detecting that would be very easy). So, the reading continues mid sentence from the body into a footnote which can be very confusing. Note also that it is very common for explanatory footnotes to take up half of a page, so it's very difficult to know when I'm in footnotes or not without looking at the PDF in another app or on a different device while listening to it.
 
Many of the journals I've been reading from have footnotes that are a smaller font size than the body text. Sometimes large quotations are also in a different font size. However, because the font is a slightly different size, @Voice has a hard time knowing when to add spaces and line breaks. Playing with the initial spacing (changing 0.1 to 0.065 for example) can reduce the number of line breaks and split the words more intelligently, however it is far from perfect. Many times, I'll get one or five words or so per line. With the slight pause between lines to signal a new paragraph, that is very choppy reading. I am using the "join lines" option below encoding in the text regeneration menu. "Separate lines" makes things even worse. I also tried setting "Preserve styles" in font options, but that seems to have no effect in PDFs.
 
Is there some way to more elegantly handle footnotes and multiple font sizes?

3 Replies Related Threads

    Admin
    Administrator
    • Total Posts : 275
    • Reward points: 0
    • Joined: 2010/11/22 00:00:00
    • Location: USA
    • Status: offline
    Re: Academic PDFs with footnotes / multiple font sizes 2016/03/18 07:38:05 (permalink)
    Please email me a few PDF articles with the issues you describe, so that I could have a better look and think of possible solutions. No promise though, PDF format is just horrible for text extraction. All I get from it are single characters with X, Y coordinates on the page, plus font info (like size, font face etc.) I have to re-create from this words, guess where words must be separated, how to combine them into lines and paragraphs etc. I'm not analyzing any graphics on the page (like horizontal lines, which could be drawn in many different ways, with different PDF graphics commands etc.), the code is already so complicated... Whenever possible, try to get the same articles in another format, e.g. HTML, DOC, TXT...
     
    Greg
    david.dupont
    User
    • Total Posts : 20
    • Reward points: 0
    • Joined: 2015/04/10 03:37:04
    • Status: offline
    Re: Academic PDFs with footnotes / multiple font sizes 2016/03/18 08:13:55 (permalink)
    If I can give a hint.
    Sometime footpage can be remove with a regex pattern.
    But you have to create it for each documents.
    Admin
    Administrator
    • Total Posts : 275
    • Reward points: 0
    • Joined: 2010/11/22 00:00:00
    • Location: USA
    • Status: offline
    Re: Academic PDFs with footnotes / multiple font sizes 2016/03/20 09:37:10 (permalink)
    Right, given an example of such PDF file I could maybe suggest the RegEx patterns and maybe create @Voice text filter file for it.
     
    Greg
    Jump to:
    © 2024 APG vNext Commercial Version 5.1