Reply to post

Scientific articles in PDF - Jumbled sentences and 1-2 word fragments per line

Author
astrae_research
User
  • Total Posts : 0
  • Reward points: 0
  • Joined: 2016/10/30 20:56:26
  • Status: offline
2018/01/22 00:43:17 (permalink)

Scientific articles in PDF - Jumbled sentences and 1-2 word fragments per line

Hi,
 
The pdf below exemplifies the problem.  There are several pages of a word fragment per line and read separately.
I've seen this with many sci papers from Elsevier and other publishers. Other PDFs read ok.
 
https://www.dropbox.com/s/8uhz5waszl97cgl/article_1.pdf?dl=0
 
Any help would be appreciated!

3 Replies Related Threads

    Admin
    Administrator
    • Total Posts : 275
    • Reward points: 0
    • Joined: 2010/11/22 00:00:00
    • Location: USA
    • Status: offline
    Re: Scientific articles in PDF - Jumbled sentences and 1-2 word fragments per line 2018/01/22 07:52:48 (permalink)
    Maybe you mean the rotated text that is on the left margin of each page and says something like "Annu. Rev Financ. ... blah blah". Open this PDF in @Voice again, and when you see "PDF Text Import Settings", select the "Manually crop pages..." option.
     
    When it opens in @Voice PDF Crop Plugin, grab the left edge of the white area, move it to the right to exclude that marginal text. While at it, also grab the bottom edge and move it up to exclude the page numbers and the footer header they have on each page. Then press the "hamburger menu" at top left, under the "Apply crop to pages:" header press "Current + following".
     
    Return to the main screen of the PDF Crop Plugin, press the > button on top to review if other changes are cropped correctly, then press Back button to exit. @Voice extracts the text minus the unwanted parts.
    astrae_research
    User
    • Total Posts : 0
    • Reward points: 0
    • Joined: 2016/10/30 20:56:26
    • Status: offline
    Re: Scientific articles in PDF - Jumbled sentences and 1-2 word fragments per line 2018/01/22 10:41:44 (permalink)
    Interesting! I guess the lateral text messes it up. I have about a thousand of these PDFs, is there any to automate this? The lateral copyright text quite common in published sci articles so maybe an option in @Voice to do this automatically?
     
    Here is another one https://www.dropbox.com/s/htgosu81hj3i8a3/article_2.pdf?dl=0 . It seems that any landscape text messes things up? Maybe have an option to skip it or convert to portrait (horizontal) mode?
     
    Thank you!
    Admin
    Administrator
    • Total Posts : 275
    • Reward points: 0
    • Joined: 2010/11/22 00:00:00
    • Location: USA
    • Status: offline
    Re: Scientific articles in PDF - Jumbled sentences and 1-2 word fragments per line 2018/01/22 19:55:26 (permalink)
    Do you mean a completely rotated page, like the one with a table on page # 1690 (and some others later) in the article_2.pdf file? Right now you could only exclude such pages manually with the PDF Crop Plugin (it has Exclude command on the left hand "hamburger" menu). I would have to study the PDF code and enhance my text extraction code to handle it better. I already automatically reject any rotated text item in PDF, but I guess they use here some other PDF constructs, like rotating entire blocks of text or something.
     
    Actually I never imagined that my humble app would be used for reading such complex texts, table, rotations, graphs... Again, with the crop plugin you may exclude entire pages, or exclude fragments of pages, e.g. where graphs are etc. It's manual work, if you have one or a few articles like this to listen, I guess it's still doable. If you wanted to do dozens or hundreds of them, I don't want. A lot of work. A lot of work for me too to implement all these automatic text conversions, exceptions etc.
    Jump to:
    © 2024 APG vNext Commercial Version 5.1