Re: Scientific articles in PDF - Jumbled sentences and 1-2 word fragments per line
2018/01/22 19:55:26
(permalink)
Do you mean a completely rotated page, like the one with a table on page # 1690 (and some others later) in the article_2.pdf file? Right now you could only exclude such pages manually with the PDF Crop Plugin (it has Exclude command on the left hand "hamburger" menu). I would have to study the PDF code and enhance my text extraction code to handle it better. I already automatically reject any rotated text item in PDF, but I guess they use here some other PDF constructs, like rotating entire blocks of text or something.
Actually I never imagined that my humble app would be used for reading such complex texts, table, rotations, graphs... Again, with the crop plugin you may exclude entire pages, or exclude fragments of pages, e.g. where graphs are etc. It's manual work, if you have one or a few articles like this to listen, I guess it's still doable. If you wanted to do dozens or hundreds of them, I don't want. A lot of work. A lot of work for me too to implement all these automatic text conversions, exceptions etc.