Re: Academic PDFs with footnotes / multiple font sizes
2016/03/18 07:38:05
(permalink)
Please email me a few PDF articles with the issues you describe, so that I could have a better look and think of possible solutions. No promise though, PDF format is just horrible for text extraction. All I get from it are single characters with X, Y coordinates on the page, plus font info (like size, font face etc.) I have to re-create from this words, guess where words must be separated, how to combine them into lines and paragraphs etc. I'm not analyzing any graphics on the page (like horizontal lines, which could be drawn in many different ways, with different PDF graphics commands etc.), the code is already so complicated... Whenever possible, try to get the same articles in another format, e.g. HTML, DOC, TXT...
Greg