java - Is it possible to read a PDF line by line? -
is there way in java, read pdf line line , convert text? have used itextpdfparser reads page page rather line line. has few drawbacks. let me know if there way read pdf's line line.
before start on this, should ask few more abstract questions. first "what line of text in document?" problem, see, pdf represents large set of printable documents (i won't all, it's pretty close).
text placed on page number of operators: tj
'
"
tj
. example (a string) tj
places "a string" in current font current text transformations (word/char spacing/scaling, transformation matrix) on page. , on simplified because 8-bit characters in string may interpreted in kinds of screwy ways depending on encoding used instance of font.
so let's @ way - if place text on page in lines, generating application might use '
operator moves next line , places line of text. great, extracting line line easy. if application decides place plain text on page , italic text , bold text (i'm looking @ you troff), don't things in order expect. in fact, application can place text on page in possible order want.
ok, say, take text , sort in reading order. that's easy. bounding boxes each piece of text , sort top bottom left right. columns? inset boxes? small caps or initial drop caps? sub , superscript? text on map follows contours of road or river? is reading order anyway? if text kanji? if it's a mix of kanji , english? if it's hebrew numbers? ligatures? word boundaries anyway? if word placed glyph @ time? how know when glyph part of word , should put in space? if there no spaces placed on page? discretionary hyphens?
this gives idea of scope of problem , things need consider when interpreting output of typical text extraction. pdf text extraction tools go far pulling text, undoing encoding, annealing words , sorting.
i worked on text extraction tools in acrobat 1.0 , 2.0 , hit in list. had 1 engineer/researcher working full time on text extraction code in 2.0 product , started during middle of 1.0 product - that's close 2 years right(ish).
so want line line? roll sleeves.
Comments
Post a Comment