pdf tables seem easy until they fail in real life! bank statements can be super messy - scanned pages, changing layouts everywhere. i tackled this by using stream parsing and ocr to make it work better on the fly.
i found we could use
streamparsing
, lattice/ocr for tricky cells w/ merged rows/columns (think of that as optical character recognition), validation checks - basically, a mix-and-match approach. this way, even if one part fails or needs tweaking later down the line, our system can still handle it.
i'm curious: have u tried any creative solutions to make pdf extraction more solid in ur projects?
article:
https://www.infoq.com/articles/redesign-pdf-table-extraction/?utm_campaign=infoq_content&utm_source=infoq&utm_medium=feed&utm_term=global