ADAPTIVE DOCUMENT CONTENT EXTRACTION VIA ENTROPY-GUIDED GLOBAL ALIGNMENT
Assignee
Richard Hermann
Inventors
Richard Hermann
Abstract
A system and method for extracting content from electronic documents, addressing limitations of rigid, template-based approaches and overfitting issues of machine learning approaches are disclosed. The method begins by identifying and ranking content features by Shannon entropy. The highest-ranked feature(s) are used to identify and match “Landmarks”—content that serves as distinct global anchor points for establishing global alignment between documents. With these Landmarks as a foundation, an adaptive, stepwise global alignment process matches the remaining content. This process uses a two-stage technique: deterministic features first identify a set of potential candidate matches, and then non-deterministic spatial features select the single best match from the candidates based on its geometric coherence with already-aligned items. In the final stage, LLMs are selectively employed to generalize the discovered features and relationships into reusable, abstracted prompts. This allows the system to adapt to unseen document formats with higher accuracy than brute force prompting.
CPC Classifications
Filing Date
2025-09-15
Application No.
19328817