Started working on a parser for Office Open (OOXML). Specifically I am working on a docx import.
The job is getting the payload (a regulatory text) extracted, shedding the entire word processor formatting informatiomation.
The ooxml format is slightly unwieldy: the spec has some 5000 pages (plus annexes).
Towards the end of this week I should have something to tell about taming the beast.
I have a buddy who may have worked on something similar, but I am not certain.
I have pinged him with a link to this thread, so that he can take a look and see if he has anything to share.
I also kind of recall that @anon45965781 might have once posted about using a micro-service to extract text content from various document file-formats. I do not recall the details of this, however. If your endeavor involves extracting the text sans formatting, maybe there is some similarity in what @anon45965781 was doing, and the work you are doing.
Right. Don’t write a parser. It’s already done and free. It works with just about every binary file format.
Check out the Apache Tika library. Easy to implement in a microservice. Then just drag any binary file to container and you can extract the text content to a regular text field.
You don’t have to use container fields. That was just an FMP implementation way I use tika with my clients.
Universal XSLT provides a structured import with end tags taken off. The import is used for further processing (parsing). All word processor-specific formatting information is skipped, except list information <w:pStyle>. OOXML covers the needs of the word processor and is not a business-specific xml. Business-specific xml can be added through injected custom xml. The custom xml section is the only part of the payload that is not ‘diluted’ by a multitude of word formatting tags. For semantic parsing this is the starting point. Content parts of the body section <w:body> <w:sdtContent> are linked to custom xml nodes through references <w:id>.
After collection of information in the custom xml section, related content in the body section is parsed, skipping over all non-relevant formatting tags. The result is stored in two tables:
customxml 1 <----> n content
The process of parsing OOXML with injected custom xml entirely depends on the custom xml. OOXML itself is not a business xml, it is a page description language. Semantic interpretation of content in <w:body> also cannot be derived to 100% from custom xml. The missing semantic knowledge has to be incorporated in the parsing routine.
In addition to the above, I am using MBS xml functions for parsing specific nodes and formatted output for development.
The custom xml is work in progress and may warrant exploration of other parsers (MBS or as suggested by @steve_ssh and @anon45965781) .
The attached file shows the node list of a typical MS Word xml.OOXML node list.txt (3.2 KB)
Did you parse a ooxml with it? Would be interested in the outcome. Ooxml has the potential of thwarting a regular parser with its formatting gibberish.