Parsing Office Open (OOXML) files

Torsten · October 7, 2019, 4:58pm

Started working on a parser for Office Open (OOXML). Specifically I am working on a docx import.
The job is getting the payload (a regulatory text) extracted, shedding the entire word processor formatting informatiomation.
The ooxml format is slightly unwieldy: the spec has some 5000 pages (plus annexes).
Towards the end of this week I should have something to tell about taming the beast.

Has anyone worked on a similar FM/ooxml project?

steve_ssh · October 25, 2019, 4:22pm

Hi Torsten,

I have a buddy who may have worked on something similar, but I am not certain.

I have pinged him with a link to this thread, so that he can take a look and see if he has anything to share.

I also kind of recall that @anon45965781 might have once posted about using a micro-service to extract text content from various document file-formats. I do not recall the details of this, however. If your endeavor involves extracting the text sans formatting, maybe there is some similarity in what @anon45965781 was doing, and the work you are doing.

Kind regards,

-steve

anon45965781 · October 25, 2019, 5:15pm

Right. Don’t write a parser. It’s already done and free. It works with just about every binary file format.

Check out the Apache Tika library. Easy to implement in a microservice. Then just drag any binary file to container and you can extract the text content to a regular text field.

You don’t have to use container fields. That was just an FMP implementation way I use tika with my clients.

Pls write if you have any questions.

Thanks Steve for thinking of me!

Torsten · October 27, 2019, 6:30am

@steve_ssh, @anon45965781, thanks for the heads-up. I will look into this a I need to expand my xml toolset.

anon45965781 · October 27, 2019, 9:25am

Glad to help. I should have also mentioned that you can run tika code standalone also.

A micro-service would be used if you wanted to directly call the tika code from FMP passing a Base64-encoded binary file of some kind, for example.

Tons of examples online.

Let me know if you have any coding questions.

Enjoy!!

Torsten · November 10, 2019, 12:49am

After working on 2 iterations of provided xml, I settled on two ways of extracting payload data from OOXML:

Jens Teich’s ‘UNIVERSAL XSLT’
the MBS plugin XML functions

Universal XSLT provides a structured import with end tags taken off. The import is used for further processing (parsing). All word processor-specific formatting information is skipped, except list information <w:pStyle>. OOXML covers the needs of the word processor and is not a business-specific xml. Business-specific xml can be added through injected custom xml. The custom xml section is the only part of the payload that is not ‘diluted’ by a multitude of word formatting tags. For semantic parsing this is the starting point. Content parts of the body section <w:body> <w:sdtContent> are linked to custom xml nodes through references <w:id>.
After collection of information in the custom xml section, related content in the body section is parsed, skipping over all non-relevant formatting tags. The result is stored in two tables:

customxml 1 <----> n content

The process of parsing OOXML with injected custom xml entirely depends on the custom xml. OOXML itself is not a business xml, it is a page description language. Semantic interpretation of content in <w:body> also cannot be derived to 100% from custom xml. The missing semantic knowledge has to be incorporated in the parsing routine.

In addition to the above, I am using MBS xml functions for parsing specific nodes and formatted output for development.

The custom xml is work in progress and may warrant exploration of other parsers (MBS or as suggested by @steve_ssh and @anon45965781) .

The attached file shows the node list of a typical MS Word xml.OOXML node list.txt (3.2 KB)

anon45965781 · November 12, 2019, 6:56pm

You can also use the free API by Apache (POI). That’s what I do.

Torsten · November 12, 2019, 7:05pm

Did you parse a ooxml with it? Would be interested in the outcome. Ooxml has the potential of thwarting a regular parser with its formatting gibberish.

anon45965781 · November 12, 2019, 7:17pm

Sure, that’s why you use an API like this one from Apache.

https://poi.apache.org/apidocs/dev/org/apache/poi/ooxml/package-summary.html

Lots of examples online.

–

Also, check out Tika from Apache, it will extract text from just about any binary data format – hundreds of them – all for free.

Torsten · November 12, 2019, 7:19pm

Ok, see. That’s definitely interesting. Thanks!

anon45965781 · November 12, 2019, 7:40pm

Give it a try. Post some code and I’ll try to help.

Torsten · November 12, 2019, 7:45pm

Great, thanks. I think I’ll work on it in December, when things calm down…

anon45965781 · November 12, 2019, 8:05pm

Sounds good. The offer stands.

Topic		Replies	Views
Micro-service Examples Using FMP Intro to micro services micro-services	5	1324	December 6, 2019
A freely available tool Tools & Utilities (Free/Open Source)	11	545	January 12, 2024
XLSX Import Versions Questions scripting , integration	16	401	October 3, 2023
Extract invoice for ZUGFeRD and Facture-X MBS Plugins pdf , plugin , dynapdf , mbs	0	41	September 23, 2024
Spreadsheet operations in FMP web-viewer Questions web_viewer	4	157	January 6, 2024

Parsing Office Open (OOXML) files

Related topics