Parsing Office Open (OOXML) files

Started working on a parser for Office Open (OOXML). Specifically I am working on a docx import.
The job is getting the payload (a regulatory text) extracted, shedding the entire word processor formatting informatiomation.
The ooxml format is slightly unwieldy: the spec has some 5000 pages (plus annexes).
Towards the end of this week I should have something to tell about taming the beast.

Has anyone worked on a similar FM/ooxml project?

3 Likes

Hi Torsten,

I have a buddy who may have worked on something similar, but I am not certain.

I have pinged him with a link to this thread, so that he can take a look and see if he has anything to share.

I also kind of recall that @anon45965781 might have once posted about using a micro-service to extract text content from various document file-formats. I do not recall the details of this, however. If your endeavor involves extracting the text sans formatting, maybe there is some similarity in what @anon45965781 was doing, and the work you are doing.

Kind regards,

-steve

2 Likes

Right. Don’t write a parser. It’s already done and free. It works with just about every binary file format.

Check out the Apache Tika library. Easy to implement in a microservice. Then just drag any binary file to container and you can extract the text content to a regular text field.

You don’t have to use container fields. That was just an FMP implementation way I use tika with my clients.

Pls write if you have any questions.

Thanks Steve for thinking of me! :grinning:

4 Likes

@steve_ssh, @anon45965781, thanks for the heads-up. I will look into this a I need to expand my xml toolset.

Glad to help. I should have also mentioned that you can run tika code standalone also.

A micro-service would be used if you wanted to directly call the tika code from FMP passing a Base64-encoded binary file of some kind, for example.

Tons of examples online.

Let me know if you have any coding questions.

Enjoy!!:nerd_face:

1 Like

After working on 2 iterations of provided xml, I settled on two ways of extracting payload data from OOXML:

  1. Jens Teich’s ‘UNIVERSAL XSLT’
  2. the MBS plugin XML functions

Universal XSLT provides a structured import with end tags taken off. The import is used for further processing (parsing). All word processor-specific formatting information is skipped, except list information <w:pStyle>. OOXML covers the needs of the word processor and is not a business-specific xml. Business-specific xml can be added through injected custom xml. The custom xml section is the only part of the payload that is not ‘diluted’ by a multitude of word formatting tags. For semantic parsing this is the starting point. Content parts of the body section <w:body> <w:sdtContent> are linked to custom xml nodes through references <w:id>.
After collection of information in the custom xml section, related content in the body section is parsed, skipping over all non-relevant formatting tags. The result is stored in two tables:

customxml 1 <----> n content

The process of parsing OOXML with injected custom xml entirely depends on the custom xml. OOXML itself is not a business xml, it is a page description language. Semantic interpretation of content in <w:body> also cannot be derived to 100% from custom xml. The missing semantic knowledge has to be incorporated in the parsing routine.

In addition to the above, I am using MBS xml functions for parsing specific nodes and formatted output for development.

The custom xml is work in progress and may warrant exploration of other parsers (MBS or as suggested by @steve_ssh and @anon45965781) .

The attached file shows the node list of a typical MS Word xml.OOXML node list.txt (3.2 KB)

You can also use the free API by Apache (POI). That’s what I do.

1 Like

Did you parse a ooxml with it? Would be interested in the outcome. Ooxml has the potential of thwarting a regular parser with its formatting gibberish.

Sure, that’s why you use an API like this one from Apache.

https://poi.apache.org/apidocs/dev/org/apache/poi/ooxml/package-summary.html

Lots of examples online.

Also, check out Tika from Apache, it will extract text from just about any binary data format – hundreds of them – all for free.

2 Likes

Ok, see. That’s definitely interesting. Thanks!

Give it a try. Post some code and I’ll try to help. :nerd_face:

2 Likes

Great, thanks. I think I’ll work on it in December, when things calm down…

Sounds good. The offer stands. :heavy_check_mark:

1 Like