Problems addressed / Applications
- Adding linguistic analysis information to .docx documents
- Tagging a PDF’s line and page breaks in the TEI source document
- Inserting links into tagged documents
… with a single, customizable, open-source XProc/XSLT library
Linguistic analysis of .docx files
data:image/s3,"s3://crabby-images/0f30c/0f30c81d0c63b6fb647c52a9c3bb4e5b1074c5a1" alt="Word file about Celts"
Linguistic analysis of .docx files
data:image/s3,"s3://crabby-images/b88fd/b88fdd624a3d9e5f3890e1b84970ae16514295af" alt="linguistically anylzed file"
Linguistic analysis of .docx files
data:image/s3,"s3://crabby-images/c8716/c8716918080ee4e0eacc438acf4f0ba51096ca3b" alt="linguistically anylzed file"
Tagging a PDF’s line breaks in the TEI source document
Taken from: Uwe Johnson Werkausgabe. Ein Vorhaben der
Berlin-Brandenburgischen Akademie der Wissenschaften an der Universität
Rostock
data:image/s3,"s3://crabby-images/2f237/2f23746cb01240974a05a2503fd731746a5a8216" alt=""
Tagging a PDF’s line breaks in the TEI source document
Taken from: Uwe Johnson Werkausgabe. Ein Vorhaben der
Berlin-Brandenburgischen Akademie der Wissenschaften an der Universität
Rostock
data:image/s3,"s3://crabby-images/4af2d/4af2d2c544b5bba6111a8bb40a817137928bdce1" alt=""
Tagging a PDF’s line breaks in the TEI source document
Taken from: Uwe Johnson Werkausgabe. Ein Vorhaben der
Berlin-Brandenburgischen Akademie der Wissenschaften an der Universität
Rostock
data:image/s3,"s3://crabby-images/094f8/094f8f8412bee31a14e56858626d0c8e4e9cb7f4" alt=""
Inserting links into tagged documents
See the synthetic ttt-linking-demo
repo on
Github
It links occurrences (in the paragraphs) of chapter titles to the respective chapters.
The similar real-life application (for Deutscher Apotheker Verlag) linked occurrences of an
entry title to the primary entry in major reference works (thousands of pages).
→ see
demo source
Inserting links into tagged documents
Step-by-step inspection of the linking scenario. Macroscopic steps:
p:xslt
with custom prepare-target-list.xsl
ttt:prepare-input
(library step with 3 XSLT passes)
p:xslt
with custom find-candidates.xsl
ttt:process-paras
(library step with 3 XSLT passes)
ttt:merge-results
(library step with 3 XSLT pass)
→ demo (diff between input and linked output is on next slide)
Diff
data:image/s3,"s3://crabby-images/2f080/2f080c25ae90b514efb59e7b5b811f12ce22a297" alt="the diff between sample input and linked output"
Output of the bespoke find-candidates.xsl
pass
data:image/s3,"s3://crabby-images/2f877/2f877dc0ccfc46ff824861dcaecfb0be2549bb06" alt="A ttt:para element with the normalized input para and
the regex-matching-based tokenization result (indented for better fit into this slide)"
Performance Detail: Converting token start/end milestones into spanning elements
Occurs as last XSLT pass in ttt:process-paras
- “Upward
projection” method for pulling up milestones to immediately beneath
the paragraph
- Only then may they be connected into spanning elements
- Why is it a good idea to split up the input into paragraph-like units?
- Performance scales roughly with node count times splitting point count
- 10 paras with 20 elements and 4 milestones each ⇒ (10×20)×(10×4) = 8000 when processing
as single chunk, 10×20×4 = 800 individually
Common properties of the three tasks
- Deeply nested XML
- Flat-string-based tokenization, often by
regex matching
- Enriching the tokens with analysis results:
- part of speech information
- page/line numbers from a PDF
- link targets
- Merging the new token structure with the
source XML
⇒ Overlapping markup
Interface of mark-linebreaks.xpl
(The PDF scenario)
- Pulling up milestones is optional; not used here
- Good showcase for multiple output ports in XProc
data:image/s3,"s3://crabby-images/107bd/107bd6a5324d6a3a2c693c4715e99820dbbbd588" alt=""
Details of mark-linebreaks.xpl
data:image/s3,"s3://crabby-images/ab909/ab909aae9f985060707b19b1211e18c7ce3d12df" alt=""
Advantages of XProc in orchestrating these pipelines
- Functional language
- Excellent encapsulation
- Thereby great customizability (config files or overriding XSLT on input ports)
- In contrast to other functional languages: Multiple “return values” (output ports) whose consumption
can be deferred to when they are needed (
ttt:prepare-input
on the previous slide)
- Wealth of other libraries, in particular for dealing with zipped XML formats such as .docx, IDML, or
EPUB (for ex. docx2hub,
hub2docx)