[TYPO3-doc] ReST migration how-to: Join the journey - from OpenOffice to HTML, ReST and HTML again

Martin Bless m.bless at gmx.de
Fri Mar 2 16:30:49 CET 2012


Good news from today:

We have a working build chain to convert OpenOffice documents to ReST
:)))

I've written a parser "ooxhtml2rst" that makes this possible. It reads
xhtml files created by OpenOffice and transforms them to ReST text
format.

If you are an expert and want to know how it's done visit these links:
http://srv123.typo3.org/~mbless/2012-03-02/official_template_openoffice/README.html
http://srv123.typo3.org/~mbless/2012-03-02/official_template_openoffice/

If you are just interested: Here are four examples with direct links
to 
[1] the folder,
[2] the OpenOffice source as xhtml,
[3] the ReST version 
[4] and the RESULT when rendered as HTML again.

TSref:

[1]
http://srv123.typo3.org/~mbless/2012-03-02/doc_core_tsref/500-manual-made-by-rst2html.html
[2]
http://srv123.typo3.org/~mbless/2012-03-02/doc_core_tsref/300-manual-as-xhtml.html
[3]
http://srv123.typo3.org/~mbless/2012-03-02/doc_core_tsref/400-manual-parsed-by-html2rst.rst.txt
[4] 
http://srv123.typo3.org/~mbless/2012-03-02/doc_core_tsref/500-manual-made-by-rst2html.html



TypoScript in 45 Minuten

[1] http://srv123.typo3.org/%7Embless/2012-03-02/doc_tut_ts45_de/
[2]
http://srv123.typo3.org/%7Embless/2012-03-02/doc_tut_ts45_de/300-manual-as-xhtml.html
[3]
http://srv123.typo3.org/%7Embless/2012-03-02/doc_tut_ts45_de/400-manual-parsed-by-html2rst.rst.txt
[4]
http://srv123.typo3.org/%7Embless/2012-03-02/doc_tut_ts45_de/500-manual-made-by-rst2html.html


TypoScript in 45 Minutes (Russian!)

[1] http://srv123.typo3.org/%7Embless/2012-03-02/doc_tut_ts45_ru/
[2]
http://srv123.typo3.org/%7Embless/2012-03-02/doc_tut_ts45_ru/300-manual-as-xhtml.html
[3]
http://srv123.typo3.org/%7Embless/2012-03-02/doc_tut_ts45_ru/400-manual-parsed-by-html2rst.rst.txt
[4]
http://srv123.typo3.org/%7Embless/2012-03-02/doc_tut_ts45_ru/500-manual-made-by-rst2html.html


Offical Documentation Template (OpenOffice)

[1]
http://srv123.typo3.org/%7Embless/2012-03-02/official_template_openoffice/
[2]
http://srv123.typo3.org/%7Embless/2012-03-02/official_template_openoffice/300-manual-as-xhtml.html
[3]
http://srv123.typo3.org/%7Embless/2012-03-02/official_template_openoffice/400-manual-parsed-by-html2rst.rst.txt
[4]
http://srv123.typo3.org/%7Embless/2012-03-02/official_template_openoffice/500-manual-made-by-rst2html.html

What we get as result depends - expectedly - very much on how the
original OpenOffice document is coded. While structural information is
kept we currently loose the style informationen of the character
level. In HTML terms: we loose the information of <font>, <span> and
similar tags.

State of the parser: There are some minor issues with a blank or a
line present or not in some places - but overall it works very well. I
don't expect the parsing result to be much different or dramatically
better in the future. So I wouldn't hesitate to use it for real world
conversions. Of course I have some ideas to improve it. We may have to
teach the parser about more tags when we start processing more
documents. An maybe we can use some style information from <font> tags
to turn them into structural information. But I don't have much hope
that this approach is worth the effort.

At the end of [4] the parser adds some statistics about what it has
seen while working. Especially if it talks about unhandled tags we
should give it a look.

ReST source files should be UTF-8 if not plain Ascii anyway. I even
use and recommend UTF-8 WITH BOM (byte order mark).

The parser "ooxhtml2rst" will look into the xml input file first und
try to determine the encoding all by itself by extracting it from the
<?xml ...> declaration. Otherwise it expects Ascii.

Have a nice weekend - I certainly will :-)

Martin

-- 
http://mbless.de


More information about the TYPO3-project-documentation mailing list