[TYPO3-content-rendering] Illegal SGML characters in output

Ernesto Baschny [cron IT] ernst at cron-it.de
Thu Dec 15 21:01:42 CET 2005


Hi,

the "non SGML character number 128" is probably the most annoying
validation error that TYPO3-sites hit when users from the Windows world
(especially european-based) copy&paste input some field which will go
right through to the frontend.

THE PROBLEM
---------------

The origin of the problem comes from the fact that the ISO-Latin-1
character table specifies every character from the decimal range 32 up
to 255, but has a gap in the range from 128 to 159 (see [1]). This range
is (mis?)used by Microsoft in the so called "Windows-Latin-1" for
various characters. The most frequently chars are the EURO-sign, the
emdash ("langer Gedankenstrich", which MS-Word creates automatically if
you type an hyphen with spaces around it) and opening-double-quotes
(bottom) (also created by Word in German if you start some quotation).

So outputting these characters for the Web in "charset=iso-8859-1" mode
is not "valid", because they are not part of this charset (which is also
why the W3C-validator chokes on them). The very good article in [2]
present some alternatives on how to output them in a generic way.

SOME TYPO3 SOLUTIONS
------------------------

Some time in the past I've written "cron_rte_cleanenc", which will remap
those characters from the RTE into proper numerical entities (which is
what the article [2] suggests as the most widely used method). This is
nice, but later I figured out that these characters can also be pasted
into fields that are not RTE-enabled (e.g. Title, Subtitle, etc), so my
processing also works on some cases.

Later versions of qcom_htmlcleaner include the switch "Remap illegal
chars" (clean_chars), which will translation any "high ASCII" character
to a proper entity. Two problems I see with the current approach:

1. it only applies to XHTML_clean(), while the problem also exists in
   HTML mode.
2. it translates *all* characters >127 into entities, which is not
   needed. The range 128-159 is sufficient here, as Ä can be
   represented by a proper ISO-Latin-1 character already.

MY GOAL/AIM
--------------

I want this translation to happen in TYPO3-core, without needing any
extention. Our goal has been XHTML-validity, and this is a major issue
in this commitment. This is not a "xhtml_cleaning" problem, but a
generic charset problem. We have proven solutions to the problem, we
just need to see if they are generic enough not to hurt and add them in
a meaningful way to the core.

HOW TO PROCEED
-----------------

We need to find out in which character sets this is a problem. If I set
my site to "forceCharSet=utf-8", the problem doesn't exist, because all
pasted input will have corresponding UTF-8 entities which are valid. So
maybe some charset expert around could tell us a bit about it, and if
noone is available, I would do some research on it. I suspect every
ISO-Latin-x variant hast this problem.

Then we need to create some patches to correct the situation.

I've just commited this text as http://bugs.typo3.org/view.php?id=2048,
so we can track the progress on this. :)


[1] http://www.htmlhelp.com/reference/charset/
[2] http://www.cs.tut.fi/~jkorpela/www/windows-chars.html


Cheers,
Ernesto



More information about the TYPO3-project-content-rendering mailing list