[TYPO3-content-rendering] Illegal SGML characters in output

Michael Stucki michael at typo3.org
Mon Feb 13 12:03:28 CET 2006


Hi Ernesto,

I am willing to solve this problem, but need some more assistance from your
side.

If anyone else is interested in that topic, please join the discussion at
http://bugs.typo3.org/view.php?id=2048

Regards, michael

> Hi,
> 
> the "non SGML character number 128" is probably the most annoying
> validation error that TYPO3-sites hit when users from the Windows world
> (especially european-based) copy&paste input some field which will go
> right through to the frontend.
> 
> THE PROBLEM
> ---------------
> 
> The origin of the problem comes from the fact that the ISO-Latin-1
> character table specifies every character from the decimal range 32 up
> to 255, but has a gap in the range from 128 to 159 (see [1]). This range
> is (mis?)used by Microsoft in the so called "Windows-Latin-1" for
> various characters. The most frequently chars are the EURO-sign, the
> emdash ("langer Gedankenstrich", which MS-Word creates automatically if
> you type an hyphen with spaces around it) and opening-double-quotes
> (bottom) (also created by Word in German if you start some quotation).
> 
> So outputting these characters for the Web in "charset=iso-8859-1" mode
> is not "valid", because they are not part of this charset (which is also
> why the W3C-validator chokes on them). The very good article in [2]
> present some alternatives on how to output them in a generic way.
> 
> SOME TYPO3 SOLUTIONS
> ------------------------
> 
> Some time in the past I've written "cron_rte_cleanenc", which will remap
> those characters from the RTE into proper numerical entities (which is
> what the article [2] suggests as the most widely used method). This is
> nice, but later I figured out that these characters can also be pasted
> into fields that are not RTE-enabled (e.g. Title, Subtitle, etc), so my
> processing also works on some cases.
> 
> Later versions of qcom_htmlcleaner include the switch "Remap illegal
> chars" (clean_chars), which will translation any "high ASCII" character
> to a proper entity. Two problems I see with the current approach:
> 
> 1. it only applies to XHTML_clean(), while the problem also exists in
>    HTML mode.
> 2. it translates *all* characters >127 into entities, which is not
>    needed. The range 128-159 is sufficient here, as Ä can be
>    represented by a proper ISO-Latin-1 character already.
> 
> MY GOAL/AIM
> --------------
> 
> I want this translation to happen in TYPO3-core, without needing any
> extention. Our goal has been XHTML-validity, and this is a major issue
> in this commitment. This is not a "xhtml_cleaning" problem, but a
> generic charset problem. We have proven solutions to the problem, we
> just need to see if they are generic enough not to hurt and add them in
> a meaningful way to the core.
> 
> HOW TO PROCEED
> -----------------
> 
> We need to find out in which character sets this is a problem. If I set
> my site to "forceCharSet=utf-8", the problem doesn't exist, because all
> pasted input will have corresponding UTF-8 entities which are valid. So
> maybe some charset expert around could tell us a bit about it, and if
> noone is available, I would do some research on it. I suspect every
> ISO-Latin-x variant hast this problem.
> 
> Then we need to create some patches to correct the situation.
> 
> I've just commited this text as http://bugs.typo3.org/view.php?id=2048,
> so we can track the progress on this. :)
> 
> 
> [1] http://www.htmlhelp.com/reference/charset/
> [2] http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
> 
> 
> Cheers,
> Ernesto

-- 
Use a newsreader! Check out
http://typo3.org/community/mailing-lists/use-a-news-reader/



More information about the TYPO3-project-content-rendering mailing list