[TYPO3-RTE] Cleaning pasted content

Robert Markula robert.markula at gmx.net
Thu Dec 29 10:08:52 CET 2005


Hi Stanislas & List,
There are many settings for rtehtmlarea or the RTE API in general which 
enable the admin to control the input to a certain degree for the sake 
of a consistent output in the FE (like removing attributes from certain 
tags or removing certain tags in general). Which is very good, since a 
consistent output is very important.

However, when pasting content from other sources (websites, Word, 
OpenOffice.org etc.), the current input control may not be sufficient, 
especially when the source is not well-formed from the perspective of 
the RTE (even more when tables are disabled with 'removeTags = table, 
tbody, td, th, thead, tr').

Here are a few things I came across:
(1) There can be constructs like </p><br /><p>, which are usually not 
intended and mess up paragraphs.
(2) The same applies to tabs within sentences.
(3) Some editors have the annoying habit to position text with a lot of 
empty spaces (tabs would be better, but still are a pain when pasting 
such content in rte - see (2)). This can also happen to every editor who 
accidently presses the 'space' key more than once.
(4) When pasting lists from OpenOffice, there are <p> tags within lists: 
<li><p>Some text here</p></li>.
(5) Some admins may even want to remove empty paragraphs (<p>&nbsp;</p>).
(6) When pasting text from Acrobat Reader, there are soft line breaks 
(<br />) within sentences because the text is copied exactly as you see 
it in Acrobat Reader. The same happens to text copied from e-mails, 
because in Plain-Text E-Mails text is usually wrapped after 76 
characters. See below for an example.

Points (1) and (2) can be reproduced by opening the rtehtmlarea manual 
in OpenOffice.org and pasting the content in htmlarea. Then scroll down 
to the bottom of the text.
(3) can be reproduced by pasting any list from OpenOffice.org.
(4) is well known to admins which have these kind of editors.
To reproduce (5), just add an empty paragraph.
(6): Open a PDF document in Acrobat Reader, select the 'select text' 
tool and copy text to the RTE.

Solutions to these "problems" might be (the numbers apply to the list 
above):
(1) Introduce an option to remove <br /> tags when they occur outside 
paragraphs.
(2) Introduce an option to remove tabs.
(3) Do the same with multiple 'space' characters.
(4) Remove P tags when they occur inside list tags.
(5) Introduce an option to completly remove empty paragraphs 
(<p>&nbsp;</p>).
(6) I don't see how this could be easily solved (how can the rte 
distinguish between a real sentence and just some words that are 
deliberately placed in a seperate line?). Perhabs by introducing an 
option to remove linebreaks which occur inside a sentence _only_ when 
the sentence is inside a paragraph.

What do you think about the whole thing? Is this important to you, do 
you have other opinions?
Like to hear your opinions on this,
Ro

----
The following example text is copied from a multi-column document opened 
in Acrobat Reader:
----
This text is messed
up because of line
breaks which occur
inside sentences.
Imagine pasting this
text in RTE. It won't
look very good in the
FE.
----



More information about the TYPO3-project-rte mailing list