[TYPO3-RTE] Cleaning pasted content
Robert Markula
robert.markula at gmx.net
Thu Dec 29 10:08:52 CET 2005
Hi Stanislas & List,
There are many settings for rtehtmlarea or the RTE API in general which
enable the admin to control the input to a certain degree for the sake
of a consistent output in the FE (like removing attributes from certain
tags or removing certain tags in general). Which is very good, since a
consistent output is very important.
However, when pasting content from other sources (websites, Word,
OpenOffice.org etc.), the current input control may not be sufficient,
especially when the source is not well-formed from the perspective of
the RTE (even more when tables are disabled with 'removeTags = table,
tbody, td, th, thead, tr').
Here are a few things I came across:
(1) There can be constructs like </p><br /><p>, which are usually not
intended and mess up paragraphs.
(2) The same applies to tabs within sentences.
(3) Some editors have the annoying habit to position text with a lot of
empty spaces (tabs would be better, but still are a pain when pasting
such content in rte - see (2)). This can also happen to every editor who
accidently presses the 'space' key more than once.
(4) When pasting lists from OpenOffice, there are <p> tags within lists:
<li><p>Some text here</p></li>.
(5) Some admins may even want to remove empty paragraphs (<p> </p>).
(6) When pasting text from Acrobat Reader, there are soft line breaks
(<br />) within sentences because the text is copied exactly as you see
it in Acrobat Reader. The same happens to text copied from e-mails,
because in Plain-Text E-Mails text is usually wrapped after 76
characters. See below for an example.
Points (1) and (2) can be reproduced by opening the rtehtmlarea manual
in OpenOffice.org and pasting the content in htmlarea. Then scroll down
to the bottom of the text.
(3) can be reproduced by pasting any list from OpenOffice.org.
(4) is well known to admins which have these kind of editors.
To reproduce (5), just add an empty paragraph.
(6): Open a PDF document in Acrobat Reader, select the 'select text'
tool and copy text to the RTE.
Solutions to these "problems" might be (the numbers apply to the list
above):
(1) Introduce an option to remove <br /> tags when they occur outside
paragraphs.
(2) Introduce an option to remove tabs.
(3) Do the same with multiple 'space' characters.
(4) Remove P tags when they occur inside list tags.
(5) Introduce an option to completly remove empty paragraphs
(<p> </p>).
(6) I don't see how this could be easily solved (how can the rte
distinguish between a real sentence and just some words that are
deliberately placed in a seperate line?). Perhabs by introducing an
option to remove linebreaks which occur inside a sentence _only_ when
the sentence is inside a paragraph.
What do you think about the whole thing? Is this important to you, do
you have other opinions?
Like to hear your opinions on this,
Ro
----
The following example text is copied from a multi-column document opened
in Acrobat Reader:
----
This text is messed
up because of line
breaks which occur
inside sentences.
Imagine pasting this
text in RTE. It won't
look very good in the
FE.
----
More information about the TYPO3-project-rte
mailing list