[TYPO3-core] RFC #7984: Bug: stdWrap.crop now closes opened tags and counts chars correctly

Sat Apr 5 13:42:41 CEST 2008

Hello Martin,

> [...] you cannot simply change the behaviour of 
> crop as it might me be used on plain text. So this will have to be 
> stdWrap.cropHTML (or stdWrap.cropXML) which makes at new feature. Which 
> makes it TYPO3_4-2 only, that is if you get the ok from the release 
> managers. RC1 is already out!

You can still crop plain text with the patched stdWrap.crop. So I don't 
think we need a new feature like stdWrap.cropXML. The patch is just a 
bugfix IMO.

> Not terribly thrilling I'm afraid. It's simple enough to use either 
> strpos() or preg_match() * to walk through the html without having to 
> split the entire content.

I didn't intend to thrill you ;-). Before I posted the patch I tested 
three solutions:
1) crawling through the content only with strpos() and preg_match()
2) utilizing a stack and a recursive function
3) splitting the content using preg_split and crawling over the array

IMO the last solution is the best because the code is fast, readable and 
(relatively) easy to maintain and debug (the code blocks that split, 
crop and close tags are clearly separated).

> A more serious problem is that you use html_entity_decode() which 
> supports only a subset of the charset TYPO3 supports.

The most important charsets are supported by html_entity_decode() (like 
utf-8, ISO-8859-1, ISO-8859-15, GB2312 or EUC-JP. It has also a fallback 
to ISO-8859-1 and throws a warning if a charset is not supported. 
Furthermore it is not recommended (see PHP-doc) to use entities if the 
charset is multibyte.

> I also do not understand why you have a list of tags to split on. It's 
> much easier and forward compatible to work on all tags that come along.

You are right. I first intended to restrict the parsing to a subset of 
tags. It's not necessary anymore. The new regex (see below) should do 
that now.

> Thanx for trying, but there is IMHO much room for improvement.

Thank you, too. I hope I'm on the right way ;-)

> * eg
> preg_match('/<([^ >]+)([^>]*)>/', $html, $match, PREG_OFFSET_CAPTURE)

Your suggested (and my former) regex don't take all valid HTML-tags into 
account like the following example shows:

<img title="next >" src="next.gif">   // valid HTML

<img title="next >  // this is what your and my regex matched

The new regex should match all valid (X)HTML and typolink tags (like 
<link john.doe at mydomain.org>).

If you want to test the output of the patched stdWrap.crop you can take 
the snippet below as cObj TEXT (disable RTE) and test it with

tt_content.text.20.crop = 20| ... |0

-----
Lorem <strong>ipsum</strong> dolor <strong>amet</STRONG>, consectetur 
<img title="next" src="clear.gif" /> adipisicing elit, sed do 
<STRONG>eiusmod tempor</strong> incididunt ut labore et dolore <link 
john.doe at mydomain.com>sit</link> magna aliqua. Ut enim ad minim veniam, 
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo 
consequat. Duis aute irure dolor in reprehenderit in voluptate velit 
esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat 
cupidatat non proident, sunt in culpa qui officia deserunt mollit anim 
id est laborum.
-----

Greetings from Laax
Jochen
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: bug_7984.diff
Url: http://lists.netfielders.de/pipermail/typo3-team-core/attachments/20080405/4eb3f892/attachment.txt