[TYPO3-core] RFC #7984: Bug: stdWrap.crop now closes opened tags and counts chars correctly
Martin Kutschker
martin.kutschker-n0spam at no5pam-blackbox.net
Sat Apr 5 14:53:44 CEST 2008
Jochen Rau schrieb:
> Hello Martin,
>
>> [...] you cannot simply change the behaviour of crop as it might me be
>> used on plain text. So this will have to be stdWrap.cropHTML (or
>> stdWrap.cropXML) which makes at new feature. Which makes it TYPO3_4-2
>> only, that is if you get the ok from the release managers. RC1 is
>> already out!
>
> You can still crop plain text with the patched stdWrap.crop. So I don't
> think we need a new feature like stdWrap.cropXML. The patch is just a
> bugfix IMO.
How come? If my content was "blah <foobar> whatever" the patch will not
count the content of <foobar> and will add </foobar>, so this is a
change of the behaviour.
>> Not terribly thrilling I'm afraid. It's simple enough to use either
>> strpos() or preg_match() * to walk through the html without having to
>> split the entire content.
> I didn't intend to thrill you ;-).
My apologize for my wording. Fortunately you seem to have a good humour.
> Before I posted the patch I tested
> three solutions:
> 1) crawling through the content only with strpos() and preg_match()
> 2) utilizing a stack and a recursive function
> 3) splitting the content using preg_split and crawling over the array
>
> IMO the last solution is the best because the code is fast, readable and
> (relatively) easy to maintain and debug (the code blocks that split,
> crop and close tags are clearly separated).
I'm amazed that splitting the whole content is faster, but the reason is
probably the size of the content. I agree that is is unlike that you
crop 500Kb of text, so it will maybe not matter much in real world
situations.
As for readability: tastes vary ;-) I was rather confused when I saw the
multiple for-loop that process the text/arrays.
BTW, can you not build up a simple stack of open tags, so you easily
close the tags without having to parse the text again.
>> A more serious problem is that you use html_entity_decode() which
>> supports only a subset of the charset TYPO3 supports.
>
> The most important charsets are supported by html_entity_decode() (like
> utf-8, ISO-8859-1, ISO-8859-15, GB2312 or EUC-JP. It has also a fallback
> to ISO-8859-1 and throws a warning if a charset is not supported.
Sorry, but this is not enough. The code must work for all charsets.
> Furthermore it is not recommended (see PHP-doc) to use entities if the
> charset is multibyte.
Where? And with multibyte you mean utf-8? There are other multi-byte
encodings that are not of the Unicode charset. Also you cannot assume
the content passed to the function contains no entities just becasue it
is "not recommended".
>> I also do not understand why you have a list of tags to split on. It's
>> much easier and forward compatible to work on all tags that come along.
>
> You are right. I first intended to restrict the parsing to a subset of
> tags. It's not necessary anymore. The new regex (see below) should do
> that now.
Thanx.
>
>> Thanx for trying, but there is IMHO much room for improvement.
>
> Thank you, too. I hope I'm on the right way ;-)
>
>> * eg
>> preg_match('/<([^ >]+)([^>]*)>/', $html, $match, PREG_OFFSET_CAPTURE)
>
> Your suggested (and my former) regex don't take all valid HTML-tags into
> account like the following example shows:
>
> <img title="next >" src="next.gif"> // valid HTML
Hm, I'm used to use entities here but, you're right the HTML 4.01 spec
only advises HTML authors to use > within attributes.
I fear that I am not alone with that. At least TYPO3 generated content
will have all attributes values processed by htmlspecialchars().
Nitpicking: does the code cope with attributes enclosed in single quotes
or unenclosed content like <img src='image.gif'> or <img src=image.gif>?
They are valid too (though only as HTML and not as XHTML).
Masi
More information about the TYPO3-team-core
mailing list