[TYPO3-Solr] SOLR_CONTENT and HTML output

Sun Aug 10 13:54:01 CEST 2014

Hi,

SOLR_CONTENT is meant to clean up content; it removes tags, entities, 
incorrect UTF-8 characters and so on. There is however a small problem 
with the resulting text:

If it's used in a field in a solr document and result highlighting is on 
you may and up with a piece of text that is not valid HTML:

Original: [...] the department R&amp;D; HRM is [...]
SOLR_CONTENT: [...] the department R&D; HRM is [...]
Match "department": [...] the <span class="highlight">department</span> 
R&D; HRML is [...]

Validator says the &D; is not a valid entity. htmlSpecialChars cannot be 
used on the result because it would ruin the highlighting tags.
Same problem might occur for other characters which should be encoded 
for use in HTML.

Solution?

At the moment the workaround could be to use the SOLR_CONTENT object 
inside a COA and apply htmlSpecialChars to it.
Maybe it would be useful for SOLR_CONTENT to get a property to set the 
target context. HTML / JS / Text / ... and apply the proper encoding 
before sending it off to the solr index.

What do you think?

-- 
Jigal van Hemert
TYPO3 CMS Active Contributor

TYPO3 .... inspiring people to share!
Get involved: typo3.org