[Typo3] pdf generator output encoded with utf-8

Wed Apr 20 23:44:41 CEST 2005

"Sacha Vorbeck" <sachav at gmx.net> wrote in message 
news:mailman.1.1113986195.2425.typo3-english at lists.netfielders.de...
> Hi,
>
> I switched a site to utf-8 and the only problem remaining is that the 
> chars contained in the PDFs created by the pdfgenerator extension are not 
> encoded correctly.

No solution from me, but I can at least provide some insight to this issue.

Yes, the same problem faces the applications that use FPDF.org library, 
although one of the developers of another CMS made an UFPDF: Unicode/UTF-8 
extension for FPDF just for this purpose. Looking at the code will give you 
the needed steps - basically you would need to convert the font encoding and 
remap the characters to the appropriate positions. Which is at least to 
say -  lots of work.

Read http://www.synop.com/Weblogs/Richard/ here, paragraph The charset 
within the machine:

Quote:

'we're using HTMLDOC to generate PDFs from microcontent encoded as UTF-8, 
but HTMLDOC doesn't yet support UTF-16BE, so we pre-convert our incoming 
UTF-8 data to CP1252, before passing it to HTMLDOC to remap inside the PDF. 
The roundtrip is interesting, because the data was originally stored in a 
Microsoft Access database, and was converted from CP1252 to UTF-8 when it 
was exported for use in Sytadel.'

> On this page: http://www.easysw.com/htmldoc/faq.php?27 I found a note that 
> utf-8 is not supported by htmldoc.

Obviously HTMLDOC guys are working on a workaround.

> Do you have any ideas or workarounds for this?

Well, basically what I wrote. Which, unless someone posts a solution he 
found elsewhere, would require much PHP work and much font exploration and 
character remapping. Any solution I saw for PDF creator apps to support PDF 
were based on reencoding and remapping.

I am so angry for this. UTF-8, as a substandard to Unicode which actually 
encodes Unicode characters, is DE FACTO an ISO standard which I have taken 
as a basis on our development for sites in Croatian languages, while CP-125x 
are Windows encodings and they are not standards of any kind. FPDF class 
patched with the above noted UFPDF can work it's way around this, so perhaps 
someone will make this for HTMLDOC or make a Typo extension that uses 
FPDF/UFPDF which is also very good.

To illustrate, Adobe PDF 1.4 Reference:

'3.8.1

Text Strings

Certain strings contain information that is intended to be human-readable,
such as text annotations, bookmark names, article names, document 
information,
and so forth. Such strings are referred to as text strings. Text strings are 
encoded
in either PDFDocEncoding or Unicode character encoding. PDFDocEncoding is a 
superset
of the ISO Latin 1 encoding and is documented in Appendix D. Unicode is 
described in the
Unicode Standard by the Unicode Consortium (see the Bibliography). For text 
strings encoded
in Unicode, the ?rst two bytes must be 254 followed by 255, representing the 
Unicode byte
order marker, UFEFF. (This sequence con?icts with the PDFDocEncoding 
character sequence
thorn ydieresis, which is unlikely to be a meaningful beginning of a word or 
phrase.)
The remainder of the string consists of Unicode character codes, according 
to the UTF-16
encoding speci?ed in the Unicode standard, version 2.0. Commonly used 
Unicode values are
represented as 2 bytes per character, with the high-order byte appearing 
?rst in the string.'

UTF-8 actually transforms Unicode:

'UTF-8 stands for Unicode Transformation Format-8. It is an octet (8-bit) 
lossless encoding of Unicode characters.
UTF-8 encodes each Unicode character as a variable number of 1 to 4 octets, 
where the number of octets depends on the integer value assigned to the 
Unicode character. It is an efficient encoding of Unicode documents that use 
mostly US-ASCII characters because it represents each character in the range 
U+0000 through U+007F as a single octet. UTF-8 is the default encoding for 
XML.'

http://www.utf-8.com/

However, I wonder... how could iconv be used as a middleman between to make 
a conversion?

Additionally, quote http://sourceforge.net/forum/forum.php?forum_id=456543 :

BEGIN QUOTE

'What about UTF-8?
UTF8 is the standard character set used by WordPress, and if the user 
doesn't change it (which is probably true for most WP installations), all 
entries submitted with WordPress use UTF8. This is not a problem until the 
user installs WP2PDF. WP2PDF does not support UTF8. It uses the font's 
format, which is ISO-8859-1 (also known as Latin1) for all supplied fonts. 
Since UTF-8 uses the basic ASCII character set and each of these characters 
have the same position as in ISO-8859-1, some characters will look right, 
while some other characters (for example Umlauts: ü, ö, ä) have a different 
position in UTF-8. Most users with a blog in English language will not even 
note the difference - all characters look right to them. But if you've typed 
text in French, German, Italian, Swedish or any other non-English language, 
some characters will look garbled.

The temporary solution
The solution sounds pretty obvious: WP2PDF does not support UTF-8 but 
ISO-8859-1, so we have to convert the UTF-8 strings to ISO-8859-1. WP2PDF 
already does this, it uses iconv to convert the strings. However, iconv is 
not installed on every webserver and in that case, WP2PDF will just skip the 
conversion and output the string as it gets it.

Another problem is that this kind of conversion will only cover languages 
that are (at least partly) supported by ISO-8859-1, namely: Albanian, 
Basque, Catalan, Danish, Dutch, English, Faroese, French, Finnish, German, 
Icelandic, Irish, Italian, Norwegian, Portuguese, Rhaeto-Romanic, Scottish, 
Spanish, Swedish. All other languages, including (but not limited to) 
Russian, Japanese, Chinese, Indian (Hindi, Tamil ...) and many more, are 
just not supported. This is bad, and thus conversion is a bad idea.

The real solution
The real solution is even more obvious, but not easy at all: build in UTF-8 
support into WP2PDF. WP2PDF uses FPDF, a free PDF library that does most of 
the PDF conversion. FPDF has no support for UTF8 at all, which means I have 
to rewrite it. This is the first problem. The second and even bigger problem 
is, that all included fonts use ISO-8859-1 and I have no idea where I could 
get freeware TTF-Fonts which use Unicode (if you know where, drop me a 
line). I can't use the ones supplied with Windows, for example, because they 
are copyrighted.'

END QUOTE

Ok, some fonts are here:

http://www.alanwood.net/unicode/fonts.html

But the conversion issue remains.  :(

Boris Senker

: dvotocka design
________________________________________________________________

Graphic Design for Print and Web, Prepress, Website Production

J. Laurencica 8, 10000 Zagreb, Croatia
http://www.dvotocka.hr