[Typo3] pdf generator output encoded with utf-8
Boris Senker
typo3 at dvotocka.hr
Wed Apr 20 23:44:41 CEST 2005
"Sacha Vorbeck" <sachav at gmx.net> wrote in message
news:mailman.1.1113986195.2425.typo3-english at lists.netfielders.de...
> Hi,
>
> I switched a site to utf-8 and the only problem remaining is that the
> chars contained in the PDFs created by the pdfgenerator extension are not
> encoded correctly.
No solution from me, but I can at least provide some insight to this issue.
Yes, the same problem faces the applications that use FPDF.org library,
although one of the developers of another CMS made an UFPDF: Unicode/UTF-8
extension for FPDF just for this purpose. Looking at the code will give you
the needed steps - basically you would need to convert the font encoding and
remap the characters to the appropriate positions. Which is at least to
say - lots of work.
Read http://www.synop.com/Weblogs/Richard/ here, paragraph The charset
within the machine:
Quote:
'we're using HTMLDOC to generate PDFs from microcontent encoded as UTF-8,
but HTMLDOC doesn't yet support UTF-16BE, so we pre-convert our incoming
UTF-8 data to CP1252, before passing it to HTMLDOC to remap inside the PDF.
The roundtrip is interesting, because the data was originally stored in a
Microsoft Access database, and was converted from CP1252 to UTF-8 when it
was exported for use in Sytadel.'
> On this page: http://www.easysw.com/htmldoc/faq.php?27 I found a note that
> utf-8 is not supported by htmldoc.
Obviously HTMLDOC guys are working on a workaround.
> Do you have any ideas or workarounds for this?
Well, basically what I wrote. Which, unless someone posts a solution he
found elsewhere, would require much PHP work and much font exploration and
character remapping. Any solution I saw for PDF creator apps to support PDF
were based on reencoding and remapping.
I am so angry for this. UTF-8, as a substandard to Unicode which actually
encodes Unicode characters, is DE FACTO an ISO standard which I have taken
as a basis on our development for sites in Croatian languages, while CP-125x
are Windows encodings and they are not standards of any kind. FPDF class
patched with the above noted UFPDF can work it's way around this, so perhaps
someone will make this for HTMLDOC or make a Typo extension that uses
FPDF/UFPDF which is also very good.
To illustrate, Adobe PDF 1.4 Reference:
'3.8.1
Text Strings
Certain strings contain information that is intended to be human-readable,
such as text annotations, bookmark names, article names, document
information,
and so forth. Such strings are referred to as text strings. Text strings are
encoded
in either PDFDocEncoding or Unicode character encoding. PDFDocEncoding is a
superset
of the ISO Latin 1 encoding and is documented in Appendix D. Unicode is
described in the
Unicode Standard by the Unicode Consortium (see the Bibliography). For text
strings encoded
in Unicode, the ?rst two bytes must be 254 followed by 255, representing the
Unicode byte
order marker, UFEFF. (This sequence con?icts with the PDFDocEncoding
character sequence
thorn ydieresis, which is unlikely to be a meaningful beginning of a word or
phrase.)
The remainder of the string consists of Unicode character codes, according
to the UTF-16
encoding speci?ed in the Unicode standard, version 2.0. Commonly used
Unicode values are
represented as 2 bytes per character, with the high-order byte appearing
?rst in the string.'
UTF-8 actually transforms Unicode:
'UTF-8 stands for Unicode Transformation Format-8. It is an octet (8-bit)
lossless encoding of Unicode characters.
UTF-8 encodes each Unicode character as a variable number of 1 to 4 octets,
where the number of octets depends on the integer value assigned to the
Unicode character. It is an efficient encoding of Unicode documents that use
mostly US-ASCII characters because it represents each character in the range
U+0000 through U+007F as a single octet. UTF-8 is the default encoding for
XML.'
http://www.utf-8.com/
However, I wonder... how could iconv be used as a middleman between to make
a conversion?
Additionally, quote http://sourceforge.net/forum/forum.php?forum_id=456543 :
BEGIN QUOTE
'What about UTF-8?
UTF8 is the standard character set used by WordPress, and if the user
doesn't change it (which is probably true for most WP installations), all
entries submitted with WordPress use UTF8. This is not a problem until the
user installs WP2PDF. WP2PDF does not support UTF8. It uses the font's
format, which is ISO-8859-1 (also known as Latin1) for all supplied fonts.
Since UTF-8 uses the basic ASCII character set and each of these characters
have the same position as in ISO-8859-1, some characters will look right,
while some other characters (for example Umlauts: ü, ö, ä) have a different
position in UTF-8. Most users with a blog in English language will not even
note the difference - all characters look right to them. But if you've typed
text in French, German, Italian, Swedish or any other non-English language,
some characters will look garbled.
The temporary solution
The solution sounds pretty obvious: WP2PDF does not support UTF-8 but
ISO-8859-1, so we have to convert the UTF-8 strings to ISO-8859-1. WP2PDF
already does this, it uses iconv to convert the strings. However, iconv is
not installed on every webserver and in that case, WP2PDF will just skip the
conversion and output the string as it gets it.
Another problem is that this kind of conversion will only cover languages
that are (at least partly) supported by ISO-8859-1, namely: Albanian,
Basque, Catalan, Danish, Dutch, English, Faroese, French, Finnish, German,
Icelandic, Irish, Italian, Norwegian, Portuguese, Rhaeto-Romanic, Scottish,
Spanish, Swedish. All other languages, including (but not limited to)
Russian, Japanese, Chinese, Indian (Hindi, Tamil ...) and many more, are
just not supported. This is bad, and thus conversion is a bad idea.
The real solution
The real solution is even more obvious, but not easy at all: build in UTF-8
support into WP2PDF. WP2PDF uses FPDF, a free PDF library that does most of
the PDF conversion. FPDF has no support for UTF8 at all, which means I have
to rewrite it. This is the first problem. The second and even bigger problem
is, that all included fonts use ISO-8859-1 and I have no idea where I could
get freeware TTF-Fonts which use Unicode (if you know where, drop me a
line). I can't use the ones supplied with Windows, for example, because they
are copyrighted.'
END QUOTE
Ok, some fonts are here:
http://www.alanwood.net/unicode/fonts.html
But the conversion issue remains. :(
Boris Senker
: dvotocka design
________________________________________________________________
Graphic Design for Print and Web, Prepress, Website Production
J. Laurencica 8, 10000 Zagreb, Croatia
http://www.dvotocka.hr
More information about the TYPO3-english
mailing list