Limited Unicode Support in LibreOffice 4.1.4.2? Character insertion, non-zero (SMP, SIP) planes, and multi language documents.

NeilBR · January 28, 2014, 10:12am

As a newcomer I am trying out LibreOffice 4.1.4.2 and is quite satisfied with most of its features, especially the experimental sidebar panel. However, several issues occurred when I was playing around with Unicode characters in LO.

Environment: MS Win7 64bit Ultimate English with French and Chinese language pack, non-Unicode code page: Chinese (CP936), LibreOffice 4.1.4.2, JRE 1.7.0_09 (both 32-bit and 64-bit).

THE "SPECIAL CHARACTERS" DIALOG

Tom_Davies1 · January 28, 2014, 2:02pm

Hi
I am not sure about the Unicode issue. Might it be something that the
international translators might have some ideas about? It might be
worth posting some of the question on their mailing list
L10n@Global.LibreOffice.Org
but try to keep it short as not everyone there is as advanced with
English as you. Most are but make it easy for them to help by keeping
it short! It's a fairly low-traffic list (unlike this one!) and not
really meant to be user-support.

On a side issue ...
The biggest boost in productivity seems to happen when people read-up
a bit about styles
https://wiki.documentfoundation.org/Documentation/Publications
and they also make the biggest impact in the quality of the output =
finished documents.

Sadly the way MS Office uses styles tends to introduce weirdnesses so
a lot of people seem unwilling to try to understand them at all. I
was. When i did start i used them as little as possible but found
huge improvements made it worth finding out / experimenting with more.

I've finally got it to the state where producing 'my' company's
newsletter has dropped from taking about 2 weeks to about 1 or 2
afternoons and the result is far better.
Regards from
Tom

CVAlkan · January 28, 2014, 4:05pm

Hi Neil:

You don't mention whether you are solely using LibreOffice's language
support features or whether you are also using one of the OS-wide "input
method" utilities that are available.

I've found that, with Linux at least, Writer sometimes "helps" too much by
silently substituting fonts for you and doesn't always make good choices -
leading to it looking like one font (which you know has the character you
want) is in use, when it really isn't.

Take a look at whatever paragraph style is in use were you're having the
difficulty. Edit the style and, on the "Font" tab - assuming Writer knows
you're fiddling with multiple languages, you will see some sections called
"Asian Text Font" and "CTL Font."

Whatever these are set to, Writer attempts to use them when that particular
"Western Text Font" is in use and you choose another language. Go to one of
the places where you have the problems and see if changing those alternate
fonts solves your problem. If so, you need to hit the various style
definitions to get things working correctly.

I use the iBus input method (on Linux), and ran into similar difficulties
with Thai until I figured out how to make iBus and Writer play together
nicely (which they can and now do). I like using iBus (or an external input
method) because it works with any application I use in a consistent manner,
and I seldom ever need to use "Insert Special Character" since activating a
different keyboard is done with a simple keystroke, and when you type some
characters that ought to be combined (e.g. accents, diacritics and so
forth), this happens automagically so long as you have a font that knows
about such things.

I hope this helps you figure out how to get things working.

Frank

Peter_Maunder · January 28, 2014, 5:48pm

Neil,

I was interested to read your note and have a few thoughts.

Environment: MS Win7 64bit Ultimate English with French and Chinese
language pack, non-Unicode code page: Chinese (CP936), LibreOffice
4.1.4.2, JRE 1.7.0_09 (both 32-bit and 64-bit).

Is their any reason you are still using the MS CP936 on your system rather
than Unicode.? You should normally expect the system code settings to equal
or higher than applications otherwise you can have problems for example
cutting and pasting. LibO is Unicode based
I note that MS would also suggest Unicode.

http://msdn.microsoft.com/en-US/goglobal/cc305153.aspx

... Using Unicode is recommended in preference to any code page because it
has better language support and is less ambiguous than any of the code
pages.

2. wrong detection of unicode ranges

A more complicated problem with the Special Character dialog is the
mis-identification of supported code points in >a font—LO doesn't seem
to handle this thing correctly at all. Instead, it displays blocks or
squared question marks >or blanks for the unsupported glyphs in a
Unicode range partially supported by the font, and glyphs from fallback
>fonts assigned by the OS for a Unicode range the font totally does not
support—with almost no sign of suppressing >these unsupported glyphs or
ranges from display. Only very few fonts are correctly identified
(Cardo, Code2000 >and Quivira, for example).

There are two issues here, entering text and displaying the text with a
chosen font. LibO does not know what font you may eventually choose to use
to display some text and therefore follows the Unicode rules as specified
in The Unicode Standard manual. In my copy of the manual "Unicode Standard
5.0" a useful addition to any college library, deals with this area in a
section called "Unknown and Missing Characters" LibO appears to follow
international standards in this area.

�� (the replacement character) are a different matter and often found when
you cut and paste in windows system when the character sets do not match.

9. wrong font information for mixed language documents

LO Writer failed to display font information correctly for mixed
language documents. Although a number of glyphs >from non-Latin writing
systems have been identified and correctly displayed, the font
information is not set >correspondingly. One example is Ogham. To
correctly display Ogham glyphs, one need to use fonts like Code2000 >or
Segoe UI Symbol. However, Times New Roman, a Latin/Greek/Cyrillic/Arabic
font, is displayed in the “font >name” drop-down or the “character”
dialog when you select Ogham texts.

When I type Ogham, which is very seldom, I just enter the Unicode directly
(I am using Ubuntu) using the keys. LibO is set up to substitute for the
missing glyphs. For example, Times New Roman does not contain Ogham but LibO
shows  ᚁᚁᚗᚖᚙ taken from a font with these characters. Whether that is a
good thing has been a subject of discussion.

Anyway good luck. You may find the LibreOffice manuals useful. Either
download the PDFs from the site or to purchase paper versions from Lulu.com
Peter

NeilBR · January 29, 2014, 1:45pm

TomD wrote

Sadly the way MS Office uses styles tends to introduce weirdnesses so
a lot of people seem unwilling to try to understand them at all.

I'll second that. One of the weirdest thing with styling in MS Word is that
'format' and 'style' can easily be mixed up, thanks to Word's native support
of creating styles out of formatting. Many users confuse the two concepts,
and end up with documents having dozens of unnecesary styles.

NeilBR · January 29, 2014, 1:56pm

CVAlkan wrote

with Linux at least, Writer sometimes "helps" too much by silently
substituting fonts for you and doesn't always make good choices - leading
to it looking like one font (which you know has the character you want) is
in use, when it really isn't.

Hi, sadly (and interestingly) I feel like my Writer's "helping" too little
by not accepting my font settings at all. Take the "fork and knife"
character (U+1F374) as an example again. Following your advice, I tried
three things.1. changing font settings in Format→Character2. changing font
settings for the "Default" style which is in use3. changing language
settings to "None" as this character is a sign, not part of any writing
system.But I still get square ...

NeilBR · January 29, 2014, 2:32pm

Hi, Peter!
Peter Maunder wrote

Is their any reason you are still using the MS CP936 on your system rather
than Unicode.?

This is for downward compatibility with older programs not supporting
Unicode. Every Windows distribution has this option, in Win7 it's in Control
Panel→Region and Language→Administration→Language for Non-Unicode programs.
It by default the corresponding code page for your OS language. Unicode
programs use UTF-16 and non-unicode ones use CP936. Windows does the
translation for API calls.I list this out because I have experienced
language encoding issues with programs originally in English, notably with
commandline applications like Perl. Windows is notorious for having Unicode
bugs in its commandline interface.
Peter Maunder wrote

�� (the replacement character) are a different matter and often found
when you cut and paste in windows system when the character sets do not
match.

Thanks for pointing out this difference. I agree with your opinion that LO
conforms to Unicode Standard.But the problem remains that LO could not
display a glyph though the right font has been set. This is not a single
case. Actually all glyphs not exposed in the Special Characters dialog for a
certain font cannot be displayed, despite that the font supports them. You
can test this using FreeSerif, Code2002 and Segoe UI Symbol.
Peter Maunder wrote

When I type Ogham, which is very seldom, I just enter the Unicode directly
(I am using Ubuntu) using the keys. LibO is set up to substitute for the
missing glyphs.

Is the direct Unicode input one feature of LO or that of the Linux system?
Windows has a limited support for direct Unicode input for code points
<=U+FFFF.

NeilBR1 · January 29, 2014, 1:27pm

"Sadly the way MS Office uses styles tends to introduce weirdnesses so
a lot of people seem unwilling to try to understand them at all."

I'll second that. One of the weirdest thing with styling in MSO is that
"format" and "style" sometimes get mixed up, which is well exemplified by
what Word calls "create styles from formats". Many users would start to get
confused with the two concepts, and many more end up with documents having
various unncessary styles.

Peter_Maunder · January 29, 2014, 3:51pm

Well there are three main issues that meet here and it helps me to keep them
separate. I hope I am not confusing matters further by describing them.

1) Unicode, which your system supports, the standard characters that the
system understands.
2) The Fonts you are using and are installed on your system, the characters
they support.
3) Libreoffice using and depends on Unicode and the fonts.

A little history.
Unicode 5.0 was released July 2006. It encoded 99,089 characters.
Unicode 5.1 was released March 2008. It encoded 100,713 characters.
Unicode 5.2 was released in October 2009. It encoded 107,361 characters.
Unicode 6.0 was released in October 2010. It encoded 109,449 characters.
As of today 6.3 is the current version.
There is plenty of space left before the million + spaces are filled.
Unicode 5 has a gap between U+1D7FF and 2F800

On my system Ubuntu 12.04 with LibreOffice 4.2 using the DejaVuSans font.
U+1F637 gives in DejaVuSans but the "fork and knife" character
(U+1F374) which is a part of Unicode 6.0
and is not in the font shows up on my system as or a blank box. I can
insert the face as a special character, and both the characters using the
keyboard U+1Fxxx

So it appears the system recognises Unicode 6.3. The font recognises the
face but not the knife and fork and LibreOffice uses the two. If I had a
font with the correct Unicode knife and fork, I assume that would also show
up. But you will need a font that is new enough to show the character.

Hope this helps.. Peter

NeilBR · January 29, 2014, 4:19pm

Thanks for the suggestion. After reading your reply, I suspect this might be
issues of LibreOffice on Windows (or more precisely Windows with asian code
page configured for non-Unicode applications) but not on Linux.

Peter Maunder wrote

I can insert the face as a special character, and both the characters
using the keyboard U+1Fxxx

I also have DejaVu Sans installed on my Windows. In Microsoft Word 2010, I
can insert the face as a special character (and paste them into Windows 7
notepad/wordpad). In LibreOffice 4.1.4.2, I am not given any glyphs from
this font beyond U+FFEF. I see a number of readable glyphs not in the
typeface of DejaVu Sans but resembles Chinese font outlines.

Besides, at first attempt, I can see some glyphs in the SMP plane from
DejaVU *Serif*, but after fiddling with the Special Characters dialog for
one minute or two, I am no longer able to see these glyphs in this font.

As for keyboard direct input, Windows appear to be somehow lagging behind,
allowing no non-zero plane code points.

Peter Maunder wrote

So it appears the system recognises Unicode 6.3. The font recognises the
face but not the knife and fork and LibreOffice uses the two.

In my case, and possible in Windows' case, the system recognises Unicode
6.3. The font recognises the face but not the knife and fork, but
LibreOffice uses neither of the two.

Peter_Maunder · January 29, 2014, 5:11pm

♪͡♪♪͡♪Neil Ren♪͡♪♪͡♪ wrote

Thanks for the suggestion. After reading your reply, I suspect this might
be issues of LibreOffice on Windows (or more precisely Windows with asian
code page configured for non-Unicode applications) but not on Linux.

As for keyboard direct input, Windows appear to be somehow lagging behind,
allowing no non-zero plane code points.

I agree that this is probably not a LibreOffice issue. If you are using a
multitude of "exotic" code blocks and fonts you need to be very careful to
keep to the International code point standards. You also need to check that
you are using the same copy of the fonts and that you do not have various
versions confusing the issue. It certainly makes this easier to use
Unicode which is why it is the default for the Internet.

♪͡♪♪͡♪Neil Ren♪͡♪♪͡♪ wrote

In my case, and possible in Windows' case, the system recognises Unicode
6.3. The font recognises the face but not the knife and fork, but
LibreOffice uses neither of the two.

You say that your system is not defined as Unicode, therefore it doesn't
recognise Unicode 6.3. It will then depend on whether the fonts you are
using do actually have the same characters at the same code point, and are
not "variations". LibreOffice uses Unicode as much as your system supports
the code points.
I think that 6.3 is rather a red herring as it is the font that needs the
character. The �� character that you sometimes see in emails replacing an
apostrophe is a good example of "mixing" character sets and overwriting
Unicode.

Anyway, gives you something to think about, goof luck.. Peter

MR_ZenWiz · January 29, 2014, 6:22pm

OT:

-----
"雷聲靐䨻": if you can see four Chinese characters instead of blobs or question marks than your browser/OS has a good support for Unicode and CJK characters! (P.S. the four chracters put together means "loud thunder")

in Chrome, I see 3 characters and a blank box. In Firefox and
Seamonkey, I see three characters and a box with "4A 3B" on two lines.

I'm running Xubuntu 12.04 with the 3.8 kernel. Just interesting....

MR

Don_C_Myers · January 29, 2014, 6:59pm

Hi,

Relative to

"雷聲靐䨻": if you can see four Chinese characters instead of blobs or question marks than your browser/OS has a good support for Unicode and CJK characters! (P.S. the four chracters put together means "loud thunder")

in Ubuntu 13.10 with Unity, in Thunderbird e-mail the 4 characters are present. If I paste the same in LO Write from the Document Foundation 4.1.4.2 they also show up a 4 characters.

Don

NeilBR · January 30, 2014, 3:15am

MR ZenWiz wrote

OT:

in Chrome, I see 3 characters and a blank box. In Firefox and
Seamonkey, I see three characters and a box with "4A 3B" on two lines.

Your browsers supports Unicode CJK characters, but either they do not
associate a proper display font for the "CJK Unified Ideographs Extension A"
range, or your system does not come with a font that supports glyphs
(particularly this glyph) in this range.

NeilBR · January 30, 2014, 3:19am

Tractor wrote

in Ubuntu 13.10 with Unity, in Thunderbird e-mail the 4 characters are
present. If I paste the same in LO Write from the Document Foundation
4.1.4.2 they also show up a 4 characters.

Your browser probably has full support for CJK characters in the basic
multilingual plane. I'm curious what font LO uses to display these 4 chars,
as I am gradually shifting to Linux platform as a beginner.

NeilBR · January 30, 2014, 3:24am

Thanks again. I am planning to check these issues in virtual machine with
Windows or Linux installed, which I think would yield some more helpful
results.

I'm afraid I did not make myself clear regarding the non-Unicode support in
Windows. I am confident that my OS is set up with Unicode support. The CP
set-up is just in parallel. In fact each running Windows system has the two
encoding systems running simultaneously, typcially using CP437 for English
environments.

I will come back with test results on these issues.

Don_C_Myers · January 30, 2014, 4:19am

Hi Neil,

At work, my e-mail is automatically downloaded to Thunderbird and comes off the server. At home it is automatically downloaded but stays on the server. So I checked the characters here at home tonight in both Firefox and Chrome. Here is what I have: (I've enlarged them for clarity.)
Firefox:

Characters Firefox

Chrome Browser
Characters Chrome Browser

I have a standard US version of Ubuntu 13.10 with Unity. Firefox and Thunderbird come with it, and are always updated to the current versions. Chrome is always the current release, and it is kept updated. I download it from Google. I always delete completely the LibreOffice version from Ubuntu, and install the version (presently 4.1.4.2) from The Document Foundation. Hopefully the images I put in this e-mail will show up in your e-mail. I've not added any extra fonts or anything to the system or LibreOffice.

Don

Dominique_Michel · January 30, 2014, 7:31am

Tractor wrote
> in Ubuntu 13.10 with Unity, in Thunderbird e-mail the 4 characters
> are present. If I paste the same in LO Write from the Document
> Foundation 4.1.4.2 they also show up a 4 characters.

Your browser probably has full support for CJK characters in the basic
multilingual plane. I'm curious what font LO uses to display
these 4 chars, as I am gradually shifting to Linux platform as a
beginner.

http://www.latouche.info/admin/user_guides/chinese_support_gentoo.html
This is for gentoo, so maybe the names of the packages can be a little
bit different.

Also, it is a little bit outdated. With the last Xorg versions, you
don't need at all a font section in /etc/X11/xorg.conf. The most
important now is to install the fonts. After, each distribution have its
own way to prioritize them. On gentoo, this is via eselect. All
distributions must have a documentation about this.

As example, this one for Arch is more up-to-date:
https://wiki.archlinux.org/index.php/Fonts

Dominique

NeilBR · January 30, 2014, 5:19pm

Hi! I wasn't able to view the pictures in your message, in both Nabble and
mailing list. I only get plain text.

I have been trying out LibreOffice on Ubuntu 12.04 LTS in VMware player
today. Everything is working fine. No all these windows bugs.

NeilBR · January 30, 2014, 5:45pm

This reply goes to every one who has offered their help. I just want to say
thank you. I have tried out LO on Ubuntu 12.04 and on Windows 7 in a VM, and
these issues are simply almost gone on the Linux platform but remain the
same on the virtual Windows.

I will briefly answer all my questions in the following, with a short
summary.

Environment: Ubuntu 12.04 64bit in VMWare, clean install; LibreOffice
4.1.4.2, deb package; extra fonts copied from Windows to support CJK chars
in plane 2 (though the AR UL font also supports this), Tibetan script (using
Microsoft Himalaya), Burmese (not sure if Ubuntu has a supporting font so
just copied one), Gothic (using Code2000 fonts or Segoe UI Symbol) and more
signs (using Segoe UI Symbol).

♪͡♪♪͡♪Neil Ren♪͡♪♪͡♪ wrote

1. no unicode code point entry field in the dialog

This is UNNECESSARY for Ubuntu/Linux because I have learned it has
built-in support for direct keyboard input of unicode characters without
limitation found in Windows (that you can only enter a code point lower
than U+10000 using keyboard direct input).

♪͡♪♪͡♪Neil Ren♪͡♪♪͡♪ wrote

2. wrong detection of unicode ranges

This is NOT observed in LO on Ubuntu. All fonts are properly detected of
their supported unicode ranges. And yes, LO follows the Unicode Standard
in this.

♪͡♪♪͡♪Neil Ren♪͡♪♪͡♪ wrote

3. limited support for fonts with non-zero plane glyphs

This is NOT observed in LO on Ubuntu. Full range support for these fonts,
including those of which LO don't show their non-zero plane glyphs in
Windows.

♪͡♪♪͡♪Neil Ren♪͡♪♪͡♪ wrote

4. no built-in mechanism to produce glyphs from typed code points

This is UNNECESSARY for Ubuntu/Linux. See issue 1.

♪͡♪♪͡♪Neil Ren♪͡♪♪͡♪ wrote

5. unstable detection of unicode ranges: weird behaviors in the Special
Characters dialog

As of this post is composed, I haven't experienced any such behavior.
Seeing that the unicode ranges are also correctly detected and fully
supported here, I am confident this won't happen in Ubuntu/Linux.

♪͡♪♪͡♪Neil Ren♪͡♪♪͡♪ wrote

6. broken bi-di display (on same line ) in mixed-script document

This is NOT observed in LO on Ubuntu if you copy text from a web browser
or type in text. However, if you try to open an HTML page straight from
LO, which I do so only for testing, the display do get screwed up.

♪͡♪♪͡♪Neil Ren♪͡♪♪͡♪ wrote

7. Broken glyph display in mixed-script document

This is NOT observed in LO on Ubuntu if you assign a proper language to
the text. If you don't, the glyphs may show up as blobs, but once you set
it right, the glyphs are shown proper. Try it with Tibetan and Gothic.

♪͡♪♪͡♪Neil Ren♪͡♪♪͡♪ wrote

8. problematic character spacing

This is NOT observed in LO on Ubuntu if you assign a proper language to
the text. If you don't, the glyphs may display errneously, but once you
set it right, the glyphs are shown proper. Try it with Thai.

♪͡♪♪͡♪Neil Ren♪͡♪♪͡♪ wrote

9. wrong font information for mixed language documents

There is ongoing discussion of the issue of automatic font selection, but
at least you don't make your glyph go wrong by using a wrong font.

Verdict: Unicode support of LO is much better on Ubuntu/Linux than on
Windows, speaking of non-zero plane code points and complex scripts. This is
possibly due to originality of LO as a Linux application later ported to
Windows platform, and also due to Windows' running parallel in Unicode and
ANSI modes, the latter for backward compatibility.

I'll perhaps try LO out on a Windows 8.1 system in a VM if possible, and if
I do that I'll continue to update this post.