Struggling with Hebrew in LO

Jonathan_Allen · October 25, 2016, 2:44pm

Dear List,

I'm struggling with using mixed English and Hebrew text in LO. This is
a fully up-to-date LO 5.1.4.2 in a new installation of Linux Mint.

When I type English text, the letters come out in the right order, but the
punctuation goes at the beginning of the line, until the next letter is typed
which is strange but sort-of-OK in mid-sentence but no good at end of the
paragraph.

Attempting to combine Hebrew and English text in the same sentence, as it
were to say 'shalom' in flight, assembles the language blocks the wrong
way round. Using Alt-Ctrl-8 and Alt-Ctrl-9 doesn't seem to fix this and
the Shift-Ctrl-D and Alt-Shift combinations are also dysfunctional. The
font-name (selected as SBL Hebrew) switches once characters are typed to
DejaVu Sans.

This is probably all very familiar to someone (if not all), so anyone
help me to get this working correctly, please?

Jonathan

Trever_L_Adams · October 25, 2016, 11:17pm

I have never been able to get RTL and LTR languages to mix in the same
sentence properly with any application in Windows or Linux when you
start adding punctuation. In LO you can set the language for a paragraph
and, if I remember correctly, that fixes the punctuation placement
problem. I am sorry for not being certain as it has been a few years
since I wrote any significant quantity of Hebrew.

I have never had any problems with the order of actual words between the
two languages.

In English, I say hello, in Hebrew I say שלום.

שלום, Jonathan.

I am not sure what mailer you are using or how it will render, but LO
and Thunderbird in Linux render the above the same. Period may be
considered out of order in the first. The comma in the second definitely
is. This is with leaving the language at default (en_US-UTF-8 for
myself). Try selecting your paragraphs and changing the language to
Hebrew and see if that helps any.

I cannot comment on the keyboard shortcuts you mentioned as I don't know
what they are supposed to do or where they do work.

Trever

Trever_L_Adams · October 25, 2016, 11:24pm

Sorry, the option is Format/Character/Font/Language or right click and
character/font/language. It isn't paragraph. As I said, it has been a
few years.

Trever

toki · October 25, 2016, 11:48pm

Use language specific character styles, and language specific paragraph
styles. Hebrew only, English only. You can't mix English and Hebrew in
the same style. Things will get messed up.

I usually use a different colour, for each paragraph style, and each
character style,. Then, when the document is completed, proof-read,
copy-edited, grammar checked, etc., change the style colours to black.

To avoid the misplaced punctuation issue, don't use the non-dominant
language at either the beginning, or end of the sentence. Ideally, it
won't be the first, or last word, in the correctly punctuated phrase.

jonathon

CVAlkan · October 26, 2016, 6:36pm

Hi Jonathan:

Welcome to the multi-language war veterans club! Your purple heart is in the
mail.

There IS actually a logic as to how characters are laid out when typing
mixed L-T-R and R-T-L text in a line or sentence, but you need to understand
that most text rendering mechanisms continue to make some unwarranted
assumptions. First off, some punctuation in "foreign" (i.e. non-Latin)
scripts is taken from what used to be called the lower ASCII characters.
This makes some sense, as there is no point in duplicating identical
characters that are used for identical purposes. BUT: when the text
rendering engines encounter such characters while typing in another script,
they decide that the typist is back to using Latin script which, as you've
seen, can be disastrous.

With Hebrew Script (used in Hebrew, Yiddish and perhaps others I'm not aware
of), there is an additional problem:

Like most languages, Hebrew uses the same set of parentheses as English, and
treats "opening" and "closing" as meaning the curve is towards the innards
of the set. Since Hebrew is written from right-to-left, however, what is an
"opening" paren in Latin scripts is a "closing" paren in Hebrew and
vice-versa. This is the reason that the opening paren is above the 9 key for
both keyboard layouts, but they face the opposite direction (the "(" is
above the "9" key on English keyboards and above the "0" key on many Hebrew
keyboards) . Because of the flaw in the way rendering engines recognize
these characters (both of which are in the "Latin" set) as indicating a
return to English, you lose! Well, you get the idea.

Arabic - another RTL script used in far more languages than Hebrew - leaves
the "(" and ")" characters in the same positions they are on English
keyboards.

If you go to LibreOffice Bug #92655
(https://bugs.documentfoundation.org/show_bug.cgi?id=92655) I attached a pdf
document there titled "General Discussion of Complex Text Attributes" which
you can download; this (and particularly page 14 &ff) describes some of the
details of the issue you're running into, and Hebrew is one of the specific
scripts used as an example.

Jonathon(toki)'s advice given above is spot on! If you understand what's
actually going on, and understand the characters to avoid having in certain
places and why (that's where I think my document may help you), you can
intermingle multiple directions within single lines successfully almost all
the time. Like him, I've done this for long enough to be able to get it
done, but I also have my own "tricks" to keep my head screwed on straight
when composing.

LibreOffice, which like most apps relies on an external rendering engine (I
believe it's HarfBuzz now, but am not certain) is affected by this rendering
assumption, as you have seen. You may also run into LibreOffice sometimes
substituting fonts unnecessarily when you switch, even if you have
specifically selected a font containg both scripts/languages you are using.
This results from some fonts not properly reporting which scripts and
languages they support. So the rendering engines dutifully find a substitute
font to use. It's messy. The other document attached to the same bug report
discusses many other side effects you'll need to become aware of.

You might also take a look at bug 32357 which deals with auto-completion
quirks when using multiple languages.

But it can be done: Best of Luck, and if you have other questions, or
discover new tricks, please post them.

Frank

toki · October 27, 2016, 5:38am

Does the bad font substitution go away, if you manually add foundry data
to the font?

jonathon

CVAlkan · October 27, 2016, 2:55pm

A very good question but, depending on what you mean specifically by "foundry
data" I'm afraid I don't know that it will. My best guess is that it will
**improve the chances that it will**, but there are many other factors
involved. My understanding of all this is murky at best given the unclear
and often conflicting information I could find on the web.

In Linux at least, there is a "thing" (utility? service?) called fc-match
that seems to actually decide which fonts are the closest match to the one
that doesn't meet the immediate needs of the calling app - whether fc-match
is called directly by an app or indirectly through Harfbuzz or other
rendering mechanism (it isn't clear to me if, or to what extent, Windows
uses this, although Gimp under Windows certainly does).

fc-match is part of Behdad Esfahbod's fontconfig package (see
https://en.wikipedia.org/wiki/Behdad_Esfahbod), and it determines the
matches (and ranking) according to a multitude of factors exposed/reported
by the fonts themselves. Since a number of perfectly lovely fonts are either
missing some things, or define them incorrectly or inconsistently, the
answer to your question might be: if you were to fix all of those things,
you probably wouldn't encounter the unwarranted and unexpected font
substitutions.

There are of course two significant "gotchas" in expecting good results from
fc-match: The first is while the font itself is actually suitable, it
doesn't report some of its capabilities correctly, causing an unecessary
quest for a substitute. The second is that, while looking for a good
substitute, the fonts being examined don't correctly or consistently report
their capabilities. The acronymn we used to use in the early days was GIGO
(Garbage In = Garbage Out)

I've long been annoyed that LibreOffice (among other apps, but this is an LO
list) doesn't report that such substitutions were made, but I've since
discovered the possibility that it might not even have been given that
information as feedback from the rendering engine in the first place. The
way I confirm these stealth substitutions by the way is to either generate a
.pdf or save the file as an .fodt; in either case the actual font being used
can be determined from those files.

So: Trying to answer the question you pose has been part of my goal for a
while, but I first wanted to come up with a list of fonts that would be
suitable for experimentation. There are a number of ways of "looking into" a
font to see what's in it, but they are all fairly tedious if you want to
compare some arbitrary set of fonts, and involve looking at one font at a
time.

I've recently been playing with a shell script I wrote; if I'd known how
deep the water was that I was stepping into, the choice of bash would likely
have been different, but that's water (pun intended) under the bridge. I
give the script multiple command line arguments for the particular scripts
I'm interested in combining in a single document and it gives me back a list
of all the fonts that are "potentials." Along with each font name, there is
a short list of some of the things it has to say about itself. The results
so far are rather fascinating, and tend to confirm your suspicion/hope/guess
that fixing the fonts may fix the problem.

In no particular order, here are some representative tidbits I've
discovered:

1) When looking at fonts containing both Greek and Armenian characters,
there are 31 of them installed on my machine, and all of those 31 (all from
the FreeFont and DejaVu families) include the appropriate language codes
('el' and 'hi') for this example. BUT: DejaVuSans-ExtraLight.ttf is missing
the 0x0559 character from the available character bit map. Not knowing
Armenian, I don't know what to make of that, but it's interesting.
FreeSerifItalic doesn't report the ISO 15924 script tag 'grek' and FreeMono
fails to report the code 'armn'; DejaVuSansMono-Oblique and FreeMonoOblique
don't report either 'el' or 'hi'. You can see that answering your question
would require some serious experimentation: are there valid reasons some
members of these families are slightly different from others, etc? There are
more examples like this.

2) I have found two versions of Garamond on my system using this script (I
don't believe I would have done that, so I suspect that different apps may
have added them not realizing the other was there). Appearance-wise, their
glyphs seem at least superficially identical, but what they report is quite
different. The base font for one reports it's in a 'Normal' style, while the
other says it's 'Regular.' One of these families is clearly superior in what
it is reporting as capabilities, so I'll soon be purging the other, but the
questions remain: how the heck would I ever have stumbled across this? and
what effect(s) might this have had on unexpected font substitution?

3) For coverage of "upper" Unicode planes (i.e. scripts that begin beyond
0xffff and containing such things as ligatures, box drawing characters,
complete musical symbols and so forth), none of the utilities I've used
seems to report anything. An examination (using FontForge) of some fonts
that provide these seem to be constructed correctly, leading me to believe
that the underlying utilities may never have been updated to handle extended
values, but that's just a guess.

4) Despite the fact that Thai and Laotian character sets, while different
and have different Unicode plane assignments, are similar enough that they
can be read (though possibly not understood) on either side of the border, I
have found no fonts whatever that contain both of these. Since my collection
of Thai fonts is rather extensive, I find this odd. If I were ever to mix
Thai and Laotian in the same document - which I haven't - my guess is that
substitution problems will pop up immediately.

5) The Droid family of fonts, by the way, does NOT contain any Thai
characters. Fair enough, as they provide supplemental fonts for other
Scripts. For Thai, I have DroidSerifThai-Regular, DroidSerifThai-Bold, and
DroidSansThai installed. You would think therefore, that when using
DroidSerif-Regular as the font, DroidSerifThai-Regular would be the perfect
substitute for text passages containing Thai. For reasons I've yet to track
down, however, it isn't even on the list of fonts considered as substitutes;
I suspect that since none of these Thai variants report support for the ISO
639-1 Language Code 'th' that's probably a good clue but, since I don't
actually use Droid, I haven't pursued that further. (To the OP's original
question, there are equivalent Droid Hebrew fonts as well.)

Finally, I have some hesitation in modifying any particular font, since as
far as I know they could be overwritten at any time by a helpful app or OS.
It seems preferable that if any font errors are found, that they be vetted,
confirmed and corrected by the original creating entity. Unfortunately, I'm
not sure how all the existing "faulty" versions could ever be rounded up and
destroyed.

But - that's another problem. If enough definitive examples are found,
perhaps there will be some recognition that there is still work to be done.
I know from past postings that you're familiar with this situation, so if
there is a way for me to pass my bash script along - assuming you have
access to a linux machine (or the Windows 10 bash shell experiment???), let
me know; I'd be happy to hear any comments or corrections you might have
...

Time to stop here: I have a tendency to run on when goaded.

Regards - Frank

toki · October 27, 2016, 5:18pm

answer to your question might be: if you were to fix all of those things,
you probably wouldn't encounter the unwarranted and unexpected font substitutions.

If it wasn't so big, Unifont would make a good test case.
Maybe I can figure out how to do bulk editing of the meta-data.

.pdf or save the file as an .fodt; in either case the actual font being used

I'll add that (.fodt) to the list of formats to save documents in.

For one project I'm working on, getting the characters to behave
themselves is a huge problem. So much so, that I've wanted for a macro
that could change the character style of a glyph, according to its
Unicode Code Point.

font to see what's in it, but they are all fairly tedious if you want to
compare some arbitrary set of fonts, and involve looking at one font at a time.

Running a script, or set of scripts that export data to a text file,
then looking at the results, is probably the simplest/fastest way to
examine them.

deep the water was that I was stepping into, the choice of bash would likely

Donald Knuth said that the only way to write software, was to write the
program three times, changing programming languages at least once.

a short list of some of the things it has to say about itself. The results
so far are rather fascinating, and tend to confirm your suspicion/hope/guess
that fixing the fonts may fix the problem.

('el' and 'hi') for this example. BUT: DejaVuSans-ExtraLight.ttf is missing
the 0x0559 character from the available character bit map. Not knowing
Armenian, I don't know what to make of that, but it's interesting.

The most probably reason for an otherwise complete weight, with other
weights in the same typeface including the missing glyphs, is designer
oversight.

FreeSerifItalic doesn't report the ISO 15924 script tag 'grek' and FreeMono
fails to report the code 'armn'; DejaVuSansMono-Oblique and FreeMonoOblique
don't report either 'el' or 'hi'. You can see that answering your question
would require some serious experimentation: are there valid reasons some
members of these families are slightly different from others, etc?

The Unicode Standard allows for some designer leeway, in both how the
glyphs are constructed, and how they interact with each other. That
might be what is happening here.

2) I have found two versions of Garamond on my system using this script (I
don't believe I would have done that, so I suspect that different apps may
have added them not realizing the other was there).

That gets into font metadata, and a program might look for something
specific, and failing to find it, automatically downloads and installs
the font that contains what it was specifically looking for.

The base font for one reports it's in a 'Normal' style, while the other says it's 'Regular.'

Those are two different weights. Without looking at them, my guess is
that "Regular" is the inferior looking one.

questions remain: how the heck would I ever have stumbled across this?

This is where one uses a utility that generates a chart of all installed
fonts in one's system, and then looks at which fonts it claims are
installed, and what they look like.

I've forgotten which font management utility for Linux includes that
functionality.

that the underlying utilities may never have been updated to handle extended
values, but that's just a guess.

Look at how they handle the multi-coloured emoji of Unicode 9.0.

OTOH, that is perhaps unfair, because most font utilities for *Nix are
fixated on Unicode 5.0, or earlier.

If I were ever to mix Thai and Laotian in the same document

They require different IMEs, so most people that put fonts together,
won't combine them in the same typeface.

my guess is that substitution problems will pop up immediately.

No doubt.

For reasons I've yet to track down, however, it isn't even on the list of fonts considered as substitutes;

One probably needs to update a database somewhere.

Finally, I have some hesitation in modifying any particular font, since as
far as I know they could be overwritten at any time by a helpful app or OS.

This is where running
sudo /usr/share/fonts rm * -R
sudo cp /media/theme/fonts/default/* /usr/share/fonts/*
ensures that only the fonts that one wants are installed.

It seems preferable that if any font errors are found, that they be vetted,
confirmed and corrected by the original creating entity.

This depends upon who/what created the font in question.

For some of those US$5K fonts I've seen, even looking for "errors" in
the font is a breach of the license.

Unfortunately, I'm not sure how all the existing "faulty" versions

could ever be rounded up and destroyed.

You give the corrected font a new, higher version number, and hope that
users will start using the updated font. Trying to destroy faulty
versions, is playing whack a mole. There are better things to do with
one's time and energy.

there is a way for me to pass my bash script along - assuming you have
access to a linux machine (or the Windows 10 bash shell experiment???), let

Is it on GitHub? if so, that might be the easiest.
Otherwise send it as an email attachment.
Box, or dropbox might be easier than email.

I use Linux. Windows is way to frustrating for me to use.

jonathon

CVAlkan · October 28, 2016, 7:25pm

I've attempted to upload my shell script using the More button; it is very
heavily commented (as much for my own benefit as anything). As a quick trial
you can make it executable and then type, for instance:

FindFont thai greek
or
FindFont hindi persian hebrew

The first will list all the fonts that contain characters for both Thai and
Greek, and list enough information to see which of those fonts correctly
report the needed support. That helps when choosing a font that Writer
(hopefully) won't mysteriously replace. Even with the ability to set a CTL
language and font, it will be ignored and replaced if the font isn't
reporting what it has correctly. With trying to intermingle more than one
language/script in a document, CTL hurts far more than it helps.

The second command - giving three arguments - probably won't list anything
unless you're have some really beefy fonts, although both FreeFontSerif and
its bold counterpart support all three. FindFont.FindFont
<http://nabble.documentfoundation.org/file/n4198555/FindFont.FindFont>

The bad news is that the list of languages/scripts is not comprehensive
(just ones I've happened to look into); the good news is that they're
defined in a "case" statement, so adding some others should be relatively
simple.

Frank

toki · October 28, 2016, 9:57pm

I've attempted to upload my shell script using the More button;

Thanks.

heavily commented (as much for my own benefit as anything).

If a script is not commented, then nobody knows what it is supposed to do.

That helps when choosing a font that Writer (hopefully) won't mysteriously replace. Even with the ability to set a CTL
language and font, it will be ignored and replaced if the font isn't reporting what it has correctly.

If the font does correctly report what it has, then it shouldn't be
replaced.

The bad news is that the list of languages/scripts is not comprehensive
(just ones I've happened to look into); the good news is that they're
defined in a "case" statement, so adding some others should be relatively
simple.

Give me a week or so to play with it.
I'll probably expand it to include most, if not all writing systems
included in Unicode 9.0:

CVAlkan · October 28, 2016, 10:27pm

Would love to see it when you have a prototype ...

As I said in the comments, I started with a quick and dirty shell script,
having no idea how big it would get, but by the time I realized I should
have begun with a "real" language it was a little late.

If I can do any testing and can find the time, I'd be happy to ...

Good Luck.

Frank

CVAlkan · November 2, 2016, 7:04pm

Jonathon (and Jonathan):Here is a pdf containing a better explanation of how
to HELP avoid unnecessary font substitutions in LibreOffice Writer and other
applications. The primary cause of this to be using fonts that either don't
have or don't correctly report their coverage and other capabilities. This
is particularly annoying when intermingling multiple scripts/languages
within a single sentence, paragraph, or document. The pdf also contains an
updated bash shell script for helping to select fonts that are appropriate
for a given combination of scripts/languages, as well as identifying fonts
that should perhaps be retired to that proverbial foundry in the sky and
replaced with better ones.Frank Evaluating-fonts-for-multilingual-use.pdf
<http://nabble.documentfoundation.org/file/n4199007/Evaluating-fonts-for-multilingual-use.pdf>

Dotan_Cohen · November 4, 2016, 6:58am

I apologize for not having seen this thread sooner. Here is a document
which explains why you see what you do, and how to work with it rather
than against it. The concepts are really quite simple, but not
intuitive:
http://dotancohen.com/howto/rtl_right_to_left.html

You are invited to contact me at any time with questions.

Some examples of proper mixed Hebrew and English:

Hello, יהונתן, how are you?‎
‫שלום, Jonathan, מה שלומך?

English at the beginning, עברית בסוף.‎
‫עברית בהתחלה, English at the end.

Because plain-text email does not even have a concept of alignment,
the alignment of the Hebrew sentences depends on your renderer (email
client or web browser). Most likely, they will all be left-aligned.
Note however that alignment and directionality are different concepts.
In all cases, the punctuation should be at the proper end of the
sentence. In order to have Hebrew texts right-aligned in email, I
would have to have sent an HTML email. In LibreOffice you shouldn't
have such an issue. LibreOffice, unlike email, has a concept of
alignment.

CVAlkan · November 4, 2016, 1:53pm

Dotan:

I'm sorry I never stumbled across your essay before, but thanks for an
excellent explanation of how the Unicode® Standard Annex #9/Unicode
Bidirectional Algorithm works *in actual practice*!

So far as I can see, your description is still valid for even the recently
updated version of that algorithm
(http://www.unicode.org/reports/tr9/tr9-35.html). As published, the Unicode
Consortium's algorithm really doesn't explain what's happening in a way that
would help an average user - one who is just trying to "type" while mixing
multiple scripts with opposing directionality - it's more intended for
developers.

Unfortunately (in my view anyway), the algorithm itself makes some
assumptions that I find unjustifiable. A primary example is the
categorization of certain "shared characters" (spaces, punctuation and so
forth) as neutral, and accompanying that with the idea that they should
therefore take on the directionality of the paragraph unless and until
surrounded by characters that clearly define them as one directionality or
another.

This seems to be why, for instance, the cursor jumps around mysteriously
when entering a multi-word segment of Hebrew or Arabic scripts (regardless
of the actual language they are used for) each time a space is encountered
(you said "In LibreOffice you shouldn't have such an issue" - true enough,
but several remain). It would seem to me that - from a user-interface
perspective at least - such characters should keep the directionality of the
most recently typed character, leaving the cursor where it was before the
space (most common example, but occurs with other such characters) was
entered. If the next character is indeed one of the opposite directionality,
then make the correction accordingly.

As a matter of principle, assumptions in algorithms always seem risky and/or
dangerous. In this case, the whole idea that one needs to set the
directionality of characters or phrases ahead of time seems particularly
problematic. The obvious counter-argument to this is when beginning a
paragraph with a character that isn't in the direction the writer intended,
that would need to be treated as a special circumstance.

The ultimate objective would seem to be completely removing any barriers to
freely typing in whatever language or script desired without needing to know
a lot of special tricks; both Unicode (and UTF-8) and OpenType font
technology are big huge steps towards this goal - but we're not quite there
yet.

Again, thanks for pointing out your essay!

toki · November 24, 2016, 6:18am

Instead of creating a program from scratch, I'm going to modify an
existing one. The downside is that I have to learn another programming
language. Along the way, I'll also have to fix some existing bugs in
that program.

The reason for modifying the existing program, is that it means I have
to write one routine (^1) to display the metadata that it currently reads.

Once I've finished that modification, I'll modify a different tool, to
write valid metadata to each glyph.

I don't have a time frame by which this will be done.

I'm in the midst of two major projects, and one minor project. However,
one of those projects requires LibO to recognize, and correctly display
different writing systems in the same phrase.
(So this is, in effect, becomes a sub-project of one of those projects.)

^1: That is the theory. In practice, I expect to have to fix half a
dozen bugs, and, maybe implement one or two other functions.

jonathon

CVAlkan · November 24, 2016, 10:35am

A couple questions ...

What's the program and language you're looking at? Even though I doubt I
could assist, I'd still like to be able to read through what already exists
in a never-ending quest to get a handle on this subject (which is far more
complex than I originally envisioned, although I'm very impressed with what
things have been done since the days of 8 bits).

But consider this my encouragement for whatever that's worth; I'm also
willing to experiment with whatever you come up with if you need another
pair or eyes.

Regards ...

Frank

jonathon2 · November 25, 2016, 12:12am

What's the program and language you're looking at?

FontMatrix, and C++, respectively.

But consider this my encouragement for whatever that's worth; I'm also
willing to experiment with whatever you come up with if you need another

I cloned it to my GitHub space, and then cloned it on my system here.
I'll merge my changes onto the version on my GitHub space. That way,
the original stays "pure", but people that want to risk their system
with my programming skills, can try it out.

jonathon

khaledhosny · December 10, 2016, 4:04am

A bit late to the party…

Unfortunately (in my view anyway), the algorithm itself makes some
assumptions
that I find unjustifiable. A primary example is the categorization of
certain
"shared characters" (spaces, punctuation and so forth) as neutral, and
accompanying
that with the idea that they should therefore take on the directionality
of the
paragraph unless and until surrounded by characters that clearly define
them as one
directionality or another.

This seems to be why, for instance, the cursor jumps around mysteriously
when
entering a multi-word segment of Hebrew or Arabic scripts (regardless of
the actual
language they are used for) each time a space is encountered.

This usually means you didn’t set the paragraph direction and just aligned
the
paragraph to the right while leaving its direction LTR. No jumping would
happen if
the paragraph has RTL direction.

It looks to me most of the directionality issues in this thread is coming
from this
confusion, but again paragraph direction and alignment are two different
things.

CVAlkan · December 10, 2016, 1:35pm

Khaled:

Re: "paragraph direction and alignment are two different things." Certainly
true (and, from many discussions here and elsewhere, it's obvious to me that
anyone who cares about such things already knows this as well) but, I think,
misses the primary point, which is to eliminate a bizarre and confusing user
interface.

So: Two responses to your comment: "This usually means you didn’t set the
paragraph direction and just aligned the paragraph to the right while
leaving its direction LTR. No jumping would happen if the paragraph has RTL
direction."

First off, I'm basing my own opinions on the idea that following the Unicode
Standard in this regard *should be* the objective, since a) it is well
thought out and b) results in an interface that is both more intuitive and
far easier to use in practice. The reference for that, by the way, is
http://www.unicode.org/reports/tr9/tr9-35.html for the official “Unicode®
Standard Annex #9: UNICODE BIDIRECTIONAL ALGORITHM.”

So here goes.

To your point about setting the paragraph direction, you are correct. But
why should a user need to do so if it is unnecessary? Annex 9 clearly
recommends that the *default* paragraph direction should be set to the
directionality of the first strongly directional character entered into a
that paragraph. This is just my interpretation, of course, but it's
bolstered by the fact that it makes life easier. More on the *default* in a
bit ...

The Calligra Words and FocusWriter word processors, as well as the gEdit and
Kate text editors both act in this manner, so it's not unheard of. Of
course, neither word processor has the feature set of Writer, but that's not
the scope of this discussion - I will say that, for some complex or
extensive entry where intermingling of bidirectional text is required, I
will switch to one of those to do the actual typing, and then copy the block
to Writer to make use of its other features. In such situations, Writer's
behavior is actually annoying.

Secondly, you seem to be assuming that paragraphs run in just one direction
or another. For certain use cases, that's reasonable, of course, but as a
general rule, that is entirely too limiting. (Think of translators,
literary, morphological, and etymological analyses, and so forth).

Far and away the most annoying aspect of this is when initially entering an
RTL phrase of more than one word in an otherwise LTR paragraph. Having the
cursor jump to the right as each space (a non-directional or "neutral" as
Annex 9 calls it) between each RTL word is entered is fun to watch, but
certainly not what a typical user would expect.

Annex 9 does not specify this (although I've read some postings suggesting
it does). The relevant section says “Generally, NIs [i.e. neutral and
isolate formatting characters] take on the direction of the surrounding
text. In case of a conflict, they take on the embedding direction.” But, if
the user hasn't yet entered any character beyond the space, there is no
SURROUNDING text - there is only PRECEDING text. The cursor should stay just
where it is unless and until the user enters another LTR character. Of
course this doesn't take into account very unusual needs (where the isolate
formatting characters are needed), but for typical text entry, this is the
most common use case for mixing bidirectional text in a single paragraph.

As a further comment on "No jumping would happen if the paragraph has RTL
direction." The same distracting behavior will occur in the opposite
direction if an LTR segment is entered into a RTL paragraph. (except when
numeric digits are entered, which are mostly LTR even with RTL languages;
they seem to be handled independently of other characters in most
implementations - but again, that's a distraction from this particular
thread).

The original poster also mentioned his struggle with placing the period at
the end of a sentence; in a normally LTR paragraph containing bidirectional
text that ENDS WITH the non-default directionality I could almost hear him
screaming as it took me ages to figure out how to overcome that in Writer,
but it seems that interpreting "surrounding" is the culprit here as well.
I'm not sure if you've ever explained to a non-technical translator how to
insert a zero-width character before, but it can turn into a fascinating
conversation - I'd like to see Writer (and, to be fair) many other apps a
bit more intuitive to use in such cases.

Editing text in bidirectional paragraphs is a bit different, of course,
since the settled layout needs to be disrupted, but the issues there are a
bit more involved than entry, so I'll leave well enough alone for the
moment.

But thanks for responding; it's good to know that some attention is being
bestowed on those (apparently very) few of us who type such things.

Are you, by the way, the "HarfBuzz" Khaled Hosny, or is that a different
person?

-Frank

khaledhosny · December 11, 2016, 1:21pm

Re: "paragraph direction and alignment are two different things." Certainly
true (and, from many discussions here and elsewhere, it's obvious to me
that
anyone who cares about such things already knows this as well) but, I
think,
misses the primary point, which is to eliminate a bizarre and confusing
user
interface.

I don’t think there is anything bizarre or confusing about this user
interface
(speaking as a RTL user), but your mileage may vary.

So: Two responses to your comment: "This usually means you didn’t set the
paragraph direction and just aligned the paragraph to the right while
leaving its direction LTR. No jumping would happen if the paragraph has
RTL
direction."

First off, I'm basing my own opinions on the idea that following the
Unicode
Standard in this regard *should be* the objective, since a) it is well
thought out and b) results in an interface that is both more intuitive and
far easier to use in practice. The reference for that, by the way, is
http://www.unicode.org/reports/tr9/tr9-35.html for the official “Unicode®
Standard Annex #9: UNICODE BIDIRECTIONAL ALGORITHM.”

We do follow it.

So here goes.

To your point about setting the paragraph direction, you are correct. But
why should a user need to do so if it is unnecessary? Annex 9 clearly
recommends that the *default* paragraph direction should be set to the
directionality of the first strongly directional character entered into a
that paragraph. This is just my interpretation, of course, but it's
bolstered by the fact that it makes life easier. More on the *default* in
a
bit ...

True, but it also allows for higher level protocols to control it,
see http://unicode.org/reports/tr9/#HL1, which is what we do here and
pretty much any system that allows text formatting. The problem with
setting paragraph direction based n first string is that it is heuristic,
and
it often fails; if your RTL paragraph starts with a LTR word it will get the
wrong paragraph direction, or if it does not contain any strong direction
characters (e.g. numbers only) then you can’t even determine the
direction.

In Writer you simply need to select paragraph direction just once when
you start a new document and it will use it for all paragraphs until you
change it. This have the benefit of being explicit rather than implicit, if
there was a 100% certain way to auto detect paragraph direction it
would have been the default.

The Calligra Words and FocusWriter word processors, as well as the gEdit
and
Kate text editors both act in this manner, so it's not unheard of. Of
course, neither word processor has the feature set of Writer, but that's
not
the scope of this discussion - I will say that, for some complex or
extensive entry where intermingling of bidirectional text is required, I
will switch to one of those to do the actual typing, and then copy the
block
to Writer to make use of its other features. In such situations, Writer's
behavior is actually annoying.

Most of these are plain text editor, and they have no option but to use the
heuristic since plain text has no way to store paragraph direction.

Calligra Words is not very different from Writer, it has explicit paragraph
direction setting, but if you didn’t explicitly set it it will use the
heuristic. Not
sure what I feel about this, seems a bit surprising but I’m a UX expert, and
it
sounds like a valid improvement request if someone wants to open an issue
on the bug tracker.

Secondly, you seem to be assuming that paragraphs run in just one
direction
or another. For certain use cases, that's reasonable, of course, but as a
general rule, that is entirely too limiting. (Think of translators,
literary, morphological, and etymological analyses, and so forth).

That is how the Unicode Bidirectional Algorithm works, there need to be a
paragraph (base) direction, whether it is set explicitly or auto-detected
using
a heuristic or another. For embdded text that need a different base
direction
than the parent paragraph, you have to resort to control characters (LRE,
RLE,
FSI, etc).

Far and away the most annoying aspect of this is when initially entering
an
RTL phrase of more than one word in an otherwise LTR paragraph. Having the
cursor jump to the right as each space (a non-directional or "neutral" as
Annex 9 calls it) between each RTL word is entered is fun to watch, but
certainly not what a typical user would expect.

That is how bidi works, the space at the end of the paragraph will take the
paragraph direction and be LTR, until you insert another RTL character.
If you know any application that does better here, or have a concrete
suggestion how to improve the situation, please share it (preferably on the
bug tracker).

Annex 9 does not specify this (although I've read some postings suggesting
it does). The relevant section says “Generally, NIs [i.e. neutral and
isolate formatting characters] take on the direction of the surrounding
text. In case of a conflict, they take on the embedding direction.” But,
if
the user hasn't yet entered any character beyond the space, there is no
SURROUNDING text - there is only PRECEDING text. The cursor should stay
just
where it is unless and until the user enters another LTR character. Of
course this doesn't take into account very unusual needs (where the
isolate
formatting characters are needed), but for typical text entry, this is the
most common use case for mixing bidirectional text in a single paragraph.

If there is no surrounding text, it is taking the paragraph direction which
is
LTR here. How do you know that the user is going to insert new text, may be
his paragraph just ended?

The original poster also mentioned his struggle with placing the period at
the end of a sentence; in a normally LTR paragraph containing
bidirectional
text that ENDS WITH the non-default directionality I could almost hear him
screaming as it took me ages to figure out how to overcome that in Writer,
but it seems that interpreting "surrounding" is the culprit here as well.

Please give a concrete example, I’m unable to follow you here.

I'm not sure if you've ever explained to a non-technical translator how to
insert a zero-width character before, but it can turn into a fascinating
conversation - I'd like to see Writer (and, to be fair) many other apps a
bit more intuitive to use in such cases.

As a matter of fact I did, plenty of times. Being a software localizer
myself I
use bidi control characters all the time, and have explained them to many
localizer over the years and they seem to grasp the concept quickly and
appreciate it.

Are you, by the way, the "HarfBuzz" Khaled Hosny, or is that a different
person?

I do contribute occasionally to HarfBuzz.

Regards,
Khaled