Develop hyphenation extension

Dear list members,

I'd like to create a hyphenation dictionary for Church Slavonic
(Church Slavic), ISO code CU. I have ported the existing TeX patterns
using the instructions here:
https://wiki.openoffice.org/wiki/Documentation/SL/Using_TeX_hyphenation_patterns_in_OpenOffice.org

(are these the correct instructions to use? I have not found anything newer.)

Now, it's not clear to me what I need to do with the resulting .dic
file. The documentation says I need to bundle it as a Dictionary
Extension. But is there documentation on how that needs to be done?
After some digging around, I found this:
https://wiki.openoffice.org/wiki/Extension_Dictionaries

But it doesn't really say what to do. It just gives some XML code.
Some guidance would be appreciated!

Second question. The character commonly used in Church Slavonic for
hyphenation is the underscore, not the hyphen (e.g., hyphe_nation). In
TeX, I can simply set the hyphenchar to be _. Is this possible in
LibreOffice? If yes, where do I specify it?

Thank you for your help!

Aleksandr

Dear list members,

Now, it's not clear to me what I need to do with the resulting .dic
file. The documentation says I need to bundle it as a Dictionary
Extension. But is there documentation on how that needs to be done?

To update, after some hacking around, I was able to bundle and
instlall the hyphenation patterns. It involved creating a
dictionaries.xcu file and a description.xml file. The simplest way
appears to be to crack open an existing extension with an archive
manager and take a look at how it is organized.

Second question. The character commonly used in Church Slavonic for
hyphenation is the underscore, not the hyphen (e.g., hyphe_nation). In
TeX, I can simply set the hyphenchar to be _. Is this possible in
LibreOffice? If yes, where do I specify it?

Does anyone know the answer to this question? Can I set the
hyphenation character to be _ instead of -? Maybe in the locale data
files? If not, is this a bug against LO or a bug against Hunspell?

Thanks,

Aleksandr

Hello Aleksandr,

2016-04-03 12:16, Aleksandr Andreev wrote:

Second question. The character commonly used in Church Slavonic for
hyphenation is the underscore, not the hyphen (e.g., hyphe_nation). In
TeX, I can simply set the hyphenchar to be _. Is this possible in
LibreOffice? If yes, where do I specify it?

Does anyone know the answer to this question? Can I set the
hyphenation character to be _ instead of -? Maybe in the locale data
files? If not, is this a bug against LO or a bug against Hunspell?

I think this belongs to either locale data, as you are suggesting, or
perhaps even to the actual fonts you're using.

The reason why I suspect it might belong to fonts is because there is
only one Unicode codepoint I know of serving this exact purpose (U+00AD
SOFT HYPHEN), and OpenType has a feature called "Localized forms", which
is designed exactly for cases like this (where glyph representation in
particular language is supposed to be different than usual). In
combination, these features seem to provide means to solve your problem.

Rimas

Dear Rimas,

The reason why I suspect it might belong to fonts is because there is
only one Unicode codepoint I know of serving this exact purpose (U+00AD
SOFT HYPHEN), and OpenType has a feature called "Localized forms", which
is designed exactly for cases like this (where glyph representation in
particular language is supposed to be different than usual). In
combination, these features seem to provide means to solve your problem.

You may be right that it belongs to the realm of Font Features
(although that sounds like a terrible design flaw IMHO, given that LO
has no mechanism currently to turn simple OpenType features on and off
IIUC). But it certainly has nothing to do with the Soft Hyphen.

According to the Unicode documentation (p. 268),

Despite its name, U+00AD soft hyphen is not a hyphen, but rather an
invisible format character used to indicate optional intraword breaks.

And on p. 812 of the Standard:

U+00AD soft hyphen (SHY) indicates an intraword break point, where a
line break is preferred if a word must be hyphenated or otherwise
broken across lines. Such
break points are generally determined by an automatic hyphenator. SHY
can be used with
any script, but its use is generally limited to situations where users
need to override the
behavior of such a hyphenator.

So, the SHY:
* has no visible glyph, despite what some font manufacturers are doing;
* is not a graphic character, but rather a format control character;
* is not supposed to be used by an automatic hyphenator for hyphenation;
* is supposed to be used by a user to *override* the behavior of an
automatic hyphenator.

Cordially,

Aleksandr

Hi Aleksandr,

2016-04-03 16:17, Aleksandr Andreev wrote:

The reason why I suspect it might belong to fonts is because there is
only one Unicode codepoint I know of serving this exact purpose (U+00AD
SOFT HYPHEN), and OpenType has a feature called "Localized forms", which
is designed exactly for cases like this (where glyph representation in
particular language is supposed to be different than usual). In
combination, these features seem to provide means to solve your problem.

You may be right that it belongs to the realm of Font Features
(although that sounds like a terrible design flaw IMHO, given that LO
has no mechanism currently to turn simple OpenType features on and off
IIUC). But it certainly has nothing to do with the Soft Hyphen.

According to the Unicode documentation (p. 268),

Despite its name, U+00AD soft hyphen is not a hyphen, but rather an
invisible format character used to indicate optional intraword breaks.

And on p. 812 of the Standard:

U+00AD soft hyphen (SHY) indicates an intraword break point, where a
line break is preferred if a word must be hyphenated or otherwise
broken across lines. Such
break points are generally determined by an automatic hyphenator. SHY
can be used with
any script, but its use is generally limited to situations where users
need to override the
behavior of such a hyphenator.

So, the SHY:
* has no visible glyph, despite what some font manufacturers are doing;
* is not a graphic character, but rather a format control character;
* is not supposed to be used by an automatic hyphenator for hyphenation;
* is supposed to be used by a user to *override* the behavior of an
automatic hyphenator.

I see you've done your homework and did a bit more research than me.
Great! :slight_smile:

With all the data you shared, I'm even more certain that this belongs to
the locale data, much like quotation characters and number formatting
characters. I'm not sure if this locale property is readily available
for inclusion in locale data though. It might be that Slavonic is a very
rare exception to the common rule of using hyphens for that, and that
this hasn't been accounted for anywhere. At least I couldn't find
anything about this neither in the LDML standard, nor in our DTD for
locale definition files
(https://cgit.freedesktop.org/libreoffice/core/plain/i18npool/source/localedata/data/locale.dtd).

Regards,
Rimas

Dear Rimas,

With all the data you shared, I'm even more certain that this belongs to
the locale data, much like quotation characters and number formatting
characters. I'm not sure if this locale property is readily available
for inclusion in locale data though.

A while ago, I created the LO locale XML files for Church Slavonic
("Church Slavic"). Please see here:
https://gerrit.libreoffice.org/#/c/15540/

There was nothing there about hyphenation characters, AFAICT.

It might be that Slavonic is a very
rare exception to the common rule of using hyphens for that, and that
this hasn't been accounted for anywhere.

I don't think it's that uncommon. Unicode includes a number of
script-specific hyphenation characters, for example U+058A Armenian
Hyphen, U+1400 Canadian Syllabics Hyphen, etc. How are users supposed
to use those? Also, some Indic scripts, IIUC, do not use a hyphen
character at all; they just split a word across line. What if users
are using a legacy codepage where the Hyphen is encoded somewhere
other that U+002D? (BTW, strictly speaking U+002D is *not* a hyphen,
and LO should really be using U+2010 for hyphenation). Or the user
wants to set some decorative character to be a hyphen.

IMHO, a hyphenation character should be settable from the user
interface, for example, together with the "Characters at line end" and
"Characters at line begin" in Format->Paragraph -> Text Flow. It
should not involve having to hack an XML file and rebuild LO from
source.

BTW, despite setting LEFTHYPHMIN and RIGHTHYPHMIN in the hyphenation
dictionary, "Characters at line end" and "Characters at line begin"
cannot be set lower than 2. But Church Slavic uses LEFTHYPHEMIN = 1
(Ancient Greek uses both LEFTHYPHMIN and RIGHTHYPHMIN = 1). Is this a
bug? Or a feature?

anything about this neither in the LDML standard, nor in our DTD for
locale definition files
(https://cgit.freedesktop.org/libreoffice/core/plain/i18npool/source/localedata/data/locale.dtd).

So, I guess as a first step, LO should support changing the hyphen
character in the XML locale files. Or, is this really a Hunspell
issue, and it should be specified from the hyphenation dictionary
extension? Could someone confirm this?

Cordially,

Aleksandr