BASIC StrComp and tricky Unicode (Turkish, for example)

pitonyak · January 1, 2015, 5:51pm

I am updating some of my documentation on using BASIC with LO and I wondered if case-sensitive comparisons are done with locale specific information. I am in the US with ENGLISH locale, and I am not certain how easily I can test this sort of thing, but I expect that a simple compare such as

Sub StrCompTest
Print StrComp("ı", "I", 0) ' compare Turkish dotless i with upper case i
Print StrComp("i", "I", 0) ' compare lower case i to upper case i
End Sub

In my locale, this returns -1 and 0. With a Turkish locale, I expect that the first compare will return a 0, but I am unsure how to test this. Any insights into this?

Luuk · January 1, 2015, 7:17pm

I'm not an expert in this, but this page:
http://en.wikipedia.org/wiki/Dotted_and_dotless_I

Shows there is a small, and a capital version of the 'Turkish dotless i'

StrComp("i","I",0) only returns 0 for the small, and the capital i,
and not for the i's with diacritics (í,ì,ï, ....)

I think the 'Turkish dotless i' should be treated as an i with a diacritic.

pitonyak · January 1, 2015, 10:03pm

The code contains the Turkish dotless i, but it fails (as it should) on my computer. I am hesitant (perhaps lazy) to mess with my computer's locale since I don't have a good handle on the implications of changing it or whether or not I just need to change it in an LO configuration somewhere. I also don't want to try installing a Turkish version of things since I would not be able to read anything. Hopefully someone who is sufficiently familiar with these issues (say because they are turkish or are simply aware of other examples that will fail in my locale and work in theirs) can speak up. I do appreciate the wiki link, and it does confirm that this should work if my locale were Turkish and if LO deals with it correctly.

Andrew

Villeroy · January 1, 2015, 10:42pm

Hi,

The locale (Tools>Options>LanguageSettings>Languages>Locale) can be
changed at will without changing a single byte in your documents. This
is a default setting for only for numerals. Numerals with no explicitly
set locale will be displayed differently. Same in Basic where the locale
is used for type conversions between strings and decimals/dates/times. I
can not see any effect on Basic StrComp.

Urmas · January 1, 2015, 10:41pm

"Andrew Douglas Pitonyak":

With a Turkish locale, I expect that the first compare will return a 0, but I am unsure how to test

With a virtual machine? I could try it sometime tomorrow.

Tom_Davies1 · January 1, 2015, 10:50pm

Hi
How about sending the question to the

L10n@Global.LibreOffice.Org

mailing list to try to reach the international translators. It is very
quiet on that list at the moment and has been for over a week.
Regards from
Tom

doug11 · January 2, 2015, 3:48am

If you have the Compose key set up, you can make a Turkish capital I with a dot on it
easily: type Compose .I That's Compose, then period, then capital I: İ It's a little
hard to see here, but it's correct. If you don't have a Compose key, look for your
keyboard setup in your system, and set up a compose key. If you have a modern
k/b with a right Windows key, that would be a good one to choose. I have an old
IBM model M, no Win keys, so I chose right alt.

--doug

pitonyak · January 2, 2015, 4:00pm

Zeki, I think that I managed to sort this out!

I resorted to reading the source code... The short answer is that the application's language setting determines how the compare is done for a case-insensitive compare!

Also, based on what I read in the source code, I figured out how the locale is determined and used. I created a new section in my documentation that shows a few examples and indicates how to change the settings for testing. The actual snippet of code that does this is as follows:

  LanguageType eLangType = Application::GetSettings().GetLanguageTag().getLanguageType();
  pTransliterationWrapper->loadModuleIfNeeded( eLangType );
  nRetValue = pTransliterationWrapper->compareString( rStr1, rStr2 );

Yep, it pulls the value from the application language settings, so that if I set that to Turkish for the application, I can test in Turkish.

Print StrComp("ı", "I", 0) ' compare Turkish dotless i with upper case i
Print StrComp("i", "I", 0) ' compare lower case i to upper case i

returns 0 and 1. If I change back to English US, this evaluates as -1, 0.

I will be pushing a change to my web site shortly, complete with updated code examples (in section 7.3 of OOME) and an explanation.

Thanks to all for the pointers and thanks to Jim Caughran for the exchange that then had me thinking about this.

Mark_Bourne · January 3, 2015, 6:01pm

Andrew Douglas Pitonyak wrote:

Yep, it pulls the value from the application language settings, so that
if I set that to Turkish for the application, I can test in Turkish.

Print StrComp("ı", "I", 0) ' compare Turkish dotless i with upper
case i
Print StrComp("i", "I", 0) ' compare lower case i to upper case i

returns 0 and 1. If I change back to English US, this evaluates as -1, 0.

I think in Turkish "I" is used as the upper-case dotless ı, and there's a separate character for the upper-case dotted i (which presumably should compare equal to the lower-case dotted i in that locale). Perhaps U+0130 "İ".

Mark.

pitonyak · January 4, 2015, 1:08am

You are correct.... and it seems that LO handles these just fine. I am told that it will not handle all cases well (seems that there are a few cases in German that may not be correctly handed; for example, comparing characters with accents to characters without). I expect some clarification on this in the next week so that I can add more detail to a new section where I am attempting to add clarity so that others do not need to read the code and rely on trial and error to know what to expect. Oh, and perhaps file a bug report if the behavior is just wrong or inconsistent.