Help text for MIDB

Jesper_Hertel · January 19, 2015, 1:16pm

The help text for MIDB (
https://help.libreoffice.org/Calc/Text_Functions#MIDB) says this:

"

MIDB

Returns a text string of a DBCS text. The parameters specify the starting
position and the number of characters.

Syntax

MIDB("Text"; Start; Number_bytes)

Text is the text containing the characters to extract.

Start is the position of the first character in the text to extract.

Number_bytes specifies the number of characters MIDB will return from text,
in bytes.

Example

=MIDB("office";2;2) returns ff.

"

But "office" is not a string written in a double byte character set (DBCS,
https://en.wikipedia.org/wiki/DBCS), so the example is not helpful for the
main use of the function.

There should primarily be an example with a string in a language that
actually uses a double byte character set (DBCS), like Chinese. And then
the example would show that only 1 character is returned when asking for 2
bytes, i.e. the number of characters returned will be *half* of the number
of bytes asked for. It should also be noted that if you ask for 3 bytes,
you get 1 character, etc.

The given example only shows the rather special case when you are *not*
giving the function a DBCS string; in this case the number of characters
returned is the *same* as the number of bytes.

Secondly, the sentence "Number_bytes specifies the number of characters
MIDB will return from text, in bytes" is not very clear, because the number
does *not* specify the number of characters. The fact is that if you feed
the function a string in a single byte character set (SBCS,
https://en.wikipedia.org/wiki/SBCS), such as "office", the number of bytes
is the *same* as the number of characters. If you feed the function a
string in a double byte character set (DBCS), such as a string of Chinese
characters, the number of bytes is *double* the amount of characters.

I don't know if this is the right place to report this problem.

The problem is probably also known already, and I kind of expect to get the
reply "oh yeah, we know, the help is a big mess and really needs
improvement.". But maybe I am wrong, so I am reporting it anyway.

Jesper

suokunlong · January 19, 2015, 2:33pm

A1 = "中国"
B1 = MIDB(A1,1,1) returns ""
B1 = MIDB(A1,1,2) returns "中"
B1 = MIDB(A1,1,3) returns "中"
B1 = MIDB(A1,1,4) returns "中国"

I think it is better up to the localizer to translate this help text according to their needs, for example Japanese team may show how this works with Japanese chars.

Kevin Suo

于 2015年1月19日 GMT+08:00PM9:16:23, Jesper Hertel <jesper.hertel@gmail.com> 写到:

Jesper_Hertel · January 19, 2015, 3:12pm

A1 = "中国"
B1 = MIDB(A1,1,1) returns ""
B1 = MIDB(A1,1,2) returns "中"
B1 = MIDB(A1,1,3) returns "中"
B1 = MIDB(A1,1,4) returns "中国"

Thanks for the examples, Kevin! I was afraid they wouldn't go through the
maling list system, so that was why I didn't supply any. But yours are even
better than the ones I would have thought of providing.

I think it is better up to the localizer to translate this help text
according to their needs, for example Japanese team may show how this works
with Japanese chars.

I agree that the specific translation is up to the localizers. But even
people using a non-DBCS user interface language, such as English or Danish,
could want to use that function and could want to know what it is about and
how to use it; they could work with Japanese or another DBCS language
without having the user interface in that language. So I still believe the
English text could be improved. Both regarding the earlier mentioned
sentence and regarding the addition of several actual DBCS examples similar
to the good ones you provided. Maybe just worded and expanded like this to
show that the position argument is also counted in bytes and not in
character positions:

MIDB("中国",1,1) returns "" (1 byte is only half a character and it is
therefore discarded).
MIDB("中国",1,2) returns "中" (2 bytes are one complete character).
MIDB("中国",1,3) returns "中" (3 bytes are one character and a half; the last
byte is discarded).
MIDB("中国",1,4) returns "中国" (4 bytes are two complete characters).
MIDB("中国",2,1) returns "" (byte position 2 is not at the beginning of a
character).
MIDB("中国",2,2) returns "" (byte position 2 is not at the beginning of a
character).
MIDB("中国",3,1) returns "" (byte position 3 is at the beginning of a
character, but 1 byte is only half a character and is therefore discarded).
MIDB("中国",3,2) returns "国".

And yes, I do believe that this rather large amount of examples are
necessary to make it completely clear how this rather technical function
works, and that the Help should be the place for such an explanation.

Whether my explanations in parentheses are understandable or relevant I
don't know. It is an attempt to explain what is happening to the
not-so-technical users, but even also to technical users that want to be
sure they understood it right.

Jesper

stanislav.horacek · January 19, 2015, 8:31pm

Hi,

I agree that these examples are really useful. Could you provide also some examples for the other functions dealing with DBCS (LEFTB, RIGHTB, LENB)?
If so, I will add them to the Help text.

Thanks!
Stanislav

Dne 19.1.2015 v 16:11 Jesper Hertel napsal(a):

Jesper_Hertel · January 20, 2015, 1:26am

Hi Stanislav and others,

Here are my suggestions for examples for MIDB, LEFTB, RIGHTB and LENB.

I actually made a spreadsheet in LibreOffice Calc and tested each
expression to be absolutely sure of the results. The spreadsheet I made can
be found at [1]. I made it using the English (US) user interface and locale.

[1]: http://www49.zippyshare.com/v/YbkWBbkZ/file.html

It turned out that invalid requests (half DBCS characters) actually do
*not* result in empty strings but rather in a *space character*.

Therefore these suggested examples and explanations.

The return values are the *actual* return values using the actual mentioned
expressions and were therefore *not* typed by hand (check the spreadsheet
if you want to see how). Note the rather subtle spaces returned.

MIDB("中国",1,0) returns "" (0 bytes is always an empty string).MIDB("中国",1,1)
returns " " (1 byte is only half a DBCS character and therefore the result
is a space character).MIDB("中国",1,2) returns "中" (2 bytes constitute one
complete DBCS character).MIDB("中国",1,3) returns "中 " (3 bytes constitute
one and a half DBCS character; the last byte results in a space
character).MIDB("中国",1,4)
returns "中国" (4 bytes constitute two complete DBCS characters).MIDB("中国",2,1)
returns " " (byte position 2 is not at the beginning of a character in a
DBCS string; 1 space character is returned).MIDB("中国",2,2) returns " "
(byte position 2 points to the last half of the first character in the DBCS
string; the 2 bytes asked for therefore constitutes the last half of the
first character and the first half of the second character in the string; 2
space characters are therefore returned).MIDB("中国",2,3) returns " 国" (byte
position 2 is not at the beginning of a character in a DBCS string; a space
character is returned for byte position 2).MIDB("中国",3,1) returns " " (byte
position 3 is at the beginning of a character in a DBCS string, but 1 byte
is only half a DBCS character and a space character is therefore returned
instead).MIDB("中国",3,2) returns "国" (byte position 3 is at the beginning of
a character in a DBCS string, and 2 bytes constitute one DBCS
character).MIDB("office",2,3)
returns "ffi" (byte position 2 is at the beginning of a character in a
non-DBCS string, and 3 bytes of a non-DBCS string constitute 3 characters).
LEFTB("中国",1) returns " " (1 byte is only half a DBCS character and a space
character is returned instead).LEFTB("中国",2) returns "中" (2 bytes
constitute one complete DBCS character).LEFTB("中国",3) returns "中 " (3 bytes
constitute one DBCS character and a half; the last character returned is
therefore a space character).LEFTB("中国",4) returns "中国" (4 bytes constitute
two complete DBCS characters).LEFTB("office",3) returns "off" (3 non-DBCS
characters each consisting of 1 byte).
RIGHTB("中国",1) returns " " (1 byte is only half a DBCS character and a
space character is returned instead).RIGHTB("中国",2) returns "国" (2 bytes
constitute one complete DBCS character).RIGHTB("中国",3) returns " 国" (3
bytes constitute one half DBCS character and one whole DBCS character; a
space is returned for the first half).RIGHTB("中国",4) returns "中国" (4 bytes
constitute two complete DBCS characters).RIGHTB("office",3) returns "ice"
(3 non-DBCS characters each consisting of 1 byte).
LENB("中") returns "2" (1 DBCS character consisting of 2 bytes).LENB("中国")
returns "4" (2 DBCS characters each consisting of 2 bytes).LENB("office")
returns "6" (6 non-DBCS characters each consisting of 1 byte).

If anyone else is curious, "中国" means China in Chinese – according to
Google Translate :-).

Jesper

suokunlong · January 20, 2015, 12:19pm

在2015年01月20 09时25分, "Jesper Hertel"<jesper.hertel@gmail.com>写道:

Here are my suggestions for examples for MIDB, LEFTB, RIGHTB and LENB.

Good job!

If anyone else is curious, "中国" means China in Chinese – according to

Google Translate :-).

Google Translate is 100% right.

Kevin Suo

naruoga · January 20, 2015, 1:32pm

Hi, Kevin, Jesper, *

Sorry, I couldn't catch this discussion, just short comment.

Basically Japanese characters can be expressed double-byte as
Chinese, and some of Japanese characters use 4 bytes (called
"Surrogate Pair"), not a two byte, such as "𠀋" (U+2000B).

I know it's trivial example:

A1 = "𠀋𠀋"
B1 = MIDB(A1,1,1) returns ""
B1 = MIDB(A1,1,2) returns "(*)"
B1 = MIDB(A1,1,3) returns "(*)"
B1 = MIDB(A1,1,4) returns "𠀋"
B1 = MIDB(A1,1,5) returns "𠀋(*)"
B1 = MIDB(A1,1,6) returns "𠀋(*)"
B1 = MIDB(A1,1,7) returns "𠀋(*)"
B1 = MIDB(A1,1,8) returns "𠀋𠀋"

(*) is a special character means that font has no glyph in that codepoint.

I wonder if HELP should describe such a detail, though.

Regards,

Yury_Tarasievich · January 20, 2015, 1:42pm

I won't pretend I understood the Chinese and Japanese cases, however, seems to me ALL this, or at least the most representative parts, should go into help, all languages, possibly not into the specific Basic function but into some separate subclause ("handling the multi-byte codings?").
This shouldn't be considered a "duplicate" of the relevant standards, but an explanation of what is actually implemented in LO.

...

I wonder if HELP should describe such a detail, though.

-Yury

Jesper_Hertel · January 20, 2015, 3:46pm

在2015年01月20 09时25分, "Jesper Hertel"<jesper.hertel@gmail.com>写道:
> Here are my suggestions for examples for MIDB, LEFTB, RIGHTB and LENB.
Good job!

Thanks!

> If anyone else is curious, "中国" means China in Chinese – according to
Google Translate :-).

Google Translate is 100% right.

stanislav.horacek · January 21, 2015, 5:16pm

Thanks a lot!

I've submitted the suggestions to Gerrit, where anyone is welcome to comment them:
https://gerrit.libreoffice.org/#/c/14092/

Stanislav

Dne 20.1.2015 v 02:25 Jesper Hertel napsal(a):