Pootle doesn't recognize difference between c, z, s and Croatian diacritics č, ž, š

Hi,

facing rather strange bug (?) in Pootle. If I put 'citat' in search box, Pootle is returning words like 'čitati' and for 'čitati' is returning 'citat' also. It happens with 'š', 'ž', 'č' and 'ć'. For non-existing word 'moze' it will return 'može', which is actually a word but that's not what I searched for.

Seams like it's converting diacritics to 'c', 'z', 's' internally.

Not noticed that before but this might not be new. Usually I'm searching for two or three words so I wasn't really been able to notice that because of the additional context.

It's not a big deal but instead two or three results, I'm getting fifty.

Does it happen with other languages?

I guess it's not easy to make Pootle to cope well with every existing language as it might get resource intensive?

Thanks,

Kruno

And seams that 'đ' is recognized correctly. Maybe because that letter is used in other languages?

Kruno

08.04.2017 u 18:16, Krunose je napisao/la:

Michael Wolf schrieb:

Krunose schrieb:

And seams that 'đ' is recognized correctly. Maybe because that letter is used in other languages?

Yes, it's ASCII. It exists in Icelandic and Faroese. \u00D0 and \u00F0 (hexadecimal).

Michael Wolf schrieb:

Krunose schrieb:

Hi,

facing rather strange bug (?) in Pootle. If I put 'citat' in search box, Pootle is returning words like 'čitati' and for 'čitati' is returning 'citat' also. It happens with 'š', 'ž', 'č' and 'ć'. For non-existing word 'moze' it will return 'može', which is actually a word but that's not what I searched for.

Seams like it's converting diacritics to 'c', 'z', 's' internally.

Yes, it's true. I translate into Upper and Lower Sorbian, they are Slavic languages as well.

08.04.2017 u 19:42, Michael Wolf je napisao/la:

Michael Wolf schrieb:

Krunose schrieb:

Hi,

facing rather strange bug (?) in Pootle. If I put 'citat' in search box, Pootle is returning words like 'čitati' and for 'čitati' is returning 'citat' also. It happens with 'š', 'ž', 'č' and 'ć'. For non-existing word 'moze' it will return 'može', which is actually a word but that's not what I searched for.

Seams like it's converting diacritics to 'c', 'z', 's' internally.

Yes, it's true. I translate into Upper and Lower Sorbian, they are Slavic languages as well.

Thanks for confirming this.

Those characters are passed to URL as 'c', 's' and 'z'. Maybe it can be fixed with percent encoding or something?

Kruno

08.04.2017 u 19:42, Michael Wolf je napisao/la:

Michael Wolf schrieb:

Krunose schrieb:

Hi,

facing rather strange bug (?) in Pootle. If I put 'citat' in search box, Pootle is returning words like 'čitati' and for 'čitati' is returning 'citat' also. It happens with 'š', 'ž', 'č' and 'ć'. For non-existing word 'moze' it will return 'može', which is actually a word but that's not what I searched for.

Seams like it's converting diacritics to 'c', 'z', 's' internally.

Yes, it's true. I translate into Upper and Lower Sorbian, they are Slavic languages as well.

Then it affects Serbian, Bosnian, Montenegrin and Slovenian and possible some other Slavic languages.

Kruno

08.04.2017 u 19:56, Michael Wolf je napisao/la:

Krunose schrieb:

No, HTML entities probably wouldn't work. Letter 'đ' is not passed like that. Don't know if they can fix that easily.

This letter works with me by Alt+numeric 240 (208 is upper case) method on Windows 10 on three Pootle projects: Mozilla, LO and Pootle 2.8.0. I tested it with Icelandic.

I don't understand characters encodings at all, but if you referring to 'đ', appears it's passed plainly as 'đ' to URL from Pootle. I was wondering if 'č', 'š', 'ć', and 'ž' can be passed as percent encoding to improve Pootle's search functionality?

Hope someone will give as answer.

Kruno

Krunose schrieb:

No, HTML entities probably wouldn't work. Letter 'đ' is not passed like that. Don't know if they can fix that easily.

This letter works with me by Alt+numeric 240 (208 is upper case) method on Windows 10 on three Pootle projects: Mozilla, LO and Pootle 2.8.0. I tested it with Icelandic.

Michael

Krunose schrieb:

08.04.2017 u 19:42, Michael Wolf je napisao/la:

Michael Wolf schrieb:

Krunose schrieb:

Hi,

facing rather strange bug (?) in Pootle. If I put 'citat' in search box, Pootle is returning words like 'čitati' and for 'čitati' is returning 'citat' also. It happens with 'š', 'ž', 'č' and 'ć'. For non-existing word 'moze' it will return 'može', which is actually a word but that's not what I searched for.

Seams like it's converting diacritics to 'c', 'z', 's' internally.

Yes, it's true. I translate into Upper and Lower Sorbian, they are Slavic languages as well.

Then it affects Serbian, Bosnian, Montenegrin and Slovenian and possible some other Slavic languages.

Yes, e.g. the Sorbian languages, Polish, Czech, Slovak.

Michael

08.04.2017 u 20:11, Michael Wolf je napisao/la:

Krunose schrieb:

08.04.2017 u 19:42, Michael Wolf je napisao/la:

Michael Wolf schrieb:

Krunose schrieb:

Hi,

facing rather strange bug (?) in Pootle. If I put 'citat' in search box, Pootle is returning words like 'čitati' and for 'čitati' is returning 'citat' also. It happens with 'š', 'ž', 'č' and 'ć'. For non-existing word 'moze' it will return 'može', which is actually a word but that's not what I searched for.

Seams like it's converting diacritics to 'c', 'z', 's' internally.

Yes, it's true. I translate into Upper and Lower Sorbian, they are Slavic languages as well.

Then it affects Serbian, Bosnian, Montenegrin and Slovenian and possible some other Slavic languages.

Yes, e.g. the Sorbian languages, Polish, Czech, Slovak.

Michael

Let's wait and see what happens :smiley:

Kruno

Krunose schrieb:

Yes, e.g. the Sorbian languages, Polish, Czech, Slovak.

Michael

Let's wait and see what happens :smiley:

I filed a bug:

https://github.com/translate/pootle/issues/6238

Michael

08.04.2017 u 21:21, Michael Wolf je napisao/la:

Krunose schrieb:

Yes, e.g. the Sorbian languages, Polish, Czech, Slovak.

Michael

Let's wait and see what happens :smiley:

I filed a bug:

https://github.com/translate/pootle/issues/6238

Michael

I'll probably leave a comment latter to bring the heat.

Now when I think about it, it's not just about passing strings incorrectly to URL from search, it's more complicated then that so I stop playing Sherlock Holmes here.

But I kinda doubt it's easy to fix. We'll see...

And thanks for quick reaction! :slight_smile:

Kruno

08.04.2017 u 21:54, Krunose je napisao/la:

08.04.2017 u 21:21, Michael Wolf je napisao/la:

Krunose schrieb:

Yes, e.g. the Sorbian languages, Polish, Czech, Slovak.

Michael

Let's wait and see what happens :smiley:

I filed a bug:

https://github.com/translate/pootle/issues/6238

Michael

I'll probably leave a comment latter to bring the heat.

Now when I think about it, it's not just about passing strings incorrectly to URL from search, it's more complicated then that so I stop playing Sherlock Holmes here.

But I kinda doubt it's easy to fix. We'll see...

And thanks for quick reaction! :slight_smile:

Kruno

As I suspected, seams that search query _is_ passed as percent encoding for 'čitati' as Firebug in Firefox shows

...search=%C4%8Ditati

as what passed to GET and %C4%8D should be percent encoding for 'č' so something else is wrong.

I'll definitely leave a comment to that bug report.

Kruno