Chopping UTF-8
While researching a very weird bug0 in Koha I had to figure out a way chop a string to a specific maximum length. In bytes and not in characters, because in that case the horrible format USMARC is used, whose spec starts with two red flags: It's from January 2000, and it's an "implementation of the American national standard", so you can bet that it only works (well) with ASCII and will be ... interesting when handling Unicode. But it's generally broken for longer strings1.
Bytes vs. Characters
I assume you know the difference between bytes and characters (or code points). If not: read this!
As you've probably haven't read the link, here's a very short summary:
- UTF8 uses variable length to store letters in zeros and ones. Older formats (like ASCII) used a fixed length (eg one byte = 8 bit), but could therefore only represent a limited amount of letters.
- "Regular" letters (if you're talking about English) take 1 byte.
- But we can handle letters besides English. And we have "letters" like 🚲. 🚲 takes 4 bytes. See a lot of details here.
- Usually, if a regular person talks about the length of a string, they count letters: "Hello" has 5 letters. "Hello🚲" has 6 letters.
- Unfortunately, computers sometimes need to know the number of bytes (eg to store the data somewhere). "Hello" takes 5 bytes. "Hello🚲" takes 9 bytes.
How long is my string?
So let's compare "I love to ride my bicycle" with "I ♥ to ride my 🚲".
~$ perl -C -Mutf8 -MEncode -E 'say q{"I love to ride my bicycle" vs. "I ♥ to ride my 🚲"}'
"I love to ride my bicycle" vs. "I ♥ to ride my 🚲"
The maybe weird command line flags are explained in this footnote2.
Per default, Perl counts characters, using length():
~$ perl -C -Mutf8 -MEncode -E 'say length("I love to ride my bicycle")'
25
So 25 characters.
~$ perl -C -Mutf8 -MEncode -E 'say length("I ♥ to ride my 🚲")'
16
And now only 16.
If we want to count bytes (which is something you should not usually do, because the computers should handle it for you), we need to use bytes::length():
~$ perl -C -Mutf8 -MEncode -E 'say bytes::length("I love to ride my bicycle")'
25
Also 25 bytes, because each of the basic English letters takes one byte.
~$ perl -C -Mutf8 -MEncode -E 'say bytes::length("I ♥ to ride my 🚲")'
21
But the fancy Unicode version now takes 21 bytes, because ♥ takes 3 bytes and 🚲 4, so 3 + 4 - 2 = 5 bytes more than characters.
Stuff it into limited space
Now say we only have 20 bytes to store this string. So we need to chop of some of the text, using substr():
~$ perl -C -Mutf8 -MEncode -E 'say substr("I love to ride my bicycle", 0, 20)'
I love to ride my bi
Works!
If we do that on the fancy Unicode version:
~$ perl -C -Mutf8 -MEncode -E 'say substr("I ♥ to ride my 🚲", 0, 20)'
I ♥ to ride my 🚲
We get the same string back! Because, by default, Perl counts characters. And the string only has 16 characters. But 21 bytes. Which is 1 byte too much. So we need to use bytes::substr():
~$ perl -C -Mutf8 -MEncode -E 'say bytes::substr("I ♥ to ride my 🚲", 0, 20)'
I ⥠to ride my ð
What's that garbage?
As we're leaving the safe and padded space of regular Perl string handling and calling bytes, we unfortunately have to be a bit more careful and know about some Perl internals (which I will handwave a bit): Perl tracks if a string contains UTF-8. If you call bytes, it sort of forgets that the string contains the UTF-8 and now outputs raw bytes. But we want Perl to interpret that string returned from bytes::substr as UTF-8, so we explicitly have to tell it via decode_utf8 (imported from Encode). I find the name decode_utf8 a bit confusing, but it helps (me) to think about it like this: Decode this string of bytes (or octets as the docs say) which we know to contain UTF-8 into a Perl string of characters. So:
~$ perl -C -Mutf8 -MEncode -E 'say decode_utf8(bytes::substr("I ♥ to ride my 🚲", 0, 20))'
I ♥ to ride my �
Yay, we have the ♥ back!
Another way to do this (so we don't have to use bytes) is to pass substr the byte representation of the strings, which we can generate using encode_utf8. Which is again a bit confusing, unless you think about it like this: Take this Perl string and encode it into a bunch of bytes using UTF-8:
perl -C -Mutf8 -MEncode -E 'say decode_utf8(substr(encode_utf8("I ♥ to ride my 🚲"), 0, 20))'
I ♥ to ride my �
Same, but a tiny bit nicer!
Anyway, we have the ♥ back. But the resulting string is bit ugly, because it ends with "�". As we've chopped a few bytes of the 🚲, we do indeed end up with an invalid symbol, which is rendered as �.
But how long is this mangled string?
~$ perl -C -Mutf8 -MEncode -E 'say bytes::length(bytes::substr("I ♥ to ride my 🚲", 0, 20))'
20
Yay, 20 bytes, now it fits!
Stuff it into limited space, but without trailing garbage
Now maybe we don't like that � at the end. There's a nice flag you can pass to decode_utf8() that I've learned about while working with/against the original bug: Encode::FB_QUIET.
perl -C -Mutf8 -MEncode -E 'say decode_utf8(substr(encode_utf8("I ♥ to ride my 🚲"),0,20),Encode::FB_QUIET)'
I ♥ to ride my
No more �! Because FB_QUIET tells decode_utf8 to ignore invalid bytes.
And how long is it?
perl -C -Mutf8 -MEncode -E 'say bytes::length(decode_utf8(substr(encode_utf8("I ♥ to ride my 🚲"),0,20),Encode::FB_QUIET))'
17
Even shorter, because now the string does not contain any bicycle parts :-)
All the strings
For fun, here are all the substrings from 1 to original string, without any "half" / invalid characters. You can see quite nicely that 🚲 takes a few steps before it can be completely rendered:
perl -C -Mutf8 -MEncode -E 'my $s= "I ♥ to ride my 🚲"; for my $l (1 .. bytes::length($s)) { say decode_utf8(substr(encode_utf8($s),0,$l),Encode::FB_QUIET)}'
I
I
I
I
I ♥
I ♥
I ♥ t
I ♥ to
I ♥ to
I ♥ to r
I ♥ to ri
I ♥ to rid
I ♥ to ride
I ♥ to ride
I ♥ to ride m
I ♥ to ride my
I ♥ to ride my
I ♥ to ride my
I ♥ to ride my
I ♥ to ride my
I ♥ to ride my 🚲
Discussions
Footnotes
0 When searching for one specific word, the request crashed with a 500 error. This was / is caused (I think) by Koha storing a rather long string in a byte-limited field and chopping of some data. And one document that could by found by that specific word happened to have an Umlaut (which needs two bytes to store) in place where those two bytes would be chopped in half so the long string fits into the bytes. Resulting in a broken utf-8 character. Which caused the site to explode when it tried to render the search result. The quick fix was to add one space to the data, so the Umlaut was completely outside the shortened string. Very ugly, but worked. Not sure if I should be proud or ashamed of this "fix".
1 See this Koha bug. The real problem is that USMARC uses an int with 4 digits to store the size of a field, followed by 5 digits for the offset. So the max size for one record is 99999 bytes and the max size for one field in that record is 9999 bytes. But MARC::Record does not check the actual size, so if you store data in a field that's 10009 bytes long starting at offset 00100, the matching part of the metadata/header will be 1000900100. If you later try to parse this record, the first 4 digits will be used to indicate the length to read, i.e. 1000 bytes (instead of 10009 bytes), and the offset will be 90019 instead of 00100, thus reading in completely crap (or nothing, if the whole record is smaller).
2 Command line flags explained:
-Cis short for-CESLwhich turns on UTF-8 for various input and output streams (STDOUT etc). Seeperldoc perlrun.-Mutf8loads the moduleutf8and is the same as callinguse utf8;in the source code. This tells Perl that the source code itself contains UTF-8 characters. Like 🚲...-MEncodeloadsEncode.