/ domm

I hack Perl for fun and profit

Follow me on twitter!
Atom Icom ... on Atom!
20.09.2018: Longest unicode name

At our last Vienna.pm meeting at Cafe Else we talked about brian's recent blog post on new Unicode 10 emojis in Perl 5.28. I noticed that some of them have ridiculous long names:

🤭 (U+1F92D) SMILING FACE WITH SMILING EYES AND HAND COVERING MOUTH

So I wondered which code point has the longest description.

Spoiler

It's this:

ﯹ (U+FBF9) ARABIC LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED FORM

The first try

Getting the longest name is easy, if you have a list of all code points. But getting this list is hard(ish)

Daxim suggested a script named unichars which is included in Unicode::Tussle and I hacked up this:

cpanm Unicode::Tussle
perl unichars | \
  perl -n -E 'chomp; @a = split(/\s+/,$_,6); say length($a[-1])." ".$a[-1]' | \
  sort -n

This is in fact buggy, but returns the right result by accident. And it's a bit to shell-y for my taste.

So we got our answer, and continued talking about other stuff. But the next day I wanted to find some better solutions:

Brute force

perl -Mcharnames=viacode -E 'foreach(0..0x10FFFF){$n=charnames::viacode($_);\
say $f=$n if length($n)>length($f)}'

Just iterate through the whole unicode address space (which like real space is huge and mostly empty), get the charname and print it if it's the longest we've seen so far. This script runs a long time, but will find the correct answer after a few iterations (at 0xFBF9, and it's a long way from there to @0x10FFFF)

So here's a smarter (but longer) solution

Use the metadata

Unicode::UCD provides access to some unicode metadata. Of interest to me is charblocks which returns the name and the codepoint ranges of all code blocks:

'Mahajani' => [
   [
     69968,
     70015,
     'Mahajani'
   ]
 ]

Here's a slightly golfed version that does not take too long (but skips the various Private blocks):

perl -MUnicode::UCD=charblocks -E 'for(values charblocks->%*){next if $_->[0][2]=~/Private/;for $c($_->[0][0]..$_->[0][1]){$n=charnames::viacode($c);my $l=length($n);if($l>$lf){$f=$n;$lf=$l;say "$l $n $c"}}}'

Here with some line breaks

perl -MUnicode::UCD=charblocks -E '
  for (values charblocks->%*) {
    next if $_->[0][2]=~/Private/;
    for $c($_->[0][0]..$_->[0][1]) {
      $n=charnames::viacode($c);
      my $l=length($n);
      if ($l>$lf) {
        $f=$n;
        $lf=$l;
        say "$l $n $c"
      }
    }
  }'

I'm sure this can be golfed quite a bit more (feel free to post your version..).

Due to hash randomization, the exact output will change each time, but might look something like this:

32 KANNADA SIGN SPACING CANDRABINDU 3200
35 JAPANESE INDUSTRIAL STANDARD SYMBOL 12292
36 REVERSED DOUBLE PRIME QUOTATION MARK 12317
43 VERTICAL KANA REPEAT WITH VOICED SOUND MARK 12338
54 VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HALF 12340
58 PRESENTATION FORM FOR VERTICAL LEFT TORTOISE SHELL BRACKET 65081
59 PRESENTATION FORM FOR VERTICAL RIGHT TORTOISE SHELL BRACKET 65082
60 PRESENTATION FORM FOR VERTICAL LEFT BLACK LENTICULAR BRACKET 65083
61 PRESENTATION FORM FOR VERTICAL RIGHT BLACK LENTICULAR BRACKET 65084
75 ARABIC LETTER BEH WITH THREE DOTS POINTING UPWARDS BELOW AND TWO DOTS ABOVE 1875
78 CLOCKWISE RIGHTWARDS AND LEFTWARDS OPEN CIRCLE ARROWS WITH CIRCLED ONE OVERLAY 128258
83 ARABIC LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED FORM 64505

Happy golfing!

Comments (via senph)

16.09.2018: YEF: Ctrl-Alt-Del
18.08.2018: The European Perl Conference 2018 in Glasgow
17.08.2018: Perl Conference Orga BOF report
27.05.2018: Syncing data using advanced SQL
08.04.2018: Things I learned at the German Perl Workshop 2018
11.03.2018: Forking tests
26.02.2018: Announcing Plack::Middleware::StatsPerRequest
18.02.2018: Announcing CtrlO::Crypt::XkcdPassword
12.08.2017: How to make Perl conferences beginner friendly
>>>>>>>>>>
<