and or AND and ElasticSearch, or: Case matters

Today I spend a few hours hunting a weird bug. I got a report that people entering data for the Steiermärkische Landesbibliothek where not able to find duplicate entries when adding new books. In theory, Koha should search through all the already existing data and present the user with a dialog if they maybe want to reuse the existing entry, if one is found:

This works when an ISBN was entered, but not when "only" title, author and some other data was available. To make things a bit more complicated, we're using the not-default ElasticSearch backend, as opposed to the old-school Zebra0 index for searching.

So I dug through the source code, adds some strategic Data::Dumper statements to get the actual query sent to ElasticSearch, and then played a bit with that query, sending it directly to ElasticSearch via curl. The best way to do this (IMO) is to store the query into a file, and use -d filename:

curl http://localhost:9200/biblios/_search?pretty -X GET -H 'Content-Type: application/json' -d query.json

Here's the query.json (that was not returning any results)

{
  "query": {
    "query_string": {
      "query": "(author:'Schwartz, Randal J' and title:'Einführung in Perl')",
      "default_operator":"AND",
      "type": "cross_fields",
      "analyze_wildcard": true, "fuzziness": "auto", "lenient": true
    }
  }
}

Note that this search is using the simple query_string search, where you pass a semi-complex query as a string instead of composing a very deeply nested data structure1:

(author:'Schwartz, Randal J' and title:'Einführung in Perl')

After some fiddling and testing I found that removing "default_operator":"AND" yields results (but not very good results..)

So I took my problem to the Koha IRC channel, where kidclamp provided the needed clue after some back and forth:

Case matters

When using the simple_query, ElasticSearch interprets the string "AND" as a boolean operator linking the literal values in the query. But it interprets the string "and" as a literal value!

So:

  • "foo AND bar" finds documents that contain foo and bar.
  • "foo and bar" finds documents that contain foo, bar and and!
  • In the latter case, ElasticSearch checks default_operator, which in our case was set to AND, thus only finding documents that contained the string "and" in addition to what we're actually searching for
  • Removing default_operator let ElasticSearch fall back to the default, OR, so we now found stuff, but very crappy stuff (anything with "and" in any field, not only in title or author)

None of these results were really usable!

So the real fix was to patch Koha to use an uppercase AND to construct this query. See Koha Bug #30153 for the gory details2. Thanks to kidclamp for helping my locate the problem and to the Koha devs for quickly applying and testing my patch!

Footnotes

0 Don't ask!

1 I used to joke that ElasticSearch devs get paid by the tab indentation needed to express a search query

2 Or not so gory:

-        my $op = 'and';
+        my $op = 'AND';