A small DB change and a lot of problems

2019-10-04
⇦
⇧
⇨

A short story about how a small DB change made me question google cloud, Postgres, and my sanity.

SPOILER: It was all my fault!

A simple change

Recently we had a very strange problem with a postgres database running in google cloud: A rather trivial change to the DB schema, which took <1sec on my laptop did not finish after a few minutes. In fact, a service using this database stopped working, which is why I aborted the change after a few minutes.

You see, all I wanted to do was to add a new column domain to a table called app, which is where we store info about brands/apps/customers (naming things is hard..) that access a white-labeled service:

ALTER TABLE winxle.app ADD domain text;
UPDATE winxle.app SET domain = 'aaa.example.com' where name = 'aaa';
UPDATE winxle.app SET domain = 'bbb.example.com' where name ='new-customer';
ALTER TABLE winxle.app ALTER domain SET NOT NULL;
CREATE UNIQUE INDEX unique_app_domain ON winxle.app USING btree (upper(domain));

While we always planed to white-label this product, and thus put some preliminary white-labeling capabilities into our database, we only recently got a second customer. So we're having to actually make sure all parts of the product are white-label-enabled, and we also need to update our white-labeling meta-data, thus the above data change.

Microservices & Cloud Hosting make everything so much easier, they said...

Before splitting the whole product into microservices (or maybe medium-services?) and moving the whole application into the google cloud, such changes would be applied in <1 second on the live system. Now they took generally longer, or did not work at all, like in this case. This made me assume that something's wrong with how google is hosting postgres, which BTW is not very transparent and has some weird corner cases (e.g. try importing your data...)

I tried the change a few times over the course of a few days, it never worked. It's a tiny table (containing three rows), but a lot of other database objects depend on it. I finally tried to NOT run the 5 statements in one transaction, but separately. Still no success, but again a service crashed, and this time quite hard. So I did the classic IT thing: I shut down the DB cluster, and turned it back on again. And suddenly the change could be applied. Well, at least the first change and the updates, but the SET NOT NULL again took ages, so I aborted it.

I was starting to get nervous, and assumed that the table might somehow have gotten corrupted. So we decided to move the cluster from postgres 9.6 to 11 (which is available as a 'beta' in Google Cloud, no idea what that means, but well...). The migration went smoothly (yay!), but the change still could not be applied. So I again let this issue rest for a few days, and started to think about maybe moving away from the google hosted postgres to hosting it by ourselves (in Kubernetes, uff...)

Meanwhile, in a related bug

A few days later, during a deployment to our Kubernetes cluster, I did a kubectl get pods to see if the new version was already up. And there I spotted something strange:

winxle-api-765d5fb759-qsrr5                       Running   0      3h
winxle-frontend-cd7d648bd-dj8mv                   Running   0      3h
winxle-scripts-daily-stats-cd7d648bd-dj8mv        Running   0      2d
winxle-merchant-api-759c57b8b5-5dmfb              Running   0      3h

winxle-scripts-daily-stats is running for 2 days! WTF? This pod contains a cronjob that runs once a day for ~2 minutes to collect some global stats and report them to influxdb. Or should run for ~2 minutes...

So I did git log lib/Winxle/Scripts/DailyStats.pm which returned

commit 1f64d9f688abb518790b08c65ec7db4273bb68c0
Author: Thomas Klausner <domm@plix.at>
Date:   Mon Aug 26 16:22:47 2019 +0200

add app_id to daily_stats, index_campaigns and import_awin

Let's diff that: git diff 1f64d9f688abb518790b08c65ec7db4273bb68c0^ -- lib/Winxle/Scripts/DailyStats.pm

-WITH participators AS (
+WITH
+appid AS (SELECT id FROM winxle.app WHERE id = ?),
+participators AS (
     SELECT distinct usr
-    FROM winxle.usr_campaign_participation
-    WHERE created > now() - interval '3 years'
+    FROM winxle.usr_campaign_participation p, winxle.usr u, appid
+    WHERE
+            p.created > now() - interval '3 years'
+        AND p.usr = u.id
+        AND u.app = appid.id
 ),

It's a bit hard to see this in this diff, but I changed this:

WITH participators AS (
    SELECT distinct usr
    FROM winxle.usr_campaign_participation
    WHERE created > now() - interval '3 years'
),

to this:

WITH
appid AS (SELECT id FROM winxle.app WHERE id = ?),
participators AS (
    SELECT distinct usr
    FROM winxle.usr_campaign_participation p, winxle.usr u, appid
    WHERE p.created > now() - interval '3 years'
          AND
          u.app = appid.id
),

Now, this SQL fragment is part of a larger query using CTEs (but that's not the problem). The real problem is the commit date: Mon Aug 26 16:22:47 2019. On that week I was on holidays in Latvia after PerlCon.

Don't deploy on Friday (before you leave for a conference and 2 weeks of holidays)

You see, during PerlCon (when I was in Latvia) we had a presentation in Switzerland. For this presentation we wanted to show the customer a prototype of the now white-label enabled application. So I sort of rushed a quick prototype release, and went on holidays. The presentation worked, yay (c.f. the schema change, if you still remember the start of this post). But after the presentation (and during my holiday) we (well, users..) discovered that the two test-campaigns have somehow seeped into the website of the other customer. Ouch!

This happened because we did not add the check for app to all our code. So I quickly (yeah, bad idea) added the missing checks to make sure that each customer (app) only sees data belonging to itself. And I also altered the above query to properly report the stats.

SQL 101

But (I blame my "working" during holidays) I did the most basic SQL error you can do:

    SELECT distinct usr
    FROM winxle.usr_campaign_participation p, winxle.usr u, appid
    WHERE p.created > now() - interval '3 years'
          AND
          u.app = appid.id

I SELECT from three tables, but only provide join conditions for two of them!

So now postgres calculates a cross-product of ALL users (quite a lot) with ALL user-participations (a very big lot!), resulting in a hell-of-a-lot of rows. This seems to take a while (a few days, or maybe until the heat-death of the universe). Hence the long-running cron-job. Of course this is quickly & easily fixed:

    SELECT distinct usr
    FROM winxle.usr_campaign_participation p, winxle.usr u, appid
    WHERE
            p.created > now() - interval '3 years'
        AND p.usr = u.id
        AND u.app = appid.id

And then it hits me!

The ALTER TABLE command needs an exclusive lock on the app table. But the broken query (running for days!) holds a select lock on the table (with a bit of hand-waving, read this for details). That's why the ALTER TABLE never finished (in fact, it never started). And because Postgres is awesome, it queues all the locks, and the ALTER TABLE blocks all further reads on the table (until it is done), which is why the service stopped working.

And this also explains why the schema change nearly worked after the cluster reboot: The broken query was not started yet, so the first part of the ALTER TABLE worked. But the Kubernetes helpfully realized that the cron-job is long overdue (because it never finished...), started it again, which again caused the table lock, etc.

A massive facepalm later...

After being amazed at my stupidity, the complexity of the whole system, this nice interaction between Postgres table locks, Kubernetes, and also very happy that I figured it out, I started cleaning up. The SQL was easily fixed and deployed, but the schema change still did not work. Because postgres is very thorough (or stubborn?)

When you abort a query (eg by killing the pod that contains the job that issued it), Postgres can only realize that the client has gone away when it tries to send some data to it. Unfortunately in my case the server never got so far as to send any data (because it was still busy multiplying two very large sets), so the query is not stopped. Luckily, Postgres provides a lot of metadata and helpful functions to handle such situations.

pg_cancel_backend

You just need to do pg_cancel_backend( PID ) to discard the query associated with the connection with the give PID. So how do we get this pid?

pg_stat_activity:

SELECT pid,age(now(),query_start) FROM pg_stat_activity;
   pid   |          age           
---------+------------------------
 1134889 | 00:00:00
 1134797 | 00:00:00.009823
 1134804 | 00:00:00.015204
 1134794 | 00:00:00.110826
 1134565 | 00:00:00.138034

We can use this to kill the old queries:

SELECT pg_cancel_backend(pid)
FROM pg_stat_activity
WHERE age(now(), query_start) > '1 day';

And we're done!

Finally I could apply my schema change!

Lessons learned

It's never Lupus^w the library!
Don't rush deployments!
Don't hack during holidays!
While we have automatic preview branches via gitlab, our DB setup is not up to this yet. We should really get this done..
Postgres is awesome, and I'm too stupid (from time to time..) to use it correctly.
Complex systems have complex problems. Unfortunately, sometimes we do need complex systems...