The Advisory Boar

By Abhijit Menon-Sen <>

Managing schema changes

Yesterday, I happened upon this video from PGCon 2012 of David Wheeler talking about his schema management tool named Sqitch, and thence also discovered depesz's Versioning scripts.

I have wrestled with schema migrations for many years, so I found David's presentation very interesting. Sqitch (without a "u") has many compelling features. For example, you can "sqitch deploy --untracked" to test a change you haven't committed, then revert to the last committed revision before you edit or commit the change. depesz's scripts are less magical, but offer similar capabilitites.

In particular, one thing is common to both systems: the schema exists in the repository as a number of interdependent changes, each of which must have a name (whether the names are artificial or assigned by the user is immaterial; some kind of tag is required for dependency resolution). To create the whole schema, you have to assemble the pieces in order, and to see the whole schema, you have to look at the database. The database is the canonical representation of the data model.

I prefer to think of my schema as a part of my source code, so I keep a complete version in a text file (or files), presented in the order that I want to explain it, with comments in the right places, and not leave that responsibility to "pg_dump -s".

What difference does it make?

  • ✓ Anyone can look at the source code for the schema and understand it in the "preferred form for modifications".
  • ✗ Making changes means writing an upgrade snippet, and downgrade snippet, and changing the main file.
  • ✓ Creating a new instance of the database always means feeding a small number of files to psql. No need to build a big schema up step by step.
  • ✗ Testing changes becomes harder—the upgrade/downgrade scripts are tested immediately, but in practice the main files is tested only on the next from-scratch deployment.
  • ✓ There is no need for dependency management or complex ordering between changes. Deploying needs no cleverness, only psql. Maybe a little shell script.
  • ✗ The natural way to represent a series of changes is with the numbered files that David so hates, and every number is an incipient merge conflict.

How these points stack up against each other depends on the situation. For example, a single web service may care less about deploying from scratch than an installable package. If the schema changes frequently, the testing overhead may outweigh other considerations. A project with one or two developers may not have to worry about numbering conflicts, and so on. Being able to read through the schema ranks highly for me.

I'll write about our approach to schema management in Archiveopteryx later.

Reading table data into an array

Suppose you have a table where each row represents a widget, and another table in which zero or more rows may refer to each widget, representing a set of values for some property of that widget. For example:

create table widgets (
    widget_id serial primary key,
    widget_name text unique

create table widget_flags (
    widget_id integer not null references widgets,
    flag_name text unique not null,
    flag_value integer not null,
    flag_colour text not null

Given these tables, it's easy to fetch all of a widget's flags given its id. Joining the two tables using widget_id would allow the widget's own data to be fetched with its flags, but it would result in as many rows for a widget as it had flags. I've often wished I could fetch all the flag data within a single row.

For various (well-known) reasons, it's a bad idea to store data in array columns, but using an array aggregate to read from the widget_flags table would be a neat way to stuff many rows' worth of data into a single column. For example, one could do something like:

> select widget_id,array_agg(flag_colour) from widget_flags
> group by widget_id;
 widget_id |    array_agg     
         1 | {blue,red,green}
(1 row)

But that's just one column, and what I really wanted was an array of arrays representing the values in each row. Because of various quirks of PostgreSQL's array implementation, I haven't found a truly satisfactory solution, but the definition of array_accum in the documentation semeed a reasonable starting point:

CREATE AGGREGATE array_accum (anyelement)
    sfunc = array_append,
    stype = anyarray,
    initcond = '{}'

To create an array of arrays, I changed anyelement argument to anyarray, and changed the sfunc from array_append to array_cat. The stype remained unaltered. Unfortunately, that wasn't quite enough. A query like "select aaa(array[a,b,c])" would return an array of values, not arrays. The only way to fix that was to change the initcond to '{{0,0,0}}', which meant that the first dummy element had to be removed afterwards.

CREATE FUNCTION allbutdummy(anyarray) RETURNS anyarray
AS $$select $1[2:array_upper($1,1)]$$ LANGUAGE 'SQL';

CREATE AGGREGATE aaa (anyarray)
    sfunc = array_cat,
    stype = anyarray,
    finalfunc = allbutdummy,
    initcond = '{{0,0,0}}'

With these definitions, I can do something like this:

> select widget_id,widget_name,
> (select aa(array[flag_name,flag_colour,flag_value::text])
>  from widget_flags where widget_id=w.widget_id) as flags
> from widgets w;
 widget_id | widget_name |               flags                
         1 | foo         | {{a,blue,3},{b,red,4},{c,green,5}}
(1 row)

Unfortunately, this method has two major limitations. First, arrays in Postgres must contain values of the same type, so one is forced to cast some values to maintain compatibility (flag_value has been cast to text in the example above). A more serious problem is that the three-element array specified in the initcond forces every accumulated array to have exactly three elements. I have found no better way to work around this than to create a family of aggregate functions: aaa4, aaa5, ….

Thanks to RhodiumToad on IRC for helping me work out the details of this hack.

Returning records from PL/PgSQL

I needed to write a plpgsql function that inserted rows into some tables and returned the generated primary key values to the caller. I had never needed to return multiple values from a function before, and I ran into a couple of new problems along the way.

This was my first attempt:

create function create_widget(name text, colour int) returns record as $$
    wid int;
    cid int;
    ids record;
    insert into widgets (name) values (name)
        returning widget_id into wid;
    insert into widget_colours (widget_id, colour)
        values (wid, colour)
        returning colour_id into cid;
    select wid::int, cid::int into ids;
    return ids;
$$ language plpgsql;

When I tried this, I got a syntax error which I eventually understood to mean that my parameters could not have the same names as any columns in the table I was inserting into. Not very surprising, since the inserts look a bit ambiguous even when reading the code. Since parameter names occur only in the function definition and aren't exposed to the caller, I didn't mind working around this by renaming my parameters "p_name" and "p_colour".

What did surprise me was that I had to call the function like this:

select * from create_widget('x', 3) as cw(widget_id int, colour_id int);

In other words, the caller always had to supply a column definition list, and I couldn't (see any way to) specify what I was going to return in the function definition.

I was given two suggestions about how to work around this problem. The first was to use OUT parameters:

create function create_widget(name text, colour int,
                              widget_id OUT int, colour_id OUT int)

But this approach, predictably enough, had the same problem with the names of parameters conflicting with the names of columns, and I was much less willing to force the caller to retrieve and have to use columns named "r_widget_id" or something similar.

With Postgres 8.4 and above, I learned that I could also declare the function as returning a table, which looked exactly like what I had hoped for:

create function create_widget(name text, colour int)
    returns table (widget_id int, colour_id int)

But to my great disappointment, the conflict with the column names persists even in this form.

In the end, I decided to use "p_" prefixed parameter names, return a record, and name the columns in the caller.