Category: SQL

I recently spent a bit of time messing around with the Postgres wire protocol. I’d like to build a library that lets any application accept connections as if it were a Postgres database; many database clients and databases speak the Postgres wire protocol, and it seems like a practical way to expose tabular data over the network.

Implementing the protocol isn’t too bad; the docs aren’t great but the protocol itself is relatively straightforward. Unfortunately, client behaviour seems harder to accommodate. Clients can issue arbitrarily complex SQL queries to the database to inspect its schema, and sometimes they expect accurate answers. For example, Npgsql issues this beast and then disconnects if it gets an invalid response.

So I’m stuck with a question: how far should I go to support legitimate client behaviour outside of the protocol? The easiest approach for now is to test with a few popular clients and hardcode responses for any queries that they need implemented, but that will leave a long tail of unsupported clients. On the other end of the complexity scale, I could try running schema queries on an in-memory SQLite instance with database objects that mimic Postgres (certainly a massive yak shave, but tempting nonetheless).

I think my next step will be to dig through the source of a few Postgres-compatible databases to see how they solve this.

Relational Database History

1983 called. They said SQL needs a makeover

I recently read 2 papers that I’d highly recommend if you ever wonder about the underpinnings of modern relational databases:

What Goes Around Comes Around (Stonebraker and Hellerstein, 2005)

This is an opinionated history of database architectures from roughly 1970 to 2005, and it’s got some serious credentials behind it – Michael Stonebraker was the driving force behind Postgres and many other things. It’s an easy read, and it’s a good explanation of how relational DBs and SQL came to (mostly) rule the world. It was written amidst a lot of hype over XML databases, and it (rightly, in retrospect) critiques XML DBs and suggests that they won’t see much commercial adoption.

One interesting bit: Stonebraker and Hellerstein do not like SQL, and they claim that it became dominant largely because of IBM’s dominance in the marketplace during the 1980’s. Which leads us to the next item…

A Critique of the SQL Database Language (Chris Date, 1983)

This was cited approvingly in the 1st paper as a “scathing critique of the semantics of SQL”, which caught my eye - virtually every database supports SQL, how bad can it be?

I won’t summarize the entire thing, it’s a laundry list of complaints, but this resonated with me:

Notice that it is just table-names that appear in the FROM clause. Completeness suggests that it should be table-expressions (as Gray puts it, “anything in computer science that is not recursive is no good”).

…a simple table-reference (i.e. a table-name) ought to be just a special case of a general table-expression

To give a (trivial) example, say we want to union 2 tables and extract a column. We’d write that like this in SQL:

SELECT EMPLOYEE_ID FROM EMPLOYEES
UNION
SELECT EMPLOYEE_ID FROM CONTRACTORS

But if table names were just special cases of table expressions, we could write it like this instead:

SELECT EMPLOYEE_ID FROM (EMPLOYEES UNION CONTRACTORS)

SQL gets the job done, and it could be a lot worse - but it is a little sad that many of the issues academics have known about for 35+ years still exist today. It’s a reminder of how entrenched something that’s only “good enough” can become.

Quick T-SQL Regression Testing

Using CHECKSUM() and CHECKSUM_AGG()

Here’s something I’ve found useful in SQL Server, but it should apply in any DBMS with checksum functions. Next time you’re refactoring some database objects and you want to do some regression testing, give the CHECKSUM (MSDN) and CHECKSUM_AGG (MSDN) functions a try.

CHECKSUM and CHECKSUM_AGG

They behave pretty much as you would expect; CHECKSUM returns 1 checksum given 1 row, and CHECKSUM_AGG is an aggregate function that returns 1 checksum given multiple checksum rows. Between the two of them, you can get a checksum for any arbitrary collection of data:

-- Returns 2 rows with 2 columns
SELECT Field1, Field2
FROM (VALUES ('foo','bar'),('baz','foo')) TempTable(Field1,Field2)

-- Returns 2 rows with 1 checksum column
SELECT CHECKSUM(Field1,Field2)
FROM (VALUES ('foo','bar'),('baz','foo')) TempTable(Field1,Field2)
 
-- Returns 1 row with 1 checksum column
SELECT CHECKSUM_AGG(CHECKSUM(Field1,Field2))
FROM (VALUES ('foo','bar'),('baz','foo')) TempTable(Field1,Field2)

Regression testing

These help with a common problem: when you’re refactoring a database object, how can you be sure that you haven’t changed the existing functionality? Sometimes it’s enough to spot check a few inputs, but often the logic is complex enough that this does not provide a high level of confidence. With CHECKSUM and CHECKSUM_AGG, you can quickly check large numbers of test cases.

For example, say we have a TVF called MyDateTVF that takes in a date parameter and returns columns Column1 and Column2. I want to refactor this function and then test that the functionality is unchanged for every date in 2018. If we have a table called DateList which contains every date (more useful than you might expect in a DB with a lot of temporal aspects), I can simply run the following before and after my change:

SELECT CHECKSUM_AGG(CHECKSUM(MDT.Column1, MDT.Column2))
FROM DateList DL
CROSS APPLY (
    SELECT Column1, Column2 from dbo.MyDateTVF(DL.Date)
) MDT
WHERE DL.Date BETWEEN '2018-01-01' AND '2018-12-31'

If that query returns the same checksum before and after my change, we can rule out any regressions for those inputs. Nice and easy!

headshot

Cities & Code

Top Categories

View all categories