Writing Legible SQL

There are a lot of ways to write SQL, and everyone has their own style. This post describes how I write SQL, and why. I think that as devolopers gain more experience writing SQL, they gradually shift towards a similar style.

As far as a guide as to best practices for naming conventions, I agree with this article by Sehrope Sarkuni. I think everything in it should be treated as a rule, with two exceptions:

  1. PascalCase vs snake_case: I think it’s better to keep it consistant with your environment. nd tooling. So that usually means snake_case for Postgres with Python, but PascalCase for SQL Server and .NET.
  2. Using ‘id’ as the primary key instead of ‘tablename_id’ or ‘TableNameId’: If every table has an ‘id’ column, joins get just a little bit more confusing, and you run the risk of accidentally making the wrong join, and it will not be obvious. a.person_id = b.person_id looks clean and is easy to understand. Also, when you do joins and want to output the ids, you don’t have to manually rename each column to describe it and prevent column name collisions.

A Simple Query

I’m stealing a query from another one of my articles. The goal of it is to return all employees who have a higher salary than their boss. The next few queries will all have the exact same logic, produce the same result set, and run at the same speed, but which is easiest to read?

Select name, Salary, a.Name, a.Salary from Employee join Employee a 
on a.EmployeeId = Employee.BossId where Employee.Salary > a.Salary;
SELECT a.name EmployeeName, a.Salary EmployeeSalary, b.Name BossName, b.Salary BossSalary
FROM Employee a JOIN Employee b ON b.EmployeeId = a.BossId
WHERE a.Salary > b.Salary;
	emp.name as EmployeeName,
	emp.Salary as EmployeeSalary,
	boss.Name as BossName,
	boss.Salary as BossSalary
	Employee emp
	join Employee boss
		on boss.EmployeeId = emp.BossId
	emp.Salary > boss.Salary;

I could easily add three more variations between a messy one liner and the last query. There are no rules to formatting SQL.

At first glance, some people might think that there are a lot of line breaks. And it’s true. But all the line breaks do is turn horizontal space into vertical space. They make it way easier to parse the SQL with your eyes, and each part of the query is clearly separated.

In the last query, at a glance, it’s clear how many columns we’re selecting, how many tables we’re using, what the table names are, what their aliases are, and what we’re filtering for. Modifying the query is also easy.

Properly format any query that will be used in more than one session as you write it

It’s ok to have 6 varations of select foo from bar where baz = 'blah' at the top of your file. It’s easy to understand what it does, and it’s probably not that important. But if you plan on storing that query somewhere or using it again, I would still recommend spliting that into three lines, at the select, from, and where keywords.

You should get used to adding the line breaks and indentations as you write any query you are not going to throw away. This will make it way easier to analyze, modify, and comment the query as you write it, and in the future.

Try to keep it to one logical item per line.

What this means is a bit muddy, but notice I had every field that is being selected from on it’s own line, each table and it’s alias on it’s own line, each join condition on it’s own line, and each filter own it’s own line. That’s because it makes it easy to read and alter. SQL is terse, and a lot can happen in a little bit of code. If you don’t want to select a column anymore, just using a single line comment. If you need to add something, and it on a new line. There is no cost to using vertical space, but it makes it much easier to read and maintain.

I feel that it’s fine to put keywords and their related logic on the same line, if there is only one logical item on that line. For example:

select e.Name
from Employee e
where r.Salary > 100000

There is not much reason to split that into new lines. And you might see something like this in a subselect. But if you were to add another field to the select, or another clause to the where filter, I would leave the keywords on their own line.

Alias your tables

You should pretty much always give every table an alias, and use the alias whenever you reference a column. There almost no reasons not to, and they’re so rare that I won’t even mention them. A one letter alias is fine if you are only referencing the table once, and the query is pretty simple. If you reference the same table multiple times (like we did in the above query) it’s good to give each table a more meaningful alias. This will save you time in the future so you don’t have to go back and add the aliases after you realise you need them.

Use indentation (not alignment) like you would in C or Python

Indentation is logical. You indent in to a certain scope in the query, and step back out when you exit that scope. It makes it easy to follow the query and it draws the eye to where it needs to be. Alignment is better than nothing, but it scatters the query based on how long your object names are, introduces unnessecary whitespace, and adds gaps in the queries. In my opinion, it is ugly, meaning it acheives the opposite of what it is attempting to accomplish. There a reason we opted for indentation over alignment in almost all languages. SQL is no different.

Indent your case statements

It’s ok to have simple, one-line case statements. For example case when IsActive = 1 then 'Active' else 'Inactive' end is fine. But that is about as complicated as a one-line case statement should get. Past that, try to ident them like so:

		when e.Salary < 50000 then e.Salary + 3000
		when e.Salary between 50000 and 100000 then e.Salary * 1.05
		when e.Salary > 100000 then 0  -- Fire them
	end as NewSalary
from Employees e

Comment the ‘why’ of what you did if you do something complex or strange

Everyone is happy when code is self-documenting. But that isn’t always the case. When you write a complex query, leave a comment describing what it does. And whenever you do something strange, explain what it is doing and why.

For example, in Postgres, there is predicate called ‘is not distinct from’ that is used to make a comparison where evalutating two nulls will return true.

null = null                     -- false
null is not distinct from null  -- true

But if you wanted to write that in a database that doesn’t support this keyword (which is a lot of them), you would have to write

(not (a != b or a is null or b is null) or (a is null and b is null))

If you saw that in a query without comments, how long would it take you to figure out why it’s there, and why it isn’t just a = b? If you write that, leave a comment explaining why.

Put unions on their own line

The union and union all keyword merge two result sets. Put them alone on their own line so that it’s clear what is being merged.

Keep operators on the left

Knowing whether there is an ‘and’ or ‘or’ in a filter is very important, and should come before what’s being filtered, not after. Again, alignment isn’t important. Logic is. That said…

Put ‘or’ statements on their own line

‘Or’ is a simple logically operator that most people are familiar with, but often make mistakes due to forgetting about operator precedence. ‘Or’ has lower precedence than ‘and’, so an ‘or’ without parenthesis will essentially split a filter into two distinct filters which will return rows whether either of the sides is true. That’s why it should have it’s own line, like so:

-- Assume the business wants a query to get all active accounts 
-- who owe money, or all closed account that owe a lot of money.
	accounts a
	a.balance > 0
	and a.status = 'Active'
	a.balance > 1000 
	and a.status = 'Inactive'

However, it ok to have simple one-liners with the ‘or’ keyword, like (a.status = 'Active' or a.inactive_date < current_Date + interval '1 day'). But notice the parenthesis around the clause, so that this doesn’t affect anything else in the filter.

Use ANSI standard features over database specific ones

This is save some time and context switching. For example, instead of using isnull() or nvl(), use coalesce(). It is standard and accepts multiple values so you don’t need to nest the other methods when you do have to deal with multiple potential nulls.

Stuff that I don’t think really matters

  • Alignment. Use indentation instead.
  • What case you write your keywords in. Eg. SELECT, Select, or select. Try to keep it consistant, but it doesn’t matter which you choose. Yelling WHERE in your code might help if you write it like you were writing a sentence, but if you use line breaks and indentation properly, it’s obvious where you are in the query. I write lowercase because I don’t want to hold shift.
  • Adding ‘inner’ to your inner joins. It’s optional, so I usually don’t do it. It works everwhere I use it, so I tend not to type it. It might be a good practice to do it though, so I won’t discourage it.

A more complex query

with salaryByDepartment as (
		coalesce(sum(e.Salary), 0) as TotalSalary
		Department d
		full join Employee e
			on e.DepartmentId = d.DepartmentId
	group by d.Name

totalSalary as (
	select sum(e.Salary) as TotalSalary
	from Employee e

	'All Employees' as Department,
	100 as PercentOfTotal
from totalSalary ts
union all
	(cast(sbd.TotalSalary as float) / ts.TotalSalary) * 100
	salaryByDepartment sbd
	cross apply totalSalary ts;

Imagine if this query was written without formatting. How long would it take to understand what it is doing?