Properly reference your columns! - Jon Shaulis

If you don’t read the rest of this setup, I want you to take away one thing.

Always reference your tables with your columns when more than one table is involved in the query!

This post is made primarily with SQL Server in mind, but this behavior is actually ANSI SQL and can be replicated in PostgreSQL, MySQL, and Oracle.

-- Example of table reference
SELECT
    table.ColumnA
FROM table 
INNER JOIN tabletwo
ON table.ColumnID = tabletwo.ColumnID;

I recently experienced an odd issue where my UPDATE statement was updating every row in my table. The problem specifically stemmed from the fact that I was referencing an invalid column from a table in a subquery. Now you may be thinking, if you are referring to an invalid column in a subquery, surely your SQL will just fail and rollback the transaction. Well, the exact opposite is happening.

Let’s set the stage here, if I’m referring to an invalid column in a query, I expect to get something like this and my transaction should rollback:

Msg 207, Level 16, State 1, Line 45 Invalid column name ‘FakeTest1’

Now instead of this expected error message, EVERY SINGLE ROW IS UPDATED! What kind of sorcery is this? Let’s start by creating our environment to test this.

IF OBJECT_ID('dbo.FakeTableSource', 'U') IS NOT NULL
BEGIN
    DROP TABLE [dbo].[FakeTableSource];
END;

CREATE TABLE [dbo].[FakeTableSource]
( 
             [FakeID] INT PRIMARY KEY, [FakeVarchar] VARCHAR(10)
);

GO

IF OBJECT_ID('tempdb.dbo.#T', 'U') IS NOT NULL
BEGIN
    DROP TABLE [#T];
END;

IF OBJECT_ID('dbo.FakeTableDestination', 'U') IS NOT NULL
BEGIN
    DROP TABLE [dbo].[FakeTableDestination];
END;

CREATE TABLE [dbo].[FakeTableDestination]
( 
             [ID] INT PRIMARY KEY, [FakeVarchar] VARCHAR(10)
);

GO

INSERT INTO [dbo].[FakeTableSource]( [FakeID], [FakeVarchar] )
VALUES( 1, 'One' ), ( 2, 'Two' ), ( 3, 'Three' ), ( 4, 'Four' );

INSERT INTO [dbo].[FakeTableDestination]( [ID], [FakeVarchar] )
VALUES( 1, 'One' ), ( 2, 'Two' ), ( 3, 'Five' ), ( 5, 'Four' );

Great! We have our tables and some sample data. Let’s pretend to try and update FakeTableDestination. Let’s say we wanted to set FakeVarchar equal to the other table where the ID’s match. To replicate the issue I saw, I’m going to wrap this up in a CTE and perform the update statement although this is not necessary to replicate the problem at hand.

;WITH CTE
     AS (SELECT [FakeID], [FakeVarchar]
         FROM [dbo].[FakeTableSource])
     SELECT  [FakeID], [FakeVarchar]
     INTO [#T]
     FROM [CTE];

UPDATE [dbo].[FakeTableDestination]
  SET [FakeVarchar] = [src].[FakeVarchar]
FROM [dbo].[FakeTableSource] AS [SRC]
WHERE [FakeTableDestination].[ID] IN
(
    SELECT [ID]
    FROM [#T]
);

Now if you ran that statement, you’re going to see something really weird. You’re going to see that there is no syntax error and every row is updated! This is strange because the subquery is selecting ID when only the column FakeID exists in the #T table. Let’s check out those tables. (Note, you’ll have to re-run all of the code up to this point to check out the temp table.)

SELECT *
FROM [dbo].[FakeTableSource];

SELECT *
FROM [dbo].[FakeTableDestination];

SELECT *
FROM [#t];

Do you have any guesses yet as to why this behavior is happening? Is this a bug? If you take SELECT ID FROM #T out of the subquery, you get the error I showed above initially.

When I first ran into this, I thought it may be because I was using a temp table. It turns out table variables give you the same problem. You can test with this code below:

IF OBJECT_ID('dbo.FakeTableSource', 'U') IS NOT NULL
BEGIN
DROP TABLE [dbo].[FakeTableSource];
END;

CREATE TABLE [dbo].[FakeTableSource]
([FakeID]      INT
 PRIMARY KEY, 
 [FakeVarchar] VARCHAR(10)
);

GO

IF OBJECT_ID('dbo.FakeTableDestination', 'U') IS NOT NULL
BEGIN
DROP TABLE [dbo].[FakeTableDestination];
END;

CREATE TABLE [dbo].[FakeTableDestination]
([ID]          INT
 PRIMARY KEY, 
 [FakeVarchar] VARCHAR(10)
);

GO

DECLARE @T TABLE
([FakeID]      INT
 PRIMARY KEY, 
 [FakeVarchar] VARCHAR(10)
);

INSERT INTO [dbo].[FakeTableSource]
(
       [FakeID], 
       [FakeVarchar]
)
VALUES
       (1, 
        'One'),
       (2, 
        'Two'),
       (3, 
        'Three'),
       (4, 
        'Four');

INSERT INTO [dbo].[FakeTableDestination]
(
       [ID], 
       [FakeVarchar]
)
VALUES
       (1, 
        'One'),
       (2, 
        'Two'),
       (3, 
        'Five'),
       (5, 
        'Four');

WITH CTE
     AS (SELECT 
                [FakeID], 
                [FakeVarchar]
         FROM [dbo].[FakeTableSource])

     INSERT INTO @T
     (
            [FakeID], 
            [FakeVarchar]
     )
     SELECT 
            [FakeID], 
            [FakeVarchar]
     FROM [CTE];

UPDATE [dbo].[FakeTableDestination]
  SET 
      [FakeVarchar] = [src].[FakeVarchar]
FROM [dbo].[FakeTableSource] AS [SRC]
WHERE 
      [FakeTableDestination].[ID] IN
(
    SELECT 
           [ID]
    FROM @T
);

Ok, we get this with table variables… how about a physical table? You can modify the above code with this extra snippet to test that theory. Spoiler, the same problem exists still!

SELECT 
     [FakeID], 
     [FakeVarchar]
INTO dbo.T
FROM [CTE];

UPDATE [dbo].[FakeTableDestination]
  SET 
      [FakeVarchar] = [src].[FakeVarchar]
FROM [dbo].[FakeTableSource] AS [SRC]
WHERE 
      [FakeTableDestination].[ID] IN
(
    SELECT 
           [ID]
    FROM T
);

What if I change the UPDATE WHERE clause to NOT IN instead of IN? Well, NOTHING updates then! How crazy! The opposite behavior happens. Well… what about the version of SQL Server? I tested this on SQL Server 2012, 2014, and 2016, the same thing happens. In fact, if you test this on any major SQL platform, the exact same behavior occurs.

So this is not specific to SQL Server, what could it possibly be? Some testing got me close to what was happening, but ultimately Paul White was able to help shed some light on the issue and I later found a great definition by Microsoft.

Here is how Microsoft describes the behavior:

If a column is referenced in a subquery that does not exist in the table referenced by the subquery’s FROM clause, but exists in a table referenced by the outer query’s FROM clause, the query executes without error. SQL Server implicitly qualifies the column in the subquery with the table name in the outer query.
https://docs.microsoft.com/en-us/sql/relational-databases/performance/subqueries?view=sql-server-2017

This is an ANSI standard and while I don’t agree with it, this behavior is probably here to stay. So what does this mean for you? Well, here’s a version of the query that WOULD have errored out and not allowed the update to continue. (Which is why you should always be explicit about where columns are coming from specific tables.)

SELECT 
     [FakeID], 
     [FakeVarchar]
INTO dbo.T
FROM [CTE];

UPDATE [dbo].[FakeTableDestination]
  SET [FakeVarchar] = [src].[FakeVarchar]
FROM [dbo].[FakeTableSource] AS [SRC]
WHERE [FakeTableDestination].[ID] IN
(
    SELECT #T.[ID]
    FROM [#T]
);

You can see on line twelve that I’m referencing my table name prior to the column. I’m specifically stating where I expect the value to come from since there is more than one table in this script.

Honestly, doing this is just good practice anyway. If you’ve ever had to fix a script or work on an old query where you have no columns being referenced by tables, you’ll understand where I’m coming from on this. It makes it significantly easier to troubleshoot scripts and it prevents ambiguous errors in the future. It’s just good defensive scripting. So always reference your columns with the tables they come from when more than one table is in use!

Note: No production databases or tables were harmed in the discovery of this ANSI standard.

Where I got my answer.

Additional reading by Aaron Bertrand on SentryOne’s website.

Share this:

Leave a Reply Cancel reply