ELI5 how are join statements semantically structured?

There are some folks just starting out to SQL Server, maybe they come from another flavor of database or perhaps they have no history with SQL at all. The terminology we use to talk about the work we do can get a little confusing, and I was lost when I began learning about schema structures. This is my “Explain it like I’m five – on how join statements are semantically structured.”. This is geared to new users who are in the first 1-6 months of learning SQL Server.

What we call “schema”, “databases”, “instances”, and “tables” are words that we use to describe objects inside of a relational database management system (RDBMS). Oracle, MySQL, SQL Server, etc all use similar but different terminology.

In SQL Server however, I’ll illustrate for you what we are referring to regarding schema, databases, instances, and tables.

SELECT Orders.OrderID, 
       Customers.CustomerName, 
       Orders.OrderDate
FROM Orders
     INNER JOIN Customers ON Orders.CustomerID = Customers.CustomerID;

-> This is our command piece

SELECT 

-> Orders is the table we are retrieving the column OrderID from

  Orders.OrderID, 

-> Customers is the table we are retrieving the column CustomerName from

  Customers.CustomerName, 

-> Orders is the table we are retrieving the column OrderDate from

  Orders.OrderDate 

-> A table we are selecting columns to see

FROM Orders 

-> Another table we are selecting columns to see. The inner join says that when the two columns are matched, they must be equal. Only results found that are common between both tables will be returned.

INNER JOIN Customers 

-> You are looking at the column CustomerID in the Orders table and looking at the column CustomerID in the Customers table. When you find a CustomerID common to both, you are asking for the data as illustrated below.

  ON Orders.CustomerID=Customers.CustomerID;

In SQL Server, the hierachy goes as :

Server -> SQL Server Instance -> Databases -> Schemas -> Tables.

So I connect to my server that runs my software. I then have SQL Server installed on the server (it is possible to run multiple installations or instances on a single server, this is not recommended typically), when I sign in, I can see the databases associated to that instance. Each database has a schema which is typically used to help group tables together or limit access for security. Many shops just keep this as DBO but if you are looking at Adventureworks or other data sets, you may see something like Customers.Orders. If you were to look at the whole syntax for that table, it may look like : Adventureworks.Customers.Orders. Adventureworks is my database, Customers is my schema, and Orders is my table.

Now here’s where I think people may get confused or have been confused. You can have the same named column in multiple tables. How do you know which column belongs to which table? In this example, we specify the CustomerID from the Orders table. Here is another way to write all of this:

SELECT 
  Adventureworks.Customers.Orders.OrderID, 
  Adventureworks.Customers.Customers.CustomerName, 
  Adventureworks.Customers.Orders.OrderDate 
FROM Adventureworks.Customers.Orders 
INNER JOIN Adventureworks.Customers.Customers 
  ON Adventureworks.Customers.Orders.CustomerID=Adventureworks.Customers.Customers.CustomerID;

The last line:

ON Adventureworks.Customers.Orders.CustomerID=Adventureworks.Customers.Customers.CustomerID;

This can be read as:

[Database].[Schema].[Table].[Column] = [Database].[Schema].[Table].[Column]

That kind of sucks to read doesn’t it? It seems a little redundant too. As long as SQL Server understands the scope of your SQL and where you are running things, a lot of how you have your statement written out is for the benefit of the developer. In your example provided, I would have personally appended the schema to the tables.

Appending the schema helps resolve any ambiguity, as does referencing the table before the column name if you are joining tables. This is also seen as best practice as it helps (ever so slightly) with performance and is sometimes required if both tables have the same column name.

Specifying Orders or Customer before your column name allows me to know where it is coming from. If you joined 100 tables and none prefixed the table name, I’m going to have to search through all 100 tables to know which column you are talking about.

Consider tables as data sets when thinking about how they look. There is a more advanced level of knowledge here regarding B-Tree index structures when working with clustered and non-clustered indexes, but we could have heaps or even clustered column indexes or OLTP tables that change how the table is physically and logically structured. If you are just starting out, I recommend to make a note of this to come back to it later. It gets advanced quickly but should be learned sooner than later.

Inner joins are where two tables are matched on a column(s) to limit the results between tables that are the same. There are more in depth explanations and more complicated explanations, but I would stick to basic for now if you are new to SQL.

SQL is Structured Query Language, it’s based mostly off of English. For example:

Bring me the red book from the living room that is the same book you had on your bed earlier.

SELECT
  LivingRoom.Book
FROM LivingRoom
INNER JOIN BedRoom
ON BedRoom.BookID = LivingRoom.BookID
WHERE LivingRoom.BookColor = 'Red'

When you read the top sentence and match it to the code, it makes sense logically and in English. What if I typed the code this way:

SELECT
  Book
FROM LivingRoom
INNER JOIN BedRoom
ON BookID = BookID
WHERE BookColor = 'Red'

Reading that code as a human, I would say the English version turns into:

Bring me the red book from the living room or the bedroom or both where the book is the same book in the bedroom or where the book is the same book in the living room.

If the column exists in both tables, you need to specify the table name before the column. The column may exist in both tables but it may not be the same values for the same row. Even if the values are the same for the same row, SQL Server doesn’t understand what table you want the column from, it’s ambiguous. Likewise, it’s not sure if you want to join the BookID onto the same table you are selecting from (there are reasons to do this and you can do this, just know it’s possible and move on for now) or if it’s supposed to join on another table. Again, it’s ambiguous and you need to resolve any clarification for the SQL engine.

Now for the last piece, you can alias things. Sometimes table names can be long, so you can rename them for the length of the transaction. Here’s an example:

SELECT 
  o.OrderID, 
  c.CustomerName, 
  o.OrderDate 
FROM Orders as [o]
INNER JOIN Customers as [c]
  ON o.CustomerID=c.CustomerID;

I aliased Orders as [o] and customers as [c], so when I refer to those tables henceforth, I can refer to them by their alias shortening how much I have to type but maintaining readability.

Leave a Reply

Your email address will not be published. Required fields are marked *