Week 9: Advanced Database Topics

In an increasingly data-driven world, efficiently configuring and managing a database are crucial to creating robust and scalable web applications. This week, we'll learn some tools and techniques to help us make the most of our Postgres database.

Agenda

Database Constraints
- Primary Keys
- Foreign Keys
- Other Constraints
  - UNIQUE and NOT NULL
  - CHECK
Advanced Queries
- DISTINCT
- UNION
- Aggregate Functions
  - COUNT
  - MIN/MAX
  - AVG
  - SUM
  - GROUP BY & HAVING
- ORDER BY
- CASE
- LIKE
- JOIN
  - Different Types of Joins
- SUBQUERY

In this lesson, we'll cover some more advanced database topics that will help you make the most out of your Postgres database by creating and enforcing predictable and reliable data structures.

Database Constraints

Databases allow us to define constraints, which are just rules that we can ask our columns to follow. Now, we'll take a look at some of the most common constraints that you'll use when working with databases.

Primary Keys

By this point, you've heard plenty about ids, and you've probably worked with them in some capacity. It's important that we take the time to understand exactly what an id is and why we use them.

Let's back up a little bit and begin with primary keys. As the name suggests, a table can only have one primary key. The primary key is the key which uniquely identifies each entry in the table. Tables do not need a primary key, nor does the primary key need to be a column called id. However, both are standard practice, and it's important to understand why.

It's straightforward to see why primary keys are so helpful. The most important benefit is that they allow us to quickly identify exactly one row in a table. Whether we are programming a React component on our client or business logic in our backend, ids provide a way of easily identifying entries in our database. Standard practice is to create a column id of type SERIAL to serve as our primary key. id is a short, descriptive and unambiguous name. The SERIAL type stores an integer, with the added benefit of auto increment: whenever we make a new entry, Postgres will automatically increment the id for us. This way, we don't need to worry about setting the id values ourselves. PRIMARY KEY is technically just a constraint (more on constraints later) that mandates that all values be UNIQUE and NOT NULL. Thus, we are guaranteed that each row has an id, and that no other row shares that id.

To designate a column as a primary key when we create a table, we use this syntax:

CREATE TABLE products (
    id serial PRIMARY KEY,
    name text,
    price numeric
);

Foreign Keys

What gives databases their power and versatility is their ability to establish complex relationships between entries in different tables. For example, let's say we are building a simple e-commerce application that has a list of products and allows users to place orders of those products.

We start with our products table, which lists all of our products:

CREATE TABLE products (
    id serial PRIMARY KEY,
    name text,
    price numeric
);

Next, we want to create an orders table that lists all orders made in the application. However, before we define our table, there is an important observation that we can make. Every order must be of a particular product; in other words, every order must contain exactly one product. It doesn't make sense to have an order with no products, and we can't store multiple products in one row (since we'd need an arbitrary number of columns to store each product). So, we need a column in our orders table that references an entry in the products table. It makes sense, then, to use its id. That gives us this definition:

CREATE TABLE orders (
    id serial PRIMARY KEY,
    product_id integer
);

As you can see, every order has a product_id. But this makes no guarantee that this id actually points to a product. For example, say you had only two products, with ids of 1 and 2, but you created an order with a product_id of 3. To prevent unpredictable errors from happening, we might want to add a constraint that guarantees that product_id contains a valid product id. That's exactly what the FOREIGN KEY constraint does. To use it, we can slightly alter our table definition:

CREATE TABLE orders (
    id serial PRIMARY KEY,
    product_id integer REFERENCES products
);

REFERENCES tells Postgres to add a foreign key constraint, and the table name (products, in this case) immediately follows, telling Postgres that the product_id column refers to a a primary key in the products table. Now, if we try to create an order with an invalid product_id, we'll get an error. Similarly, if we try to delete a product that is referenced to by an order, we will also get an error.

With this foreign key in place, we can now make an order for any product, and keep order information separate from product info. But we'd like to take this one step further: a single order should be able to contain multiple products. As we briefly mentioned earlier, this is impossible with the current two-table setup, because our orders table would need to have an arbitrary number of product_ids. The solution is to create another table called a join table. It has one purpose: to join, or connect, two other tables, so that we can establish what we call a many-to-many relationship -- put simply, this means that an order can contain an arbitrary many products.

Our new table only needs two columns: an order id and a product id. Every entry in this table represents something like a single item in a shopping basket. That item belongs to a particular basket (the order), and points to a product. So, we might create a table like this:

CREATE TABLE order_items (
    product_id integer REFERENCES products,
    order_id integer REFERENCES orders
);

This is an example of a table that needs no primary key. We might add one if we want, but it may not be useful because the only purpose this table serves is to join two existing tables. If we need to refer to a specific entry in this table, we can use the combination of order_id and product_id, since no two rows should have the same value for both. We'll see how to do this in the next section.

Other Constraints

`UNIQUE` & `NOT NULL`

As we mentioned earlier, PRIMARY KEY is technically just a combination of the UNIQUE and NOT NULL constraints. These constraints are self-explanatory; UNIQUE requires that a column's value is unique with respect to all other rows in the table, and NOT NULL requires that a column's value is not NULL (like in Javascript, NULL just represents the absence of a value ). In general, most columns should be NOT NULL. Only allow columns be NULL if it makes sense for there to be no data.

Note: when we say "require" in terms of constraints, we mean that any attempt to create, update, or delete an entry in the table that will violate the constraint will result in an error . This enables us to assume that all entries in a table satisfy the table's constraints.

Here is an example of how we might define these constraints:

CREATE TABLE products (
    id serial UNIQUE NOT NULL,
    name text NOT NULL,
    price numeric
);

What if no columns in a table are unique by themselves, but the combination of two or more is unique? For example, recall order_items, our join table from earlier:

CREATE TABLE order_items (
    product_id integer REFERENCES products,
    order_id integer REFERENCES orders
);

We can't make product_id unique because we may have multiple order items that refer to the same product. Similarly, we can't make order_id unique because the entire point of creating this join table was to allow multiple entries to belong to the same order. However, we may want to guarantee that there are no duplicate entries -- i.e., the combination of the product id and order id is unique. More formally, no entry in the table contains both the same product id and the same order id.

Fortunately, Postgres allows us to define a UNIQUE constraint over multiple columns:

CREATE TABLE order_items (
    product_id integer REFERENCES products,
    order_id integer REFERENCES orders,
    UNIQUE (product_id, order_id)
);

`CHECK`

There is one final important constraint that we will cover today called CHECK. CHECK allows us to guarantee that some arbitrary expression is true for every entry in our table. For example, let's say we wanted to make sure that all of our product's prices are positive (greater than zero). We might use this definition:

CREATE TABLE products (
    id serial PRIMARY KEY,
    name text,
    price numeric CHECK (price > 0)
);

Here, whatever follows CHECK is an expression that Postgres will automatically check any time we create or update an entry in our table. This again allows us to assume that every entry in the table satisfies this constraint, so we can be absolutely sure that every product has a positive price.

We also might define a constraint that involves two columns. For example, say that we also had a discounted_price column, which we want to constrain to be strictly less than the price. We can define our updated table like this:

CREATE TABLE products (
    id serial PRIMARY KEY,
    name text,
    price numeric CHECK (price > 0),
    discounted_price numeric CHECK (discounted_price > 0),
    CHECK (price > discounted_price)
);

Note that the final CHECK is on its own row, separate from any column. This is because it is a table constraint rather than a column constraint. Table constraints are defined with respect to the entire table, whereas column constraints are only defined with regards to the column whose line they are defined on. Any column constraint can be written as a table constraint, but the reverse is not true. (Column constraints cannot reference other columns.)

Advanced Queries

Last week, we learned how to construct basic SELECT queries that can read rows and columns from our database. In the real world, though, data is much more complex than just a list of rows and columns. Now, we'll learn some advanced querying techniques that will help us get exactly the data we want using only SQL.

Distinct

Normally, we can use the SELECT keyword to select all relevant rows from a table. But what if we want to only select distinct rows? That is, we want to select all relevant rows that are not identical (by the given criteria).

For example, let's say we wanted a list of all distinct product names from our products table defined above. Rather than query all products and remove duplicates, we can use SQL to query only distinct values:

SELECT DISTINCT name FROM products

We can also apply DISTINCT to multiple columns. This works how you would think:

SELECT DISTINCT name, price FROM products

selects all rows whose combination of name and price is unique. In other words, the resulting table will have two columns, name and price, and no two rows will have the same value for both.

Union

Some of you may already be familiar with the mathematical notion of a union -- the combination of two sets. Similarly, we can use SQL's UNION keyword to combine the results of two queries . Note that the queries must reference the same columns: a union of two sets of rows with different columns doesn't make sense.

Like union in math, UNION does not include any duplicates (unless you use UNION ALL).

Here's an example of a union query:

SELECT column1, column2 FROM table WHERE column1 > 50
UNION
SELECT column1, column2 FROM table WHERE column1 < 5

Aggregate Functions

If you've used Excel before, you probably appreciate the statistics operations that allow you to quickly get info about data entries, like average() and max(). As you might expect, Postgres includes this functionality out of the box.

Count

For example, let's say we want to know how many entries are in a table. This is trivial:

SELECT COUNT(*) FROM table

The asterisk tells Postgres to count all rows, including NULLs and duplicates. We can also add WHERE clauses to count only entries that satisfy some condition, like this:

SELECT COUNT(*) FROM table WHERE column3 = false

This query returns a table with only a single column count and one entry containing the count.

Min/Max

To get the minimum or maximum values of a numeric column in a table, we can use the MIN() and MAX () functions, like this:

SELECT MIN(price) FROM products;

Like COUNT, this returns a table with a single column min and one entry containing the minimum value of the queried column.

Avg

Average returns the average of a numeric column:

SELECT AVG(price) FROM products

Sum

Sum returns the sum of a numeric column:

SELECT SUM(price) FROM products

Group By & Having

Let's say we had a payments table defined like this:

CREATE TABLE payments (
    id serial PRIMARY KEY,
    customer_id integer,
    order_id integer,
    amount integer
)

What if we want to know how much each customer has paid over all their orders? We can use SQL's GROUP BY clause to group payments by their customer_id, and then use SUM() to compute the sum.

SELECT customer_id, SUM(amount) FROM payments GROUP BY customer_id

Now, imagine we only cared about our most loyal customers -- the ones who've spent more than $200 . You might think that we could use a WHERE clause to only get the groups of payments that sum to more than 200. But WHERE only works for rows, not groups in the result. Instead, we use the HAVING keyword, like this:

SELECT customer_id, SUM(amount) FROM payments GROUP BY customer_id HAVING SUM(amount) > 200

Order By

Often, we'll want to sort the data that we receive from our database. We could do this in our application logic, but it's almost always faster to ask the database to do it directly for us . Fortunately, SQL makes this trivially easy for us.

For example, let's say we wanted to get a list of products sorted from cheapest to most expensive . This simple query does that for us:

SELECT * FROM products ORDER BY price ASC

Of course ASC indicates that the resulting rows will have an ascending price. DESC would be the opposite.

Sorting works on more than just numeric values. Ordering textual values in ASC order sorts by alphabetical order, for example.

You can also sort on multiple columns as a tie-breaker by separating each term with a comma:

SELECT * FROM products ORDER BY column1 ASC, column2 DESC, column3 DESC;

Case

The CASE statement, similar to if-then-else statement or a switch statement, goes through different conditions and returns a value when the first condition is met. If the first condition is true, a value will be returned and , if not, the rest of the conditions will be evaluated . In case non of the given conditions are true, it returns the value in the ELSE clause (just like the default case in switch statements). Don't forget to tell the program that your CASE statement is done by an END clause. Here is what the syntax looks like:

SELECT ...
...
CASE
    WHEN condition1 THEN result1
    WHEN condition2 THEN result2
    ...
    WHEN conditionN THEN resultN
    ELSE result
END;

and here is an example with a SELECT statement:

SELECT order_id, amount,
CASE
    WHEN amount > 10 THEN 'Payment due is more that $10'
    WHEN amount < 10 THEN 'Payment due is not much!'
    ELSE 'You gotta pay exactly $10'
END AS PaymentDueFeedback
FROM payments;

Note: AS is used for renaming a column in the table that we want to be displayed after running a query.*

Like

This operation is used with the WHERE clause. LIKE allows you to find a specific pattern in a column of a table. You can use % and _ in conjuction with LIKE if there are other characters before or after the wanted pattern. The percent sign and underscore can be used in combinations as well. - % represents zero, one, or multiple characters. - _ represents a single character.

Imagine for some reason you wanted to find all products that have s as their second letter; you can perform the following query:

SELECT * FROM products
WHERE name LIKE '_s%';

You can also have as many conditions as you'd like by using AND or OR operators. So now imagine you wanted every product in your database whose name either started or ended with s:

SELECT * FROM products
WHERE name LIKE 's%' OR '%s';

Join

So far, we've seen how to design tables to help us store related data. For example, we created an e-commerce database with three tables: orders, products, and order_items. Every row in orders represents one shopping basket. Every row in order_items belongs to one order and refers to a products, so we can have as many products as we want in one shopping basket.

But storing our data is only one part of our application. Equally important is accessing it. Let's say we want to list all the products in an order. The first step is to obtain all the product_ids associated with that order. So, we might run:

SELECT * FROM order_items WHERE order_id = 20

to find all the order_items associated with order 20. Then, for each order item, we can find its product by running separate queries:

SELECT * FROM products WHERE id = ORDER_ITEM_ID

but this is verbose. Instead, we can use the JOIN keyword, which allows us to join two tables based on a matching column, like an id. So, we could write:

SELECT p.id AS Product_ID, p.name AS Product_Name, o.id AS Order_ID
    -- the trailing 'o' below lets you set a table alias, so you can refer to it in other parts of the query
    FROM order_items o 
    JOIN products p
    ON Product_ID = o.product_id;

This table will give us 3 colums representing the Product id in our order, the name of the product, and finally our order id in case we wanted to double check that we've got the right products. The rows will contain the list of products in our order.

Different types of joins

Based on what we want to get from two tables, we can do different types of joins. As you can see in the picture below, some of these different joins are: INNER JOIN (or JOIN as default in SQL), LEFT JOIN, RIGHT JOIN, and FULL JOIN.

SQL Join Comparison

INNER JOIN: returns rows when there is a match in both tables (default if no qualifier is specified).

LEFT JOIN: returns all rows from the left table, even if there are no matches in the right table.

RIGHT JOIN: returns all rows from the right table, even if there are no matches in the left table.

FULL JOIN: It combines the results of both left and right outer joins. The joined table will contain all records from both the tables and fill in NULLs for missing matches on either side.

Subquery

Subquery's definition is within its name: a query within a query! Subqueries could be nested inside a SELECT, INSERT, UPDATE, DELETE, or another subquery. They can output a list of data or individual values.

We use subqueries as a condition (therefore it's usually used within a WHERE clause) and will return data that will be used in the main query and the point of it is to restrict the information that's being retrieved from the database. Make sure you surround your subqueries with paranthesis.

Note: Operators could be <, =, >, etc.

SELECT column1, column2, ...
FROM table
WHERE clause/expressions OPERATOR 
    (SELECT column
    FROM table
    WHERE condition);

Here's an example:

SELECT * FROM products
WHERE id IN (SELECT product_id AS id FROM order_items
            GROUP BY product_id
            HAVING COUNT(product_id) > 5) s;