Chapter 4 Database Management and Data Systems (SQL)

SQL (Structured Query Language) is used to answer questions / extract information both within and across relational database tables —> high-level / parallel programming language

Database schemas show data types for each field (column) in all tables (data frame), and they also show relationships between tables.

  • cardinality of relation —> # of tuples (rows / records)

  • arity of the relation —> # of columns (attributes)

  • relation = table

A key is a minimal subset of columns that acts as a unique identifier for tuples in a relation.

SQL Style Guide by Simon Holywell Keywords are reserved words for operations.

4.1 CRUD (create, read, update, delete) Operations

4.1.1 CREATE —> Databases | Tables | Views | Users | Permissions | Security Groups

/* create a new table w/ column names, data type, size */
CREATE TABLE test_table( -- unique table name
  test_date date,
  test_name varchar(20),
  test_int int,
  PRIMARY KEY (test_date, test_name)
)

/* data types: */
Dates: date (YYYY-MM-DD) | datetime (YYYY-MM-DD hh:mm:ss) | time
Numeric: int | decimal | float | bit (1=TRUE, 0=FALSE. Also accepts NULL)
Strings: char | varchar | nvarchar

/* create a copy of an existing table */
CREATE TABLE new_table_name AS
    SELECT column1, column2, ...
    FROM existing_table_name,
    WEHRE conditions;

4.1.2 INSERT, — insert new records into existing database tables

/* insert new columns & values */
INSERT INTO table_name (col1, col2, col3)
VALUES
  ('value1', 'value2', 'value3')
  
/* insert + select columns & values from another existing table */
INSERT INTO table_name (col1, col2, col3)
SELECT
  column1,
  column2,
  column3
FROM other_table
WHERE condition(s);

4.1.3 UPDATE — Amend existing database records

UPDATE table_name
SET 
  column1 = value1,
  column2 = value2
WHERE condition(s)

4.1.4 DELETE — delete existing records from tables

DELETE 
FROM table_name
WHERE condition(s)

/* another method: TRUNCATE --> remove all data from ALL columns at once */
TRUNCATE TABLE table_name

4.1.5 Declare Variables —> so it will be easier to use in later conditions without repetitively calling the same values

-- declare your variables
DECLARE @start DATE
DECLARE @stop DATE
DECLARE @affected INT;

-- set relevant values for each variable
SET @start = '2022-01-01'
SET @stop = '2022-12-31'
SET @affected = 5000; -- threshold for # of affected customers

SELECT 
  column1,
  column2,
  affected_customers
FROM 
  table_name
WHERE date BETWEEN @start AND @stop
AND affected_customers >= @affected;

4.1.6 Temporary tables

SELECT
  col1,
  col2,
  col3 INTO #my_temp_table -- #my_temp_table exists until connection or session ends
FROM my_existing_table
WHERE condition(s);

-- Remove table manually
DROP TABLE #my_temp_table

4.1.7 READ | VIEW

Example: SFW query —> SELECT FROM WHERE statements

  • Selection is the operation of filtering a relation’s tuples on some condition.

  • Projection is the operation of producing an output table with tuples that have a subset of their prior attributes.

    • SELECT indicates which fields should be selected.

    • FROM indicates where these fields are located.

    • SELECT DISTINCT allows you to select field content.

SELECT DISTINCT var_name1 AS new_var_name1, var_name2
FROM table_name
WHERE condition(s);

DROP DATABASE drops an existing SQL database

ALTER DATABASE modifies a database

ALTER TABLE modifies a table DROP TABLE table_name deletes a table CREATE INDEX creates an index (search key) DROP INDEX deletes an index BACKUP DATABASE is used in SQL Server to create a full back up of an existing SQL database TO DIST = 'filepath'; Use aliasing to rename columns:

View is a virtual table that is the result of a saved SELECT statement. When accessed, views automatically update in response to updates in the underlying data.

CREATE VIEW new_table_name AS
SELECT id, var_name1, var_name2
FROM table_name;

4.2 Exploratory Data Analysis in SQL (T-SQL)

4.2.1 GROUP BY | HAVING | WHERE

  • GROUP BY splits data up into combinations of one or more values

  • WHERE filters on row values

  • HAVING appears after the GROUP BY clause and filters on groups or aggregates

/* List the number of customers in each country, ordered by the country with the most customers first.*/
SELECT COUNT(customerID),
country,
FROM customers
GROUP BY country
ORDER BY COUNT(customerID) DESC; 

/* List the # of customers in each country, sorted high to low (but only include countries with more than 5 customers) */
SELECT COUNT(CustomerID), Country
FROM Customers
GROUP BY Country
HAVING COUNT(CustomerID) > 5
ORDER BY COUNT(CustomerID) DESC;

/* List if the employees "Davolio" or "Fuller" have registered more than 25 orders */
SELECT Employees.LastName, COUNT(Orders.OrderID) AS NumberOfOrders
FROM Orders
INNER JOIN Employees ON Orders.EmployeeID = Employees.EmployeeID
WHERE LastName = 'Davolio' OR LastName = 'Fuller'
GROUP BY LastName 
HAVING COUNT(Orders.OrderID) > 25;

4.2.2 JOIN examples

/* INNER JOIN selects records that have matching values in BOTH tables */
SELECT column_name
FROM table1
INNER JOIN table2 ON table1.column_name = table2.column_name;

/* LEFT JOIN selects all records from the left table (table1) and the matching records from the right table (table2) */
SELECT column_name
FROM table1
LEFT JOIN table2 ON table1.column_name = table2.column_name;

/* RIGHT JOIN selects all records from the right table (table2) and the matching records from the left table (table1) */
SELECT column_name
FROM table1
RIGHT JOIN table2
ON table1.column_name = table2.column_name;

/* FULL OUTER JOIN returns all records when there is a match in left (table1) OR right (table2) table records -- FULL OUTER JOIN == FULL JOIN */
SELECT column_name
FROM table1
FULL OUTER JOIN table2
ON table1.column_name = table2.column_name
WHERE condition;

/* a self join is a regular join, but the table is joined with itself */
SELECT column_name
FROM table1 T1, table2 T2 -- T1, T2 are two aliases of table 1 and table 2
WHERE condition

/* SQL Self Join Example: matching customers from the same city */
SELECT A.CustomerName AS CustomerName1, B.CustomerName AS CustomerName2, A.City
FROM Customers A, Customers B
WHERE A.CustomerID <> B.CustomerID -- <> stands for not equal 
AND A.City = B.City
ORDER BY A.City;

4.2.3 UNION Operator

  • UNION discards duplicates while UNION ALL does not

  • the UNION operator is used to combine the result-set of two or more SELECT statements

    • Every SELECT statement within UNION must have the same number of columns

    • The columns must also have similar data types

    • The columns in every SELECT statement must also be in the same order

    /* UNION : selects only distinct values by default */
    SELECT column_name(s) FROM table1
    UNION
    SELECT column_name(s) FROM table2;
    
    /* UNION ALL : allows duplicate values */
    SELECT column_name(s) FROM table1
    UNION ALL
    SELECT column_name(s) FROM table2;
    
    /* example1: returns the German cities (duplicate values also) from both the customers and the suppliers table */
    SELECT City, Country FROM Customers
    WHERE Country='Germany'
    UNION ALL
    SELECT City, Country FROM Suppliers
    WHERE Country='Germany'
    ORDER BY City;

4.2.4 CASE statements can be used to create columns (new variables) for

  • categorizing data

  • filtering data

  • aggregating data based on results of a logical test

    • e.g. COUNT(CASE statement)–>returns the number of rows returned by case statements instead of string / text

    • similar aggregations include SUM/AVG/ROUND(AVG(…), digits)

    • Percentages with CASE and AVG

SELECT 
  column1,
  ROUND(AVG(CASE WHEN condition1 AND condition2 THEN 0
                 WHEN condition1 AND condition3 THEN 1 
                 END), 2) AS pct_column,
FROM table1
GROUP BY column1
SELECT
  id,
  home_goal,
  away_goal,
  CASE WHEN home_goal > away_goal THEN 'Home Team Win'
       WHEN home_goal < away_goal THEN 'Away Team Win'
       ELSE 'Tie' END AS outcome
    /* ELSE NULL AS outcome --> if everything else should be N.A.
       END AS outcome --> if we want to exclude every other conditions 
       END IS NOT NULL --> keep all clauses excluding missing values*/
FROM match
WHERE season = '2013/2014';

/* Example 1: CASE WHEN + COUNT GROUP BY */
-- Identify the home team as Bayern Munich, Schalke 04, or neither
SELECT 
    CASE WHEN hometeam_id = 10189 THEN 'FC Schalke 04'
         WHEN hometeam_id = 9823 THEN 'FC Bayern Munich'
         ELSE 'Other' END AS home_team,
    COUNT(id) AS total_matches
FROM matches_germany
-- Group by the CASE statement alias
GROUP BY home_team;

/* Example 2: CASE WHEN + LEFT JOIN + WHERE */
SELECT 
    m.date,
    t.team_long_name AS opponent,
    -- Complete the CASE statement with an alias
    CASE WHEN m.home_goal > away_goal THEN 'Barcelona win!'
       WHEN m.home_goal < away_goal THEN 'Barcelona loss :(' 
       ELSE 'Tie' END AS outcome 
FROM matches_spain AS m
LEFT JOIN teams_spain AS t 
ON m.awayteam_id = t.team_api_id
-- Filter for Barcelona as the home team
WHERE m.hometeam_id = 8634; 

4.2.5 TEXT operations

  • Select names starting with vowels
SELECT column
FROM table
-- start with vowels
WHERE column LIKE '[aeiou]%'; 
-- end with vowels
WHERE column LIKE '%[aeiou]'
-- both start and end w/ vowels
WHERE column LIKE '[aeiou]%[aeiou]'
 -- contains vowels in any positions
WHERE column LIKE '%[aeiou]%';

-- if NOT containing such characters then simiply do
NOT LIKE ...
  • LIKE: Simple string pattern matching
LIKE operators with ‘%’ and ‘_’
LIKE Operator Description
WHERE column LIKE 'a%' Finds any values that start with “a”
WHERE column LIKE '%a' Finds any values that end with “a”
WHERE column LIKE '%or%' Finds any values that have “or” in any position
WHERE column LIKE '_r%' Finds any values that have “r” in the second position
WHERE column LIKE 'a_%' Finds any values that start with “a” and are at least 2 characters in length
WHERE column LIKE 'a__%' Finds any values that start with “a” and are at least 3 characters in length
WHERE column LIKE 'a%o' Finds any values that start with “a” and ends with “o”
WHERE column LIKE '[aeiou]%' Finds any values that start with any characters in “a”, “e”, “i”, “o”, or “u”
SELECT 
  colunm,
  LEN(column) AS column_length -- returns # of chars
  LEFT(column, 20) AS first_20_left_column -- returns first 20 chars from the LEFT
  RIGHT(column, 20) AS last_20_column -- returns last 20 chars from the RIGHT
  CHARINDEX('_', column) AS char_location -- returns the index of a char/string in column
  SUBSTRING(column, start, length) AS target_section 
        -- returns substring starting at location 12 and has length of 12
FROM table;

/* REPLACE */
SELECT
  TOP(5) REPLACE(column, 'a','b') AS replaced_column_with_b
        -- replaces char 'a' in column with char 'b' for all first 5 rows
FROM table;

4.2.6 Substituting NULL values using COALESCE in T-SQL

COALESCE(value_1, value_2, value_3, ..., value_n)
-- if value_n is NULL and value_2 is not NULL, return value_2
-- ... value_n can also be some column

SELECT column1, column2
COALESCE(column1, column2, 'N/A') AS new_column
FROM table1

-- using ISNULL() function to replace all mising values in one column with values in another column or some other specified value
SELECT column1, ISNULL(column1, column2) AS new_column1
FROM table1

4.2.7 DATE

  • DATEADD (DATEPART, number, date): Add or subtract datetime values —> always returns a date

    • DATEPART: Unit of measurement (DD, MM, etc.)

    • number: an integer value to add

    • date: a datetime value

  • DATEDIFF (datepart, startdate, enddate): Obtain the difference between two datetime values —> always returns a number

    • datepart: unit of measurement (DD, MM, etc,)

    • startdate: the starting date value

    • enddate: the ending datetime value

    /* date math w/ DATEDIFF */
    SELECT 
      date1, 
      date2, 
      DATEDIFF(DD, date1, date2) AS diff_day,
      DATEDIFF(MM, date1, date2) AS diff_month,
      DATEDIFF(YYYY, date1, date2) AS diff_year
    FROM table1
    
    /* date math w/ DATEADD */
    SELECT
      date1,
      DATEADD(DD, 7, date1) AS date_after_a_week
    FROM table1

4.2.8 ROUND and TRUNCATE

/* IN T-SQL */
-- round(17.8, 0) --> 18 while truncate rounds 17.8 to 17
SELECT column1,
ROUND(column2, 0) AS RoundingtoWhole,
ROUND(column2, 0, 1) AS Truncating -- adding a third parameter 1 to round for truncating
FROM table1

OTHER MATH functions

ABS(): absolute value

SQRT(): square root | SQUARE(): square

LOG(column, digit): returns the LOG with base # of digit

4.3 Advanced SQL - loops/CTE/Windows

4.3.1 WHILE Loops & DECLARE

Assigning values to variables and write loops

-- Declare the variable
DECLARE @my_var INT
-- Use SET a value to the variable
SET @my_var = 1
-- Show the value
SELECT @my_var

-- Specify the condition of the WHILE loop
WHILE @my_var < 10
  -- Begin the code to execude insde the WHILE loop
  BEGIN
    -- Keep incrementing the value of @my_var
    SET @my_var = @my_var + 1
    -- Check if @my_var is equal to 4
    IF @my_var = 4
      -- When my_var is 4 then break the loop
      BREAK
    -- End WHILE lopp
  END
-- View the value after the loop
SELECT @my_var

4.3.2 Derived Tables

  • another name for a query acting as a table and are commonly used to do aggregations in T-SQL

  • use derived tables when we want to break down a complex query into smaller steps

  • great solution if we want to create intermediate calculations needed to be used in a larger query

SELECT a.* FROM Kidney a
-- This derived table computes the Average age are joined to the actual table
JOIN (SELECT AVG(age) AS AverageAge FROM kidney) b
ON a.Age = b.AverageAge
-- a & b are aliases for table Kidney and our derived table

4.3.3 CTE (Common Table Expressions)

  • another type of derived table

  • can be used multiple times in a query and are defined like a table

-- CTE definitions start with the keyword WITH
-- Followed by the CTE names and the columns it contains
WITH CTEName (Co1, Col2)
AS
-- Define the CTE query
(
-- the two columns from the definition above
    SELECT Col1, Col2
    FROM TableName
)

/* CTEs in T-SQL example */
-- Create a CTE to get the Maximum BloodPressure by Type
WITH BloodPressureAge(Age, MaxBloodPressure)
AS
(SELECT Age, MAX(BloodPressure) AS MaxBloodPressure
 FROM Kidney
 GROUP BY Age)
 
-- Create a query to use the CTE as a table
SELECT a.Age, MIN(a.BloodPressure), b.MaxBloodPressure
FROM Kidney a
-- join the CTE with the table
JOIN BloodPressureAge b
     ON a.Age = b.Age
GROUP BY a.Age, b.MaxBloodPressure

4.3.4 Window Functions in SQL

  • Create the window with OVER clause

  • PARTITION BY creates the frame

  • If you do not include PARTITION BY the frame is the entire table

  • Allows aggregations to be created at the same time as the window

-- Create a Window data grouping
SELECT SalesPerson, SalesYear, CurrentQuota,
       SUM(CurrentQuota)
       OVER (PARTITION BY SalesYear ORDER BY SalesYear) AS YearlyTotal,
       ModifiedDate AS ModDate
FROM SaleGoal

/* the above example partitions the table by SalesYear and uses the windowing function SUM to add up every row of the CurrentQuota column in the window to provide a total for the entire window in the YearTotal column.

When the year changes, the value in the YearTotal column changes showing the total for the next year becuz the window is partitioned by sales year.
*/

4.3.5 Windows functions Cheatsheet

Col1 Col2
ROW_NUMBER() Displays the unique number/index of a given row (no duplicates as compared to rank/dense_rank); starts are 1 and numbers the rows according to the ORDER BY part of window statements.
RANK() Gives the identical rows a rank of 2, then skip ranks 3 and 4, so next rank would be 5.
DENSE_RANK() Also gives identical rows a rank of 2, but the following rank would be 3 – no ranks would be skipped.
NTILE(*# of buckets*) Identify what percentile (or quantile, or any other subdivison) a given row falls into.
LAG(column, *# of rows*) Creates columns that pull previous values from other rows (second parameter specifies how many rows we need to pull)
LEAD(column, *# of rows*) Creates columns that pull following/next values from other rows (second parameter specifies how many rows we need to pull)
SUM() | COUNT() | AVG() | STDEV() Creates columns that calculate corresponding values based on mathematical operations

4.3.5.1 Getting first/last and next/previous values

  • FIRST_VALUE() returns the first values in the window

  • LAST_VALUE() returns the last values in the window

--> becuz we are trying to find a specific record in a window, the position of the records must be specified using the ORDER BY command.

-- Select the columns
SELECT SalesPerson, SalesYear, CurrentQuota,
  -- First value from every window
     FIRST_VALUE(CurrentQuota)
     OVER (PARTITION BY SalesYear ORDER BY ModifiedDate) AS StartQuota,
  -- Last value from every window
     LAST_VALUE(CurrentQuota)
     OVER (PARTITION BY SalesYear ORDER BY ModifiedDate) AS EndQuota,
     ModifiedDate as ModDate
FROM SaleGoal
  • Getting the next value with LEAD() and previous value with LAG()

    • provides the ability to query the value from NEXT or PREVIOUS row

    • also requires the use of ORDER BY to order the rows

    /* getting next value */
    SELECT SalesPerson, SalesYear, CurrentQuota,
    -- Create a window function to get values from next row
           LEAD(CurrentQuota) -- use LAG(...) if need previous value
           OVER (PARTITION BY SalesYear ORDER BY ModifiedDate) AS NextQuota,
           ModifiedDate as ModDate
    FROM SaleGoal

4.3.5.2 Adding row numbers

  • ROW_NUMBER(): sequentially numbers the rows in the window

  • ORDER BY is required when using ROW_NUMBER()

/* example of using ROW_NUMBER() to assign a new row number for each row in the window */
SELECT SalesPerson, SalesYear, CurrentQuota,
  ROW_NUMBER()
  OVER (PARTITION BY SalesPerson ORDER BY SalesYear) AS QuotabySalesPerson
FROM SaleGoal

4.3.5.3 Using windows for Statistical Functions

  • STDEV(): calculates the standard deviation

    SELECT SalesPerson, SalesYear, CurrentQuota,
           STDEV(CurrentQuota)
           OVER () AS StandardDev, -- do not include partition by if we are calculating for the entire table
           ModifiedDate AS ModDate
    FROM SaleGoal
  • Calculating the mode: 1) create a CTE containing an ordered count of values using ROW_NUMBER(); 2) write a query using the CTE to pick the value with the highest row number

    -- Create a CTE called QuotaCount 
    WITH QuotaCount AS 
    (
    SELECT SalesPerson, SalesYear, CurrentQuota,
           ROW_NUMBER() -- creating ordered count of values
           OVER (PARTITION BY CurrentQuota ORDER BY CurrentQuota) AS QuotaList
    FROM SaleGoal
    )
    
    -- select everything from CTE
    SELECT * FROM QuotaCount
    
    -- write a query using CTE to pick the value with highest count (i.e. row number)
    SELECT CurrentQuota, QuotaList AS Mode
    FROM QuotaCount
    WHERE QuataList IN (SELECT MAX(QuotaList) FROM QuotaCount) -- select value in maximum count

4.3.6 Defining a window alias

Example:

SELECT column1,
       column2,
       NTILE(4) OVER
        (PARTITION BY column1 ORDER BY column2)
        AS quartile,
       NTILE(5) OVER
        (PARTITION BY column1 ORDER BY column2)
        AS quintile,
       NTILE(100) OVER
        (PARTITION BY column1 ORDER BY column2)
        AS percentile
FROM table_name
WHERE condition(s)
ORDER BY column1, column2

Re-written example using window alias

SELECT column1,
       column2
       NTILE(4) OVER ntile_window AS quartile,
       NTILE(5) OVER ntile_window AS quintile,
       NTILE(100) OVER ntile_window AS percentile
FROM table_name
WHERE condition(s)
/* NOTE if include the WINDOW clause, should always come AFTER the WHERE clause */
WINDOW ntile_window AS
        (PARTITION BY column1 ORDER BY column2)
ORDER BY column1, column2