Chapter 4 Database Management and Data Systems (SQL)
SQL (Structured Query Language)
is used to answer questions / extract information both within and across relational database tables —> high-level / parallel
programming language
Database schemas
show data types
for each field (column) in all tables (data frame), and they also show relationships
between tables.
cardinality
of relation —> # of tuples (rows / records)arity
of the relation —> # of columns (attributes)relation
= table
A key
is a minimal subset of columns that acts as a unique identifier for tuples in a relation.
SQL Style Guide by Simon Holywell Keywords are reserved words for operations.
4.1 CRUD (create, read, update, delete) Operations
4.1.1 CREATE —> Databases | Tables | Views | Users | Permissions | Security Groups
/* create a new table w/ column names, data type, size */
CREATE TABLE test_table( -- unique table name
date,
test_date varchar(20),
test_name int,
test_int PRIMARY KEY (test_date, test_name)
)
/* data types: */
date (YYYY-MM-DD) | datetime (YYYY-MM-DD hh:mm:ss) | time
Dates: Numeric: int | decimal | float | bit (1=TRUE, 0=FALSE. Also accepts NULL)
char | varchar | nvarchar
Strings:
/* create a copy of an existing table */
CREATE TABLE new_table_name AS
SELECT column1, column2, ...
FROM existing_table_name,
WEHRE conditions;
4.1.2 INSERT, — insert new records into existing database tables
/* insert new columns & values */
INSERT INTO table_name (col1, col2, col3)
VALUES
'value1', 'value2', 'value3')
(
/* insert + select columns & values from another existing table */
INSERT INTO table_name (col1, col2, col3)
SELECT
column1,
column2,
column3FROM other_table
WHERE condition(s);
4.1.3 UPDATE — Amend existing database records
UPDATE table_name
SET
= value1,
column1 = value2
column2 WHERE condition(s)
4.1.4 DELETE — delete existing records from tables
DELETE
FROM table_name
WHERE condition(s)
/* another method: TRUNCATE --> remove all data from ALL columns at once */
TRUNCATE TABLE table_name
4.1.5 Declare Variables —> so it will be easier to use in later conditions without repetitively calling the same values
-- declare your variables
DECLARE @start DATE
DECLARE @stop DATE
DECLARE @affected INT;
-- set relevant values for each variable
SET @start = '2022-01-01'
SET @stop = '2022-12-31'
SET @affected = 5000; -- threshold for # of affected customers
SELECT
column1,
column2,
affected_customersFROM
table_nameWHERE date BETWEEN @start AND @stop
AND affected_customers >= @affected;
4.1.6 Temporary tables
SELECT
col1,
col2,INTO #my_temp_table -- #my_temp_table exists until connection or session ends
col3 FROM my_existing_table
WHERE condition(s);
-- Remove table manually
DROP TABLE #my_temp_table
4.1.7 READ | VIEW
Example: SFW
query —> SELECT FROM WHERE
statements
Selection is the operation of filtering a relation’s tuples on some condition.
Projection is the operation of producing an output table with tuples that have a subset of their prior attributes.
SELECT
indicates which fields should be selected.FROM
indicates where these fields are located.SELECT DISTINCT
allows you to select field content.
SELECT DISTINCT var_name1 AS new_var_name1, var_name2
FROM table_name
WHERE condition(s);
DROP DATABASE
drops an existing SQL database
ALTER DATABASE
modifies a database
ALTER TABLE
modifies a table DROP TABLE table_name
deletes a table CREATE INDEX
creates an index (search key) DROP INDEX
deletes an index BACKUP DATABASE
is used in SQL Server to create a full back up of an existing SQL database TO DIST = 'filepath';
Use aliasing to rename columns:
View is a virtual table that is the result of a saved SELECT
statement. When accessed, views automatically update in response to updates in the underlying data.
CREATE VIEW new_table_name AS
SELECT id, var_name1, var_name2
FROM table_name;
4.2 Exploratory Data Analysis in SQL (T-SQL)
4.2.1 GROUP BY | HAVING | WHERE
GROUP BY
splits data up into combinations of one or more valuesWHERE
filters on row valuesHAVING
appears after theGROUP BY
clause and filters on groups or aggregates
/* List the number of customers in each country, ordered by the country with the most customers first.*/
SELECT COUNT(customerID),
country,FROM customers
GROUP BY country
ORDER BY COUNT(customerID) DESC;
/* List the # of customers in each country, sorted high to low (but only include countries with more than 5 customers) */
SELECT COUNT(CustomerID), Country
FROM Customers
GROUP BY Country
HAVING COUNT(CustomerID) > 5
ORDER BY COUNT(CustomerID) DESC;
/* List if the employees "Davolio" or "Fuller" have registered more than 25 orders */
SELECT Employees.LastName, COUNT(Orders.OrderID) AS NumberOfOrders
FROM Orders
INNER JOIN Employees ON Orders.EmployeeID = Employees.EmployeeID
WHERE LastName = 'Davolio' OR LastName = 'Fuller'
GROUP BY LastName
HAVING COUNT(Orders.OrderID) > 25;
4.2.2 JOIN examples
/* INNER JOIN selects records that have matching values in BOTH tables */
SELECT column_name
FROM table1
INNER JOIN table2 ON table1.column_name = table2.column_name;
/* LEFT JOIN selects all records from the left table (table1) and the matching records from the right table (table2) */
SELECT column_name
FROM table1
LEFT JOIN table2 ON table1.column_name = table2.column_name;
/* RIGHT JOIN selects all records from the right table (table2) and the matching records from the left table (table1) */
SELECT column_name
FROM table1
RIGHT JOIN table2
ON table1.column_name = table2.column_name;
/* FULL OUTER JOIN returns all records when there is a match in left (table1) OR right (table2) table records -- FULL OUTER JOIN == FULL JOIN */
SELECT column_name
FROM table1
FULL OUTER JOIN table2
ON table1.column_name = table2.column_name
WHERE condition;
/* a self join is a regular join, but the table is joined with itself */
SELECT column_name
FROM table1 T1, table2 T2 -- T1, T2 are two aliases of table 1 and table 2
WHERE condition
/* SQL Self Join Example: matching customers from the same city */
SELECT A.CustomerName AS CustomerName1, B.CustomerName AS CustomerName2, A.City
FROM Customers A, Customers B
WHERE A.CustomerID <> B.CustomerID -- <> stands for not equal
AND A.City = B.City
ORDER BY A.City;
4.2.3 UNION Operator
UNION
discards duplicates whileUNION ALL
does notthe
UNION
operator is used to combine the result-set of two or moreSELECT
statementsEvery
SELECT
statement within UNION must have the same number of columnsThe columns must also have similar data types
The columns in every
SELECT
statement must also be in the same order
/* UNION : selects only distinct values by default */ SELECT column_name(s) FROM table1 UNION SELECT column_name(s) FROM table2; /* UNION ALL : allows duplicate values */ SELECT column_name(s) FROM table1 UNION ALL SELECT column_name(s) FROM table2; /* example1: returns the German cities (duplicate values also) from both the customers and the suppliers table */ SELECT City, Country FROM Customers WHERE Country='Germany' UNION ALL SELECT City, Country FROM Suppliers WHERE Country='Germany' ORDER BY City;
4.2.4 CASE statements can be used to create columns (new variables) for
categorizing data
filtering data
aggregating data based on results of a logical test
e.g.
COUNT(CASE statement)
–>returns the number of rows returned by case statements instead of string / textsimilar aggregations include SUM/AVG/ROUND(AVG(…), digits)
Percentages with CASE and AVG
SELECT
column1,ROUND(AVG(CASE WHEN condition1 AND condition2 THEN 0
WHEN condition1 AND condition3 THEN 1
END), 2) AS pct_column,
FROM table1
GROUP BY column1
SELECT
id,
home_goal,
away_goal,CASE WHEN home_goal > away_goal THEN 'Home Team Win'
WHEN home_goal < away_goal THEN 'Away Team Win'
ELSE 'Tie' END AS outcome
/* ELSE NULL AS outcome --> if everything else should be N.A.
END AS outcome --> if we want to exclude every other conditions
END IS NOT NULL --> keep all clauses excluding missing values*/
FROM match
WHERE season = '2013/2014';
/* Example 1: CASE WHEN + COUNT GROUP BY */
-- Identify the home team as Bayern Munich, Schalke 04, or neither
SELECT
CASE WHEN hometeam_id = 10189 THEN 'FC Schalke 04'
WHEN hometeam_id = 9823 THEN 'FC Bayern Munich'
ELSE 'Other' END AS home_team,
COUNT(id) AS total_matches
FROM matches_germany
-- Group by the CASE statement alias
GROUP BY home_team;
/* Example 2: CASE WHEN + LEFT JOIN + WHERE */
SELECT
date,
m.AS opponent,
t.team_long_name -- Complete the CASE statement with an alias
CASE WHEN m.home_goal > away_goal THEN 'Barcelona win!'
WHEN m.home_goal < away_goal THEN 'Barcelona loss :('
ELSE 'Tie' END AS outcome
FROM matches_spain AS m
LEFT JOIN teams_spain AS t
ON m.awayteam_id = t.team_api_id
-- Filter for Barcelona as the home team
WHERE m.hometeam_id = 8634;
4.2.5 TEXT operations
- Select names starting with vowels
SELECT column
FROM table
-- start with vowels
WHERE column LIKE '[aeiou]%';
-- end with vowels
WHERE column LIKE '%[aeiou]'
-- both start and end w/ vowels
WHERE column LIKE '[aeiou]%[aeiou]'
-- contains vowels in any positions
WHERE column LIKE '%[aeiou]%';
-- if NOT containing such characters then simiply do
NOT LIKE ...
LIKE:
Simple string pattern matching
LIKE Operator | Description |
---|---|
WHERE column LIKE 'a%' |
Finds any values that start with “a” |
WHERE column LIKE '%a' |
Finds any values that end with “a” |
WHERE column LIKE '%or%' |
Finds any values that have “or” in any position |
WHERE column LIKE '_r%' |
Finds any values that have “r” in the second position |
WHERE column LIKE 'a_%' |
Finds any values that start with “a” and are at least 2 characters in length |
WHERE column LIKE 'a__%' |
Finds any values that start with “a” and are at least 3 characters in length |
WHERE column LIKE 'a%o' |
Finds any values that start with “a” and ends with “o” |
WHERE column LIKE '[aeiou]%' |
Finds any values that start with any characters in “a”, “e”, “i”, “o”, or “u” |
SELECT
colunm,column) AS column_length -- returns # of chars
LEN(LEFT(column, 20) AS first_20_left_column -- returns first 20 chars from the LEFT
RIGHT(column, 20) AS last_20_column -- returns last 20 chars from the RIGHT
'_', column) AS char_location -- returns the index of a char/string in column
CHARINDEX(column, start, length) AS target_section
SUBSTRING(-- returns substring starting at location 12 and has length of 12
FROM table;
/* REPLACE */
SELECT
5) REPLACE(column, 'a','b') AS replaced_column_with_b
TOP(-- replaces char 'a' in column with char 'b' for all first 5 rows
FROM table;
4.2.6 Substituting NULL values using COALESCE in T-SQL
COALESCE(value_1, value_2, value_3, ..., value_n)
-- if value_n is NULL and value_2 is not NULL, return value_2
-- ... value_n can also be some column
SELECT column1, column2
COALESCE(column1, column2, 'N/A') AS new_column
FROM table1
-- using ISNULL() function to replace all mising values in one column with values in another column or some other specified value
SELECT column1, ISNULL(column1, column2) AS new_column1
FROM table1
4.2.7 DATE
DATEADD (DATEPART, number, date):
Add or subtract datetime values —> always returns a dateDATEPART:
Unit of measurement (DD, MM, etc.)number:
an integer value to adddate:
a datetime value
DATEDIFF (datepart, startdate, enddate):
Obtain the difference between two datetime values —> always returns a numberdatepart:
unit of measurement (DD, MM, etc,)startdate:
the starting date valueenddate:
the ending datetime value
/* date math w/ DATEDIFF */ SELECT date1, date2, AS diff_day, DATEDIFF(DD, date1, date2) AS diff_month, DATEDIFF(MM, date1, date2) AS diff_year DATEDIFF(YYYY, date1, date2) FROM table1 /* date math w/ DATEADD */ SELECT date1,7, date1) AS date_after_a_week DATEADD(DD, FROM table1
4.2.8 ROUND and TRUNCATE
/* IN T-SQL */
-- round(17.8, 0) --> 18 while truncate rounds 17.8 to 17
SELECT column1,
ROUND(column2, 0) AS RoundingtoWhole,
ROUND(column2, 0, 1) AS Truncating -- adding a third parameter 1 to round for truncating
FROM table1
OTHER MATH functions
ABS():
absolute value
SQRT():
square root | SQUARE():
square
LOG(column, digit):
returns the LOG with base # of digit
4.3 Advanced SQL - loops/CTE/Windows
4.3.1 WHILE Loops & DECLARE
Assigning values to variables and write loops
-- Declare the variable
DECLARE @my_var INT
-- Use SET a value to the variable
SET @my_var = 1
-- Show the value
SELECT @my_var
-- Specify the condition of the WHILE loop
WHILE @my_var < 10
-- Begin the code to execude insde the WHILE loop
BEGIN
-- Keep incrementing the value of @my_var
SET @my_var = @my_var + 1
-- Check if @my_var is equal to 4
IF @my_var = 4
-- When my_var is 4 then break the loop
BREAK
-- End WHILE lopp
END
-- View the value after the loop
SELECT @my_var
4.3.2 Derived Tables
another name for a query acting as a table and are commonly used to do aggregations in T-SQL
use derived tables when we want to break down a complex query into smaller steps
great solution if we want to create intermediate calculations needed to be used in a larger query
SELECT a.* FROM Kidney a
-- This derived table computes the Average age are joined to the actual table
JOIN (SELECT AVG(age) AS AverageAge FROM kidney) b
ON a.Age = b.AverageAge
-- a & b are aliases for table Kidney and our derived table
4.3.3 CTE (Common Table Expressions)
another type of derived table
can be used multiple times in a query and are defined like a table
-- CTE definitions start with the keyword WITH
-- Followed by the CTE names and the columns it contains
WITH CTEName (Co1, Col2)
AS
-- Define the CTE query
(-- the two columns from the definition above
SELECT Col1, Col2
FROM TableName
)
/* CTEs in T-SQL example */
-- Create a CTE to get the Maximum BloodPressure by Type
WITH BloodPressureAge(Age, MaxBloodPressure)
AS
SELECT Age, MAX(BloodPressure) AS MaxBloodPressure
(FROM Kidney
GROUP BY Age)
-- Create a query to use the CTE as a table
SELECT a.Age, MIN(a.BloodPressure), b.MaxBloodPressure
FROM Kidney a
-- join the CTE with the table
JOIN BloodPressureAge b
ON a.Age = b.Age
GROUP BY a.Age, b.MaxBloodPressure
4.3.4 Window Functions in SQL
Create the window with
OVER
clausePARTITION BY
creates the frameIf you do not include
PARTITION BY
the frame is the entire tableAllows aggregations to be created at the same time as the window
-- Create a Window data grouping
SELECT SalesPerson, SalesYear, CurrentQuota,
SUM(CurrentQuota)
OVER (PARTITION BY SalesYear ORDER BY SalesYear) AS YearlyTotal,
AS ModDate
ModifiedDate FROM SaleGoal
/* the above example partitions the table by SalesYear and uses the windowing function SUM to add up every row of the CurrentQuota column in the window to provide a total for the entire window in the YearTotal column.
When the year changes, the value in the YearTotal column changes showing the total for the next year becuz the window is partitioned by sales year.
*/
4.3.5 Windows functions Cheatsheet
Col1 | Col2 |
---|---|
ROW_NUMBER() |
Displays the unique number/index of a given row (no duplicates as compared to rank/dense_rank); starts are 1 and numbers the rows according to the ORDER BY part of window statements. |
RANK() |
Gives the identical rows a rank of 2, then skip ranks 3 and 4, so next rank would be 5. |
DENSE_RANK() |
Also gives identical rows a rank of 2, but the following rank would be 3 – no ranks would be skipped. |
NTILE(*# of buckets*) |
Identify what percentile (or quantile, or any other subdivison) a given row falls into. |
LAG(column, *# of rows*) |
Creates columns that pull previous values from other rows (second parameter specifies how many rows we need to pull) |
LEAD(column, *# of rows*) |
Creates columns that pull following/next values from other rows (second parameter specifies how many rows we need to pull) |
SUM() | COUNT() | AVG() | STDEV() |
Creates columns that calculate corresponding values based on mathematical operations |
4.3.5.1 Getting first/last and next/previous values
FIRST_VALUE()
returns the first values in the windowLAST_VALUE()
returns the last values in the window
--> becuz we are trying to find a specific record in a window, the position of the records must be specified using the ORDER BY
command.
-- Select the columns
SELECT SalesPerson, SalesYear, CurrentQuota,
-- First value from every window
FIRST_VALUE(CurrentQuota)
OVER (PARTITION BY SalesYear ORDER BY ModifiedDate) AS StartQuota,
-- Last value from every window
LAST_VALUE(CurrentQuota)
OVER (PARTITION BY SalesYear ORDER BY ModifiedDate) AS EndQuota,
as ModDate
ModifiedDate FROM SaleGoal
Getting the next value with
LEAD()
and previous value withLAG()
provides the ability to query the value from NEXT or PREVIOUS row
also requires the use of
ORDER BY
to order the rows
/* getting next value */ SELECT SalesPerson, SalesYear, CurrentQuota, -- Create a window function to get values from next row LEAD(CurrentQuota) -- use LAG(...) if need previous value OVER (PARTITION BY SalesYear ORDER BY ModifiedDate) AS NextQuota, as ModDate ModifiedDate FROM SaleGoal
4.3.5.2 Adding row numbers
ROW_NUMBER():
sequentially numbers the rows in the windowORDER BY
is required when usingROW_NUMBER()
/* example of using ROW_NUMBER() to assign a new row number for each row in the window */
SELECT SalesPerson, SalesYear, CurrentQuota,
ROW_NUMBER()
OVER (PARTITION BY SalesPerson ORDER BY SalesYear) AS QuotabySalesPerson
FROM SaleGoal
4.3.5.3 Using windows for Statistical Functions
STDEV(): calculates the standard deviation
SELECT SalesPerson, SalesYear, CurrentQuota, STDEV(CurrentQuota)OVER () AS StandardDev, -- do not include partition by if we are calculating for the entire table AS ModDate ModifiedDate FROM SaleGoal
Calculating the mode: 1) create a CTE containing an ordered count of values using ROW_NUMBER(); 2) write a query using the CTE to pick the value with the highest row number
-- Create a CTE called QuotaCount WITH QuotaCount AS (SELECT SalesPerson, SalesYear, CurrentQuota, ROW_NUMBER() -- creating ordered count of values OVER (PARTITION BY CurrentQuota ORDER BY CurrentQuota) AS QuotaList FROM SaleGoal ) -- select everything from CTE SELECT * FROM QuotaCount -- write a query using CTE to pick the value with highest count (i.e. row number) SELECT CurrentQuota, QuotaList AS Mode FROM QuotaCount WHERE QuataList IN (SELECT MAX(QuotaList) FROM QuotaCount) -- select value in maximum count
4.3.6 Defining a window alias
Example:
SELECT column1,
column2,NTILE(4) OVER
PARTITION BY column1 ORDER BY column2)
(AS quartile,
NTILE(5) OVER
PARTITION BY column1 ORDER BY column2)
(AS quintile,
NTILE(100) OVER
PARTITION BY column1 ORDER BY column2)
(AS percentile
FROM table_name
WHERE condition(s)
ORDER BY column1, column2
Re-written example using window alias
SELECT column1,
column2NTILE(4) OVER ntile_window AS quartile,
NTILE(5) OVER ntile_window AS quintile,
NTILE(100) OVER ntile_window AS percentile
FROM table_name
WHERE condition(s)
/* NOTE if include the WINDOW clause, should always come AFTER the WHERE clause */
AS
WINDOW ntile_window PARTITION BY column1 ORDER BY column2)
(ORDER BY column1, column2