Chapter 5 CS145 — DBs & Data Systems
5.1 Introduction — DB overview
Applications of DBs and Data Systems
Properties of
general DBs
,special-purpose DBs
,data lakes
Unpack a DB: Example of a mobile game using a DB
For Whom and Why?
Sample data architectures
Goal of
standard databases
- Platform to store, manage data (read/learn/modify) —> supporting scale/speed/stability/evolution…
Goals of
special databases
DBs are often optimized for key use cases
store current data (e.g. lot of leads)
optimize historical data (e.g. logs)
run batch workloads (training)
Data System “v1” on cloud
log user actions
store in DB, after ETL
run queries in a peta scale analytcis system (BigQuery)
visualize query results
5.1.1 Project goals
Run queries on public datasets
Explore/Visualize public datasets
Predict using Machine Learning
Full Query-Visualize-Learn data cycle on cloud stack
5.1.2 IO Blocks for Efficiency
KEY CONCEPTS
Data is stored in Blocks (aka “partition”)
Sequential IO is 10x-100x+ faster than “many” random IO
- e.g. 1 MB (located sequentially) versus 1 Million bytes in random locations
HDDs/SSDs copy sequential “big” blocks of bytes to/from RAM
5.1.3 Basic System Numbers
Access Latency (secs)
= Time to access Block’s start locationScan Throughput (GB/sec)
= Speed to Scan + Copy data to RAM
*Note: WHY are they different?
Digital (e.g. SSDs and RAM) vs Analog (e.g. HDD seeks)
Distance from CPU
Access Latency (sec) | Scan Throughput (GB/sec) | What you get for ~100 | |
---|---|---|---|
RAM | ~100 nanosec | ~100 GB/sec | |
High-end SSD | |||
HDD Seek (hard disk) | |||
Machines M1 TO M2 (network) |
5.3
5.3.1 Data Independence
Logical data independence
—> we can add a new column or attribute w/o rewriting the application.Physical data independence
—> you should NOT care which disks/machines the data are stored on.
5.3.2 Data Model
Relational model (aka Tables)
simple and most popular
elegant algebra (E.F. Codd et al)
Structured data (e.g. a typed Schema)
Hierarchical model (aka JSON-like Tree)
- semi-structured data
Example on Tradeoffs
Key “CS” ideas:
structured, or semi-structured w/ lots of optional fields?
how deep is the tree structure?
what kinds of queries and updates do you want to run? (e.g. customer-oriented queries? Purchase-oriented queries?)
Rough rule of thumb
Relational: Google Ads
Hierarchical: document search. Lots of optional fields? many levels deep. Mostly search oriented around the top level of doc.