Chapter 5 CS145 — DBs & Data Systems

5.1 Introduction — DB overview

Applications of DBs and Data Systems
Properties of general DBs, special-purpose DBs, data lakes
Unpack a DB: Example of a mobile game using a DB
- For Whom and Why?
- Sample data architectures
Goal of standard databases
- Platform to store, manage data (read/learn/modify) —> supporting scale/speed/stability/evolution…
Goals of special databases
- DBs are often optimized for key use cases
  - store current data (e.g. lot of leads)
  - optimize historical data (e.g. logs)
  - run batch workloads (training)
Data System “v1” on cloud
1. log user actions
2. store in DB, after ETL
3. run queries in a peta scale analytcis system (BigQuery)
4. visualize query results

KEY CONCEPTS
- Data is stored in Blocks (aka “partition”)
- Sequential IO is 10x-100x+ faster than “many” random IO
  - e.g. 1 MB (located sequentially) versus 1 Million bytes in random locations
- HDDs/SSDs copy sequential “big” blocks of bytes to/from RAM

*Note: WHY are they different?

Total time to ReadData = Access Latency * M + DataSize / ScanThroughput

Clouds of Machine

Logical data independence —> we can add a new column or attribute w/o rewriting the application.
Physical data independence —> you should NOT care which disks/machines the data are stored on.

Relational model (aka Tables)
- simple and most popular
- elegant algebra (E.F. Codd et al)
- Structured data (e.g. a typed Schema)
Hierarchical model (aka JSON-like Tree)
- semi-structured data
Example on Tradeoffs
- Key “CS” ideas:
  - structured, or semi-structured w/ lots of optional fields?
  - how deep is the tree structure?
  - what kinds of queries and updates do you want to run? (e.g. customer-oriented queries? Purchase-oriented queries?)
  - Rough rule of thumb
    - Relational: Google Ads
    - Hierarchical: document search. Lots of optional fields? many levels deep. Mostly search oriented around the top level of doc.