Chapter 5 CS145 — DBs & Data Systems

5.1 Introduction — DB overview

  • Applications of DBs and Data Systems

  • Properties of general DBs, special-purpose DBs, data lakes

  • Unpack a DB: Example of a mobile game using a DB

    • For Whom and Why?

    • Sample data architectures

  • Goal of standard databases

    • Platform to store, manage data (read/learn/modify) —> supporting scale/speed/stability/evolution…
  • Goals of special databases

    • DBs are often optimized for key use cases

      • store current data (e.g. lot of leads)

      • optimize historical data (e.g. logs)

      • run batch workloads (training)

  • Data System “v1” on cloud

    1. log user actions

    2. store in DB, after ETL

    3. run queries in a peta scale analytcis system (BigQuery)

    4. visualize query results

5.1.1 Project goals

  • Run queries on public datasets

  • Explore/Visualize public datasets

  • Predict using Machine Learning

  • Full Query-Visualize-Learn data cycle on cloud stack

5.1.2 IO Blocks for Efficiency

  • KEY CONCEPTS

    • Data is stored in Blocks (aka “partition”)

    • Sequential IO is 10x-100x+ faster than “many” random IO

      • e.g. 1 MB (located sequentially) versus 1 Million bytes in random locations
    • HDDs/SSDs copy sequential “big” blocks of bytes to/from RAM

5.1.3 Basic System Numbers

  • Access Latency (secs) = Time to access Block’s start location

  • Scan Throughput (GB/sec) = Speed to Scan + Copy data to RAM

*Note: WHY are they different?

  • Digital (e.g. SSDs and RAM) vs Analog (e.g. HDD seeks)

  • Distance from CPU

Access Latency (sec) Scan Throughput (GB/sec) What you get for ~100
RAM ~100 nanosec ~100 GB/sec
High-end SSD
HDD Seek (hard disk)
Machines M1 TO M2 (network)

5.1.4 IO Cost Model

Total time to ReadData = Access Latency * M + DataSize / ScanThroughput

  • DataSize = data size

  • M = # of non-contiguous Blocks

  • AccessLatency = Time to access Block

  • ScanThroughput = Speed to copy/scan to RAM

5.2 System Primer

Clouds of Machine

5.3

5.3.1 Data Independence

  • Logical data independence —> we can add a new column or attribute w/o rewriting the application.

  • Physical data independence —> you should NOT care which disks/machines the data are stored on.

5.3.2 Data Model

  1. Relational model (aka Tables)

    • simple and most popular

    • elegant algebra (E.F. Codd et al)

    • Structured data (e.g. a typed Schema)

  2. Hierarchical model (aka JSON-like Tree)

    • semi-structured data
  3. Example on Tradeoffs

    • Key “CS” ideas:

      • structured, or semi-structured w/ lots of optional fields?

      • how deep is the tree structure?

      • what kinds of queries and updates do you want to run? (e.g. customer-oriented queries? Purchase-oriented queries?)

      • Rough rule of thumb

        • Relational: Google Ads

        • Hierarchical: document search. Lots of optional fields? many levels deep. Mostly search oriented around the top level of doc.