Introduction to Databases

A database is an organized collection of data that supports efficient storage, retrieval, and modification. A Database Management System (DBMS) is the software layer that manages the database and provides a query interface.

Why Use a Database?

Without a DBMS: store data in files. Problems: data redundancy (same data in multiple files), inconsistency (updates in one file not reflected in others), difficulty querying, no concurrency control (two programs writing simultaneously can corrupt data), no crash recovery.

With a DBMS: data abstraction, declarative queries, consistency enforcement, concurrent access, crash recovery, access control.

Core Concepts

Data model: a conceptual framework for representing data and relationships. Relational, document, key-value, graph, column-family, time-series.

Schema: the structure of the database: tables, columns, data types, constraints.

Instance: the actual data stored at a given point in time.

Query language: a way to retrieve and manipulate data. SQL (relational), MongoDB Query Language, Cypher (graph), DynamoDB API.

Transaction: a unit of work that must execute atomically (all or nothing).

DBMS Architecture

Application
  -> Query Interface (SQL)
  -> Query Processor
       -> Parser
       -> Optimizer
       -> Execution Engine
  -> Storage Engine
       -> Buffer Pool (page cache)
       -> Storage Manager (files, pages)
  -> Transaction Manager
       -> Lock Manager
       -> Log Manager

Query processing pipeline:

Parse: validate SQL syntax; build a parse tree.
Rewrite: apply logical transformations (view substitution, predicate pushdown).
Optimize: enumerate physical plans; estimate costs; choose the best plan.
Execute: run the plan and return results.

Data Abstraction Levels

Physical level: how data is stored on disk (pages, files, indexes).

Logical level: what data is stored (tables, columns, types, constraints).

View level: a customized view of data for a specific user or application.

This separation enables changes at one level without affecting others.

ACID Properties

Atomicity: a transaction either commits (all changes applied) or aborts (all changes rolled back). No partial updates.

Consistency: a transaction takes the database from one consistent state to another. All constraints hold before and after.

Isolation: concurrent transactions execute as if they were serial. One transaction’s intermediate state is not visible to others.

Durability: once a transaction commits, its changes persist even after a crash.

Types of Databases

Type	Model	Examples	Best For
Relational	Tables, SQL	PostgreSQL, MySQL, SQLite, Oracle	Structured data, complex queries
Document	JSON documents	MongoDB, Firestore, Couchbase	Nested, schema-flexible data
Key-value	Simple lookup	Redis, DynamoDB, Memcached	Caching, session data
Column-family	Wide rows	Cassandra, HBase	Write-heavy, analytics
Graph	Nodes, edges	Neo4j, Amazon Neptune	Relationships, social networks
Time-series	Timestamped events	InfluxDB, TimescaleDB	IoT, metrics, logs
Search	Inverted index	Elasticsearch, Solr	Full-text search

CAP Theorem

A distributed database can guarantee at most two of three properties:

Consistency: all nodes see the same data at the same time.

Availability: every request receives a response (not necessarily the latest data).

Partition tolerance: the system continues to operate despite network partitions.

Since network partitions are unavoidable in distributed systems, the real choice is between consistency and availability during a partition.

CP systems (consistency + partition tolerance): HBase, Zookeeper. May be unavailable during partitions.

AP systems (availability + partition tolerance): Cassandra, CouchDB. May return stale data during partitions.

OLTP vs. OLAP

OLTP (Online Transaction Processing): many short read/write transactions; normalized schema; row-oriented storage. Examples: PostgreSQL, MySQL. Use case: applications (banking, e-commerce).

OLAP (Online Analytical Processing): complex analytical queries over large datasets; star/snowflake schema; column-oriented storage. Examples: Redshift, BigQuery, Snowflake, ClickHouse. Use case: business intelligence, data warehousing.

HTAP (Hybrid Transactional/Analytical Processing): handle both workloads. Examples: TiDB, SingleStore.