Normalization
Normalization is the process of structuring a relational database to reduce data redundancy and improve data integrity. It decomposes tables using functional dependencies to achieve desirable properties, defined as normal forms.
Anomalies in Unnormalized Schemas
Update anomaly: a fact is stored redundantly; updating one copy but not others leads to inconsistency.
Insertion anomaly: cannot insert a row without also supplying unrelated data.
Deletion anomaly: deleting a row unintentionally removes unrelated information.
Example of a problematic table:
StudentCourse(student_id, student_name, dept, course_id, course_name, grade)
Redundancies:
student_nameanddeptare repeated for every course a student takes.course_nameis repeated for every student taking the course.
Anomalies:
- If a student changes their name, must update all their enrollment rows.
- Cannot add a new course without a student enrollment.
- Dropping a student’s last course deletes the course name.
Functional Dependencies (Recap)
$X \to Y$: knowing the value of $X$ is enough to determine $Y$ uniquely.
student_id -> student_name, deptcourse_id -> course_namestudent_id, course_id -> grade
Primary key determination: $K$ is a key if $K \to \text{(all attributes)}$.
First Normal Form (1NF)
Rule: all attribute values are atomic (no multi-valued attributes, no repeating groups).
Violation example:
Student(id, name, phones) -- phones is a list: {555-1234, 555-5678}
Fix: separate table StudentPhone(student_id, phone).
1NF is assumed by standard relational theory (all modern SQL tables are 1NF by definition).
Second Normal Form (2NF)
Rule: the table is in 1NF, and every non-key attribute is fully functionally dependent on the entire primary key (no partial dependencies).
Applies only when: the primary key is composite.
Example: Enrollment(student_id, course_id, student_name, course_name, grade).
Primary key: (student_id, course_id).
Partial dependencies:
student_id -> student_name(depends on part of the key).course_id -> course_name(depends on part of the key).
Fix: decompose.
Student(student_id PK, student_name)
Course(course_id PK, course_name)
Enrollment(student_id FK, course_id FK, grade) -- PK = (student_id, course_id)
Third Normal Form (3NF)
Rule: the table is in 2NF, and no non-key attribute is transitively dependent on the primary key.
Transitive dependency: $A \to B$ and $B \to C$ implies $A \to C$ transitively.
Example: Employee(emp_id, dept_id, dept_name, dept_location).
emp_id -> dept_id and dept_id -> dept_name, dept_location, so emp_id -> dept_name, dept_location transitively.
Fix: decompose.
Employee(emp_id PK, dept_id FK)
Dept(dept_id PK, dept_name, dept_location)
3NF sufficient for most practical schemas. It guarantees a lossless-join decomposition that preserves all functional dependencies.
Boyce-Codd Normal Form (BCNF)
Rule: for every non-trivial FD $X \to Y$, $X$ is a superkey.
BCNF is stronger than 3NF. Every BCNF relation is in 3NF, but not vice versa.
Example: CourseInstructor(course, instructor, room).
FDs: {course, instructor} -> room and room -> instructor.
The FD room -> instructor violates BCNF (room is not a superkey).
Fix:
CourseRoom(course PK, room)
RoomInstructor(room PK, instructor)
Trade-off: BCNF decomposition may not preserve all FDs (unlike 3NF). Sometimes 3NF is preferred to maintain dependency preservation.
Fourth Normal Form (4NF)
Rule: the table is in BCNF, and there are no non-trivial multi-valued dependencies.
Multi-valued dependency (MVD): $X \twoheadrightarrow Y$ means: for a given $X$, the set of $Y$ values is independent of the other attributes.
Example: EmployeeSkillHobby(emp, skill, hobby). If employees have multiple skills and multiple hobbies independently, this has MVDs emp ->-> skill and emp ->-> hobby.
Fix: split into EmployeeSkill(emp, skill) and EmployeeHobby(emp, hobby).
Denormalization
Deliberate introduction of redundancy for performance.
When to denormalize:
- Join-heavy queries are too slow.
- Read performance is critical and write consistency can be managed.
- The data is analytics-oriented (OLAP).
Techniques:
- Duplicate attributes from frequently joined tables.
- Pre-compute aggregations (summary tables, materialized views).
- Use JSON columns for flexible nested data.
Trade-off: faster reads; more complex writes (must keep redundant data consistent); risk of anomalies.
Normal Form Summary
| NF | Eliminates |
|---|---|
| 1NF | Multi-valued attributes, repeating groups |
| 2NF | Partial dependencies on composite key |
| 3NF | Transitive dependencies |
| BCNF | All FD violations where LHS is not a superkey |
| 4NF | Multi-valued dependencies |