Tuesday, November 6, 2007

FIRST NORMAL FORM

FIRST NORMAL FORM (1NF or Minimal Form) is a normal form used in database normalization. A relational database table that adheres to 1NF is one that meets a certain minimum set of criteria. These criteria are basically concerned with ensuring that the table is a faithful representation of a relation[1] and that it is free of repeating groups[2].


 

The concept of a "repeating group" is, however, understood in different ways by different theorists. As a consequence, there is not universal agreement as to which features would disqualify a table from being in 1NF. Most notably, 1NF as defined by some authors (for example, Ramez Elmasri and Shamkant B. Navathe[3], following the precedent established by E.F. Codd) excludes relation-valued attributes (tables within tables); whereas 1NF as defined by other authors (for example, Chris Date) permits them.


 

According to Date's definition of 1NF, a table is in 1NF if and only if it is "isomorphic to some relation", which means, specifically, that it satisfies the following five conditions:1. There's no top-to-bottom ordering to the rows.

2. There's no left-to-right ordering to the columns.

3. There are no duplicate rows.

4. Every row-and-column intersection contains exactly one value from the applicable domain (and nothing else).

5. All columns are regular [i.e. rows have no hidden components such as row IDs, object IDs, or hidden timestamps].

—Chris Date, "What First Normal Form Really Means", pp. 127-8[4]


 


 

Violation of any of these conditions would mean that the table is not strictly relational, and therefore that it is not in 1NF.


 

Examples of tables (or views) that would not meet this definition of 1NF are:

A table that lacks a unique key. Such a table would be able to accommodate duplicate rows, in violation of condition 3.

A view whose definition mandates that results be returned in a particular order, so that the row-ordering is an intrinsic and meaningful aspect of the view.[5] This violates condition 1. The tuples in true relations are not ordered with respect to each other.

A table with at least one nullable attribute. A nullable attribute would be in violation of condition 4, which requires every field to contain exactly one value from its column's domain. It should be noted, however, that this aspect of condition 4 is controversial. It marks an important departure from Codd's original vision of the relational model, which made explicit provision for nulls.[6]


 

Repeating groups


 

Date's fourth condition, which expresses "what most people think of as the defining feature of 1NF"[7], is concerned with repeating groups. The following example illustrates how a database design might incorporate repeating groups, in violation of 1NF.

Example 1: Domains and values


 

Suppose a novice designer wishes to record the names and telephone numbers of customers. He defines a customer table which looks like this:

CustomerCustomer ID    First Name    Surname    Telephone Number

123    Rachel    Ingram    555-861-2025

456    James    Wright    555-403-1659

789    Maria    Fernandez    555-808-9633


 

The designer then becomes aware of a requirement to record multiple telephone numbers for some customers. He reasons that the simplest way of doing this is to allow the "Telephone Number" field in any given record to contain more than one value:

CustomerCustomer ID    First Name    Surname    Telephone Number

123    Rachel    Ingram    555-861-2025

456    James    Wright    555-403-1659

555-776-4100

789    Maria    Fernandez    555-808-9633


 

Assuming, however, that the Telephone Number column is defined on some Telephone Number-like domain (e.g. the domain of strings 12 characters in length), the representation above is not in 1NF. 1NF (and, for that matter, the RDBMS) prohibits a field from containing more than one value from its column's domain.

Example 2: Repeating groups across columns


 

The designer might attempt to get around this restriction by defining multiple Telephone Number columns:

CustomerCustomer ID    First Name    Surname    Tel. No. 1    Tel. No. 2    Tel. No. 3

123    Rachel    Ingram    555-861-2025        

456    James    Wright    555-403-1659    555-776-4100    

789    Maria    Fernandez    555-808-9633        


 

This representation, however, makes use of nullable columns, and therefore does not conform to Date's definition of 1NF. Even if the view is taken that nullable columns are allowed, the design is not in keeping with the spirit of 1NF. Tel. No. 1, Tel. No. 2., and Tel. No. 3. share exactly the same domain and exactly the same meaning; the splitting of Telephone Number into three headings is artificial and causes logical problems. These problems include:

Difficulty in querying the table. Answering such questions as "Which customers have telephone number X?" and "Which pairs of customers share a telephone number?" is awkward.

Inability to enforce uniqueness of Customer-to-Telephone Number links through the RDBMS. Customer 789 might mistakenly be given a Tel. No. 2 value that is exactly the same as her Tel. No. 1 value.

Restriction of the number of telephone numbers per customer to three. If a customer with four telephone numbers comes along, we are constrained to record only three and leave the fourth unrecorded. This means that the database design is imposing constraints on the business process, rather than (as should ideally be the case) vice-versa.

Example 3: Repeating groups within columns


 

The designer might, alternatively, retain the single Telephone Number column but alter its domain, making it a string of sufficient length to accommodate multiple telephone numbers:

CustomerCustomer ID    First Name    Surname    Telephone Number

123    Rachel    Ingram    555-861-2025

456    James    Wright    555-403-1659, 555-776-4100

789    Maria    Fernandez    555-808-9633


 


 

This is arguably the worst design of all, and again not in keeping with the spirit of 1NF. The Telephone Number heading becomes semantically woolly, as it can now represent either a telephone number, a list of telephone numbers, or indeed anything at all. A query such as "Which pairs of customers share a telephone number?" is virtually impossible to formulate, given the necessity to cater for lists of telephone numbers as well as individual telephone numbers. Meaningful constraints on telephone numbers are also impossible to define in the RDBMS with this design.

A design that complies with 1NF

A design that is unambiguously in 1NF makes use of two tables: a Customer table and a Customer Telephone Number table.

CustomerCustomer ID    First Name    Surname

123    Rachel    Ingram

456    James    Wright

789    Maria    Fernandez


 

Customer Telephone NumberCustomer ID    Telephone Number

123    555-861-2025

456    555-403-1659

456    555-776-4100

789    555-808-9633


 


 

Repeating groups of telephone numbers do not occur in this design. Instead, each Customer-to-Telephone Number link appears on its own record.

Atomicity

Some definitions of 1NF, most notably that of E.F. Codd, make reference to the concept of atomicity. Codd states that the "values in the domains on which each relation is defined are required to be atomic with respect to the DBMS."[8] Codd defines an atomic value as one that "cannot be decomposed into smaller pieces by the DBMS (excluding certain special functions)."

Hugh Darwen and Chris Date have suggested that Codd's concept of an "atomic value" is ambiguous, and that this ambiguity has led to widespread confusion about how 1NF should be understood. In particular, the notion of a "value that cannot be decomposed" is problematic, as it would seem to imply that few, if any, data types are atomic:

A character string would seem not be atomic, as the RDBMS typically provides operators to decompose it into substrings.

A date would seem not to be atomic, as the RDBMS typically provides operators to decompose it into day, month, and year components.

A fixed-point number would seem not to be atomic, as the RDBMS typically provides operators to decompose it into integer and fractional components.


 

Date suggests that "the notion of atomicity has no absolute meaning": a value may be considered atomic for some purposes, but may be considered an assemblage of more basic elements for other purposes. If this position is accepted, 1NF cannot be defined with reference to atomicity. Columns of any conceivable data type (from string types and numeric types to array types and table types) are then acceptable in a 1NF table—although perhaps not always desirable. Date argues that relation-valued attributes, by means of which a field within a table can contain a table, are useful in rare cases.


 


 


 

SECOND NORMAL FORM (2NF)

Second normal form (2NF) is a normal form used in database normalization. 2NF was originally defined by E.F. Codd[1] in 1971. A table that is in first normal form (1NF) must meet additional criteria if it is to qualify for second normal form. Specifically: a 1NF table is in 2NF if and only if, given any candidate key and any attribute that is not a constituent of a candidate key, the non-key attribute depends upon the whole of the candidate key rather than just a part of it.


 

In slightly more formal terms: a 1NF table is in 2NF if and only if none of its non-prime attributes are functionally dependent on a part (proper subset) of a candidate key. (A non-prime attribute is one that does not belong to any candidate key.)


 

Note that when a 1NF table has no composite candidate keys (candidate keys consisting of more than one attribute), the table is automatically in 2NF.


 


 

Example


 

Consider a table describing employees' skills:

Employees' SkillsEmployee    Skill    Current Work Location

Jones    Typing    114 Main Street

Jones    Shorthand    114 Main Street

Jones    Whittling    114 Main Street

Roberts    Light Cleaning    73 Industrial Way

Ellis    Alchemy    73 Industrial Way

Ellis    Juggling    73 Industrial Way

Harrison    Light Cleaning    73 Industrial Way


 


 

The table's only candidate key is {Employee, Skill}.


 

The remaining attribute, Current Work Location, is dependent on only part of the candidate key, namely Employee. Therefore the table is not in 2NF. Note the redundancy in the way Current Work Locations are represented: we are told three times that Jones works at 114 Main Street, and twice that Ellis works at 73 Industrial Way. This redundancy makes the table vulnerable to update anomalies: it is, for example, possible to update Jones' work location on his "Typing" and "Shorthand" records and not update his "Whittling" record. The resulting data would imply contradictory answers to the question "What is Jones' current work location?"


 

A 2NF alternative to this design would represent the same information in two tables:

EmployeesEmployee    Current Work Location

Jones    114 Main Street

Roberts    73 Industrial Way

Ellis    73 Industrial Way

Harrison    73 Industrial Way


 

Employees' SkillsEmployee    Skill

Jones    Typing

Jones    Shorthand

Jones    Whittling

Roberts    Light Cleaning

Ellis    Alchemy

Ellis    Juggling

Harrison    Light Cleaning


 


 

Update anomalies cannot occur in these tables, which are both in 2NF.


 

Not all 2NF tables are free from update anomalies, however. An example of a 2NF table which suffers from update anomalies is:

Tournament WinnersTournament    Year    Winner    Winner Date of Birth

Des Moines Masters    1998    Chip Masterson    14 March 1977

Indiana Invitational    1998    Al Fredrickson    21 July 1975

Cleveland Open    1999    Bob Albertson    28 September 1968

Des Moines Masters    1999    Al Fredrickson    21 July 1975

Indiana Invitational    1999    Chip Masterson    14 March 1977


 


 

Even though Winner and Winner Date of Birth are determined by the whole key {Tournament, Year} and not part of it, particular Winner / Winner Date of Birth combinations are shown redundantly on multiple records. This problem is addressed by third normal form (3NF).


 

2NF and candidate keys


 

A table for which there are no partial functional dependencies on the primary key is typically, but not always, in 2NF. In addition to the primary key, the table may contain other candidate keys; it is necessary to establish that no non-prime attributes have part-key dependencies on any of these candidate keys.


 

Multiple candidate keys occur in the following table:

Electric Toothbrush ModelsManufacturer    Model    Model Full Name    Manufacturer Country

Forte    X-Prime    Forte X-Prime    Italy

Forte    Ultraclean    Forte Ultraclean    Italy

Dent-o-Fresh    EZBrush    Dent-o-Fresh EZBrush    USA

Kobayashi    ST-60    Kobayashi ST-60    Japan

Hoch    Toothmaster    Hoch Toothmaster    Germany

Hoch    Contender    Hoch Contender    Germany


 


 

Even if the designer has specified the primary key as {Model Full Name}, the table is not in 2NF. {Manufacturer, Model} is also a candidate key, and Manufacturer Country is dependent on a proper subset of it: Manufacturer.


 


 


 

THE THIRD NORMAL FORM (3NF)

The third normal form (3NF) is a normal form used in database normalization. 3NF was originally defined by E.F. Codd[1] in 1971. Codd's definition states that a table is in 3NF if and only if both of the following conditions hold:

The table is in second normal form (2NF)

No non-prime attribute of the table is transitively dependent on a candidate key


 

A non-prime attribute is an attribute that does not belong to any candidate key. A transitive dependency is a functional dependency X → Z in which Z is not immediately dependent on X, but rather on a third set of attributes Y which depends on X. That is, X → Z by virtue of X → Y and Y → Z.


 

An alternative formulation of Codd's definition, given by Carlo Zaniolo[2] in 1982, is this: a table is in 3NF if and only if, for each of its functional dependencies X → A, at least one of the following conditions holds:

X contains A, or

X is a superkey, or

A is a prime attribute (i.e., A is contained within a candidate key)


 

Zaniolo's definition has the advantage of giving a clear sense of the difference between 3NF and the more stringent Boyce-Codd normal form (BCNF). BCNF simply eliminates the third alternative ("A is a prime attribute").


 

Example


 

An example of a 2NF table that fails to meet the requirements of 3NF is:

Tournament WinnersTournament    Year    Winner    Winner Date of Birth

Indiana Invitational    1998    Al Fredrickson    21 July 1975

Cleveland Open    1999    Bob Albertson    28 September 1968

Des Moines Masters    1999    Al Fredrickson    21 July 1975

Indiana Invitational    1999    Chip Masterson    14 March 1977


 


 

The only candidate key is {Tournament, Year}.


 

The breach of 3NF occurs because the non-prime attribute Winner Date of Birth is transitively dependent on {Tournament, Year} via the non-prime attribute Winner. The fact that Winner Date of Birth is functionally dependent on Winner makes the table vulnerable to logical inconsistencies, as there is nothing to stop the same person from being shown with different dates of birth on different records.


 

In order to express the same facts without violating 3NF, it is necessary to split the table into two:

Tournament WinnersTournament    Year    Winner

Indiana Invitational    1998    Al Fredrickson

Cleveland Open    1999    Bob Albertson

Des Moines Masters    1999    Al Fredrickson

Indiana Invitational    1999    Chip Masterson


 

Player Dates of BirthPlayer    Date of Birth

Chip Masterson    14 March 1977

Al Fredrickson    21 July 1975

Bob Albertson    28 September 1968


 


 

Update anomalies cannot occur in these tables, which are both in 3NF


 


 


 

BOYCE-CODD NORMAL FORM (OR BCNF)

Boyce-Codd normal form (or BCNF) is a normal form used in database normalization. It is a slightly stronger version of the third normal form (3NF). A table is in Boyce-Codd normal form if and only if, for every one of its non-trivial functional dependencies X → Y, X is a superkey—that is, X is either a candidate key or a superset thereof.


 

BCNF was developed in 1974 by Raymond F. Boyce and Edgar F. Codd[1] to address certain types of anomaly not dealt with by 3NF as originally defined. Chris Date has pointed out that a definition of what we now know as BCNF appeared in a paper[2] by Ian Heath in 1971. "Since that definition predated Boyce and Codd's own definition by some three years," writes Date, "it seems to me that BCNF ought by rights to be called Heath normal form. But it isn't."[3]


 

Example


 

Only in rare cases does a 3NF table not meet the requirements of BCNF. An example of such a table is:

Tutor/Student Cross-ReferenceTutor ID    Tutor Soc. Security Num.    Student ID

1078    088-51-0074    31850

1078    088-51-0074    37921

1293    096-77-4146    46224

1480    072-21-2223    31850


 


 

The purpose of the table is to show which tutors are assigned to which students. The table's candidate keys are:

{Tutor ID, Student ID}

{Tutor Social Security Number, Student ID}


 

Therefore all three attributes of the table are prime attributes: that is, all three attributes belong to candidate keys.


 

Recall that 2NF prohibits partial functional dependencies of non-prime attributes on candidate keys, and that 3NF prohibits transitive functional dependencies of non-prime attributes on candidate keys. Since the table above lacks any non-prime attributes, it adheres to both 2NF and 3NF.


 

BCNF is more stringent than 3NF in that it does not permit any functional dependency in which the determining set of attributes is not a candidate key (or superset thereof). The dependency of Tutor ID on Tutor Social Security Number is such a dependency. Accordingly, the table above is not in BCNF.


 

Any table that falls short of BCNF will be vulnerable to logical inconsistencies. In the table above, an inconsistent combination of Tutor ID and Tutor Social Security Number could be represented.


 

Correcting the problem in this case would be a simple matter of using only one scheme of identifiers for Tutors: either IDs or Social Security Numbers, but not both.