| 
          
          
         NormalizationWhen users ask for advice about  their database applications, one of the first things I try to help them with is  the normalization of their table structure. Normalization is the process of  removing redundant data from your tables in order to improve storage efficiency,  data integrity and scalability. This improvement is balanced against an  increase in complexity and potential performance losses from the joining of the  normalized tables at query-time. Let's say you were looking to  start an online bookstore. You would need to track certain information about  the books available to your site viewers, such as: 
      
        | 
          
            Title Author Author Biography ISBN Price Subject  | 
          
            Number of Pages Publisher Description Review Reviewer Name  |  Let's start by adding a couple of  books written by Luke Welling and Laura Thomson. Because this book has two  authors, we are going to need to accommodate both in our table. Lets take a  look at a typical approach I encounter: 
      
        | Title | Author1 | Author2 | ISBN | Subject | Pages | Publisher |  
        | PHP and MySQL Web Development | Luke Welling | Laura Thomson | 0672317842 | PHP, MySQL | 867 | Sams |  
        | MySQL Tutorial | Luke Welling | Laura Thomson | 0672325845 | MySQL | 300 | Sams |  Let's take a look at some issues  involved in this table design:  First, this table is not very  efficient with storage. Lets imagine for a second that Luke and Laura were  extremely busy writers and managed to produce 500 books for our database. The  combination of their two names is 25 characters long, and since we will repeat  their two names in 500 rows we are wasting 25 × 500 = 12,500 bytes of storage  space unnecessarily.       Second, this design does not  protect data integrity. Lets once again imagine that Luke and Laura have  written 500 books. Someone has had to type their names into the database 500  times, and it is very likely that one of their names will be misspelled at  least once (i.e.. Thompson instead of Thomson). Our data is now corrupt, and  anyone searching for book by author name will find some of the results missing.  The same thing could happen with publisher name. Sams publishes hundreds of  titles and if the publisher's name were misspelled even once the list of books  by publisher would be missing titles. Third, this table does not scale well. First of all, we have limited  ourselves to only two authors, yet some books are written by over a dozen  people. This kind of limitation is often exhibited in personal info tables  where the typical design includes Phone1, Phone2, and Phone3 columns. In both  cases we are limiting future growth of our data. Another limitation to  scalability is that fact that with all our data in one table our one table file  will grow faster and we will have more trouble with file size limitations of  our underlying operating system. First Normal FormThe normalization process involves  getting our data to conform to three progressive normal forms, and a higher  level of normalization cannot be achieved until the previous levels have been  achieved (there are actually five normal forms, but the last two are mainly  academic and will not be discussed). The First Normal Form (or 1NF) involves  removal of redundant data from horizontal rows. We want to ensure that there is  no duplication of data in a given row, and that every column stores the least  amount of information possible (making the field atomic).In our table above we have two  violations of First Normal Form:. First, we have more than one author field,  and our subject field contains more than one piece of information. With more  than one value in a single field, it would be very difficult to search for all  books on a given subject. In addition, with two author fields we have two  fields to search in order to look for a book by a specific author
 We could get around these  problems by modifying our table to have only one author field, with two entries  for a book with two authors, as in the following example:
 
      
        | Title
 | Author | ISBN | Subject | Pages | Publisher |  
        | PHP and MySQL Web Development | Luke Welling | 0672317842 | MySQL | 867 | Sams |  
        | PHP and MySQL Web Development | Laura Thomson | 0672317842 | PHP | 867 | Sams |  
        | MySQL Tutorial | Laura Thomson | 0672325845 | MySQL | 300 | Sams |  
        | MySQL Tutorial | Luke Welling | 0672325845 | MySQL | 300 | Sams |  While this approach has no  redundant columns and the subject column has only one piece of information, we  do have a problem that we now have two rows for a single book. Also, to ensure  that we can do a search of author and subject (i.e. books on PHP by Luke  Welling), we would need four rows to ensure that we had each combination of  author and subject. Additionally, we would be violating the Second Normal Form,  which will be described below.Author Table:A better solution to our problem  would be to separate the redundant data into separate tables, with the tables  related to each other. In this case we will create an Author table and a  Subject table to store our information, removing that information from the Book  table:
 
      
        | Author_ID | Last Name | First Name | Bio |  
        | 1 | Welling | Luke | Author of books on PHP and MySQL |  
        | 2 | Thomson | Laura | Another author of books on PHP and MySQL |  Subject Table:
 
      
        | Subject_ID | Subject |  
        | 1 | PHP |  
        | 2 | MySQL |  Book Table:
 
      
        | ISBN | Title | Pages | Publisher |  
        | 0672317842 | PHP and MySQL Web Development | 867 | Sams |  
        | 0672325845 | MySQL Tutorial | 300 | Sams |  While creating the Author table,  I also split author name down into separate first name and last name columns,  in order to follow the requirement for storing as little information as  possible in a given column. Each table has a primary key value which can be  used for joining tables together when querying the data. A primary key value  must be unique with in the table (no two books can have the same ISBN number),  and a primary key is also an INDEX, which speeds up data retrieval based on the  primary key.  Defining RelationshipsAs you can see, while our data is  now split up, relationships between the table have not been defined. There are  various types of relationships that can exist between two tables: 
      
        One to (Zero or) One 
        One to (Zero or) Many 
        Many to Many  The relationship between the Book  table and the Author table is a many-to-many relationship: A book can have more  than one author, and an author can write more than one book. To represent a  many-to-many relationship in a relational database we need a third table to  serve as a link between the two:Book_Author Table: 
      
        | ISBN | Author_ID |  
        | 0672317842 | 1 |  
        | 0672317842 | 2 |  
        | 0672325845 | 1 |  
        | 0672325845 | 2 |  Similarly, the Subject table also  has a many-to-many relationship with the Book table, as a book can cover  multiple subjects, and a subject can be explained by multiple books:Book_Subject Table: 
      
        | ISBN | Subject_ID |  
        | 0672317842 | 1 |  
        | 0672317842 | 2 |  
        | 0672325845 | 2 |  The one-to-many relationship will  be covered in the next section. As you can see, we now have established the  relationships between the Book, Author, and Subject tables. A book can have an  unlimited number of authors, and can refer to an unlimited number of subjects.  We can also easily search for books by a given author or referring to a given  subject.       In the linking tables above the  values stored refer to primary key values from the Book, Author, and Subject  tables. Columns in a table that refer to primary keys from another table are  known as foreign keys, and serve the purpose of defining data relationships. In  database systems which support referential integrity, such as the InnoDB  storage engine for MySQL, defining a column as a foreign key will also allow to  database software to enforce the relationships you define. For example, with  foreign keys defined, the InnoDB storage engine will not allow you to insert a  row into the Book_Subject table unless the book and subject in question already  exist in the book and subject tables. Such systems will also prevent the  deletion of books from the book table that have 'child' entries in the  book_subject or book_author tables. Second Normal FormWhere the First Normal Form deals  with redundancy of data across a horizontal row, Second Normal Form (or 2NF)  deals with redundancy of data in vertical columns. As stated earlier, the  normal forms are progressive, so to achieve Second Normal Form, your tables  must already be in First Normal Form. Lets take another look at our new Book  table:Book Table: 
      
        | ISBN | Title | Pages | Publisher |  
        | 0672317842 | PHP and MySQL Web Development | 867 | Sams |  
        | 0672325845 | MySQL Tutorial | 300 | Sams |  This table is in First Normal  Form; we do not repeat the same data in any two columns, and each column holds  only the minimum amount of data. However, in the Publisher column we have the  same publisher repeating in each row. This is a violation of Second Normal  Form. As I stated above, this leaves the chance for spelling errors to occur.  The user needs to type in the publisher name each time and such an approach is  also inefficient in regards to storage usage, and the more data a table has the  more IO requests are needed to scan the table, resulting in slower queries. Publisher Table:To normalize this table to Second  Normal Form, we will once again break data off to a separate table. In this  case we will move publisher information to a separate table as follows:
 
      
        | Publisher_ID | Publisher Name |  
        | 1 | Sams |  In this case we have a  one-to-many relationship between the book table and the publisher. A given book  has only one publisher (for our purposes), and a publisher will publish many  books. When we have a one-to-many relationship, we place a foreign key in the  many, pointing to the primary key of the one. Here is the new Book table:Book Table: 
      
        | ISBN | Title | Pages | Publisher_ID |  
        | 0672317842 | PHP and MySQL Web Development | 867 | 1 |  
        | 0672325845 | MySQL Tutorial | 300 | 1 |  Since the Book table is the  'many' portion of our one-to-many relationship, we have placed the primary key  value of the 'one' portion in the Publisher_ID column as a foreign key.Review Table:The other requirement for Second  Normal Form is that you cannot have any data in a table with a composite key  that does not relate to all portions of the composite key. A composite key is a  primary key that incorporates more than one column, as with our Book_Author  table:
 Let's say we wanted to add in  tracking of reviews and started with the following table:
 
      
        | ISBN | Reviewer_ID | Review_Text | Reviewer_Name |  
        | 0672325845 | 1 | What a great book! I learned a lot! | Mike Hillyer |  In this table the combination of  Reviewer_ID and ISBN form the primary key (a composite key), ensuring that no  reviewer writes more than one review for the same book. In this case the  reviewer name is only related to the Reviewer_ID, and not the ISBN, and is  therefore in violation of Second Normal Form.Reviewer Table:In this case we solve the problem  by having a separate table for the data that relates to only one part of the  key. If the portion of the primary key that the field relates to is a foreign  key to another table, move the data there (as in this situation where the  Reviewer_ID will be a separate table):
 
      
        | Reviewer_ID | Reviewer_Name | Reviewer_House | Reviewer_Street | Reviewer_City | Reviewer_Province | Reviewer_PostalCode |  
        | 1 | Mike Hillyer | 123 | Main Street | Calgary | Alberta | T1S-2N2 |  Review Table:
 
      
        | ISBN | Reviewer_ID | Review_Text |  
        | 0672325845 | 1 | What a great book! I learned a lot! |  Third Normal FormI have a confession to make; I do  not often use Third Normal Form. In Third Normal Form we are looking for data  in our tables that is not fully dependant on the primary key, but dependant on  another value in the table. In the reviewer table above the Reviewer_Street and  Reviewer_City fields are really dependant on the Reviewer_PostalCode and not  the Reviewer_ID. To bring this table into compliance with Third Normal Form, we  would need a table based on Postal Code:PostalCode Table: 
      
        | Postal_Code | Street | City |  
        | T1S-2N2 | Main Street | Calgary |  Now of course this new table  violates Second Normal Form as the Street and City will be vertically  redundant, and must be broken off to separate Street and City tables, and  Province will need to be in it's own table which the City table will refer to  as a foreign key.The point of the last example is  that Normalization is a tradeoff. In fact, there are two additional normal  forms that are generally recognized, but are mainly academic. Additional  normalization results in more complex JOIN queries, and this can cause a  decrease in performance. In some cases performance can even be increased by  de-normalizing table data, but de-normalization is beyond the scope of this  article.
 ConclusionIt is generally a good idea to  make sure that your data at least conforms to Second Normal Form. Our goal is  to avoid data redundancy to prevent corruption and make the best possible use  of storage. Make sure that the same value is not stored in more than one place.  With data in multiple locations we have to perform multiple updates when data  needs to be changed, and corruption can begin to creep into the database. Third Normal Form removed even more data redundancy, but at the cost of  simplicity and performance. In our example above, do we really expect the City  and Street names to change very regularly? In this situation Third Normal Form  still prevents misspelling of City and Street names, but it is up to you to  find a balance between Normalization and Speed/Simplicity. In an upcoming  article we will look at how to form queries that join our normalized data  together.     |