a database-
a system which allows for the quick storage, retrieval, and manipulation of data
there are several different types of database systems available, but we will be concentrating on Relational Databases, using the technology MySQL
OK, what is a relational database then? How does it differ from other databases?
A relational database (RDBMS, Relational Database Management System) is a database system which uses a set of tables to hold related data.
A table is a collection of data records in which all records contain the same fields.
A field can be described as a container for a basic unit of data.
An entry in an RDBMS table which defines one set of related fields is referred to as a row.
While the RDBMS model is not hierarchical, the description from set to unit of the system is as follows:
DATABASE -> TABLE -> ROW -> COLUMN -> FIELD
The above example is somewhat misleading, as there is no distinct hierarchy between column and row.
Each table by definition is composed of rows and columns, and field values may be accessed by referring to either
a column or a row or both. Furthermore, the order or columns and rows in the definition of the database is insignificant:
data is retrieved, stored, and manipulated based on field values as particular insertion points as opposed to in any sequential order.
What is an insertion point?
An insertion point is defined as a record within a database at which a particular row/column intersection occurs,
or at which a particular field value is returned. Insertion points are used to specify where to begin read/write operations
for a particular database query, and upon which records in the database to perform the given query’s operation.
What is a query?
A query is the term used to define a command given to a database.
For our purposes, a query is a command given via Structured Query Language (SQL) syntax.
Each of the entities in the relationship described above has its own set of properties, such as name,
value, data type, data length, etc.
Here’s an example:
DATABASE = movies
TABLE = movie_filename
ROW = 3
COLUMN = filename
In the above example, suppose we are dealing with an application which is meant to store and retrieve video files for playback in an online system.
The database storing the information which describes the video files used in the application is called “movies”.
Within the movie database are various tables which describe each aspect of the video files.
One such table is reserved for the filenames of the video files to be accessed.
Given the conditions above, which calls the “filename” field of the third entry in the table “movie_filename”, the following information might be returned:
- Field Name = “filename”
- Field Value = “dinosaurs.mov”
- Field Type = VARCHAR
- Field Length = 255
- NULL = NOT NULL
- Field Index = FALSE
What does the above tell us?
- The field we have asked for belongs to the column described as “filename”.
The designer of the database is letting us know that data appropriate to the description and function of “filename” should be placed in this field.
- The value of the field “filename” in the third row of the table “movie_filename” is “dinosaurs.mov”.
- The type of data expected by the particular column to which this field belongs. MySQL data is strongly typed- that is, each column is defined as capable of storing only one kind of data. Attempts to store data that do not conform to the columns defined data type will fail.
In the above example the data type is VARCHAR, or variable character, indicating that a string composed of any character set is permissible.
- The maximum length of data to be stored by a given column.
Each field entry for this column must be no greater than this length; otherwise the entry will be truncated to the maximum column length or fail.
- Flags whether NULL values are allowed in the given field’s column.
In this case the column is defined to not accept NULL values. NULL indicates the absence of data and type. It is recommended to always use real data types as opposed to NULL values to conform to our strongly typed data model. This allows us to test for data type when retrieving information from the database in order to assure the accuracy of our data.
- Indicates that the column in question is not an index; i.e. this column is not used to keep track of the order and number of occurrences of all entries within this particular table.
An index is a close analog to what we might imagine as the index of a cookbook: it is used to look up related categories of data on a granular level. So while a higher level table of contents might give us an arbitrary category for main dishes, an index will give as a granular listing of all recipes which include garbanzo beans.
An index is an organizational tool for atomic values.
What do you mean by atomic values?
“Atomicity” is an arbitrary term regarding the composition of a particular data field. An atomic value is defined as one that “cannot be decomposed into smaller pieces by the DBMS (excluding certain special functions).” (Codd, E.F. The Relational Model for Database Management Version 2 (Addison-Wesley, 1990), p. 6.)
In practice there is no such thing as an atomic value, as almost any entity can be broken down into component entities (a sentence into words, a word into letters, a letter into a byte-code sequence, etc). The theory of atomic values is best applied by limiting each field entry to only one value of the data type and semantic type expected by the column. For example, in a database for an address book application, a table could be set up to hold information as such:
TABLE contacts
Column: contact_name (VARCHAR(255), NOT NULL)
Column: phone_number (TEXT, NOT NULL)
Column: email (VARCHAR(55), NOT NULL)
The above table has one column used as a primary index for quickly searching and organizing the data in the table, phone_numberID.
The second column is used to store phone numbers.
It has a data type of text, which will allow for the storage of an arbitrarily long entry of mixed characters.
Suppose one of the entries in the address book contained a home number, a work number, and cell number.
A storage method ignoring the concept of atomicity using the above table would place all three numbers into one entry as such:
contact_name: “john smith”
phone_number: “773-555-1212, 773-555-1234, 773-555-5678″
email: jsmith@null.net
The above fails the atomicity test as three distinct values for phone number are stored in the phone_number field.
This example also violates another principle of database design, Normalization.
Normalization is a database design technique that emphasizes the use of multiple tables to reduce the amount of duplicate information in a database.
The above example disregards the concept of normalization by storing the contact’s name in the same table as the phone number and email, and by storing more than one phone number in the phone_number field. This makes retrieving and parsing the data difficult, as one is more likely to retrieve more information than required to perform a specific task. Failure to adhere to this practice also makes updating the database more difficult, as updating a contact’s email address in the above requires modifications to the row containing all of the user’s information.
A normalized version of the above address book database would look something like this:
TABLE contacts
Column: contact_ID (INT(11), NOT NULL, PRIMARY KEY, AUTO INCREMENT)
Column: contact_fname (VARCHAR(255), NOT NULL)
Column: contact_lname (VARCHAR(255), NOT NULL)
TABLE phone_number
Column: phone_number_ID (INT(11), NOT NULL, PRIMARY KEY, AUTO INCREMENT)
Column: contact_ID (INT(11), NOT NULL, FOREIGN KEY)
Column: phone_number (VARCHAR(12), NOT NULL)
TABLE email
Column: email_ID (INT(11), NOT NULL, PRIMARY KEY, AUTO INCREMENT)
Column: contact_ID (INT(11), NOT NULL, FOREIGN KEY)
Column: email (VARCHAR(55), NOT NULL)
And the above entry would exist in the normalized database as follows:
TABLE contacts
Column: contact_ID: 1
Column: contact_fname: john
Column: contact_lname: smith
TABLE phone_number
Column: phone_number_ID: 1
Column: contact_ID: 1
Column: phone_number: “773-555-1212″
Column: phone_number_ID: 2
Column: contact_ID: 1
Column: phone_number: “773-555-5678″
Column: phone_number_ID: 3
Column: contact_ID: 1
Column: phone_number: “773-555-1234″
TABLE email
Column: email_ID: 1
Column: contact_ID: 1
Column: email: jsmith@null.net
The above design simplifies the data access methods necessary to retrieve the contacts information.
If we want to grab only the users first name and the second instance of their phone number, we can do so by simply
asking for those two fields as opposed to asking for an entire row from our table and then parsing it.
We can also more easily update and refine the information in our database.
If our contact gets a second email address, it can be easily inserted into the email table.
If the contact shuts of their LAN line we can simply remove that entry from the phone numbers table. All without ever touching our main entry for the contact.
If we want to know whether a phone number is a home or cell or business number, we can add the following table, with the following entries:
phone_number_type
column: phone_number_type_ID (INT(11), NOT NULL, PRIMARY KEY, AUTO INCREMENT)
column: phone_number_type (VARCHAR(4),NOT NULL)
phone_number_type_ID: 1
phone_number_type: home
phone_number_type_ID: 2
phone_number_type: cell
phone_number_type_ID: 3
phone_number_type: office
We can then update our phone_number table as follows:
1. add a new field, phone_number_type_ID
2. update our entries to include values for phone_number_type_ID
Again, all of this can be done without modifying any other tables or entries in the database. A similar procedure could be followed for email.
Other tables could be added for addresses, IMs, titles, associations, etc, again never modifying the structure of any of the existing tables and never compromising the integrity of our data.
The ability to relate and process information across multiple tables as in the above example is at the heart of what make relational databases tick.
Thoughtful, pragmatic database design begins with the concepts of atomicity and normalization.
Following these methods from the beginning of your project will save you more than a few grey hairs when you are in the midst of building your application.
links to online projects
Heat and the Heartbeat of the City, Andrea Polli
http://turbulence.org/Works/heat/index2.html
Earthquake Simulations:
http://memento.ieor.berkeley.edu/seismo.html?
http://www.ncedc.org/ftp/outgoing/userdata/quicklook/BKS.BK.LHZ.current.gif
http://memento.ieor.berkeley.edu/earthquake3.html
http://earthquake.usgs.gov/regional/nca/1906/18april/got_seismogram.php
http://www.abag.ca.gov/bayarea/eqmaps/eqtrans/result.html
http://gis.abag.ca.gov/?go=liq
http://americahurrah.com/images/EQmap.jpg
http://www.lib.berkeley.edu/EART/1906/g4364_s5_1906_w3_1.jpg
http://earthquake.usgs.gov/regional/nca/virtualtour/earthquake.php
http://www.ieor.berkeley.edu/~goldberg/art/Ballet-Mori/videos/Ballet-Mori-AVI.avi
http://allshookup.org/mpgvideo/mpgvideo.htm