Three-Day RDBMS Day 1, Session 1

part of http://philip.greenspun.com/teaching/three-day-rdbms/

Updated: January 24, 2015

Why can't we use the file system and/or RAM with a battery? [conversation with MIT CS prof regarding SAP/Oracle]
What is a database?
A database management system? (ACID, typically a declarative language to query (SQL))
Bureaucracy: Work approximately 10-12:30, lunch, then 1:15-4; as we are training to proficiency, people can stay late until they get it
More background: http://philip.greenspun.com/sql/introduction

Setting up Virtual Machine and Image

See the Virtual Machine Instructions document for information on downloading and setting up the course environment.

Connecting to MySQL

Easy: Click on the "MySQL" icon on the left navbar.

Harder:

Open a terminal (Select Terminal from the left navbar)

To launch the mysql client and connect to the course database:

$ mysql threedayrdbms -p

When prompted for the password type:

> dev

You now have a prompt from which you can begin entering SQL statements. Let's start with a simple one:

> select 2+2;

In MySQL and some other ANSI SQL databases, you can use the SELECT statement simply to run a function, such as this arithmetic example. In Oracle you can only use SELECT to pull data from a table, so they supply you with a one-row system table called "dual" (the glorious history and an explanation of the name may be found in http://en.wikipedia.org/wiki/DUAL_table ) and the corresponding command is

> select 2+2 from dual;

Note that the semicolon is not part of the SQL statement and need not be used when sending SQL commands from a C or Java program, for example. When typing at a shell client, the semicolon instructs the client to send the SQL to the database.

In the virtual machine, we already created a database for you using the following commands:

create database threedayrdbms;

use threedayrdbms;

-- since we already created the database and specified it

-- when starting the mysql client we can skip the explicit

-- selection of a database with the use command

We also created a table using the following command:

create table dog_breeds (

breed_name varchar(50),

characteristics varchar(100)

);

Note that we configured the MySQL installation to use the InnoDB storage engine by default, so you don't need to specify the InnoDB storage engine at table creation time in order to get some rudimentary support for transactions. After we created the table, we inserted two rows:

insert into dog_breeds (breed_name, characteristics)

values ('border collie', 'ankle-biter'),

('golden retriever', 'vicious killer');

-- note that the MySQL shell client runs, by default in autocommit mode

-- and therefore these rows are instantly available for all sessions

-- to see

You can experiment with the table and the shell client using the following commands:

select * from dog_breeds;

delete from dog_breeds where breed_name like 'b%';

select * from dog_breeds;

Open a Web browser inside your virtual machine by clicking on the Mozilla icon at left. Then see what you can do using standard database admin forms from a Web server:

visit http://localhost/phpmyadmin/
log in as "dev", "dev"
click on "Databases"
click on "threedayrdbms"
click on "dog_breeds"
click on "insert" (above)
fill in the form with "toy poodle" and "man's dog" or similar and click "Go"

What if you had a spreadsheet of data on all dog breeds? You could use "LOAD DATA INFILE" to bring it into a MySQL table. See

http://dev.mysql.com/doc/refman/5.5/en/load-data.html
http://dev.mysql.com/doc/refman/5.5/en/load-xml.html
SQL*Loader database utility for Oracle: http://docs.oracle.com/database/121/SUTIL/part_ldr.htm#i436326

Hooking it up to the web server

The virtual machine image you've been provided with comes with Linux, Apache, MySQL, PHP, and phpMyAdmin pre-installed. If you wanted to recreate this environment from a basic Linux image you'd need to install the packages for each of these application along with any applicable connectors or libraries. Many guides have been written this subject which can be readily found by searching the Internet for the name of your distribution and the keywords LAMP install. For now we'll assume you're working in the VM or have a similar configuration already installed.

Open your text editor of choice (e.g. start another terminal shell and run "gedit" or click the emacs icon on the menu bar [bottom])

Open the /home/dev/public_html/php/rdbmsapp/index.php file.

By default the file contains code to select the entire dog_breeds table and display it. The mysql_query line can be edited to try out different queries and the loop edited to change how the output is displayed.

Save the file and view the results by opening a web browser (click the FIrefox icon in the menu bar) and pointing it to http://localhost/~dev/php/rdbmsapp.

When you are done you can remove the dog_breeds table:

drop table dog_breeds;

ACID: Bank Data Model

We've preloaded the database with the following tables:

create table bank_customers (

customer_id int not null primary key auto_increment,

customer_name varchar(100) not null,

-- defaults in MySQL must be constant, except for timestamp

date_joined timestamp not null default current_timestamp

);

create table bank_accounts (

account_id int not null auto_increment,

-- normally we put a REFERENCES constraint with the column def

-- but MySQL ignores those, so it is at the end of the statement

customer_id int not null,

account_type enum('checking','savings') not null,

balance decimal(20,2) not null default 0,

primary key (account_id),

foreign key (customer_id) references bank_customers(customer_id)

);

-- note that we need to say how much precision we want, unlike in

-- Oracle. A simple "numeric" will result in 3.75 being rounded up

-- to 4 upon insertion.

We've preloaded the tables with the following customers and accounts:

insert into bank_customers (customer_id, customer_name, date_joined)

VALUES

(1, "Joe Hedge Fund", "2006-12-24"),

(2, "Susie Average", "1997-03-14");

-- note standard YYYY-MM-DD format above

insert into bank_accounts (customer_id, account_type, balance)

VALUES

(1, 'checking', 200000000),

(1, 'savings', 175000000),

(2, 'checking', 4000),

(2, 'savings', 30000);

Extra Credit: See Thomas Piketty's book Capital for why these two customers have such different amounts on deposit. Come up with a way to tax Customer #1 before he flees to Singapore.

Task: Use the show command to display a listing of tables currently in the database:

show tables;

Note that this and DESCRIBE, used below, are special commands of the MySQL DBMS. They are not SQL language statements. You can achieve similar results to the show command by manually selecting the information from the MySQL server's information_schema view:

select table_name from information_schema.tables

where table_schema='threedayrdbms';

Task: For more information on the structure of each table use the describe command:

describe bank_customers;

describe bank_accounts;

Reference:

http://dev.mysql.com/doc/refman/5.5/en/information-schema.html
in Oracle, this is called the "data dictionary" and is documented in the Concepts and Database Reference manuals

ACID: Atomicity

Let's look at a client program's attempt to transfer $1000 from a customer's savings account to checking. The transfer will be done in two steps: (1) add $1000 to checking, (2) subtract $1000 from savings. A software or power failure occurs between steps 1 and 2. Let's see if the RDBMS can protect us.

-- get the pre-transfer sum of Joe's checking and savings balances

select sum(balance) from bank_accounts where customer_id=1;

-- start transfer by adding the amount to checking

update bank_accounts

set balance=balance+1000

where customer_id=1

and account_type='checking';

-- and next we'll debit the savings account...

Type "exit;" or Control+C to exit the client (simulating the crash of the client program or someone unplugging the RDBMS server). At this point Joe's checking account has been credited but his savings account has not been debited. Let's reconnect to see what state the database has been left in:

mysql threedayrdbms -p

When prompted for the password type: dev

Now back at the mysql console, query for Joe's and the bank's total deposits:

select sum(balance) from bank_accounts where customer_id=1;

select sum(balance) from bank_accounts;

In attempting the transfer, our bank just gave away $1,000 to Joe, certainly not a sustainable way to do business. Let's try this again but wrapping the credit and debit together into a single atomic transaction:

start transaction;

update bank_accounts

set balance=balance+1000

where customer_id=1

and account_type='checking';

Type Control+C to exit the client (once again, simulating a failure) before we can debit the savings account. Let's reconnect to see what state the database has been left in:

mysql threedayrdbms -p

When prompted for the password type: dev

Now back at the mysql console, query for total deposits:

select sum(balance) from bank_accounts where customer_id=1;

select sum(balance) from bank_accounts;

This time the transaction could not be completed so it was rolled back and our bank didn't give away any more free money.

Discussion: How do we build this? (Original MySQL answer: Don't bother. See http://dev.mysql.com/doc/refman/5.5/en/storage-engines.html)

ACID: Isolation

Open a second terminal (shift-click the Terminal icon in the menu bar). The next steps are presented in a table to make it clear where each command should be run. The steps should be executed left-to-right and then top-to-bottom.

Terminal 1

Terminal 2

Launch the mysql client and connect to the course database:

mysql threedayrdbms -p

When prompted for the password type: dev

Back in the first terminal start another balance transfer transaction:

start transaction;

update bank_accounts

set balance=balance+1000

where customer_id=1 and

account_type='checking';

Now, before debiting the savings account, check Joe's total balance:

select balance from bank_accounts

where customer_id=1;

You will find that his total balance has increased by $1,000.

Now switch over to your other terminal and execute that same select:

select balance from bank_accounts

where customer_id=1;

This session was isolated from the changes made by an uncommitted transaction start by another session. Until a successful commit, the extra $1,000 is confined to the session attempting the updates.

Switch back to the first terminal and finish the transfer:

update bank_accounts set

balance=balance-1000

where customer_id=1 and

account_type='savings';

commit;

Now that the transaction has been committed the balance returned by our select statement should be the same in each terminal:

select balance from bank_accounts

where customer_id=1;

Go ahead and verify this by running the same select command in the second terminal window:

select balance from bank_accounts

where customer_id=1;

Discussion: How do we build this? (Two answers: multi-versioning or locks.)

Reference:

http://dev.mysql.com/doc/refman/5.5/en/innodb-multi-versioning.html
Oracle Concepts manual from http://www.oracle.com/pls/db111/homepage (figure 13-1 below)

Oracle version 12 calls this the "Undo Segment". From the Concepts manual at https://docs.oracle.com/database/121/CNCPT/consist.htm#CNCPT221, Figure 9-1:

ACID: Consistency

Now lets try inserting some data that violates one of the constraints we defined.

insert into bank_customers (customer_id, customer_name, date_joined)

VALUES (1, "New Guy", "2011-01-08");

This insert will fail because we've attempted to give New Guy and Joe Hedge Fund the same customer_id (which we defined as a primary key).

Similarly, we can't create a bank account that doesn't belong to anyone:

insert into bank_accounts (customer_id, account_type, balance)

VALUES (37, 'checking', 2000);

We don't have a bank customer with a customer_id of 37 so this insert will fail (we defined customer_id as a foreign key constraint).

Discussion: How do we build this? (Let's talk about indexing, performance, and the original MySQL answer: Don't bother.)

The main enemy of software reliability: programmers. The last line of defense: integrity constraints in the RDBMS.

Sadly, MySQL fails to deliver much of the value that companies get when they buy Oracle or similar systems because MySQL does not enforce the CHECK constraint. Here's an example:

-- no need to cut and paste this; just for discussion

-- (it won't work properly without a bank_vice_presidents table)

create table bank_wire_transfers (

-- in real life we'd have a wire_transfer_id column so we could -- refer to one row easily

account_id int not null

references bank_accounts(account_id),

-- all of the routing info

destination varchar(100),

dollar_amount decimal(20,2),

wire_date datetime,

-- may be NULL for small wires

approved_by int,

foreign key (account_id) references bank_accounts(account_id),

foreign key (approved_by)

references bank_vice_presidents(employee_id),

CHECK (dollar_amount < 10000 OR approved_by is not null)

);

insert into bank_wire_transfers

(account_id, destination, dollar_amount)

values

(1, "Embezzler Bank of Brazil", 10000000)

MySQL is perfectly happy to accept a $10 million wire transfer with no approval. Any of the commercial RDBMSes would generate an error if a transaction included the INSERT statement above and leave the database unmodified. The folks at MySQL don't seem to be concerned about this deficiency, having taken the trouble to document it: http://dev.mysql.com/doc/refman/5.5/en/create-table.html says "The CHECK clause is parsed but ignored by all storage engines".

ACID: Durability

The transactions above, once committed, should survive a power failure as they've been written to a transaction log on disk. Rip the hard drive or SSD out of your laptop and see if you can recover the transactions, though! If not, your installation falls short in the D of ACID.

Discussion: How do we build a system that delivers "durability"?

Start work on Facebooklet

Our application lets a user register, create a profile with “likes”, join one or more groups, and designate one or more other users as friends.

Sample data:

name (1), group (1 for now; more later), likes (0 or more), friends (0 or more)

George W. Bush|Yale|starting wars, Gitmo|Pamela Anderson, Borat

Barack Hussein Obama II|Harvard|health insurance, unemployment|Bruno, Paula Abdul

Pamela Anderson|Hollywood|plastic surgery, home video|George W. Bush

Borat|Kazakstan|farming, travel, singing|Pamela Anderson, Obama

Exercise: 1-table data model

Create a single table, named "facebook_simple" that can store the above data (suggest four columns). Load in the first row (George W. Bush), either using http://localhost/phpmyadmin/ or an INSERT statement.

Don't worry about keys, uniqueness, integrity constraints, or taste. The purpose of this exercise is going to be showing you how painful it is to represent everything in one table with multiple values in a column.

Possibly helpful: http://dev.mysql.com/doc/refman/5.5/en/data-types.html

Exercise: King Bush II's friends

Write a query to determine if "Bruno" is among George W. Bush's friends (if so, should return one row with "1" or "t")

Write a query to determine if "Borat" is among George W. Bush's friends

Write a query to print out a list, one row per friend, of George W. Bush's friends (remember that we suggested loading only the first row, so you're not worried about friend symmetry or using information from more than one row of the table)

Write a query to retrieve the table, sorted by last name.

Hints:

Each query will almost surely include something like WHERE NAME = 'George W. Bush'
The only SQLish way to turn a 1-row set into a multi-row set is by JOIN; you might need a helper table stuffed with integers
Documentation on available MySQL string functions is available at: http://dev.mysql.com/doc/refman/5.5/en/string-functions.html

Questions:

What was the limit on how many friends a user could have? (hint: http://dev.mysql.com/doc/refman/5.5/en/innodb-restrictions.html)
Why was this so hard?

Let's talk about First Normal Form.

Discussions: Anomalies with Unnormalized relations

create table fbt_all_in_one (

name varchar(100),

group_name varchar(100),

group_is_university_p boolean,

group_main_city varchar(100),

likes text,

friends text

);

Can we insert a new group, e.g., Olin College, without inserting a person at the same time? If not, that's an insert anomaly, according to E.F. Codd (dead).

What if we delete George W. Bush? Suppose that he was the only Yalie. Have we also deleted the facts that Yale is a university and that Yale is in New Haven, Connecticut? If so, that's a delete anomaly, according to Codd.

What if Yale decided to spin off its university and concentrate on its core business of investment management. We issue the following database command

update fbt_all_in_one

set group_is_university_p = false

where group_name = 'Yale';

Assuming both John Kerry and George W. Bush had both been users of the service, would more than one row have been updated? If so, that's the update anomaly.

Extra credit: Compare your grades so far to those of the current Secretary of State and former president by visiting http://www.cbsnews.com/stories/2005/06/07/politics/main700170.shtml

Exercise: multi-table data model

Create a multi-table data model to hold the Facebooklet data. If it doesn't suffer from any of the anomalies listed above, it is probably correct.

Before we start, let's look at and discuss a real-world data model: http://philip.greenspun.com/teaching/three-day-rdbms/education.sql.txt

Representing friend requests and friendship: Like Facebook, we need to model that User A requested to become friends with User B, but User B has not yet accepted. Once User B accepts, the friendship is symmetric and everything should work just as if User B had initiated the friendship. Also think about a way to represented defriended relationships. If you need to make a design choice, push for cleanliness of representation and don't worry if the queries might get a little ugly or complex. It is always possible to create a view to hide the complexity of a query.

Hint: Add an extra mapping table to represent a many-to-many relation, e.g., that a user could belong to more than one group and a group can contain more than one user.

Our solution is posted at http://philip.greenspun.com/teaching/three-day-rdbms/facebooklet and also in the VM at ~dev/sql/facebooklet. It can be loaded using the directions in facebooklet-loader.sql if you need it in order to complete the exercises below.

Exercise: JOINs

Let's have queries for the following reports:

the names of the users next to the names of the groups to which they belong

consider ORDER BY upper(last_name), upper(first_names)
concat(first_names, ' ', last_name) as complete_name may give you a cleaner report

the names of users paired up with the names of their friends

hint1: a table may be joined to itself and aliased, e.g., from users u1, users u2
hint2: you can do two SELECT statements and use a UNION to combine the results, with duplicates eliminated
hint3: you can JOIN a table with a more flexible condition than simply t1.x = t2.x

the names of users who all like the same thing, e.g., "starting wars" (pick just one thing)

This would be a good time to read http://philip.greenspun.com/sql/style so that you get into the habit of indenting SQL readably. Remember that adding a "LIMIT 20" at the end of a query will keep the report length manageable.

Reference:

http://www.w3schools.com/sql/sql_quickref.asp (SQL Quick Reference)
http://dev.mysql.com/doc/refman/5.5/en/ (MySQL documentation)
http://philip.greenspun.com/sql/ (SQL for Web nerds)

Exercise: OUTER JOINs

How about a report showing all of the groups in the system and their associated members? It seems simple, but what about groups that don't have any members yet? We want them in our report.

If all of the groups in your table have members, consider that Harvard has received a $5 billion donation that came with the condition for the school to change its name to "Hogwarts" and for Drew Gilpin Faust to change her name to "Professor Dumbledore". Add a new group with no members:

INSERT INTO groups (title) VALUES ('Hogwarts');

After you've added this row, try to create the report showing the complete membership list for all groups, including one row with some NULLs for groups that don't have any members.

Hint: something like

select *

from groups g left outer join group_memberships gm on (g.group_id = gm.group_id);

is a good starting point. add a WHERE clause to keep the results to something manageable, e.g.,

select *

from groups g left outer join group_memberships gm on (g.group_id = gm.group_id)

where title like 'H%';

If you want to print user names rather than user IDs, you'll need to bring in the USERS table. Something like

SELECT stuff_you_want_in_your_report

FROM table_you_want_every_row_from

LEFT OUTER JOIN table_with_mapping_relation ON (join_column_constraint)

LEFT OUTER JOIN table_with_extra_info ON (join constraint)

LIMIT 20;

Exercise: GROUP BY

Please write a query giving a report of all of the names of the groups in the system that have at least one member next to the number of members in that group. You'll be using GROUP BY and probably count(*).

Write a second query that also includes the names of groups with no members (and a "0" for number of members in the second column of the report).

Hint: COUNT(column_name) will return 0 if all of the column values are NULL.

References: http://philip.greenspun.com/sql/complex-queries

Exercise: find the mode

Use your new GROUP BY skills to write a query that will return a single row containing the most-liked "like" (human-readable topic, not the like_id) and the number of users who liked it. Assume that there is one clear winner among the "likes".

Hint: GROUP BY, ORDER BY, and the MySQL extension "limit" may be useful. Remember that ORDER BY can do ascending (ASC) or descending (DESC).

Extra credit: What if there are two or three "likes", each with the same number of users mapped to it? Can you generate a report showing all of the likes whose popularity is at the mode? You might need a subquery and/or a view.

Exercise: HAVING

Please write a query giving a report of all of the names of the groups in the system that have at least 30 members.

Discussion: Query Optimization

Take some of the queries above and feed them to MySQL once again but prefixed with "EXPLAIN" or "EXPLAIN EXTENDED".

Reference: http://dev.mysql.com/doc/refman/5.5/en/execution-plan-information.html