part of http://philip.greenspun.com/teaching/three-day-rdbms/
Updated: January 24, 2015
Setting up Virtual Machine and Image
See the Virtual Machine Instructions document for information on downloading and setting up the course environment.
Connecting to MySQL
Easy: Click on the "MySQL" icon on the left navbar.
Harder:
Open a terminal (Select Terminal from the left navbar)
To launch the mysql client and connect to the course database:
$ mysql threedayrdbms -p
When prompted for the password type:
> dev
You now have a prompt from which you can begin entering SQL statements. Let's start with a simple one:
> select 2+2;
In MySQL and some other ANSI SQL databases, you can use the SELECT statement simply to run a function, such as this arithmetic example. In Oracle you can only use SELECT to pull data from a table, so they supply you with a one-row system table called "dual" (the glorious history and an explanation of the name may be found in http://en.wikipedia.org/wiki/DUAL_table ) and the corresponding command is
> select 2+2 from dual;
Note that the semicolon is not part of the SQL statement and need not be used when sending SQL commands from a C or Java program, for example. When typing at a shell client, the semicolon instructs the client to send the SQL to the database.
In the virtual machine, we already created a database for you using the following commands:
create database threedayrdbms;
use threedayrdbms;
-- since we already created the database and specified it
-- when starting the mysql client we can skip the explicit
-- selection of a database with the use command
We also created a table using the following command:
create table dog_breeds (
breed_name varchar(50),
characteristics varchar(100)
);
Note that we configured the MySQL installation to use the InnoDB storage engine by default, so you don't need to specify the InnoDB storage engine at table creation time in order to get some rudimentary support for transactions. After we created the table, we inserted two rows:
insert into dog_breeds (breed_name, characteristics)
values ('border collie', 'ankle-biter'),
('golden retriever', 'vicious killer');
-- note that the MySQL shell client runs, by default in autocommit mode
-- and therefore these rows are instantly available for all sessions
-- to see
You can experiment with the table and the shell client using the following commands:
select * from dog_breeds;
delete from dog_breeds where breed_name like 'b%';
select * from dog_breeds;
Open a Web browser inside your virtual machine by clicking on the Mozilla icon at left. Then see what you can do using standard database admin forms from a Web server:
What if you had a spreadsheet of data on all dog breeds? You could use "LOAD DATA INFILE" to bring it into a MySQL table. See
Hooking it up to the web server
The virtual machine image you've been provided with comes with Linux, Apache, MySQL, PHP, and phpMyAdmin pre-installed. If you wanted to recreate this environment from a basic Linux image you'd need to install the packages for each of these application along with any applicable connectors or libraries. Many guides have been written this subject which can be readily found by searching the Internet for the name of your distribution and the keywords LAMP install. For now we'll assume you're working in the VM or have a similar configuration already installed.
Open your text editor of choice (e.g. start another terminal shell and run "gedit" or click the emacs icon on the menu bar [bottom])
Open the /home/dev/public_html/php/rdbmsapp/index.php file.
By default the file contains code to select the entire dog_breeds table and display it. The mysql_query line can be edited to try out different queries and the loop edited to change how the output is displayed.
Save the file and view the results by opening a web browser (click the FIrefox icon in the menu bar) and pointing it to http://localhost/~dev/php/rdbmsapp.
When you are done you can remove the dog_breeds table:
drop table dog_breeds;
We've preloaded the database with the following tables:
create table bank_customers (
customer_id int not null primary key auto_increment,
customer_name varchar(100) not null,
-- defaults in MySQL must be constant, except for timestamp
date_joined timestamp not null default current_timestamp
);
create table bank_accounts (
account_id int not null auto_increment,
-- normally we put a REFERENCES constraint with the column def
-- but MySQL ignores those, so it is at the end of the statement
customer_id int not null,
account_type enum('checking','savings') not null,
balance decimal(20,2) not null default 0,
primary key (account_id),
foreign key (customer_id) references bank_customers(customer_id)
);
-- note that we need to say how much precision we want, unlike in
-- Oracle. A simple "numeric" will result in 3.75 being rounded up
-- to 4 upon insertion.
We've preloaded the tables with the following customers and accounts:
insert into bank_customers (customer_id, customer_name, date_joined)
VALUES
(1, "Joe Hedge Fund", "2006-12-24"),
(2, "Susie Average", "1997-03-14");
-- note standard YYYY-MM-DD format above
insert into bank_accounts (customer_id, account_type, balance)
VALUES
(1, 'checking', 200000000),
(1, 'savings', 175000000),
(2, 'checking', 4000),
(2, 'savings', 30000);
Extra Credit: See Thomas Piketty's book Capital for why these two customers have such different amounts on deposit. Come up with a way to tax Customer #1 before he flees to Singapore.
Task: Use the show command to display a listing of tables currently in the database:
show tables;
Note that this and DESCRIBE, used below, are special commands of the MySQL DBMS. They are not SQL language statements. You can achieve similar results to the show command by manually selecting the information from the MySQL server's information_schema view:
select table_name from information_schema.tables
where table_schema='threedayrdbms';
Task: For more information on the structure of each table use the describe command:
describe bank_customers;
describe bank_accounts;
Reference:
Let's look at a client program's attempt to transfer $1000 from a customer's savings account to checking. The transfer will be done in two steps: (1) add $1000 to checking, (2) subtract $1000 from savings. A software or power failure occurs between steps 1 and 2. Let's see if the RDBMS can protect us.
-- get the pre-transfer sum of Joe's checking and savings balances
select sum(balance) from bank_accounts where customer_id=1;
-- start transfer by adding the amount to checking
update bank_accounts
set balance=balance+1000
where customer_id=1
and account_type='checking';
-- and next we'll debit the savings account...
Type "exit;" or Control+C to exit the client (simulating the crash of the client program or someone unplugging the RDBMS server). At this point Joe's checking account has been credited but his savings account has not been debited. Let's reconnect to see what state the database has been left in:
mysql threedayrdbms -p
When prompted for the password type: dev
Now back at the mysql console, query for Joe's and the bank's total deposits:
select sum(balance) from bank_accounts where customer_id=1;
select sum(balance) from bank_accounts;
In attempting the transfer, our bank just gave away $1,000 to Joe, certainly not a sustainable way to do business. Let's try this again but wrapping the credit and debit together into a single atomic transaction:
start transaction;
update bank_accounts
set balance=balance+1000
where customer_id=1
and account_type='checking';
Type Control+C to exit the client (once again, simulating a failure) before we can debit the savings account. Let's reconnect to see what state the database has been left in:
mysql threedayrdbms -p
When prompted for the password type: dev
Now back at the mysql console, query for total deposits:
select sum(balance) from bank_accounts where customer_id=1;
select sum(balance) from bank_accounts;
This time the transaction could not be completed so it was rolled back and our bank didn't give away any more free money.
Discussion: How do we build this? (Original MySQL answer: Don't bother. See http://dev.mysql.com/doc/refman/5.5/en/storage-engines.html)
Open a second terminal (shift-click the Terminal icon in the menu bar). The next steps are presented in a table to make it clear where each command should be run. The steps should be executed left-to-right and then top-to-bottom.
Terminal 1 | Terminal 2 |
Launch the mysql client and connect to the course database: mysql threedayrdbms -p When prompted for the password type: dev | |
Back in the first terminal start another balance transfer transaction: start transaction; update bank_accounts set balance=balance+1000 where customer_id=1 and account_type='checking'; | |
Now, before debiting the savings account, check Joe's total balance: select balance from bank_accounts where customer_id=1; You will find that his total balance has increased by $1,000. | Now switch over to your other terminal and execute that same select: select balance from bank_accounts where customer_id=1; This session was isolated from the changes made by an uncommitted transaction start by another session. Until a successful commit, the extra $1,000 is confined to the session attempting the updates. |
Switch back to the first terminal and finish the transfer: update bank_accounts set balance=balance-1000 where customer_id=1 and account_type='savings'; commit; | |
Now that the transaction has been committed the balance returned by our select statement should be the same in each terminal: select balance from bank_accounts where customer_id=1; | Go ahead and verify this by running the same select command in the second terminal window: select balance from bank_accounts where customer_id=1; |
Discussion: How do we build this? (Two answers: multi-versioning or locks.)
Reference:
Now lets try inserting some data that violates one of the constraints we defined.
insert into bank_customers (customer_id, customer_name, date_joined)
VALUES (1, "New Guy", "2011-01-08");
This insert will fail because we've attempted to give New Guy and Joe Hedge Fund the same customer_id (which we defined as a primary key).
Similarly, we can't create a bank account that doesn't belong to anyone:
insert into bank_accounts (customer_id, account_type, balance)
VALUES (37, 'checking', 2000);
We don't have a bank customer with a customer_id of 37 so this insert will fail (we defined customer_id as a foreign key constraint).
Discussion: How do we build this? (Let's talk about indexing, performance, and the original MySQL answer: Don't bother.)
The main enemy of software reliability: programmers. The last line of defense: integrity constraints in the RDBMS.
Sadly, MySQL fails to deliver much of the value that companies get when they buy Oracle or similar systems because MySQL does not enforce the CHECK constraint. Here's an example:
-- no need to cut and paste this; just for discussion
-- (it won't work properly without a bank_vice_presidents table)
create table bank_wire_transfers (
-- in real life we'd have a wire_transfer_id column so we could -- refer to one row easily
account_id int not null
references bank_accounts(account_id),
-- all of the routing info
destination varchar(100),
dollar_amount decimal(20,2),
wire_date datetime,
-- may be NULL for small wires
approved_by int,
foreign key (account_id) references bank_accounts(account_id),
foreign key (approved_by)
references bank_vice_presidents(employee_id),
CHECK (dollar_amount < 10000 OR approved_by is not null)
);
insert into bank_wire_transfers
(account_id, destination, dollar_amount)
values
(1, "Embezzler Bank of Brazil", 10000000)
MySQL is perfectly happy to accept a $10 million wire transfer with no approval. Any of the commercial RDBMSes would generate an error if a transaction included the INSERT statement above and leave the database unmodified. The folks at MySQL don't seem to be concerned about this deficiency, having taken the trouble to document it: http://dev.mysql.com/doc/refman/5.5/en/create-table.html says "The CHECK clause is parsed but ignored by all storage engines".
ACID: Durability
The transactions above, once committed, should survive a power failure as they've been written to a transaction log on disk. Rip the hard drive or SSD out of your laptop and see if you can recover the transactions, though! If not, your installation falls short in the D of ACID.
Discussion: How do we build a system that delivers "durability"?
Our application lets a user register, create a profile with “likes”, join one or more groups, and designate one or more other users as friends.
Sample data:
name (1), group (1 for now; more later), likes (0 or more), friends (0 or more)
George W. Bush|Yale|starting wars, Gitmo|Pamela Anderson, Borat
Barack Hussein Obama II|Harvard|health insurance, unemployment|Bruno, Paula Abdul
Pamela Anderson|Hollywood|plastic surgery, home video|George W. Bush
Borat|Kazakstan|farming, travel, singing|Pamela Anderson, Obama
Create a single table, named "facebook_simple" that can store the above data (suggest four columns). Load in the first row (George W. Bush), either using http://localhost/phpmyadmin/ or an INSERT statement.
Don't worry about keys, uniqueness, integrity constraints, or taste. The purpose of this exercise is going to be showing you how painful it is to represent everything in one table with multiple values in a column.
Possibly helpful: http://dev.mysql.com/doc/refman/5.5/en/data-types.html
Write a query to determine if "Bruno" is among George W. Bush's friends (if so, should return one row with "1" or "t")
Write a query to determine if "Borat" is among George W. Bush's friends
Write a query to print out a list, one row per friend, of George W. Bush's friends (remember that we suggested loading only the first row, so you're not worried about friend symmetry or using information from more than one row of the table)
Write a query to retrieve the table, sorted by last name.
Hints:
Questions:
Let's talk about First Normal Form.
Discussions: Anomalies with Unnormalized relations
create table fbt_all_in_one (
name varchar(100),
group_name varchar(100),
group_is_university_p boolean,
group_main_city varchar(100),
likes text,
friends text
);
Can we insert a new group, e.g., Olin College, without inserting a person at the same time? If not, that's an insert anomaly, according to E.F. Codd (dead).
What if we delete George W. Bush? Suppose that he was the only Yalie. Have we also deleted the facts that Yale is a university and that Yale is in New Haven, Connecticut? If so, that's a delete anomaly, according to Codd.
What if Yale decided to spin off its university and concentrate on its core business of investment management. We issue the following database command
update fbt_all_in_one
set group_is_university_p = false
where group_name = 'Yale';
Assuming both John Kerry and George W. Bush had both been users of the service, would more than one row have been updated? If so, that's the update anomaly.
Extra credit: Compare your grades so far to those of the current Secretary of State and former president by visiting http://www.cbsnews.com/stories/2005/06/07/politics/main700170.shtml
Create a multi-table data model to hold the Facebooklet data. If it doesn't suffer from any of the anomalies listed above, it is probably correct.
Before we start, let's look at and discuss a real-world data model: http://philip.greenspun.com/teaching/three-day-rdbms/education.sql.txt
Representing friend requests and friendship: Like Facebook, we need to model that User A requested to become friends with User B, but User B has not yet accepted. Once User B accepts, the friendship is symmetric and everything should work just as if User B had initiated the friendship. Also think about a way to represented defriended relationships. If you need to make a design choice, push for cleanliness of representation and don't worry if the queries might get a little ugly or complex. It is always possible to create a view to hide the complexity of a query.
Hint: Add an extra mapping table to represent a many-to-many relation, e.g., that a user could belong to more than one group and a group can contain more than one user.
Our solution is posted at http://philip.greenspun.com/teaching/three-day-rdbms/facebooklet and also in the VM at ~dev/sql/facebooklet. It can be loaded using the directions in facebooklet-loader.sql if you need it in order to complete the exercises below.
Let's have queries for the following reports:
This would be a good time to read http://philip.greenspun.com/sql/style so that you get into the habit of indenting SQL readably. Remember that adding a "LIMIT 20" at the end of a query will keep the report length manageable.
Reference:
How about a report showing all of the groups in the system and their associated members? It seems simple, but what about groups that don't have any members yet? We want them in our report.
If all of the groups in your table have members, consider that Harvard has received a $5 billion donation that came with the condition for the school to change its name to "Hogwarts" and for Drew Gilpin Faust to change her name to "Professor Dumbledore". Add a new group with no members:
INSERT INTO groups (title) VALUES ('Hogwarts');
After you've added this row, try to create the report showing the complete membership list for all groups, including one row with some NULLs for groups that don't have any members.
Hint: something like
select *
from groups g left outer join group_memberships gm on (g.group_id = gm.group_id);
select *
from groups g left outer join group_memberships gm on (g.group_id = gm.group_id)
where title like 'H%';
If you want to print user names rather than user IDs, you'll need to bring in the USERS table. Something like
SELECT stuff_you_want_in_your_report
FROM table_you_want_every_row_from
LEFT OUTER JOIN table_with_mapping_relation ON (join_column_constraint)
LEFT OUTER JOIN table_with_extra_info ON (join constraint)
LIMIT 20;
Please write a query giving a report of all of the names of the groups in the system that have at least one member next to the number of members in that group. You'll be using GROUP BY and probably count(*).
Write a second query that also includes the names of groups with no members (and a "0" for number of members in the second column of the report).
Hint: COUNT(column_name) will return 0 if all of the column values are NULL.
References: http://philip.greenspun.com/sql/complex-queries
Use your new GROUP BY skills to write a query that will return a single row containing the most-liked "like" (human-readable topic, not the like_id) and the number of users who liked it. Assume that there is one clear winner among the "likes".
Hint: GROUP BY, ORDER BY, and the MySQL extension "limit" may be useful. Remember that ORDER BY can do ascending (ASC) or descending (DESC).
Extra credit: What if there are two or three "likes", each with the same number of users mapped to it? Can you generate a report showing all of the likes whose popularity is at the mode? You might need a subquery and/or a view.
Please write a query giving a report of all of the names of the groups in the system that have at least 30 members.
Take some of the queries above and feed them to MySQL once again but prefixed with "EXPLAIN" or "EXPLAIN EXTENDED".
Reference: http://dev.mysql.com/doc/refman/5.5/en/execution-plan-information.html