UNIT II
Relational
Model Structure of
Relational Databases Basic
structure
Attribute Types:
Each attribute of a relation has a name.
The set of allowed values for each attribute is called the domain of the
attribute.
Attribute values are (normally) required to be atomic; that is,
indivisible.
E.g. the value of an attribute can be an account number,
but cannot be a set of account numbers
but cannot be a set of account numbers
Domain is said to be atomic if all its members are atomic.
The special value null is a member of every domain.
The null value causes complications in the definition of many operations.
We shall ignore the effect of null values in our main presentation and
consider their effect later.
Relation Schema
Formally, given domains D1,
D2, …. Dn a relation r is a subset of
D1 x D2 x … x Dn
Thus, a relation is a set of n-tuples (a1, a2, …, an) where each ai Î Di
D1 x D2 x … x Dn
Thus, a relation is a set of n-tuples (a1, a2, …, an) where each ai Î Di
Schema of a relation consists of:
attribute definitions
name
type/domain
integrity constraints
Relation Instance
The current values (relation
instance) of a relation are specified by a table
An element t of r is a tuple, represented by a row
in a table
Order of tuples is irrelevant (tuples may be stored in an arbitrary
order)
Database
A database consists of multiple relations
Information about an enterprise is broken up into parts, with each relation storing one part of the
informationE.g.
account : information about accounts
depositor : which customer owns which account
customer : information about customers
depositor : which customer owns which account
customer : information about customers
Why Split Information Across Relations?
Storing all information as a single relation such as
bank(account_number, balance, customer_name, ..)
results in repetition of information
bank(account_number, balance, customer_name, ..)
results in repetition of information
e.g.,if two customers own an account (What
gets repeated?)
The need for null values
e.g., to represent a customer
without an account
Normalization theory deals with
how to design relational schemas
Keys
Let K Í R
K is a superkey of R if values for K
are sufficient to identify a unique tuple of each possible relation r(R)
by “possible r ” we mean
a relation r that could exist
in the enterprise we are modeling.
Example: {customer_name,
customer_street} and
{customer_name}
are both superkeys of Customer, if no two customers can possibly have the same name
{customer_name}
are both superkeys of Customer, if no two customers can possibly have the same name
In real life, an attribute such as customer_id
would be used instead of customer_name
to uniquely identify customers, but we omit it to keep our examples
small, and instead assume customer names are unique.
K is a candidate key if K is minimal
Example: {customer_name} is a candidate key for Customer, since it is a superkey and no subset of it is a superkey.
Example: {customer_name} is a candidate key for Customer, since it is a superkey and no subset of it is a superkey.
Primary key: a candidate key
chosen as the principal means of identifying tuples within a relation Should
choose an attribute whose value never, or very rarely, changes.
E.g. email address is unique, but may change
Foreign Keys:A relation schema
may have an attribute that corresponds to the primary key of another
relation. The attribute is called a
foreign key.
E.g. customer_name and account_number attributes of depositor are foreign keys to customer and account respectively.
Only values occurring in the primary key attribute of the referenced
relation may occur in the foreign key attribute of the referencing relation.
Query
Languages
Language in which user requests information from the database.
Categories of languages
Procedural
Non-procedural, or declarative
“Pure” languages:
Relational algebra
Tuple relational calculus
Domain relational calculus
Pure languages form underlying basis of query languages that people use.
Six basic operators
of Relational algebra
select: s
project: Õ
union: È
set difference: –
Cartesian product: x
rename: r
The operators take one or two relations
as inputs and produce a new relation as a result.
select, project and rename operators are called unary operators, because
they are operating on one relation.
The other three operators are called binary operators, because they are
operating on pairs of relationship. Rename
Operation
Allows us to name, and
therefore to refer to, the results of relational-algebra expressions.
Allows us to refer to a relation by more than one name
Example:
r x (E)
returns the expression E under the name X
If a relational-algebra expression E
has arity n, then
returns the result of
expression E under the name X, and with the
attributes renamed to A1 , A2 , …., An .
Other Relational Languages
Tuple Relational
Calculus
Domain Relational
Calculus
Query-by-Example
(QBE)
Datalog
Tuple Relational Calculus
A nonprocedural query language, where each query is of
the form
{t | P (t ) }
It is the set of all
tuples t such that predicate P is true for t
t is a tuple variable, t [A ] denotes the value of tuple t on attribute A
t Î r denotes that tuple t is in relation r
P is a formula similar to that of the
predicate calculus
Predicate Calculus
Formula
1 Set of attributes and constants
2. Set of comparison operators: (e.g., <, £, =, ¹, >, ³)
3. Set of connectives: and (Ù), or (v)‚ not (Ø)
4. Implication (Þ): x Þ y, if x if true, then y is true
x Þ y
º Øx v y
5. Set of quantifiers:
$ t Î r (Q (t )) º ”there exists” a tuple in t in relation r
such that predicate Q (t ) is true
such that predicate Q (t ) is true
"t Î r
(Q (t )) º Q
is true “for all” tuples t in
relation r
Banking Example
branch (branch_name, branch_city, assets )
customer (customer_name, customer_street,
customer_city )
account (account_number, branch_name, balance )
loan (loan_number, branch_name, amount )
depositor (customer_name, account_number )
borrower (customer_name, loan_number )
Example Queries
Find the loan_number, branch_name, and amount for loans of over $1200
{t | t Î loan Ù t [amount ] > 1200}
Find the loan number
for each loan of an amount greater than $1200
{t | $ s Î loan (t [loan_number ] = s [loan_number ] Ù s
[amount ] > 1200)}
Notice that a
relation on schema [loan_number ]
is implicitly defined by
the query
Safety of Expressions
It is possible to write tuple calculus expressions that
generate infinite relations.
For example, { t | Ø t Î r }
results in an infinite relation if the domain of any attribute of relation r is infinite
To guard against the
problem, we restrict the set of allowable expressions to safe expressions.
An expression {t | P (t )} in the tuple relational calculus is safe if every component of t appears in one of the relations,
tuples, or constants that appear in P
NOTE: this is more
than just a syntax condition.
E.g. { t | t [A] = 5 Ú true } is not
safe --- it defines an infinite set with attribute values that do not appear in
any relation or tuples or constants in P.
Domain Relational Calculus
A nonprocedural query language equivalent in power to the
tuple relational calculus
Each query is an
expression of the form:
{ < x1, x2, …, xn
> | P (x1, x2, …, xn)}
x1, x2, …, xn represent domain variables
P represents a formula
similar to that of the predicate calculus
Example Queries
Safety of Expressions
for all values The expression:
{ < x1, x2, …, xn
> | P (x1, x2, …, xn )}
is safe if all of the
following hold:
All values that
appear in tuples of the expression are values from dom (P ) (that
is, the values appear either in P
or in a tuple of a relation mentioned
in P ).
For every “there
exists” subformula of the form $ x
(P1(x )), the subformula is true if and only if there is a value of x in dom (P1) such that P1(x )
is true.
For every “for all”
subformula of the form "x (P1
(x )), the subformula is true
if and only if P1(x ) is true es x
from dom (P1).
Relational database design
The goal of a
relational database design is to generate a set of relation schemas that
allows us to store information without unneccassary redundancy and allow us to retrieve information easily.
Pitfalls in
Relational database design
Among the undesirable
properties that a bad design may have are.
1 Pitfalls in
Relational-Database Design
Dependency
Preservation:
There is another goal in
relational-database design: dependency preservation.
Let F be a set of functional
dependencies on a schema R, and let R1,R2……Rn be a decomposition of R. The
restriction of F to Ri is the set Fi of all functional dependencies in F+
that include only attributes of Ri
.
Let F’=F1∪ F2∪…… ∪ Fn. F’ is a set of
functional dependencies on schema R. if F’+= F+ is true, then every dependency in F is
logically implied by F’, and, if we verify that F’ is satisfied, we
have verified that F is satisfied. We say that a decomposition having the
property F’+= F+ is a dependency-preserving
decomposition.
Boyce/codd normal
form(BCNF):
A relvar is in BCNF if and only if every
nontrivial, left-irreducible FD has a candidate key as its determinant.
A relation schema R is in BCNF
with respect to a set F of functional dependencies if for all functional
dependencies in F+ of the form a→b, where aÍR and bÍR, at least one of
the following holds:
ⅰ a→b is a trivial
functional dependency (that is bÍ a)
ⅱ a is a superkey for
schema R
Examples:
①
customer-schema=(customer-name,customer- street, customer-city)
customer-name→{customer-street,customer-city}
customer-name is a candidate key
customer-schema is in BCNF
②
branch-schema=(branch-name,assets,branch-city)
branch-name→ {assets,branch-city}
branch-schema is a candidate key
branch-schema is in BCNF
③ loan-info-schema=(branch-name,
customer-name,loan-number, amount) loan-number→{amount,branch-name}
loan-number is not a candidate key
(Downtown,John Bell,L-44,1000)
(Downtown,Jane Bell,L-44,1000)
loan-number → {amount,branch-name} is not a trivial dependency.
loan-info-schema is not in BCNF
loan-info-schema=(branch-name, customer-name,loan-number, amount)
loan-info-schema is not in
BCNF. Redundancy exist, so we should decompose it.
loan-schema=(loan-number, branch-name, amount)
borrower-schema=(customer-name,loan-number)
loan-number→{amount,branch-name}
loan-number is a candidate key of loan-schema
loan-schema and borrower-schema are both in BCNF
If R is not in BCNF, we can
decompose R into a collection of BCNF schemas R1,R2……Rn. Which not only a BCNF
decomposition, but also a lossless-join decomposition.
Algorithm:
result:={R};
done:=false;
compute F+;
while (not done) do
if (there is a schema Ri in result that is not in BCNF)
then begin
let aÒ b be a nontrivial
functional dependency that
holds on Ri such that aÒ Ri is not in F+
,and a∩b =F ;
result:=(result-Ri) ∪(Ri- b) ∪(a ,b);
end
else done:=true;
Example:
Lending-schema={branch-name,branch-city,
assets,customer-name,loan-number,amount}
A candidate key for this schema
is:
{loan-number,customer-name}
BCNF decomposition :
Example:
Banker-schema={branch-name,customer-name,banker-name}
banker-name is not a key for
Banker-schema.
Thus, Banker-schema is not in BCNF. Decompose it.
Banker-branch-schema={banker-name,branch-name}
Customer-banker-schema={customer-name,banker-name}
F1={banker-name→branch-name}
F2= F
{branch-name,customer-name}→banker-name in F+
{branch-name,customer-name}→banker-name not in {F1∪F2}+
{F1∪F2}+<>F+
and the decomposition is not dependency preserving
This example demonstrates that
not every BCNF decomposition is dependency preserving. Moreover, it
demonstrates that we cannot always satisfy all three design goals:
① BCNF
② Lossless join
③ Dependency preservation
3NF decomposition A relation schema R is in 3NF with respect to a set F
of functional dependencies if, for all functional dependencies in F+
of the form a→b, where aÍR and bÍR, at least one of
the following holds:
① a→b is a trivial
functional dependency (that is bÍ a)
② a is a superkey for
schema R
③ each attribute A in
b-a is contained in a
candidate key for R.
Example:
Banker-schema={branch-name,customer-name,banker-name}
banker-name is not a key for
Banker-schema.
Thus, Banker-schema is not in BCNF.
{branch-name,customer-name} is a candidate key for Banker-schema.
Banker-schema is in 3NF.
If R is not in 3NF, we can decompose R into a collection of 3NF schemas
R1,R2……Rn, which is not only a lossless-join decomposition, but also a
dependency-preserving decomposition.
Example:
Relvar R=(O,I,S,Q,B,D) satisfies the following FDs.
F={S→D, I→B, IS →Q, B →O}
Give a lossless-join, dependency-preserving decomposition into 3NF of the
schema R.
F=FC
L: S,I R: D,Q,O LR: B
(IS)+=ISDBQO=U have
only one candidate key.
r={SD,IB,ISQ,BO}
comparison of BCNF and 3NF
Advantage of 3NF:
It is always possible to obtain
a 3NF design without sacrificing a lossless join or dependency preservation.
Disadvantage of 3NF:
If we do not eliminate all transitive dependencies, we may have to use
null values to represent some of the possible meaningful relationships among
data items, and there is the problem of repetition of information.
If a schema is in 3NF but not in BCNF,then redundancy will occur.uniquely the value for another set of
attributes.
Distributed Databases:
A distributed database system consists of loosely coupled
sites that share no physical component
Database systems that run on each site are independent of
each other
Transactions may
access data at one or more sites
Distributed Transactions
Transaction may access data at several sites. Each site
has a local transaction manager responsible for Maintaining a log for recovery
purposes Participating in coordinating the concurrent execution of the
transactions executing at that site.
Each site has a
transaction coordinator, which is responsible for Starting the execution of
transactions that originate at the site.
Distributing sub trans actions at appropriate sites for execution.
Coordinating the termination of each transaction that originates at the site,
which may result in the transaction being committed at all sites or aborted at
all sites.
System Failure Modes
Failures unique to
distributed systems:
Failure of a site.
Loss of massages
Handled by network
transmission control protocols such as TCP-IP
Failure of a
communication link
Handled by network
protocols, by routing messages via alternative links
Network partition
A network is said to
be partitioned when it has been
split into two or more subsystems that lack any connection between them
Note: a subsystem may
consist of a single node
Network partitioning
and site failures are generally indistinguishable.
Commit Protocols
Commit protocols are used to ensure atomicity across
sites
a transaction which
executes at multiple sites must either be committed at all the sites, or
aborted at all the sites.
Not acceptable to have a transaction committed at one
site and aborted at another
The two-phase commit (2PC) protocol is
widely used
The three-phase commit (3PC) protocol is
more complicated and more expensive, but avoids some drawbacks of two-phase
commit protocol. This protocol is not
used in practice.
Two Phase Commit Protocol (2PC)
Assumes fail-stop model – failed sites simply stop
working, and do not cause any other harm, such as sending incorrect messages to
other sites.
Execution of the
protocol is initiated by the coordinator after the last step of the transaction
has been reached.
The protocol involves
all the local sites at which the transaction executed
Let T be a transaction initiated at site Si, and let the
transaction coordinator at Si
be Ci
Phase 1: Obtaining a Decision
Coordinator asks all participants to prepare to commit transaction Ti.
Ci adds the records <prepare T> to the log and forces log to stable storage
sends prepare T messages to all sites at which T executed
Upon receiving
message, transaction manager at site determines if it can commit the
transaction if not, add a record <no
T> to the log and
send abort T message to Ci
if the transaction can be committed, then:
add the record <ready T> to the
log
force all
records for T to stable
storage
send ready T message to Ci
Phase 2: Recording the Decision
T can be committed of Ci received a ready
T message from all the
participating sites: otherwise T must
be aborted.
Coordinator adds a decision record, <commit T> or <abort T>, to the log and forces record
onto stable storage. Once the record stable storage it is irrevocable (even if
failures occur) Coordinator sends a message to each participant informing it of
the decision (commit or abort) Participants take appropriate action locally.
Concurrency Control
Modify concurrency control schemes for use in distributed
environment.
We assume that each
site participates in the execution of a commit protocol to ensure global
transaction automicity.
We assume all replicas of any item are updated
Single-Lock-Manager Approach
System maintains a single
lock manager that resides in a single
chosen site, say Si
When a transaction needs to lock a data item, it sends a lock request to Si and lock manager determines
whether the lock can be granted immediately
If yes, lock manager sends a message to the site which
initiated the request
If no, request is delayed until it can be granted, at
which time a message is sent to the initiating site
The transaction can read the data item from any one of the sites at which a
replica of the data item resides.
Writes must be performed on all replicas of a data item
Advantages of scheme:
Simple implementation
Simple deadlock handling
Disadvantages of scheme are:
Bottleneck: lock manager site becomes a bottleneck
Vulnerability: system is vulnerable to lock manager site
failure.
Distributed Lock Manager
In this approach,
functionality of locking is implemented by lock managers at each site.Lock
managers control access to local data items.But special protocols may be used
for replicas
Advantage: work is distributed and can be made robust to failure
Disadvantage:
deadlock detection is more complicated
Lock managers
cooperate for deadlock detection
Several variants of this approach
Primary copy
Majority protocol
Biased protocol
Quorum consensus
Primary Copy
Choose one replica of data item to be the primary copy.
Site containing the replica is called the primary
site for that data item
Different data items can have different primary sites
When a transaction
needs to lock a data item Q, it
requests a lock at the primary site of Q.
Implicitly gets lock on all replicas of the data item
Benefit
Concurrency control for replicated data handled similarly
to unreplicated data - simple implementation.
Drawback
If the primary site of
Q fails, Q is inaccessible even though
other sites containing a replica may be
accessible.
Majority Protocol
Local lock manager at each site administers lock and unlock
requests for data items stored at that site.
When a transaction
wishes to lock an unreplicated data item Q
residing at site Si,
a message is sent to Si
‘s lock manager.
If Q is
locked in an incompatible mode, then the request is delayed until it can be
granted.
When
the lock request can be granted, the lock manager sends a message back to the
initiator indicating that the lock request has been granted.
In case of replicated
data.If Q is replicated at n
sites, then a lock request message must be sent to more than half of the n
sites in which Q is stored.The
transaction does not operate on Q
until it has obtained a lock on a majority of the replicas of Q.
When
writing the data item, transaction performs writes on all replicas.
Benefit
Can be used even when some sites are unavailable
details on how handle writes in the presence of site
failure later
Drawback
Requires 2(n/2
+ 1) messages for handling lock requests, and (n/2 + 1) messages for handling unlock requests.Potential for
deadlock even with single item - e.g., each of 3 transactions may have locks on
1/3rd of the replicas of a data.
Biased Protocol
Local lock manager at each site as in majority protocol,
however, requests for shared locks are handled differently than requests for
exclusive locks.
Shared locks. When a
transaction needs to lock data item Q,
it simply requests a lock on Q
from the lock manager at one site containing a replica of Q.
Exclusive locks. When
transaction needs to lock data item Q,
it requests a lock on Q from
the lock manager at all sites containing a replica of Q.
Advantage - imposes less overhead on read
operations.
Disadvantage - additional overhead on writes
Timestamping
Timestamp based concurrency-control protocols can be used
in distributed systems. Each transaction must be given a unique timestamp
Main problem: how to generate a
timestamp in a distributed fashion
Each site generates a unique local timestamp using either
a logical counter or the local clock.
Global unique timestamp is obtained by concatenating the unique
local timestamp with the unique identifier.
A site with a slow
clock will assign smaller timestamps
Still logically correct: serializability not affected
But: “disadvantages” transactions
To fix this problem
Define within each site Si a logical clock (LCi), which generates the unique local timestamp
Require that Si
advance its logical clock whenever a request is received from a
transaction Ti with timestamp < x,y>
and x is greater that the current value of LCi.
In this case, site Si advances its
logical clock to the value x +
1.
No comments:
Post a Comment