Monday, 27 October 2014

RDBMS

UNIT II


Relational Model Structure of Relational Databases Basic structure
Attribute Types:
Each attribute of a relation has a name.
The set of allowed values for each attribute is called the domain of the attribute.
Attribute values are (normally) required to be atomic; that is, indivisible.
E.g. the value of an attribute can be an account number,
but cannot be a set of account numbers
Domain is said to be atomic if all its members are atomic.
The special value null  is a member of every domain.
The null value causes complications in the definition of many operations.
We shall ignore the effect of null values in our main presentation and consider their effect later.

Relation Schema
Formally, given domains D1, D2, …. Dn a relation r is a subset of
        D1 x  D2  x … x Dn
Thus, a relation is a set of n-tuples (a1, a2, …, an) where each ai 
Î Di
Schema of a relation consists of:
attribute definitions
name
type/domain
integrity constraints
Relation Instance
The current values (relation instance) of a relation are specified by a table
An element t of r is a tuple, represented by a row in a table
Order of tuples is irrelevant (tuples may be stored in an arbitrary order)
Database
A database consists of multiple relations
Information about an enterprise is broken up into parts, with  each relation storing one part of the informationE.g.
account :    information about accounts
depositor :   which customer owns which account
customer :   information about customers
Why Split Information Across Relations?
Storing all information as a single relation such as
   bank(account_number, balance, customer_name, ..)
  results in repetition of information
 e.g.,if two customers own an account (What gets repeated?)
The need for null values
e.g., to represent a customer without an account
Normalization theory  deals with how to design relational schemas
Keys
Let K Í R
K is a superkey of R if values for K are sufficient to identify a unique tuple of each possible relation r(R)
by “possible r ” we mean a relation r that could exist in the enterprise we are modeling.
Example:  {customer_name, customer_street} and
                 {customer_name}
are both superkeys of Customer, if no two customers can possibly have the same name
In real life, an attribute such as customer_id would be used instead of customer_name to uniquely identify customers, but we omit it to keep our examples small, and instead assume customer names are unique.
K is a candidate key if K is minimal
Example:  {customer_name} is a candidate key for Customer, since it is a superkey and no subset of it is a superkey.

Primary key: a candidate key chosen as the principal means of identifying tuples within a relation Should choose an attribute whose value never, or very rarely, changes.
E.g. email address is unique, but may change

Foreign Keys:A relation schema may have an attribute that corresponds to the primary key of another relation.  The attribute is called a foreign key.
E.g. customer_name and account_number attributes of depositor are foreign keys to customer and account respectively.

Only values occurring in the primary key attribute of the referenced relation may occur in the foreign key attribute of the referencing relation.
 Query Languages
Language in which user requests information from the database.
Categories of languages
Procedural
Non-procedural, or declarative
“Pure” languages:
Relational algebra
Tuple relational calculus
Domain relational calculus
Pure languages form underlying basis of query languages that people use.
Six basic operators  of Relational algebra
select: s
project: Õ
union: È
set difference: –
Cartesian product: x
rename: r

The operators take one or  two relations as inputs and produce a new relation as a result.
select, project and rename operators are called unary operators, because they are operating on one relation.
The other three operators are called binary operators, because they are operating on pairs of  relationship.Rename Operation

Allows us to name, and therefore to refer to, the results of relational-algebra expressions.
Allows us to refer to a relation by more than one name
Example:
                                                r x (E)
            returns the expression E under the name X
If a relational-algebra expression E has arity n, then                                          
            returns the result of expression E under the name X, and with the
            attributes renamed to A1 , A2 , …., An .
  Other Relational Languages
Tuple Relational Calculus
Domain Relational Calculus
Query-by-Example (QBE)
Datalog

Tuple Relational Calculus

A nonprocedural query language, where each query is of the form
                        {t | P (t ) }
It is the set of all tuples t such that predicate P is true for t
t is a tuple variable, t [A ] denotes the value of tuple t on attribute A
t Î r denotes that tuple t is in relation r
P is a formula similar to that of the predicate calculus
Predicate Calculus Formula
1          Set of attributes and constants
2.         Set of comparison operators:  (e.g., <, £, =, ¹, >, ³)
3.         Set of connectives:  and (Ù), or (v)‚ not (Ø)
4.         Implication (Þ): x Þ y, if x if true, then y is true
                                                x Þ y º Øx v y
5.         Set of quantifiers:
$ t Î r (Q (t )) º ”there exists” a tuple in t in relation r
                        such that predicate Q (t ) is true
"t Î r (Q (t )) º Q is true “for all” tuples t in relation r
Banking Example
branch (branch_name, branch_city, assets )
customer (customer_name, customer_street, customer_city )
account (account_number, branch_name, balance )
loan (loan_number, branch_name, amount )
depositor (customer_name, account_number )
borrower (customer_name, loan_number )
Example Queries
Find the loan_number, branch_name, and amount for loans of over $1200
     {t | t Î loan Ù t [amount ] > 1200}
Find the loan number for each loan of an amount greater than $1200
  {t | $ s Î loan (t [loan_number ] = s [loan_number ] Ù s [amount ] > 1200)}
  Notice that a relation on schema [loan_number ] is implicitly defined by            
     the query

Safety of Expressions

It is possible to write tuple calculus expressions that generate infinite relations.
For example, { t | Ø t Î r } results in an infinite relation if the domain of any attribute of relation r is infinite
To guard against the problem, we restrict the set of allowable expressions to safe expressions.
An expression {t | P (t )} in the tuple relational calculus is safe if every component of t appears in one of the relations, tuples, or constants that appear in P
NOTE: this is more than just a syntax condition.
E.g. { t | t [A] = 5 Ú true } is not safe --- it defines an infinite set with attribute values that do not appear in any relation or tuples or constants in P.
  
Domain Relational Calculus

A nonprocedural query language equivalent in power to the tuple relational calculus
Each query is an expression of the form:
                                    { < x1, x2, …, xn > | P (x1, x2, …, xn)}
x1, x2, …, xn  represent domain variables
P represents a formula similar to that of the predicate calculus
Example Queries
Safety of Expressions

for all values The expression:
                                    { < x1, x2, …, xn > | P (x1, x2, …, xn )}
is safe if all of the following hold:
All values that appear in tuples of the expression are values from dom (P ) (that is, the values appear either in P or in a tuple of a      relation mentioned in P ).
For every “there exists” subformula of the form $ x (P1(x )), the      subformula is true if and only if there is a value of x in dom (P1)          such that P1(x ) is true.
For every “for all” subformula of the form "x (P1 (x )), the subformula is true if and only if P1(x ) is true es x  from dom (P1).

Relational database design

The goal of a  relational database design is to generate a set of relation schemas that allows us to store information without unneccassary redundancy  and allow us to retrieve information easily.
Pitfalls in Relational database design
    Among the undesirable properties that a bad design may have are.
1 Pitfalls in Relational-Database Design
Dependency Preservation:

There is another goal in relational-database design: dependency preservation.
Let F be a set of functional dependencies on a schema R, and let R1,R2……Rn be a decomposition of R. The restriction of F to Ri is the set Fi of all functional dependencies in F+ that include only attributes of Ri
.
         Let F=F1 F2…… Fn. F is a set of functional dependencies on schema R. if F’+= F+  is true, then every dependency in F is logically implied by F’, and, if we verify that Fis satisfied, we have verified that F is satisfied. We say that a decomposition having the property F’+= F+ is a dependency-preserving decomposition.

Boyce/codd normal form(BCNF):

 A relvar is in BCNF if and only if every nontrivial, left-irreducible FD has a candidate key as its determinant.
A relation schema R is in BCNF with respect to a set F of functional dependencies if for all functional dependencies in F+ of the form ab, where aÍR and bÍR, at least one of the following holds:
ab is a trivial functional dependency (that is bÍ a)
a is a superkey for schema R

Examples:

customer-schema=(customer-name,customer- street,                             customer-city)
customer-name→{customer-street,customer-city}                            
customer-name is a candidate key
customer-schema is in BCNF
branch-schema=(branch-name,assets,branch-city)
branch-name→ {assets,branch-city}
branch-schema is a candidate key
branch-schema is in BCNF
loan-info-schema=(branch-name, customer-name,loan-number, amount) loan-number→{amount,branch-name}
loan-number is not a candidate key
(Downtown,John Bell,L-44,1000)
(Downtown,Jane Bell,L-44,1000)
loan-number → {amount,branch-name} is not a trivial dependency.
loan-info-schema is not in BCNF
loan-info-schema=(branch-name, customer-name,loan-number, amount)
       loan-info-schema is not in BCNF. Redundancy exist, so we should decompose it.
loan-schema=(loan-number, branch-name, amount)
borrower-schema=(customer-name,loan-number)
loan-number→{amount,branch-name}
loan-number is a candidate key of loan-schema
loan-schema and borrower-schema are both in BCNF
       If R is not in BCNF, we can decompose R into a collection of BCNF schemas R1,R2……Rn. Which not only a BCNF decomposition, but also a lossless-join decomposition.
Algorithm:
result:={R};
done:=false;
compute F+;
while (not done) do
if (there is a schema Ri in result that is not in BCNF)
then begin
let aÒ b be a nontrivial functional dependency that
holds on Ri such that aÒ Ri is not in F+ ,and ab =F ;
result:=(result-Ri) (Ri- b) (a ,b);
end
else done:=true;
Example:
Lending-schema={branch-name,branch-city, assets,customer-name,loan-number,amount}
A candidate key  for this schema is:
{loan-number,customer-name}
BCNF decomposition :
Example:
Banker-schema={branch-name,customer-name,banker-name}
banker-name is not a  key for Banker-schema.
Thus, Banker-schema is not in BCNF. Decompose it.
Banker-branch-schema={banker-name,branch-name}
Customer-banker-schema={customer-name,banker-name}
F1={banker-name→branch-name}
F2= F
{branch-name,customer-name}→banker-name in F+
{branch-name,customer-name}→banker-name not in {F1F2}+
{F1F2}+<>F+ and the decomposition is not dependency preserving
This example demonstrates that not every BCNF decomposition is dependency preserving. Moreover, it demonstrates that we cannot always satisfy all three design goals:
BCNF
Lossless join
Dependency preservation
3NF decomposition A relation schema R is in 3NF with respect to a set F of functional dependencies if, for all functional dependencies in F+ of the form ab, where aÍR and bÍR, at least one of the following holds:
ab is a trivial functional dependency (that is bÍ a)
a is a superkey for schema R
each attribute A in b-a is contained in a candidate key for R.
Example:
Banker-schema={branch-name,customer-name,banker-name}
banker-name is not a  key for Banker-schema.
Thus, Banker-schema is not in BCNF.
{branch-name,customer-name} is a candidate key for Banker-schema. Banker-schema is in 3NF.
If R is not in 3NF, we can decompose R into a collection of 3NF schemas R1,R2……Rn, which is not only a lossless-join decomposition, but also a dependency-preserving decomposition.
Example:
Relvar R=(O,I,S,Q,B,D) satisfies the following FDs.
F={S→D, I→B, IS →Q, B →O}
Give a lossless-join, dependency-preserving decomposition into 3NF of the schema R.
F=FC
            L: S,I        R: D,Q,O        LR: B
(IS)+=ISDBQO=U   have only one candidate key.
r={SD,IB,ISQ,BO}
comparison of BCNF and 3NF

Advantage of 3NF:
It is always possible to obtain a 3NF design without sacrificing a lossless join or dependency preservation.
Disadvantage of 3NF:
If we do not eliminate all transitive dependencies, we may have to use null values to represent some of the possible meaningful relationships among data items, and there is the problem of repetition of information.
If a schema is in 3NF but not in BCNF,then redundancy will occur.uniquely the value for another set of attributes.
                         

Distributed Databases:

A distributed database system consists of loosely coupled sites that share no physical component
Database systems that run on each site are independent of each other
Transactions may access data at one or more sites


Distributed Transactions

Transaction may access data at several sites. Each site has a local transaction manager responsible for Maintaining a log for recovery purposes Participating in coordinating the concurrent execution of the transactions executing at that site.
Each site has a transaction coordinator, which is responsible for Starting the execution of transactions that originate at the site.

Distributing sub trans actions at appropriate sites for execution. Coordinating the termination of each transaction that originates at the site, which may result in the transaction being committed at all sites or aborted at all sites.
System Failure Modes

Failures unique to distributed systems:
Failure of a site.
Loss of massages
Handled by network transmission control protocols such as TCP-IP
Failure of a communication link
Handled by network protocols, by routing messages via alternative links
Network partition
A network is said to be partitioned when it has been split into two or more subsystems that lack any connection between them
Note: a subsystem may consist of a single node
Network partitioning and site failures are generally indistinguishable.

Commit Protocols

Commit protocols are used to ensure atomicity across sites
a transaction which executes at multiple sites must either be committed at all the sites, or aborted at all the sites.
Not acceptable to have a transaction committed at one site and aborted at another
The two-phase commit (2PC) protocol is widely used
The three-phase commit (3PC) protocol is more complicated and more expensive, but avoids some drawbacks of two-phase commit protocol.  This protocol is not used in practice.

Two Phase Commit Protocol (2PC)

Assumes fail-stop model – failed sites simply stop working, and do not cause any other harm, such as sending incorrect messages to other sites.
Execution of the protocol is initiated by the coordinator after the last step of the transaction has been reached.
The protocol involves all the local sites at which the transaction executed
Let T be a transaction initiated at site Si, and let the transaction coordinator at Si be Ci

Phase 1: Obtaining a Decision

Coordinator asks all participants to prepare to commit transaction Ti.
Ci adds the records <prepare T> to the log and forces log to stable storage
sends prepare T messages to all sites at which T executed
Upon receiving message, transaction manager at site determines if it can commit the transaction if not, add a record <no T> to the log and send abort T message to Ci

if the transaction can be committed, then:
add the record <ready T> to the log
force all records for T to stable storage
send ready T message to Ci

Phase 2: Recording the Decision

T can be committed of Ci received a ready T message from all the participating sites: otherwise T must be aborted.
Coordinator adds a decision record, <commit T> or <abort T>, to the log and forces record onto stable storage. Once the record stable storage it is irrevocable (even if failures occur) Coordinator sends a message to each participant informing it of the decision (commit or abort) Participants take appropriate action locally.

Concurrency Control

Modify concurrency control schemes for use in distributed environment.
We assume that each site participates in the execution of a commit protocol to ensure global transaction automicity.
We assume all replicas of any item are updated

Single-Lock-Manager Approach

System maintains a single lock manager that resides in a single chosen site, say Si When a transaction needs to lock a data item, it sends a lock request to Si and lock manager determines whether the lock can be granted immediately
If yes, lock manager sends a message to the site which initiated the request
If no, request is delayed until it can be granted, at which time a message is sent to the initiating site
The transaction can read the data item from any one of the sites at which a replica of the data item resides.
Writes must be performed on all replicas of a data item


Advantages of scheme:
Simple implementation
Simple deadlock handling
Disadvantages of scheme are:

Bottleneck: lock manager site becomes a bottleneck
Vulnerability: system is vulnerable to lock manager site failure.

Distributed Lock Manager

In this approach, functionality of locking is implemented by lock managers at each site.Lock managers control access to local data items.But special protocols may be used for replicas
Advantage: work is distributed and can be made robust to failure
Disadvantage:   deadlock detection is more complicated
Lock managers cooperate for deadlock detection

Several variants of this approach

Primary copy
Majority protocol
Biased protocol
Quorum consensus

Primary Copy

Choose one replica of data item to be the primary copy.
Site containing the replica is called  the primary site for that data item
Different data items can have different primary sites
When a transaction needs to lock a data item Q, it requests a lock at the primary site of Q.
Implicitly gets lock on all replicas of the data item
Benefit
Concurrency control for replicated data handled similarly to unreplicated data - simple implementation.
Drawback
If the primary site of  Q fails, Q is inaccessible even though other  sites containing a replica may be accessible.

Majority Protocol
Local lock manager at each site administers lock and unlock requests for data items stored at that site.
When a transaction wishes to lock an unreplicated data item Q residing at site Si, a message is sent to Si ‘s lock manager.
If Q is locked in an incompatible mode, then the request is delayed until it can be granted.
                        When the lock request can be granted, the lock manager sends a message back to the initiator indicating that the lock request has been granted.
In case of replicated data.If Q is replicated at n sites, then a lock request message must be sent to more than half of the n sites in which Q is stored.The transaction does not operate on Q until it has obtained a lock on a majority of the replicas of Q.

                        When writing the data item, transaction performs writes on all replicas.
Benefit
Can be used even when some sites are unavailable
details on how handle writes in the presence of site failure later
Drawback
Requires 2(n/2 + 1) messages for handling lock requests, and (n/2 + 1) messages for handling unlock requests.Potential for deadlock even with single item - e.g., each of 3 transactions may have locks on 1/3rd of the replicas of a data.

Biased Protocol

Local lock manager at each site as in majority protocol, however, requests for shared locks are handled differently than requests for exclusive locks.
Shared locks. When a transaction needs to lock data item Q, it simply requests a lock on Q from the lock manager at one site containing a replica of Q.
Exclusive locks. When transaction needs to lock data item Q, it requests a lock on Q from the lock manager at all sites containing a replica of Q.
Advantage - imposes less overhead on read operations.
Disadvantage - additional overhead on writes

Timestamping

Timestamp based concurrency-control protocols can be used in distributed systems. Each transaction must be given a unique timestamp

Main problem:  how to generate a timestamp in a distributed fashion
Each site generates a unique local timestamp using either a logical counter or the local clock.
Global unique timestamp is obtained by concatenating the unique local timestamp with the unique identifier.
A site with a slow clock will assign smaller timestamps
Still logically correct: serializability not affected
But: “disadvantages” transactions
To fix this problem
Define within each site Sa logical clock  (LCi), which generates the unique local timestamp
Require that Si advance its logical clock whenever a request is received from a transaction Ti with timestamp < x,y> and x is greater that the current value of LCi.
In this case, site Sadvances its logical clock to the value x + 1.


No comments:

Post a Comment