ABOUT SNS BCA: RDBMS

UNIT II

Relational Model Structure of Relational Databases Basic structure

Attribute Types:

Each attribute of a relation has a name.

The set of allowed values for each attribute is called the domain of the attribute.

Attribute values are (normally) required to be atomic; that is, indivisible.

E.g. the value of an attribute can be an account number,
but cannot be a set of account numbers

Domain is said to be atomic if all its members are atomic.

The special value null is a member of every domain.

The null value causes complications in the definition of many operations.

We shall ignore the effect of null values in our main presentation and consider their effect later.

Relation Schema

Formally, given domains D₁, D₂, …. D_n a relation r is a subset of
D₁ x D₂ x … x D_n
Thus, a relation is a set of n-tuples (a₁, a₂, …, a_n) where each a_i Î D_i

Schema of a relation consists of:

attribute definitions

name

type/domain

integrity constraints

Relation Instance

The current values (relation instance) of a relation are specified by a table

An element t of r is a tuple, represented by a row in a table

Order of tuples is irrelevant (tuples may be stored in an arbitrary order)

Database

A database consists of multiple relations

Information about an enterprise is broken up into parts, with each relation storing one part of the informationE.g.

account :    information about accounts
depositor :   which customer owns which account
customer :   information about customers

Why Split Information Across Relations?

Storing all information as a single relation such as
bank(account_number, balance, customer_name, ..)
results in repetition of information

e.g.,if two customers own an account (What gets repeated?)

The need for null values

e.g., to represent a customer without an account

Normalization theory deals with how to design relational schemas

Keys

Let K Í R

K is a superkey of R if values for K are sufficient to identify a unique tuple of each possible relation r(R)

by “possible r ” we mean a relation r that could exist in the enterprise we are modeling.

Example: {customer_name, customer_street} and
{customer_name}
are both superkeys of Customer, if no two customers can possibly have the same name

In real life, an attribute such as customer_id would be used instead of customer_name to uniquely identify customers, but we omit it to keep our examples small, and instead assume customer names are unique.

K is a candidate key if K is minimal
Example: {customer_name} is a candidate key for Customer, since it is a superkey and no subset of it is a superkey.

Primary key: a candidate key chosen as the principal means of identifying tuples within a relation Should choose an attribute whose value never, or very rarely, changes.

E.g. email address is unique, but may change

Foreign Keys:A relation schema may have an attribute that corresponds to the primary key of another relation. The attribute is called a foreign key.

E.g. customer_name and account_number attributes of depositor are foreign keys to customer and account respectively.

Only values occurring in the primary key attribute of the referenced relation may occur in the foreign key attribute of the referencing relation.

Query Languages

Language in which user requests information from the database.

Categories of languages

Procedural

Non-procedural, or declarative

“Pure” languages:

Relational algebra

Tuple relational calculus

Domain relational calculus

Pure languages form underlying basis of query languages that people use.

Six basic operators of Relational algebra

select: s

project: Õ

union: È

set difference: –

Cartesian product: x

rename: r

The operators take one or two relations as inputs and produce a new relation as a result.

select, project and rename operators are called unary operators, because they are operating on one relation.

The other three operators are called binary operators, because they are operating on pairs of relationship.Rename Operation

Allows us to name, and therefore to refer to, the results of relational-algebra expressions.

Allows us to refer to a relation by more than one name

Example:

r _x (E)

returns the expression E under the name X

If a relational-algebra expression E has arity n, then

returns the result of expression E under the name X, and with the

attributes renamed to A₁, A₂, …., A_n.

Other Relational Languages

Tuple Relational Calculus

Domain Relational Calculus

Query-by-Example (QBE)

Datalog

Tuple Relational Calculus

A nonprocedural query language, where each query is of the form

{t | P (t ) }

It is the set of all tuples t such that predicate P is true for t

t is a tuple variable, t [A ] denotes the value of tuple t on attribute A

t Î r denotes that tuple t is in relation r

P is a formula similar to that of the predicate calculus

Predicate Calculus Formula

1 Set of attributes and constants

2. Set of comparison operators: (e.g., <, £, =, ¹, >, ³)

3. Set of connectives: and (Ù), or (v)‚ not (Ø)

4. Implication (Þ): x Þ y, if x if true, then y is true

x Þ y º Øx v y

5. Set of quantifiers:

$ t Î r (Q (t )) º ”there exists” a tuple in t in relation r
such that predicate Q (t ) is true

"t Î r (Q (t )) º Q is true “for all” tuples t in relation r

Banking Example

branch (branch_name, branch_city, assets )

customer (customer_name, customer_street, customer_city )

account (account_number, branch_name, balance )

loan (loan_number, branch_name, amount )

depositor (customer_name, account_number )

borrower (customer_name, loan_number )

Example Queries

Find the loan_number, branch_name, and amount for loans of over $1200

{t | t Î loan Ù t [amount ] > 1200}

Find the loan number for each loan of an amount greater than $1200

{t | $ s Î loan (t [loan_number ] = s [loan_number ] Ù s [amount ] > 1200)}

Notice that a relation on schema [loan_number ] is implicitly defined by

the query

Safety of Expressions

It is possible to write tuple calculus expressions that generate infinite relations.

For example, { t | Ø t Î r } results in an infinite relation if the domain of any attribute of relation r is infinite

To guard against the problem, we restrict the set of allowable expressions to safe expressions.

An expression {t | P (t )} in the tuple relational calculus is safe if every component of t appears in one of the relations, tuples, or constants that appear in P

NOTE: this is more than just a syntax condition.

E.g. { t | t [A] = 5 Ú true } is not safe --- it defines an infinite set with attribute values that do not appear in any relation or tuples or constants in P.

Domain Relational Calculus

A nonprocedural query language equivalent in power to the tuple relational calculus

Each query is an expression of the form:

{ < x₁, x₂, …, x_n > | P (x₁, x₂, …, x_n)}

x₁, x₂, …, x_n represent domain variables

P represents a formula similar to that of the predicate calculus

Example Queries

Safety of Expressions

for all values The expression:

{ < x₁, x₂, …, x_n > | P (x₁, x₂, …, x_n)}

is safe if all of the following hold:

All values that appear in tuples of the expression are values from dom (P ) (that is, the values appear either in P or in a tuple of a relation mentioned in P ).

For every “there exists” subformula of the form $ x (P₁(x )), the subformula is true if and only if there is a value of x in dom (P₁) such that P₁(x ) is true.

For every “for all” subformula of the form "_x (P₁ (x )), the subformula is true if and only if P₁(x ) is true es x from dom (P₁).

Relational database design

The goal of a relational database design is to generate a set of relation schemas that allows us to store information without unneccassary redundancy and allow us to retrieve information easily.

Pitfalls in Relational database design

Among the undesirable properties that a bad design may have are.

1 Pitfalls in Relational-Database Design

Dependency Preservation:

There is another goal in relational-database design: dependency preservation.

Let F be a set of functional dependencies on a schema R, and let R1,R2……Rn be a decomposition of R. The restriction of F to Ri is the set Fi of all functional dependencies in F⁺ that include only attributes of Ri

Let F^’=F1∪ F2∪…… ∪ Fn. F^’ is a set of functional dependencies on schema R. if F^’+= F⁺ is true, then every dependency in F is logically implied by F’, and, if we verify that F^’is satisfied, we have verified that F is satisfied. We say that a decomposition having the property F^’+= F⁺ is a dependency-preserving decomposition.

Boyce/codd normal form(BCNF):

A relvar is in BCNF if and only if every nontrivial, left-irreducible FD has a candidate key as its determinant.

A relation schema R is in BCNF with respect to a set F of functional dependencies if for all functional dependencies in F⁺ of the form a→b, where aÍR and bÍR, at least one of the following holds:

ⅰ a→b is a trivial functional dependency (that is bÍ a)

ⅱ a is a superkey for schema R

Examples:

① customer-schema=(customer-name,customer- street, customer-city)

customer-name→{customer-street,customer-city}

customer-name is a candidate key

customer-schema is in BCNF

② branch-schema=(branch-name,assets,branch-city)

branch-name→ {assets,branch-city}

branch-schema is a candidate key

branch-schema is in BCNF

③ loan-info-schema=(branch-name, customer-name,loan-number, amount) loan-number→{amount,branch-name}

loan-number is not a candidate key

(Downtown,John Bell,L-44,1000)

(Downtown,Jane Bell,L-44,1000)

loan-number → {amount,branch-name} is not a trivial dependency.

loan-info-schema is not in BCNF

loan-info-schema=(branch-name, customer-name,loan-number, amount)

loan-info-schema is not in BCNF. Redundancy exist, so we should decompose it.

loan-schema=(loan-number, branch-name, amount)

borrower-schema=(customer-name,loan-number)

loan-number→{amount,branch-name}

loan-number is a candidate key of loan-schema

loan-schema and borrower-schema are both in BCNF

If R is not in BCNF, we can decompose R into a collection of BCNF schemas R1,R2……Rn. Which not only a BCNF decomposition, but also a lossless-join decomposition.

Algorithm:

result:={R};

done:=false;

compute F⁺;

while (not done) do

if (there is a schema Ri in result that is not in BCNF)

then begin

let aÒ b be a nontrivial functional dependency that

holds on Ri such that aÒ Ri is not in F⁺ ,and a∩b =F ;

result:=(result-Ri) ∪(Ri- b) ∪(a ,b);

end

else done:=true;

Example:

Lending-schema={branch-name,branch-city, assets,customer-name,loan-number,amount}

A candidate key for this schema is:

{loan-number,customer-name}

BCNF decomposition :

Example:

Banker-schema={branch-name,customer-name,banker-name}

banker-name is not a key for Banker-schema.

Thus, Banker-schema is not in BCNF. Decompose it.

Banker-branch-schema={banker-name,branch-name}

Customer-banker-schema={customer-name,banker-name}

F1={banker-name→branch-name}

F2= F

{branch-name,customer-name}→banker-name in F⁺

{branch-name,customer-name}→banker-name not in {F1∪F2}⁺

{F1∪F2}⁺<>F⁺ and the decomposition is not dependency preserving

This example demonstrates that not every BCNF decomposition is dependency preserving. Moreover, it demonstrates that we cannot always satisfy all three design goals:

① BCNF

② Lossless join

③ Dependency preservation

3NF decomposition A relation schema R is in 3NF with respect to a set F of functional dependencies if, for all functional dependencies in F⁺ of the form a→b, where aÍR and bÍR, at least one of the following holds:

① a→b is a trivial functional dependency (that is bÍ a)

② a is a superkey for schema R

③ each attribute A in b-a is contained in a candidate key for R.

Example:

Banker-schema={branch-name,customer-name,banker-name}

banker-name is not a key for Banker-schema.

Thus, Banker-schema is not in BCNF.

{branch-name,customer-name} is a candidate key for Banker-schema. Banker-schema is in 3NF.

If R is not in 3NF, we can decompose R into a collection of 3NF schemas R1,R2……Rn, which is not only a lossless-join decomposition, but also a dependency-preserving decomposition.

Example:

Relvar R=(O,I,S,Q,B,D) satisfies the following FDs.

F={S→D, I→B, IS →Q, B →O}

Give a lossless-join, dependency-preserving decomposition into 3NF of the schema R.

F=FC

L: S,I R: D,Q,O LR: B

(IS)⁺=ISDBQO=U have only one candidate key.

r={SD,IB,ISQ,BO}

comparison of BCNF and 3NF

Advantage of 3NF:

It is always possible to obtain a 3NF design without sacrificing a lossless join or dependency preservation.

Disadvantage of 3NF:

If we do not eliminate all transitive dependencies, we may have to use null values to represent some of the possible meaningful relationships among data items, and there is the problem of repetition of information.

If a schema is in 3NF but not in BCNF,then redundancy will occur.uniquely the value for another set of attributes.

Distributed Databases:

A distributed database system consists of loosely coupled sites that share no physical component

Database systems that run on each site are independent of each other

Transactions may access data at one or more sites

Distributed Transactions

Transaction may access data at several sites. Each site has a local transaction manager responsible for Maintaining a log for recovery purposes Participating in coordinating the concurrent execution of the transactions executing at that site.

Each site has a transaction coordinator, which is responsible for Starting the execution of transactions that originate at the site.

Distributing sub trans actions at appropriate sites for execution. Coordinating the termination of each transaction that originates at the site, which may result in the transaction being committed at all sites or aborted at all sites.

System Failure Modes

Failures unique to distributed systems:

Failure of a site.

Loss of massages

Handled by network transmission control protocols such as TCP-IP

Failure of a communication link

Handled by network protocols, by routing messages via alternative links

Network partition

A network is said to be partitioned when it has been split into two or more subsystems that lack any connection between them

Note: a subsystem may consist of a single node

Network partitioning and site failures are generally indistinguishable.

Commit Protocols

Commit protocols are used to ensure atomicity across sites

a transaction which executes at multiple sites must either be committed at all the sites, or aborted at all the sites.

Not acceptable to have a transaction committed at one site and aborted at another

The two-phase commit (2PC) protocol is widely used

The three-phase commit (3PC) protocol is more complicated and more expensive, but avoids some drawbacks of two-phase commit protocol. This protocol is not used in practice.

Two Phase Commit Protocol (2PC)

Assumes fail-stop model – failed sites simply stop working, and do not cause any other harm, such as sending incorrect messages to other sites.

Execution of the protocol is initiated by the coordinator after the last step of the transaction has been reached.

The protocol involves all the local sites at which the transaction executed

Let T be a transaction initiated at site S_i, and let the transaction coordinator at S_i be C_i

Phase 1: Obtaining a Decision

Coordinator asks all participants to prepare to commit transaction T_i.

C_i adds the records <prepare T> to the log and forces log to stable storage

sends prepare T messages to all sites at which T executed

Upon receiving message, transaction manager at site determines if it can commit the transaction if not, add a record <no T> to the log and send abort T message to C_i

if the transaction can be committed, then:

add the record <ready T> to the log

force all records for T to stable storage

send ready T message to C_i

Phase 2: Recording the Decision

T can be committed of C_i received a ready T message from all the participating sites: otherwise T must be aborted.

Coordinator adds a decision record, <commit T> or <abort T>, to the log and forces record onto stable storage. Once the record stable storage it is irrevocable (even if failures occur) Coordinator sends a message to each participant informing it of the decision (commit or abort) Participants take appropriate action locally.

Concurrency Control

Modify concurrency control schemes for use in distributed environment.

We assume that each site participates in the execution of a commit protocol to ensure global transaction automicity.

We assume all replicas of any item are updated

Single-Lock-Manager Approach

System maintains a single lock manager that resides in a single chosen site, say S_i When a transaction needs to lock a data item, it sends a lock request to S_i and lock manager determines whether the lock can be granted immediately

If yes, lock manager sends a message to the site which initiated the request

If no, request is delayed until it can be granted, at which time a message is sent to the initiating site

The transaction can read the data item from any one of the sites at which a replica of the data item resides.

Writes must be performed on all replicas of a data item

Advantages of scheme:

Simple implementation

Simple deadlock handling

Disadvantages of scheme are:

Bottleneck: lock manager site becomes a bottleneck

Vulnerability: system is vulnerable to lock manager site failure.

Distributed Lock Manager

In this approach, functionality of locking is implemented by lock managers at each site.Lock managers control access to local data items.But special protocols may be used for replicas

Advantage: work is distributed and can be made robust to failure

Disadvantage: deadlock detection is more complicated

Lock managers cooperate for deadlock detection

Several variants of this approach

Primary copy

Majority protocol

Biased protocol

Quorum consensus

Primary Copy

Choose one replica of data item to be the primary copy.

Site containing the replica is called the primary site for that data item

Different data items can have different primary sites

When a transaction needs to lock a data item Q, it requests a lock at the primary site of Q.

Implicitly gets lock on all replicas of the data item

Benefit

Concurrency control for replicated data handled similarly to unreplicated data - simple implementation.

Drawback

If the primary site of Q fails, Q is inaccessible even though other sites containing a replica may be accessible.

Majority Protocol

Local lock manager at each site administers lock and unlock requests for data items stored at that site.

When a transaction wishes to lock an unreplicated data item Q residing at site S_i, a message is sent to S_i ‘s lock manager.

If Q is locked in an incompatible mode, then the request is delayed until it can be granted.

When the lock request can be granted, the lock manager sends a message back to the initiator indicating that the lock request has been granted.

In case of replicated data.If Q is replicated at n sites, then a lock request message must be sent to more than half of the n sites in which Q is stored.The transaction does not operate on Q until it has obtained a lock on a majority of the replicas of Q.

When writing the data item, transaction performs writes on all replicas.

Benefit

Can be used even when some sites are unavailable

details on how handle writes in the presence of site failure later

Drawback

Requires 2(n/2 + 1) messages for handling lock requests, and (n/2 + 1) messages for handling unlock requests.Potential for deadlock even with single item - e.g., each of 3 transactions may have locks on 1/3rd of the replicas of a data.

Biased Protocol

Local lock manager at each site as in majority protocol, however, requests for shared locks are handled differently than requests for exclusive locks.

Shared locks. When a transaction needs to lock data item Q, it simply requests a lock on Q from the lock manager at one site containing a replica of Q.

Exclusive locks. When transaction needs to lock data item Q, it requests a lock on Q from the lock manager at all sites containing a replica of Q.

Advantage - imposes less overhead on read operations.

Disadvantage - additional overhead on writes

Timestamping

Timestamp based concurrency-control protocols can be used in distributed systems. Each transaction must be given a unique timestamp

Main problem: how to generate a timestamp in a distributed fashion

Each site generates a unique local timestamp using either a logical counter or the local clock.

Global unique timestamp is obtained by concatenating the unique local timestamp with the unique identifier.

A site with a slow clock will assign smaller timestamps

Still logically correct: serializability not affected

But: “disadvantages” transactions

To fix this problem

Define within each site S_ia logical clock (LC_i), which generates the unique local timestamp

Require that S_iadvance its logical clock whenever a request is received from a transaction Ti with timestamp < x,y> and x is greater that the current value of LC_i.

In this case, site S_iadvances its logical clock to the value x + 1.

Monday, 27 October 2014

RDBMS

No comments:

Post a Comment

Total Pageviews

Pages

About