< Data Management in Bioinformatics
Chapter Navigation
Top E/R Theory - Normalization - Data Querying - Integrating SQL and Programming Languages

Functional Dependency (FD)

Introduction of FD

gene_idnameannotation
.........
g23hsp39XX protein
g24hsp39XX protein

Functional Dependency: X->Y if two tuples agree on X, then they agree on Y. e.g. "gene_id -> name" and "gid-> annotation" are FDs, but "name->gene_id" and "name->annotation" are NOT FDs.

Relationship between key and FD:

A set of attribute{A1, A2, ...At} is a key for relation R if

(1) {A1, A2, ...At} functionally determines all other attributes;

(2) when no proper subset of {A1, A2, ...At} functionally determin all other attributes, it is called superkey (or Candidate key?)

rules of FDs:

Observation:

        Correct: 
               X->Y, A->B:   XA->BY;   
               X->AY:   X->A, X->Y;
               X->Y, A->Y: XA->Y;
        Incorrect:
               XY->A: X->A, Y->A;

Formal inference rules for functional dependencies:

        Armstrong’s axioms:
               1. Reflexivity: If Y  X then X -> Y (called trivial FD);
               2. Transitivity: If X -> Y and Y -> Z then X -> Z;
               3. Augmentation: If X -> Y then XZ -> YZ where Z is a set of attributes.

Where do FDs come from?

looking either at

              (1) Domain Knowledge (the meaning of the attributes); 
              (2) the data; 
abc
123
426

b->a is not FD, but a->b is a valid FD.

BCNF

Formal way to break tables:

genenameannotationexperiment idexperiment descriptionexperiment level
..................

         R1 (gid, name, annotation);    gid->name annotation;
         R2 (exp id, exp desc);         exp id -> exp desc;
         R3 (gid, expid, exp level);   gid, expid -> exp level;

A relation R is in BCNF if for all dependencies X -> Y , at least one of the following holds:

• X -> Y is a trivial FD (Y X)

• X is a superkey for R

3NF

Formal Definition: a relation R is in 3NF if for all dependencies X® Y in F+, at least one of the following holds:

• X -> Y is a trivial FD (Y X)

• X is a superkey for R

• Y a candidate key for R

Comparison of BCNF and 3NF

Removes
Redundancy
Preserves
FDs
Lossless
Decomposition
BCNF YesNoYes
3NF NoYesYes

Lossy Decomposition

Suppose we have a table such as the one below.

abc
123
426

into the following "normalized" tables

ab
12
42
bc
23
26

This means we followed

b → a or
b → c

as our FD for the decomposition of the original table. (Remember, BCNF decomposes X → Y to the sets {X, Y} and {everything except Y}). However, notice that column b has non-unique values (two of digit 2). When we recombine these two tables in a join, we wind up with a table in which the original relationships are lost:

abc
123
126
423
426

We declare this as a lossy decomposition, because the FDs assumed by the decomposed tables do not hold true (both b → a and b → c are false). Thus, BCNF can not remove all forms of redundancy. We will need to use another model, Multivalued Dependency (MD), to aid us in decomposing tables properly for normalization.

Multivalued Dependency (MD)

Let's explore a concrete scenario. We'd like to represent belongings of affluent students. We create a multivalued dependency (MD) of

SID →→ car

read as, "A given student ID can have many cars."

Compare this to an FD like

SID → name

which we read as, "A given SID can have at most one name."

Suppose we also had the following MD:

SID →→ clothes

We can get the following table:

SIDcarclothes
1Hondajeans
1Toyotakhakis
1Toyotajeans
1Hondakhakis

Perhaps we want to decompose this table further. We decompose along the MD of

SID →→ car

This leaves us with {SID, clothes}, leaving us with the following tables

SID →→ car {SID, clothes}
SIDcar
1Honda
1Toyota
sidclothes
1jeans
1khakis

We can't actually do this, because these are not FDs! Neither SID →→ car or SID →→ clothes have a superkey. The key must be all three attributes. We need new rules for decomposing MDs.

Rules for MDs

Spotting trivial MDs

We say that an MD X Y is trivial in R if X Y = {all attributes} .

For example, consider the following relation R1 where we have the MD a b:

R1
ab
12
13
22

Because we can have many bs to each a, we can not generate data to violate the MD.

Now consider the following relation R2:

R3
abc
126
137
226

In this case, a b is non-trivial, because the relation is three-way.

Non-trivial MDs come in pairs

In fact, a b has a partner MD which is also non-trivial: a c

Here are the listings of all non-trivial pairs of MDs in R2

a b
a c
b a
b c
c a
c b

Compare this to examples of trivial MDs for R2:

a bc
ab c
b ac
bc a
c ab
ca b

Implications of FDs do not hold for MDs

In FDs, we know

However, this same implication does not exist for MDs. That is,

MDs are actually about independence

Given the MDs

a b
and
a c

We know that, for each a

  • there can be many a
  • there can be many a
  • and, bs are independent of cs with respect to a, which we represent symbolically as [NOTE: the actual first binary relation symbol should be \Perp instead of \perp, but this symbol is not supported by Mediawiki.]

Every FD is an MD

By definition, X Y X Y. If X has at most one Y, then it satisfies the condition of having many Y.

For example, given the following relation

ab

and an FD of a b, then we also have a trivial MD of a b.

Now, consider the FD a b with following relation:

abcd

In this case, we actually have two non-trivial MDs:

a b
a cd

(Remember, non-trivial MDs come in pairs.)

This article is issued from Wikibooks. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.