<-- Return to the Lecture Guide

 

Distributed Databases

What is a DDBMS?

  • A Distributed DB is a related collection of data just like a normal DB, but physically distributed over a network.

  • The data is split into “fragments” on separate machines, each running a local DBMS.

  • A Distributed DBMS is the software that manages distribution of data and processing in a fashion that is invisible to users.

  • Local DBMSs can access local data autonomously, or access remote data through the DDBMS and other local DBMSs.

  • Apps that only use local data are called “Local Applications”

  • Apps that access remote data are called “Global Applications”

  • To be a DDBMS, every local DBMS must participate in at least one Global App.

  • Homogeneous DDBMSs use the same DBMS on the same platform on each site

  • Heterogeneous DDBMSs use different DBMSs/Platforms and require gateways or other middle-ware to convert queries/data models between sites

Advantages of a DDBMS:

  • Realistically reflects decentralised organisational structures

  • If well designed can strike an optimal balance between local speed and global access

  • More failure-proof than a single DBMS

  • Shared processing and I/O leads to better performance

  • Scalability

Disadvantages of a DDBMS:

  • Increased complexity and higher cost

  • Networks compromises security

  • Validity, consistency and integrity control becomes complicated

  • It’s a new technology with few standards/best practices

Note:

  • A DDBMS is not the same as “distributed processing” which is centralised DB accessible through a network (rather than the data itself being distributed)

  • A DDBMS is not always the same as a “parallel DBMS” which is a single DBMS using multiple processors/multiple disks

DDBMS Components:

  • Global External Schema
    User Views. Provides logical data independence

  • Global Conceptual Schema
    Logical description of entire DB, including entities and relationships, constraints, domains, security etc. Provides physical data independence

  • Fragmentation and Allocation Schema
    Describes how the data is partitioned and where it is stored

  • 3 Tier Schemas for each Local DBMS
    As normal, but instead of external schema, the top level is a mapping schema designed to be used to communicate with the frag/alloc schema above.

  • Local DBMS (LDBMS) – normal DBMS controlling local data

  • Data Communications (DC) – network software

  • Global System Catalogue – same as a normal systems catalogue, plus frag/alloc information

  • Distributed DBMS (DDBMS) – main functional unit – transaction management, backup/recovery etc.

structure

 

 

When designing a DDBMS we seek to maximise:

  • Locality of reference (letting local apps hitting local data)

  • Reliability and Accessibility (by strategically replicating the data)

  • System Performance (by avoiding over/under utilisation of resources)

When designing a DDBMS we seek to minimise:

  • Cost vs. Storage (affects replication strategy)

  • Communication Costs (taking into account consistency maintenance)

Design Strategies with which to approach the problem:

  • Centralised Data Storage – this is not a DDB at all. No replication so no additional storage costs.

  • Fragmented Data Storage – no replication but data is split up and distributed. If done correctly, locality of reference will be very high as is performance and storage and communications costs are low. Reliability and accessibility are only ok.

Design Strategies with which to approach the problem:

  • Fully Replicated Storage – every site hosts a full copy of the DB. Very high storage costs, improved performance for read trans/low comm costs. Write trans low performance/high comms cost.

  • Partially Replicated Storage – combination of above methods. It makes the most sense – if done right. You have high locality, high reliability/access, good all-round performance, ok storage costs and low comm. costs.

Fragmentation:

  • There are a few reasons why it makes sense to fragment data:

    • By keeping data not required by local apps separate, we improve security

    • By maximising locality of reference, we improve efficiency

    • Since most apps only use subsets of relations, it makes sense to break the relations into subsets for storage across the network.

    • Done right, we open the door for parallel processing

  • There are drawbacks, like increased integrity administration and performance hits for poorly fragmented data sets

  • For a fragmentation effort to be viable: it must be complete (all items in a relations must appear in at least one fragment of the relation), functional dependencies must be preserved, and (other than primary keys) fragments should be disjoint.

  • Types of Fragmentation:

    • Horizontal – break up relation into subsets of the tuples in that relation, based on a restriction on one or more attributes. E.g. – we could break up a table with student info into one subset for undergrads and one subset for postgrads.

    • Vertical – breaking up a relation into subsets of attributes. E.g. – breaking up a hypothetical student table into grade/course related columns and contact/personal related columns.

    • Mixed – fragments the data multiple times, in different ways. We could do our postgrad/undergrad split and then our grades/course split to each of the fragments

    • Derived – fragmenting a relation to correspond with the fragmentation of another relation upon which the 1st relation is dependent in some way.

Transparency:

  • Distribution Transparency allows users to ignore the physical fragmentation of data, to varying degrees:

    • Frag. Transparency is high level transparency where a user could write “SELECT * FROM Student WHERE year = 2” without needing to specify what fragment of the Student relation contains the data, nor where that fragment is stored.

    • Location Transparency – mid level transparency where a user would need to write “SELECT * FROM S14 WHERE year = 2” where S14 is the relevant fragment of the Student relation, but still wouldn’t need to say where the fragment is stored.

    • Local Mapping Transparency – low level transparency where a user would need to write “SELECT * FROM S14 AT SITE 7 WHERE year = 2” where S14 is the relevant fragment of the Student relation and SITE 7 is where the fragment is physically located.

    • Distribution transparency is supported by a database name server which aliases unique database object identifiers with user friendly names.

    • Transaction transparency ensures integrity and consistency vis-ŕ-vis multi-site transactions, concurrent users and DB failure.

    • Local transactions and remote single site transactions are handled without additional difficulty, but multi-site transactions must be broken into sub-transactions (for each site) and the independence, atomicity and durability of a centralised DBMS.

    • Performance Transparency simply means DDBMS perform at the same level as a normal DBMS. 

    • This puts a lot of burden on the distributed query processor, which decides what fragment to hit, which copy (if replicated), and which location to use, as well as calculating I/O time, CPU time and communication costs.
      DBMS Transparency means that a heterogeneous DDBMS will behave like a homogenous DDBMS.

Replication:

  • Advantages – better access and reliability, improved performance

  • Disadvantages – storage and consistency

  • Replication Options:

    • Synchronous Updates – all copy updates are part of one transactions commit phase - a lot of admin overhead, comm. Costs and opportunity for failure but often necessary

    • Asynchronous Updates – periodic updates of all copies based on one mater copy – violates the idea of data independence but can be useful in situations where the cost of synch updates are unwarranted.

  • What we expect from replication management:

    • Copy data, either synch or asynch from one db to another

    • Scalability and mappability (in heterogeneous environments)

    • Replication of procedures, indexes, schema etc.

    • tools for DBAs to manage replication

  • Replicated data ownership models:

    • Master/slave: publish and subscribe model – authoritive changes are made only to the master site and published to the slaves asynchronously

      • •Asynchronously replicated data is owned by one (master) site, and can be updated by only that site. •Using ‘publish-and-subscribe’ metaphor, master site makes data available. •Other sites ‘subscribe’ to data owned by master site, receiving read-only copies. •Potentially, each site can be master site for non-overlapping data sets, but update conflicts cannot occur.

    • Workflow: like M/S, but Master status moves from site to site depending on the task at hand

      • •Avoids update conflicts, while providing more dynamic ownership model. •Allows right to update replicated data to move from site to site. •However, at any one moment, only ever one site that may update that particular data set. •Example is order processing system, which follows series of steps, such as order entry, credit approval, invoicing, shipping, and so on.

    • Update-Anywhere: Symmetric, shared write authority for all replicas. 

      • •Creates peer-to-peer environment where multiple sites have equal rights to update replicated data. •Allows local sites to function autonomously, even when other sites are not available. •Shared ownership can lead to conflict scenarios and have to employ methodology for conflict detection and resolution.

    • Synchronous: We can synch up our replicas using regularly scheduled “snapshots” of the master data, or database triggers (when X happens, do Y)

Dates Twelve Rules for a DDBMS

(0) Fundamental Principle To the user a distributed system should look exactly like a non-distributed system

(1) Local Autonomy In this context, autonomy means:

Local data is locally owned and managed

Local operations remain purely local

All operations at a given site are controlled by that site

(2) No reliance on a central site

(3)  Continuous Operation   there should never be a need for a planned   system shutdown

(4)  Location Independence -  access anywhere

(5)  Fragmentation Independence – access the   data no matter how fragmented

(6) Replication Independence – user unaware

(7) Distributed Query Processing – data from more   than one site

(8) Distributed Transaction Processing

(9) Hardware Independence

(10) Operating System Independence

(11) Network Independence

(12) Database Independence – support different underlying data models

<-- Return to the Lecture Guide