Data Warehousing - Backup



There exist large volume of data into the data warehouse and the data warehouse system is very complex hence it becomes important to have backup of all the data which is available for the recovery in future as per the requirement. In this chapter I will discuss the issues on designing backup strategy.

Backup Terminologies

Before proceeding further we should know some of the backup terminologies discussed below.

  • Complete backup - In complete backup the entire database is backed up at the same time. This backup includes all the database files, control files and journal files.

  • Partial backup - Partial backup is not the complete backup of database. Partial backup are very useful in large databases because they allow a strategy whereby various parts of the database are backed up in a round robin fashion on daybyday basis, so that the whole database is backed up effectively once a week.

  • Cold backup - Cold backup is taken while the database is completely shut down. In multiinstance environment all the instances should be shut down.

  • Hot backup - The hot backup is take when the database engine is up and running. Hot backup requirements that need to be considered varies from RDBMS to RDBMS. Hot backups are extremely useful.

  • Online backup - It is same as the hot backup.

Hardware Backup

It is important to decide which hardware to use for the backup.We have to make the upper bound on the speed at which backup is can be processed. the speed of processing backup and restore depends not only on the hardware being use rather it also depends upon the how hardware is connected, bandwidth of the network, backup software and speed of server's I/O system. Here I will discuss about some of the hardware choices that are available and their pros and cons. These choices are as follows.

  • Tape Technology

  • Disk Backups

Tape Technology

The tape choice can be categorized into the following.

  • Tape media

  • Standalone tape drives

  • Tape stackers

  • Tape silos

Tape Media

There exists several varieties of tape media. The some tape media standard are listed in the table below:

Tape MediaCapacityI/O rates
DLT 40 GB 3 MB/s
3490e 1.6 GB3 MB/s
8 mm14 GB1 MB/s

Other factors that need to be considered are following:

  • Reliability of the tape medium.

  • Cost of tape medium per unit.

  • scalability.

  • Cost of upgrades to tape system.

  • Cost of tape medium per unit.

  • Shelf life of tape medium.

Standalone tape drives

The tape drives can be connected in the following ways.

  • Direct to the server.

  • As as networkavailable devices.

  • Remotely to other machine.

Issues of connecting the tape drives

  • Suppose the server is the 48node MPP machine so which node do you connect the tape drive, how do you spread them over the server nodes to get the optimal performance with least disruption of the server and least internal I/O latency?

  • Connecting the tape drive as a network available device require the network to be up to the job of the huge data transfer rates needed. make sure that sufficient bandwidth is available during the time you require it.

  • Connecting the tape drives remotely also require the high bandwidth.

Tape Stackers

The method of loading the multiple tapes into a single tape drive is known as tape stackers. The stacker dismounts the current tape when it has finished with it and load the next tape hence only one tape is available data a time to be accessed.The price and the capabilities may vary but the common ability is that they can perform unattended backups.

Tape Silos

The tape silos provide the large store capacities.Tape silos can store and manage the thousands of tapes. The tape silos can integrate the multiple tape drives. They have the software and hardware to label and store the tapes they store. It is very common for the silo to be connected remotely over a network or a dedicated link.We should ensure that the bandwidth of that connection is up to the job.

Other Technologies

The technologies other than the tape are mentioned below.

  • Disk Backups

  • Optical jukeboxes

Disk Backups

Methods of disk backups are listed below.

  • Disk-to-disk backups

  • Mirror breaking

These methods are used in OLTP system. These methods minimize the database downtime and maximize the availability.

Disk-to-disk backups

In this kind of backup the backup is taken on to disk rather than to tape. Reasons for doing Disktodisk backups are.

  • Speed of initial backups

  • Speed of restore

Backing up the data from Disk to disk is much faster than to the tape. However it is the intermediate step of backup later the data is backed up on the tape. The other advantage of Disk to disk backups is that it gives you the online copy of the latest backup.

Mirror Breaking

The idea is to have disks mirrored for resilience during the working day. When back is required one of the mirror sets can be broken out. This technique is variat of Disktodisk backups.

Note: The database may need to be shutdown to guarantee the consistency of the backup.

Optical jukeboxes

Optical jukeboxes allow the data to be stored near line. This technique allow large number of optical disks to be managed in same way as a tape stacker or tape silo. The drawback of this technique is that it is slow write speed than disks. But the optical media provide the long life and reliability make them good choice of medium of archiving.

Software Backups

There are software tools available which helps in backup process. These software tools come as a package.These tools not only take backup in fact they effectively manage and control the backup strategies. There are many software packages available in the market .Some of them are here listed in the following table.

Package Name Vendor
Networker Legato
Epoch Epoch Systems
Omniback IIHP
Alexandria Sequent

Criteria For Choosing Software Packages

The criteria of choosing the best software package is listed below:

  • How scalable is the product as tape drives are added?

  • Does the package have client server option, or must it run on database server itself?

  • Will it work in cluster and MPP environments?

  • What degree of parallelism is required?

  • What platforms are supported by the package?

  • Does package support easy access to information about tape contents?

  • Is the package database aware?

  • What tape drive and tape media are supported by package?