It’s funny how a small comment made in a blog post strikes a note with people in different ways. In this post on the potential Sun acquisition by IBM, I made the comment “tape doesn’t have a long-term strategic future in anyone’s business”. D_Ced picked up on this and questioned me about it (see comment). Let me explain…
I’ve been involved in managing tape environments for over 20 years. I used everything from reel-to-reel 3420 tape drives, right up to today’s fastest LTO4. What’s obvious from my experience is that tape gets used for two primary needs; data loss and archive.
When disk was expensive, tape was the only way to restore lost data. By today’s standards data quantities were tiny; restoring from backup was also in the operational process and (certainly on mainframe environments) was integrated into the I/O architecture so datasets on tape were effectively accessed the same way as datasets on disk. However, over time, we’ve started storing massive quantities of data on tape. The technology has improved to assist in managing the growth; we’ve incredible capacities on LTO4 tapes today and robotics and tape library automation means thousands of individual tape cartridges can be stored and accessed in a completely automated fashion.
As regulatory regimes have changed and as organisations have become highly dependent on electronic forms of data, then the need to retain this data at instances in time has become paramount. For years, backups have been sequestered for this purpose, retained long after the need to keep the data on tape for restoring has passed.
But tape, whilst being portable and compact, has problems. Here are some of them that can be found in probably all large organisations today:
- Legacy Tape. All companies will have tape data across multiple device types, including DAT, LTO, DLT, DDS, 3480, 3490 and many, many more.
- Large Historical Span of Data. Data on tape will go back years, if not in some cases being stored indefinitely.
- Large Volumes of Replicated Data. The same full backup will have been taken on servers week-in, week-out. A large proportion of those files will remain unchanged.
- Unidentifiable Data. Lots of tapes in the enterprise which have lost their labels, or don’t have sufficient documentation.
- Lack of Hardware Support. Many tapes are still being retained for which no tape drive or backup environment exists.
- Multiple Backup Server software. This can be standalone or network-based software; most are incompatible with each other.
The historical nature of the backup process means that most data on tape represents an image or snapshot of a server or data from a specific time point. The sequential nature of tape means that images are kept separately and duplicate data isn’t removed or simply re-referenced, as would be simple in a disk-based system. As we continue to see data growth and increased rigour in retaining archive copies, then something needs to change. The process of writing the same data to a sequential medium doesn’t scale over the long term, especially as that data is never ever refreshed onto new technology platforms.
I think we need a number of things:
- Tools which can interrogate existing tape media and backup software databases and transfer those backups into either a more current version of the same backup product, or ideally, any backup product. This can help deal with the legacy backlog.
- A consistent methodology for referencing backup data; this needs to operate at multiple levels – the server, file and block level. The schema needs to be able to cope with point-in-time images of each of the levels and to be able to accurately identify when two or more objects being stored are the same and therefore don’t require storing.
- The splitting of backup and archive into separate functions. Archive should become part of application design; backup is retained as an operational need, but should be tied to the recovery requirements of the application.
If we start storing only the backup and archive data we actually need, then so much more can be retained on disk (or dare I say it, in the cloud). After all, having it on a random-access medium is always going to be superior.