The challenges of archiving structured and unstructured data
Traditionally, organizations had two electronic storage technologies: disk and tape. Whilst disk became the primary storage media, tape offered a cost-effective media to store infrequently accessed contents.
This led organizations to consider tape as not just a backup media but as the organization’s archive which then resulted in using monthly full system backups over extended durations to support archiving requirements.
Over time, legislative and regulatory bodies began to accept extended time delays for inquiries and investigations caused by tape restore limitations.
Since the beginning of this century, the following trends have impacted the IT industry:
- Single disk drive capacity has grown exponentially to multi-TB delivering cost effective performance levels.
- The exponential growth of unstructured data due to the introduction of social media networks, Internet of Things, etc. have exceeded all planned growth.
- The introduction of cloud storage (storage as a service) that offer an easy way to acquire storage services with incremental investment that fits any organization’s financial planning at virtually infinite scalability.
All the above have contributed to unprecedented growth of unstructured data that is straining all organizations’ IT budgets, which is not offset by the declining storage costs. As of today, all organizations are experiencing double digits and, in some instances, triple digits unstructured storage growth year over year. The response to this challenge is often to simply buy or license more capacity to accommodate growing unstructured data, however, this also creates an additional burden on IT to protect it.
What are some of the typical scenarios seen by business today?
- With the dynamic nature of IT industry, the employee churn rate is impacting IT organizations. Typically, when an employee moves on there is no process to retire their personal folders. Doing a simple scan can often reveal a large number of files that have not been accessed or used for fifteen or twenty years, but they continue consuming precious storage, backup license, and administrative time and money. Even when the time comes for a storage infrastructure refresh, these unused files become an integral part of the migration process simply since there is no decision process to retire data on storage.
- In many cases some of the data may need to be kept for many years in support of industry compliance (e.g. healthcare records, banking records, checks, etc.), this complicates the process to retire any data even further.
- Even with low-cost per GB storage provided by cloud providers, unjustified storage of non-business-related data can drive the storage cost higher if unmanaged and will be compounded by the need for multi-region and/or multi-zone data repositories in support of resilient access to stored data.
Industry response to managing unstructured data
Unstructured data growth is recognized across end user organizations, storage vendors and cloud providers alike. The response varies from offering extremely low-cost, virtually infinitely scalable storage platforms (e.g. object storage) or cloud object storage as a service with multi-tier pricing, coupled with varying performance and protection capabilities.
Tape vendors have also jumped on the bandwagon trying to revive tape as an even lower cost per GB archive storage, which is analogous to the historical usage of tape as an archive.
Can tape backup be the ultimate archive platform?
- LTO (Linear Tape Open or Ultrium) standards were initially released in early 1990 by a consortium of storage vendors as a replacement for DLT (Digital Linear tape technology) which was owned by a single vendor.
- Since 1990 LTO has undergone multiple generations, the current generation is LTO 8 (at 24TB per cartridge native capacity and up to 60TB compressed) with an approximate price of $100 per cartridge (0.0001 cents per GB) with infinite scalability since tapes can be removed after writing and kept on the shelf as archive.
- According to planned LTO roadmap, LTO 9 will be shipping Q4 2020.
- Since its inception, the LTO roadmap was to maintain one generation backward compatibility (e.g. LTO 8 can read and write to LTO 7).
Hence, IT shops need to plan tape migration around every five years just to maintain access to archive contents. Failure to refresh tape technology beyond two generations may result in inability to use the archive.
With tape solutions such as LTO 8 reaching 24TB per cartridge native capacity, with an approximate price of $100 per cartridge, it is clear that tape wins the price war hands down. However, there is a general trend to moving away from tape since it comes with the following burdens:
- The only way to write to tape is through backup software. Backup software regularly undergoes release updates similar to LTO media, to benefit from new processors, newer releases of operating systems, databases, newer libraries and new capabilities, and backup vendors maintain limited backward compatibility.
- There is no simple way to use data stored on tape since it is not in a usable format by applications or search engines until it is restored by the same backup mechanism that wrote it in the first place.
- There is always an extensive manual process to manage tapes which is prone to human errors and potential restore challenges.
- The impact of bit decay (bit rot): Like all forms of energy storage, stored bits experience gradual decay over time. A single bit decay is simply addressed by disk controllers and Cyclic Redundancy Check (CRC) on tapes but as the media ages to 15 years or more multiple bit errors may be experienced by the tape media that in many cases may render tape data unrestorable.
Based on the above challenges, tape is nearly always disqualified as a retention media, and the problem becomes more relevant with data retention periods exceeding 15 years.
Additionally, the challenge of managing billions of unstructured files by simple filesystem structures is a daunting task that will strain IT resources increasing the storage TCO exponentially.
Is the issue the same with structured data growth?
Structured data (e.g. ERP applications, HR, databases, etc.) is not experiencing similar growth rates compared to unstructured data, however it is still subject to the following:
- Compliance mandates: almost all ERP systems contain data that is subject to one or more compliance requirements (e.g. financial transactions, HR records, etc.).
- There is no simple way to retire or archive structured data since there is no granular way to deal with records within a database.
- Merger and acquisitions: there is always a need to consolidate, migrate databases and ERP systems, and while this may help control the spiraling cost of structured data storage, because of compliance and the need to access old financial records, organizations end up needing to keep the retired ERP system alive just to fulfill access to historical financial records. This comes with hefty cost of ERP and database licensing and the need to run the aged ERP system on systems that are out of maintenance support with the potential of permanent data loss due to stopping the backup process.
- Without an effective process to control, retire, and access aged records within the ERP systems, databases will continue to grow year over year, which in turn impacts the overall system performance and dramatically escalates ERP license cost along with backup licensing and disaster recovery planning.
- Coupled with storage growth, failure to retire ERP systems’ data in line with its regulatory mandates introduces severe compliance gaps and may result in a financial penalty in addition to unnecessary searches in response to an outdated investigation or data query.
- Almost all IT shops have experienced ERP refresh where the newly deployed ERP system is not backward compatible with the retired one, however access to compliance records within forces the organization to keep the retired application running until end of retention date, which in turn multiplies issues with managing the storage growth and licenses.
Specific Recommendations:
Organizations should consider the following:
- Create a corporate governance practice to enforce and manage data retention rules. Ideally, this practice should be sponsored by the CIO and must comprise representation of all divisions subject to compliance (e.g. legal, HR, finance, etc.) in addition to IT.
- Enforce data retention and retirement policies in strict alignment with regulations and/or corporate governance.
- Create a process and automated workflow to migrate data from live production to an archive platform and from archive to purge based on defined retention values with consideration to access frequency and critical nature of contents.
- Separate between structured and unstructured data archiving and challenge their incumbent vendors about structured data archiving technology capability.
- Investigate cost effective storage technologies that offer minimally managed, self-healing, resilient storage platforms where available.
- Consider tape as a viable option while tape continues to offer a cost-effective data storage medium, however many processes and strict rules will be required to avoid the well-known tape pitfalls mentioned previously.
About the SNIA Data Protection & Privacy Committee
The SNIA Data Protection & Privacy Committee (DPPC) exists to further the awareness and adoption of data protection technology, and to provide education, best practices and technology guidance on all matters related to the protection and privacy of data.
Within SNIA, data protection is defined as the assurance that data is usable and accessible for authorized purposes only, with acceptable performance and in compliance with applicable requirements. The technology behind data protection will remain a primary focus for the DPPC. However, there is now also a wider context which is being driven by increasing legislation to keep personal data private. The term data protection also extends into areas of resilience to cyber attacks and to threat management. To join the DPPC, login to SNIA and request to join the DPPC Governing Committee. Not a SNIA member? Learn all the benefits of SNIA membership.
Mounir Elmously, Governing Committee Member, SNIA Data Protection & Privacy Committee and Executive, Advisory Services Ernst & Young, LLP
Why did I join SNIA DPPC?
With my long history with storage and data protection technologies, along with my current job roles and responsibilities, I can bring my expertise to influence the storage industry technology education and drive industry awareness. During my days with storage vendors, I did not have the freedom to critique specific storage technologies or products. With SNIA I enjoy the freedom of independence to critique and use my knowledge and expertise to help others improve their understanding.