Main page PublicationsHyper-scalable Cloud Services: the Benefits are Accessible to the Public

Hyper-scalable Cloud Services: the Benefits are Accessible to the Public

Over the past decade, data processing systems have undergone fundamental changes. According to IDC (International Data Corporation), the rapid development in the field of mobile and web applications, coupled with increasing commoditisation of content creation tools, has resulted in at least 30-fold growth in the production and consumption of information. In this context, companies are striving to find the ways to maximise the benefits from petabytes of the data they are now forced to keep. Fully automated clouds providing SaaS and IaaS services have become a flourishing business with multi-billion-dollar income. The emergence of wireless sensor networks and other machine-to-machine technologies promises another huge leap in the field of data storage and transfer. Nevertheless, despite the monumental changes in the ways of information handling, conventional data storage systems have changed little over the past 20 years, except for the growth in storage capacity and processor performance. The systems designed for terabytes try to survive the pressure from multi-petabyte amounts of data. Software-defined storages (SDS) can provide a more flexible data storage model in the current situation, when data storage turns into another IT service. Such separation from the hardware platform and functioning as a separate IT architecture component allow data flows and data storage services to be quickly adapted to current challenges and facilitate scaling of storages in both directions. The management functions in SDS platforms are separated from hardware, rather than tied to some specialised proprietary equipment. This allows the user to focus the management functionality directly on data and literally expands the limits established by hardware. Over time, this separation has revealed too high retail margins (60%), typical for the today’s data storage equipment, just as it happened before in the industry (for example, when softswitches appeared). Since SDS systems are designed to operate with petabytes of data volumes, it should provide an extremely high level of availability and not be susceptible to typical failure scenarios. The application interface should be compatible both with well-known software and with new mobile / web applications. The system must possess such features as high performance and linear scalability, and be suitable for mixed workloads. Such functions as data protection and recovery should be designed to meet the requirements for dynamic resource allocation and scalability. All this is in stark contrast to traditional data storage architectures rigidly tied to hardware with its capabilities, which limit accessibility, availability, performance, robustness and management functions. In addition, there is a limited lifetime of specialised equipment. Such architectures were designed to operate in conditions with much less stringent requirements for any of the above listed parameters.

SANs are still good due to their low latency, but they are not suitable for the “big data”

SAN (Storage Area Network) is a basic architectural solution designed to interact with data storage resources via a dedicated LAN. The LAN controls the data blocks arranged in small logical volumes, regardless of the content type, and depends entirely on system software, which is responsible for the organisation of volumes, cataloguing and data structuring. SANs have hardware-limited scalability, a set of interfaces and coverage, as well as are usually more expensive due to the need to create an isolated network infrastructure.

The main advantages of SAN include low latency in data access and distributed storage resources. The list of SAN’s shortcomings is much wider: data blocks are non-transparent and depend on system applications, database or a file system on the top of data; performance enhancement requires building of a new system (except for upper level arrays); capacity increase just by several hundred terabytes also requires building of a new system; distributed access to servers is difficult to implement without additional software; creation of a fault-safe configuration (redundant arrays, controllers, distribution across multiple sites) requires creation of a new system and replication software; robustness can be improved only by using recovering software and a separate system; data recovery operations affect the system performance because of a limited number of controllers; a limited choice of hardware configurations.

Files are still the basic unit of information; and NAS is a good “workhorse”, but such systems are difficult to scale

NAS (Network Attached Storage) architecture also uses LAN for access to data storages, but the basic part of this structure is implemented at the system and file level. File systems have a number of intrinsic constraints that arise from the features of local internal structures responsible for the file hierarchy management and access to file information. Thanks to the information contained in the file hierarchy, awareness of the content type here is higher than in the system described above, but this information is stored locally on a physical storage controller. As well as SANs, NAS systems have hardware limits in terms of scalability and coverage. Clustered NAS storage systems have some enhanced scalability features, but still have limitations tied to controllers (their number cannot exceed 10) and to the central database, which monitors the safety of the file hierarchy and files.

The main advantages of NAS include the easy-to-operate storage system that can handle hundreds of terabytes of data. Shortcomings of NAS include: the need to build a new system for performance enhancing or increasing capacity by more than several hundred terabytes; the need to use a new system and software for replication when creating a fault-safe configuration (redundant arrays, controllers, access from multiple sites); the need to use recovery software and a separate system to enhance robustness; data recovery operations affect the system performance because of a limited number of controllers; limited choice of hardware configurations.

Clustered NAS have some advantages of the standard NAS combined with the available capacity of about one petabyte. Disadvantages of clustered NAS include: performance enhancement requires building of a new system after the limit of about 100 nodes is reached; capacity enhancing up to several petabytes also requires building of a new system; creation of a fault-safe configuration (redundant arrays, controllers, access from multiple sites) requires creation of a new system and replication software; robustness is improved by using recovering software and a separate system; a limited choice of hardware configurations.

Object storage is easily scalable, but limited peak workload is its bottleneck

Object storages are based on the technology of additional abstraction level, often implemented on the top of local file systems and parallel to it. In such a system, data are presented in the form of objects (rather than blocks or files) in the global namespace, where an individual identifier is assigned to each object. This namespace can cover hundreds of servers, thereby allowing you to increase the object storage capacity much easier as compared to SAN or NAS systems.

The main advantages include a unified namespace and low costs. The shortcomings are the following: the system availability depends on the integration of API or file system gateways, which complicates the system; a flat and scalable namespace instead of system paths to a file or LUNs; the performance depends on the architecture of metadata architecture that are limited by master network nodes; replication of data between multiple sites is possible only after a certain time (asynchronous replication); truly high reliability is achieved by using only Erasure Coding algorithms. However, object storages have fundamental limitations in terms of support for network applications, as they are to be rewritten for specific HTTP API requirements. The application functions are usually limited to write once or write once, read multiple (WORM) scenarios, and no “read” permission. Such poor functionality is associated with the architecture, where the traffic goes through a limited set of metadata nodes. This often creates additional loads on these nodes filled with different services, such as Erase Coding.

Software-defined storages are designed to meet all the requirements for massive scalability

Software-Defined Storage (SDS) is a completely new concept of data storage systems, where the storage functionality is completely separated from specific hardware. This approach has given rise to systems with more flexible deployment, scaling and management functions, as well as a new level of availability.

The main advantages of software-defined storages include: standard file access, object access or access through virtualisation interfaces, the performance is limited only by the number of nodes; performance and capacity are scaled by using standard servers; replication of data across multiple platforms can be synchronous or asynchronous without using the common architecture; high robustness, fast response in case of failures. The shortcomings are the following: the performance recedes into the background in favour of parallel running of applications; there is a long delay in data access as compared with object storages (designed for ensuring low latency); the price is rather high when deploying large systems. Due to the above mentioned separation from the hardware, software of SDS systems can interact with different hardware components individually and scale capacity, performance and availability independently from each other, depending on current tasks. Such flexibility is unattainable in conventional data storage systems except for the highest-level systems, where these functions are implemented using specialised hardware components, which ultimately limit flexibility and scalability of the system as well. When the functionality of the data storage is separated from hardware, it also makes it easier to identify problems, because you can see the entire system, rather than to search for equipment failures, which can equally be caused by hardware and software. Besides, this labour-intensive process always means that time consumption is far from optimal. In addition to the basic separation of the system software component from the hardware one, a separated logic of storage services in SDS allows the services that control capacity, availability, robustness and data access to overcome the limits of physical resources. There is another characteristic feature of SDS: most of them use an object storage to create a virtually unlimited namespace of unique object names, instead of using logical unit numbers (LUN) of drives and file system paths, which have fundamental scaling limits. Such an unlimited namespace allows you to quickly build up real capacity without adding new physical units. Software-defined storage systems are impressive in terms of availability, effectively using their own dedicated network between nodes. As opposed to a fixed alternation of active/passive controllers, which is typical for most SAN and NAS systems, SDS systems can be expanded to many thousands of addresses within the same network. In addition, SDS systems allow you to use advanced routing algorithms to ensure a response even in large-scale topologies and multiple-failure scenarios. It goes far beyond switched fabric topologies or chain connection of traditional data storage systems, where the entire array of devices may become unavailable due to a cable break or a connection error. Robustness, in case of conventional data storage systems, usually means the capability to maintain the system operation in the event of unexpected failure of one or a couple of drives, which also requires immediate replacement of failed components. In petabyte-volume systems, the number of drives starts from hundreds of units and often reaches thousands of them. Even with a large safety margin for the mean time between failures (MTBF), there will be multiple drive failures at all times because of such a large scale. SDS systems are designed with the expectation of massive failures and multiple hardware points of failure. Software-defined infrastructures use the advantages of the distributed storage and data processing method to implement branched protection systems and ensure that operation is recovered as fast as possible. Such a model is more appropriate at a large scale, as compared to usage of redundant controllers to expand the infrastructure based on architectures in which the system availability during the recovery of drives and other services is the bottleneck. Accessibility was not a primary concern in traditional storage systems. Access to application servers or mainframes was provided via specialised LANs, using just a few mature protocols. Today, public Ethernet networks, with a mixed public and private access, have become the norm. Software-defined storages need to meet a wider range of requirements. From web access to Ethernet access, from network storage resources to resources locally deployed in an application server—SDS systems should support them all. As it was noted above, conventional data storage systems are highly specialised, which entails decentralisation of control and data, affecting all business aspects of any large company. This approach is not only extremely counterproductive in terms of the system operation, but also dramatically reduces the possibility of expansion by imposing strict limits on data sharing and reuse. SDS meets most of the integration requirements for applications, working on a wide variety of protocols, from persistent to protocols that do not monitor the status, from the simplest to the most interactive and semantically rich. This brings the possibility to create a universal environment, where the storage serves as the basis for running the applications, no matter what file sizes, security requirements and protocols are used. This breaks down the boundaries between NAS, object and tape storages, helps to remove the pressure from major players in the market of data storage systems, which they gladly used for many years. This also improves data storage services in conditions where the network has grown to billions of endpoints. To summarise, we can say that the requirements to data storage systems have radically changed, as well as the scenarios for their use. It suffices to say that, to the present day, 90% of all data have been created just over the past two years. The era of petabytes is already here, and the era of exabytes is on the doorstep. Enormous amounts of data and the pursuit of new economic achievements have played a catalyst role for fundamental changes in the data storage industry, which still relies on old approaches. Some of them appeared several decades ago and reached the technical limits of their capabilities to the current time.

Back to