Scalable performance of the panasas parallel file




















OSDFS is concerned with traditional block-level file system issues such as efficient disk arm utilization, media management i. These are all user level applications that run on a subset of the manager nodes. The cluster manager is concerned with membership in the storage cluster, fault detection, configuration management, and overall control for operations like software upgrade and system restart [Welch07]. This is a user level application that runs on every manager node. The metadata manager is concerned with distributed file system issues such as secure multi-user access, maintaining consistent file- and object-level metadata, client cache coherency, and recovery from client, storage node, and metadata server crashes.

Fault tolerance is based on a local transaction log that is replicated to a backup on a different manager node. In turn, these services use a local instance of the file system client, which runs inside the FreeBSD kernel. The storage cluster nodes are implemented as blades that are very compact computer systems made from commodity parts.

The blades are clustered together to provide a scalable platform. Up to 11 blades fit into a 4U 7 inches high shelf chassis that provides dual power supplies, a high capacity battery, and one or two port GE switches. The switches aggregate the GE ports from the blades into a 4 GE trunk. The 2 nd switch provides redundancy and is connected to a 2 nd GE port on each blade. The battery serves as a UPS and powers the shelf for a brief period of time about five minutes to provide orderly system shutdown in the event of a power failure.

Any number of blades can be combined to create very large storage systems. The OSD StorageBlade module and metadata manager DirectorBlade module use the same form factor blade and fit into the same chassis slots. Details of the different blades used in the performance experiments are given in Appendix I. Any number of shelf chassis can be grouped into the same storage cluster.

A shelf typically has one or two DirectorBlade modules and 9 or 10 StorageBlade modules. Customer installations range in size from 1 shelf to around 50 shelves, although there is no enforced limit on system size. While the hardware is essentially a commodity PC i. The metadata managers do fast logging to memory and reflect that to a backup with low latency network protocols. OSDFS buffers write data so it can efficiently manage block allocation.

The UPS powers the system for several minutes to protect the system as it shuts down cleanly after a power failure. The logging mechanism is described and measured in detail later in the paper. The system monitors the battery charge level, and will not allow a shelf chassis to enter service without an adequately charged battery to avoid data loss during back-to-back power failures.

Instead of trying to repair a blade, if anything goes wrong with the hardware, the whole blade is replaced. We settled on a two-drive storage blade as a compromise between cost, performance, and reliability. Having the blade as a failure domain simplifies our fault tolerance mechanisms, and it provides a simple maintenance model for system administrators. Reliability and data reconstruction are described and measured in detail later in the paper.

Traditional storage management tasks involve partitioning available storage space into LUNs i. This can be a labor-intensive scenario. We sought to provide a simplified model for storage management that would shield the storage administrator from these kinds of details and allow a single, part-time admin to manage systems that were hundreds of terabytes in size.

The Panasas storage system presents itself as a file system with a POSIX interface, and hides most of the complexities of storage management. Clients have a single mount point for the entire system. The administrator can add storage while the system is online, and new resources are automatically discovered.

To manage available storage, we introduced two basic storage concepts: a physical storage pool called a BladeSet , and a logical quota tree called a Volume. We mitigate the risk of large fault domains with the scalable rebuild performance described in Section 4. The BladeSet is a hard physical boundary for the volumes it contains. A BladeSet can be grown at any time, either by adding more StorageBlade modules, or by merging two existing BladeSets together.

The Volume is a directory hierarchy that has a quota constraint and is assigned to a particular BladeSet. The quota can be changed at any time, and capacity is not allocated to the Volume until it is used, so multiple volumes compete for space within their BladeSet and grow on demand.

The files in those volumes are distributed among all the StorageBlade modules in the BladeSet. Volumes appear in the file system name space as directories. Clients have a single mount point for the whole storage system, and volumes are simply directories below the mount point. There is no need to update client mounts when the admin creates, deletes, or renames volumes.

Each Volume is managed by a single metadata manager. Dividing metadata management responsibility along volume boundaries i. We figured that administrators would introduce volumes i. We were able to delay solving the multi-manager coordination problems created when a parent directory is controlled by a different metadata manager than a file being created, deleted, or renamed within it. We also had a reasonable availability model for metadata manager crashes; well-defined subtrees would go offline rather than a random sampling of files.

This reduces the importance of fine-grain metadata load balancing. That said, uneven volume utilization can result in uneven metadata manager utilization. Our protocol allows the metadata manager to redirect the client to another manager to distribute load, and we plan to exploit this feature in the future to provide finer-grained load balancing. While it is possible to have a very large system with one BladeSet and one Volume, and we have customers that take this approach, we felt it was important for administrators to be able to configure multiple storage pools and manage quota within them.

Our initial model only had a single storage pool: a file would be partitioned into component objects, and those objects would be distributed uniformly over all available storage nodes. Similarly, metadata management would be distributed by randomly assigning ownership of new files to available metadata managers.

This is similar to the Ceph model [Weil06]. The attraction of this model is smooth load balancing among available resources. There would be just one big file system, and capacity and metadata load would automatically balance. There are two problems with a single storage pool: the fault and availability model, and performance isolation between different users.

If there are ever enough faults to disable access to some files, then the result would be that a random sample of files throughout the storage system would be unavailable. Even if the faults were transient, such as a node or service crash and restart, there will be periods of unavailability. Instead of having the entire storage system in one big fault domain, we wanted the administrator to have the option of dividing a large system into multiple fault domains, and of having a well defined availability model in the face of faults.

In addition, with large installations the administrator can assign different projects or user groups to different storage pools. This isolates the performance and capacity utilization among different groups. Our storage management design reflects a compromise between the performance and capacity management benefits of a large storage pool, the backup and restore requirements of the administrator, and the complexity of the implementation.

In practice, our customers use BladeSets that range in size from a single shelf to more than 20 shelves, with the largest production Bladeset being about 50 shelves, or StorageBlade modules and 50 DirectorBlade modules.

The most common sizes, however, range from 5 to 10 shelves. While we encourage customers to introduce Volumes so the system can better exploit the DirectorBlade modules, we have customers that run large systems e. Capacity imbalance occurs when expanding a BladeSet i. This provides better throughput during rebuild see section 4. Our system automatically balances used capacity across storage nodes in a BladeSet using two mechanisms: passive balancing and active balancing.

Passive balancing changes the probability that a storage node will be used for a new component of a file, based on its available capacity. This takes effect when files are created, and when their stripe size is increased to include more storage nodes. Active balancing is done by moving an existing component object from one storage node to another, and updating the storage map for the affected file. During the transfer, the file is transparently marked read-only by the storage management layer, and the capacity balancer skips files that are being actively written.

Capacity balancing is thus transparent to file system clients. We have validated this in large production systems. Of course there can always be transient hot spots based on workload. It is important to avoid long term hot spots, and we did learn from some mistakes. The approach we take is to use a uniform random placement algorithm for initial data placement, and then preserve that during capacity balancing. The system must strive for a uniform distribution of both objects and capacity. This is more subtle than it may appear, and we learned that biases in data migration and placement can cause hot spots.

Initial data placement is uniform random, with the components of a file landing on a subset of available storage nodes. Each new file gets a new, randomized storage map. However, the uniform random distribution is altered by passive balancing that biases the creation of new data onto emptier blades.

On the surface, this seems reasonable. Unfortunately, if a single node in a large system has a large bias as the result of being replaced recently, then it can end up with a piece of every file created over a span of hours or a few days. In some workloads, recently created files may be hotter than files created several weeks or months ago. Our initial implementation allowed large biases, and we occasionally found this led to a long-term hot spot on a particular storage node.

Our current system bounds the effect of passive balancing to be within a few percent of uniform random, which helps the system fine tune capacity when all nodes are nearly full, but does not cause a large bias that can lead to a hot spot. Another bias we had was favoring large objects for active balancing because it is more efficient. There is per-file overhead to update its storage map, so it is more efficient to move a single 1 GB component object than to move 1 MB component objects.

However, consider a system that has relatively few large files that are widely striped, and lots of other small files. If the large files are hot, it is a mistake to bias toward them because the new storage nodes can get a disproportionate number of hot objects. We found that selecting a uniform random sample of objects from the source blades was the best way to avoid bias and inadvertent hot spots, even if it means moving lots of small objects to balance capacity.

We protect against loss of a data object or an entire storage node by striping files across objects stored on different storage nodes, using a fault-tolerant striping algorithm such as RAID-1 or RAID Small files are mirrored on two objects, and larger files are striped more widely to provide higher bandwidth and less capacity overhead from parity information. The per-file RAID layout means that parity information for different files is not mixed together, and easily allows different files to use different RAID schemes alongside each other.

This property and the security mechanisms of the OSD protocol [Gobioff97] let us enforce access control over files even as clients access storage nodes directly. It also enables what is perhaps the most novel aspect of our system, client-driven RAID. That is, the clients are responsible for computing and writing parity. The OSD security mechanism also allows multiple metadata managers to manage objects on the same storage device without heavyweight coordination or interference from each other.

Client-driven, per-file RAID has four advantages for large-scale storage systems. First, by having clients compute parity for their own data, the XOR power of the system scales up as the number of clients increases. Moving XOR computation out of the storage system into the client requires some additional work to handle failures. Clients are responsible for generating good data and good parity for it. Because the RAID equation is per-file, an errant client can only damage its own data.

However, if a client fails during a write, the metadata manager will scrub parity to ensure the parity equation is correct. The second advantage of client-driven RAID is that clients can perform an end-to-end data integrity check. Data has to go through the disk subsystem, through the network interface on the storage nodes, through the network and routers, through the NIC on the client, and all of these transits can introduce errors with a very low probability.

Clients can choose to read parity as well as data, and verify parity as part of a read operation. If errors are detected, the operation is retried. If the error is persistent, an alert is raised and the read operation fails.

We have used this facility to track down flakey hardware components; we have found errors introduced by bad NICs, bad drive caches, and bad customer switch infrastructure. While file systems like ZFS [ZFS] maintain block checksums within a local file system, which does not address errors introduced during the transit of information to a network client.

By checking parity across storage nodes within the client, the system can ensure end-to-end data integrity. This is another novel property of per-file, client-driven RAID. Third, per-file RAID protection lets the metadata managers rebuild files in parallel. Although parallel rebuild is theoretically possible in block-based RAID, it is rarely implemented. This is due to the fact that the disks are owned by a single RAID controller, even in dual-ported configurations.

Large storage systems have multiple RAID controllers that are not interconnected. Since the SCSI Block command set does not provide fine-grained synchronization operations, it is difficult for multiple RAID controllers to coordinate a complicated operation such as an online rebuild without external communication. Even if they could, without connectivity to the disks in the affected parity group, other RAID controllers would be unable to assist.

Even in a high-availability configuration, each disk is typically only attached to two different RAID controllers, which limits the potential speedup to 2x.

When a StorageBlade module fails, the metadata managers that own Volumes within that BladeSet determine what files are affected, and then they farm out file reconstruction work to every other metadata manager in the system.

Metadata managers rebuild their own files first, but if they finish early or do not own any Volumes in the affected Bladeset, they are free to aid other metadata managers.

The result is that larger storage clusters reconstruct lost data more quickly. Scalable reconstruction performance is presented later in this paper. The fourth advantage of per-file RAID is that unrecoverable faults can be constrained to individual files.

The most commonly encountered double-failure scenario with RAID-5 is an unrecoverable read error i. The 2 nd storage device is still healthy, but it has been unable to read a sector, which prevents rebuild of the sector lost from the first drive and potentially the entire stripe or LUN, depending on the design of the RAID controller. With block-based RAID, it is difficult or impossible to directly map any lost sectors back to higher-level file system data structures, so a full file system check and media scan will be required to locate and repair the damage.

A more typical response is to fail the rebuild entirely. RAID controllers monitor drives in an effort to scrub out media defects and avoid this bad scenario, and the Panasas system does media scrubbing, too.

However, with high capacity SATA drives, the chance of encountering a media defect on drive B while rebuilding drive A is still significant. With per-file RAID-5, this sort of double failure means that only a single file is lost, and the specific file can be easily identified and reported to the administrator.

Figure 2 charts iozone [Iozone] streaming bandwidth performance from a cluster of up to clients against storage clusters of 1, 2, 4 and 8 shelves. Each client ran two instances of iozone writing and reading a 4GB file with 64KB record size. Appendix I summarizes the details of the hardware used in the experiments. This is a complicated figure, but there are two basic results. The first is that performance increases linearly as the size of the storage system increases.

The second is that write performance scales up and stays flat as the number of clients increases, while the read performance tails off as the number of clients increases. The write performance curves demonstrate the performance scalability. These kinds of results depend on adequate network bandwidth between clients and the storage nodes. They also require a 2-level RAID striping pattern for large files to avoid network congestion [Nagle04].

For a large file, the system allocates parity groups of 8 to 11 storage nodes until all available storage nodes have been used. Approximately 1 GB of data stripes is stored in each parity group before rotating to the next one. When all parity groups have been used, the file wraps around to the first group again. The system automatically selects the size of the parity group so that an integral number of them fit onto the available storage nodes with the smallest unused remainder.

Each file has its own mapping of parity groups to storage nodes, which diffuses load and reduces hot-spotting. It performs delayed block allocation for new data so it can be batched and written efficiently. Thus new data and its associated metadata i. Read operations, in contrast, must seek to get their data because the data sets are created to be too large to fit in any cache.

While OSDFS does object-aware read ahead, as the number of concurrent read streams increases, it becomes more difficult to optimize the workload because the amount of read-ahead buffering available for each stream shrinks. Figure 3 charts iozone performance for mixed i. Each client has its own file, so the working set size increases with more clients.

The vertical axis shows the throughput of the storage system, and the chart compares the different configurations as the number of clients increases from 1 to 6. In each case the system had 9 StorageBlade modules, so the total memory on the StorageBlade modules was 4.

Obviously, the larger memory configuration is able to cache most or all of the working set with small numbers of clients. As the number of clients increases such that the working set size greatly exceeds the cache, then the difference in cache size will matter less. One client gets approximately 1. RAID rebuild performance determines how quickly the system can recover data when a storage node is lost. Short rebuild times reduce the window in which a second failure can cause data loss.

There are three techniques to reduce rebuild times: reducing the size of the RAID parity group, declustering the placement of parity group elements, and rebuilding files in parallel using multiple RAID engines.

The rebuild bandwidth is the rate at which reconstructed data is written to the system when a storage node is being reconstructed. The system must read N times as much as it writes, depending on the width of the RAID parity group, so the overall throughput of the storage system is several times higher than the rebuild rate.

Thus, selection of the RAID parity group size is a tradeoff between capacity overhead, on-line performance, and rebuild performance. Understanding declustering is easier with a picture.

In Figure 4, each parity group has 4 elements, which are indicated by letters placed in each storage device. They are distributed among 8 storage devices. In the picture, capital letters represent those parity groups that all share the 2 nd storage node. If the 2 nd storage device were to fail, the system would have to read the surviving members of its parity groups to rebuild the lost elements.

Figure 4: Declustered parity groups. For this simple example you can assume each parity element is the same size so all the devices are filled equally.

In a real system, the component objects will have various sizes depending on the overall file size, although each member of a parity group will be very close in size. There will be thousands or millions of objects on each device, and the Panasas system uses active balancing to move component objects between storage nodes to level capacity. Declustering means that rebuild requires reading a subset of each device, with the proportion being approximately the same as the declustering ratio.

The total amount of data read is the same with and without declustering, but with declustering it is spread out over more devices. When writing the reconstructed elements, two elements of the same parity group cannot be located on the same storage node. Having per-file RAID allows the Panasas system to divide the work among the available DirectorBlade modules by assigning different files to different DirectorBlade modules. System Size. Figure 5 plots rebuild performance as the size of the storage cluster grows from 1 DirectorBlade module and 10 StorageBlade modules up to 12 DirectorBlade modules and StorageBlade modules.

Each shelf has 1 DirectorBlade module 1. In this experiment, the system was populated with MB files or 1 GB files, and each glyph in the chart represents an individual test. The declustering ratio ranges from 0. Declustering and parallel rebuild gives nearly linear increase in rebuild performance as the system gets larger. The reduced performance at 8 and 10 shelves stems from a wider stripe size. The system automatically picks a stripe width from 8 to 11, maximizing the number of storage nodes used while leaving at least one spare location.

For example, in a single-shelf system with 10 StorageBlade modules and 1 distributed spare, the system will use a stripe width of 9.

Each file has its own spare location, which distributes the spares across the Bladeset. The system reserves capacity on each storage node to allow reconstruction.

With 80 storage nodes and 1 distributed spare, the system chooses a stripe width of 11 so that 7 parity groups would fit, leaving 3 unused storage nodes. A width of 10 cannot be used because there would be no unused storage nodes. Table 1 lists the size of the parity group i. Storage Nodes. Group Width. Table 1: Default Parity Group Size. The variability in the shelf result came from runs that used 1 GB files and multiple Volumes.

In this test, the number of files impacted by the storage node failure varied substantially among Volumes because only 30 GB of space was used on each storage node, and each metadata manger only had to rebuild between 25 and 40 files.

There is a small delay between the time a metadata manager completes its own Volume and the time it starts working for other metadata managers; as a result, not every metadata manager is fully utilized towards the end of the rebuild. When rebuilding a nearly full StorageBlade, this delay is insignificant, but in our tests it was large enough to affect the results.

Since we compute bandwidth by measuring the total rebuild time and dividing by the amount of data rebuilt, this uneven utilization skewed the results lower. We obtained higher throughput with less variability by filling the system with 10 times as many MB files, which results in a more even distribution of files among Volume owners, or by using just a single Volume to avoid the scheduling issue. Figure 6 shows the effect of RAID parity group width on the rebuild rate. If a parity stripe is 6-wide, then the 5 surviving elements are read to recompute the missing 6 th element.

If a parity stripe is only 3-wide, then only 2 surviving elements are read to recompute the missing element. Even though the reads can be issued in parallel, there is more memory bandwidth associated with reads, and more XOR work to do with the wider stripe.

Therefore narrower parity stripes are rebuilt more quickly. The experiment confirms this. We measured two systems. The maximum stripe width in this configuration is 7 to allow for the spare. The maximum stripe width in this configuration was 8. Rebuild bandwidth increases with narrower stripes because the system has to read less data to reconstruct the same amount.

The results also show that having more DirectorBlade modules increases rebuild rate. These results indicate that the rebuild performance of the large systems shown in Figure 5 could be much higher with 2 DirectorBlade modules per shelf, more than twice the performance shown since those results used older, first generation DirectorBlade modules.

There are several kinds of metadata in our system. These include the mapping from object IDs to sets of block addresses, mapping files to sets of objects, file system attributes such as ACLs and owners, file system namespace information i.

One approach might be to pick a common mechanism, perhaps a relational database, and store all of this information using that facility. This shifts the issues of scalability, reliability, and performance from the storage system over to the database system. However, this makes it more difficult to optimize the metadata store for the unique requirements of each type of metadata. In contrast, we have provided specific implementations for each kind of metadata.

Our approach distributes the metadata management among the object storage devices and metadata managers to provide scalable metadata management performance, and allows selecting the best mechanism for each metadata type.

Block-level metadata is managed internally by OSDFS, our file system that is optimized to store objects. OSDFS uses a floating block allocation scheme where data, block pointers, and object descriptors are batched into large write operations. The write buffer is protected by the integrated UPS, and it is flushed to disk on power failure or system panics. Fragmentation was an issue in early versions of OSDFS that used a first-fit block allocator, but this has been significantly mitigated in later versions that use a modified best-fit allocator.

OSDFS stores higher level file system data structures, such as the partition and object tables, in a modified BTree data structure. Block-level metadata management consumes most of the cycles in file system implementations [Gibson97].

By delegating storage management to OSDFS, the Panasas metadata managers have an order of magnitude less work to do than the equivalent SAN file system metadata manager that must track all the blocks in the system. Above the block layer is the metadata about files.

Your privacy is important to us. To learn more about cookies and how we use them, please see our Privacy Policy. A scale-out object back-end supports limitless scaling, while optimal data placement and an internally balanced architecture boost efficiency. All with frustration-free deployment, operation, and maintenance. Download White paper. PanFS Dynamic Data Acceleration technology automatically maximizes the performance of any workload, performing uniformly fast across a diversity of use-cases and mixed workloads in a single seamless system.

ActiveStor storage nodes act as OSDs, and director nodes function as metadata managers. Advanced caching capabilities streamline storage. Unlike most storage operating environments, PanFS uses unique software algorithms for each type of metadata and distributes metadata management across director and storage nodes. Storage administrators can easily create additional volumes within the PanFS global namespace by configuring the global namespace into one or more logical volumes.

The PanFS platform maintains several types of metadata about files, including typical user-visible information such as owner, size, and modification time. OSDFS software manages block-level metadata using a delayed allocation scheme in which it batches data, block pointers, and object descriptors into large write operations.

Scale-up storage architectures grow in complexity, often reducing manageability. The PanFS file system architecture simplifies management by offering a single global namespace that easily scales from one to many ActiveStor units.

In addition, automated capacity balancing and centralized management features streamline management as storage volumes grow. Cluster management software monitors hardware, manages configurations, and supports overall control.



0コメント

  • 1000 / 1000