Tuesday, August 9, 2011

Managing Resources - Part II

This is a followup to my previous article, "Managing Resources - Part I". It's been half a year since I wrote that article and it has been a mixed blessing kind of trip. So, here are some of my thoughts from the attempt at managing my storage through complexity, and my thoughts going forward.

How Did It Go?

If you read part 1, you'll note I went with a big box, lots of cpu cores, ram, and storage. I opted for complexity and multiple layers of abstraction to get all the features I wanted. I opted to get ZFS by emlpoying Nexenta and other Solaris derived solutions. Virtual machines to get the best of both worlds, and leveraging PCI-passthru to get a desktop experience on a virtualized server platform.

In short, this did not work out very well. :(



What Went Wrong

From the start, there were problems:

  • PCI passthru only worked with a small subset of devices, and the devices it did work with were primarily storage controllers. Only specific, expensive, storage controllers. Display controllers were more or less out of the question, save as CUDA processor usages.
  • Without PCI passthrough for the storage controllers that I had, the option of using Nexenta in an ESXi VM with direct access to the underlying hardware to present storage options was a bust as well. Tests with Nexenta running on top of VMFS as presented through ESXi resulted in horrible performance numbers. Think something along the lines of 5MB/second read/write.
  • Juggling 10-12 1TB hard drives proved to be a nightmare in a conventional tower case. Likewise, the heat and noise generated by the drives alone made running the server 24/7 a problematic proposition.
  • The issues proved to be made worse by the fact that the OpenSolaris instance I was using had an advanced zpool version, which was not importable, at the time. This made importing and accessing the original data in my array a non-starter.
Where Are Things Now

In the end, I opted to get the  OWC Elite Quad Pro and load it up with 4 x 1TB drives, setting it up for Raid 5, and hooking it up to the Airport Extreme to use as a time machine drive for laptops and as personal dumping space, while the big server is being debugged.

The main server is currently a mass of sata and power cables. I've added 2 quad port pcie x4 SATA cards to the server to handle more drives. I'm planning on re-imaging the box and loading it up with Linux(Ubuntu or Centos) and employing ZFSonLinux to implement ZFS storage pools and get RAIDZ2 up and running. This would remove the need for virtual machines, as I would get ZFS in Linux and would be able to run the applications I wanted to run, natively.

I've also acquired 2 x 60GB SSD(s) to serve as L2ARC caching and ZIL logging for ZFS to improve the performance of the system. The intent is to have these two SSD(s) mirrored and having the mirrored device used for the caching and logging.

What Needs To Get Done

The remaining road block now is how to get the data off of the existing ZPOOL(solaris) and onto the new ZPOOL(linux). I'm thinking it will take the following form:
  1. Build new Linux environment / ZPOOL environment. (pool1)
  2. Export pool1 from Linux environment.
  3. Boot into old OpenSolaris environment and import all pools (pool0, pool1)
  4. While booted up in OpenSolaris, zfs send from pool0 to pool1.
  5. Once completed, export pool1 from OpenSolaris environment.
  6. Boot into Linux environment and import pool1.
  7. Confirm data is all good, then re-purpose all disks in the original pool0 OpenSolaris environment.
The current capacity of the server is 14 SATA ports:
  • 6 on the motherboard :: mb0
  • 4 on quad sata 1  :: qs1
  • 4 on quad sata 2  :: qs2 
After taking into account the 8 disks for the existing array, this leaves 6 ports available for the new environment to be built off of. Building small RAIDZ groups, I'm planning on grouping the space as follows:
  • RAIDZ :: mb0:2 + qs1:0 + qs2:0    [ + 2TB ]
  • RAIDZ :: mb0:3 + qs1:1 + qs2:1    [ + 2TB ]
This way, the six ports are spread across the motherboard and each of the 2 additional controllers. When the data is transferred over and the original 8 disks are freed up, they will be added to pool1 in a similar fashion:
  • RAIDZ :: mb0:4 + qs1:2 + qs2:2   [ + 2TB ]
  • RAIDZ :: mb0:5 + qs1:3 + qs2:3   [ + 2TB ]
The remaining ports on the motherboard, MB0:0 and MB0:1 will be used for the SSD(s) for the L2ARC and the ZIL. As well as for the booting up of the Linux OS.

The effective space will be 8TB of usable RAIDZ data. However, it is 8TB spread out over 12 x 1TB drives vs the original 7TB spread out over 8 x 1TB drives. The arrangement also ensures that should one controller card fail, the arrays will be recoverable.

Upgrading storage and/or resilvering a replacement disk would also be significantly sped up by the fact that each RaidZ group is only comprised of 3 disks.

If/when I decided to upgrade to 2TB disks, I can do so 3 at a time. Though ideally, they would all be RaidZ2 with perhaps more disks per group, but funds are no infinite. :( What this configuration buys me is more IOPS across the disks, and hopefully better performance.

Future Upgrades & Potential Issues

Once this is up and running, I may bring the OWC Elite unit into the mix via an ESATA controller, with the 4 drives in the unit presented as individual drives. However, because of the way that ZFS works, I would not be able to add a drive to each of the RAIDZ(s) since ZFS lacks the ability to reshape the geometry of the zpool as well as the ability to shrink a zpool by removing a device from it. So, future adding of storage would require ever more disks/space to enact reconfigurations, which makes using ZFS a bit problematic.

One Avenue To Upgrade Via Adding More Disks, While Retaining Redundancy

I suppose I _could_ swap out disks from each RAIDZ group and replace it with a disk from the external quad enclosure like so:

From:
  • RAIDZ :: mb0:2 + qs1:0 + qs2:0   [ + 2TB ]
  • RAIDZ :: mb0:3 + qs1:1 + qs2:1   [ + 2TB ]
  • RAIDZ :: mb0:4 + qs1:2 + qs2:2   [ + 2TB ]
  • RAIDZ :: mb0:5 + qs1:3 + qs2:3   [ + 2TB ]
To:
  • RAIDZ :: mb0:2 + es0:1 + qs2:0   [ + 2TB ]
  • RAIDZ :: mb0:3 + qs1:1 + es0:2   [ + 2TB ]
  • RAIDZ :: mb0:4 + qs1:2 + qs2:2   [ + 2TB ]
  • RAIDZ :: mb0:5 + qs1:3 + qs2:3   [ + 2TB ]
  • RAIDZ :: es0:0 + qs1:0  + qs2:1  [ +2TB ]
  • + 1 x 1TB hot spare : es0:3
In this fashion, another 2TB of RAIDZ storage is added to the zpool, making it 10TB usable across 5 x RAIDZ blocks. The extra slot in the external SATA enclosure can be used to house a hot spare, to allow for auto-rebuilding.

Basically, if you want to add additional physical storage devices... you will need to ensure that the devices are re-juggled around in such a way that the failure of any one device will not result in a failure of any RAIDZ component that makes up the zpool.

In anycase, storage and data management continues to be a pain on the home front. :)

No comments:

Post a Comment