Designed to Fail

March 26th, 2014
English: Alan Shepard in capsule aboard Freedo...

English: Alan Shepard in capsule aboard Freedom 7 before launch—1961 Alan Shepard became the first American in space on May 5, 1961. He launched aboard his Mercury-Redstone 3 rocket named Freedom 7–the suborbital flight lasted 15 minutes. (Photo credit: Wikipedia)

I’m writing this a little bit for my wife and non-technical friends. As I was thinking through this idea it occurred to me that it’s not particularly technical, and it’s a great way of explaining what I spent a lot of my time doing when I’m actually designing solutions.

Recently both my wife and my phones have broken in various ways, she dropped hers and it’s never been the same since, mine just had some dodgy hardware failure and trying to fix it made it worse. The technology we buy is generally put together by the cheapest hardware provider, with the lowest cost commodity technology (not always the case, but largely true). Mass production lowers costs, so does cutting corners and reducing redundancy (multiple components that are there solely to take over when one fails).

In general home life most of us don’t bother with redundancy, some of us may have more than one desktop PC or laptop, but most of us will only have one personal phone, one big-screen TV, one Blueray player, etc. etc. We buy commodity and we accept that it is a commodity and prone to failure. Some of us don’t realise this and get upset when technology fails (perhaps a lot of us don’t realise this!), but when you remember that the technology you have isn’t necessarily the best (the best is expensive) but the cheapest, you’ll realise why it’s so prone to failure. This is why the home technology insurance industry is so strong!

General , , ,

DevOps vs IT-Ops

March 17th, 2014

I’ve spent the last few months closely monitoring the job boards, and because of my web development background I get flagged for development jobs. Interestingly the vast majority of development roles seem to be pitched as DevOps roles. Initially this got my interested as I’d be very interested in doing a DevOps role (my dev skills are rusty, but I can do the Ops side pretty well). But it seems the majority of DevOps roles are simply just Development roles with a bit of config management included, and the config management is code related, not infrastructure related.

If you look at the IT Operations side of things, these guys are getting more involved in automated builds, infrastructure configuration management and the ubiquitous immutable server concept. The problem is there is significant cross-over in the tooling for DevOps and IT-Ops. If you’re looking at something like Chef, Puppet, Ansible or Salt, one of the key decision factors is are you a developer or an infrastructure person. Developers are more likely to understand Github repositories and workflows, while infrastructure guys will understand more scripting an automated build. With the major infrastructure virtualisation vendors coming to the party with things like VMware’s Application Director and Data Director, as well as Microsoft’s App-Controller, this market is quickly becoming busy.

But the key question is still, are you a developer or an infrastructure person? Either an infrastructure person is building a template to hand-over to development, or a developer is taking a pre-built template and automating their code over the top of it. What about DevOps then? At what point will the infrastructure operations team actually work closely with the development team? Maybe the question is closer to: At what point will the infrastructure team let the development team get closer to the infrastructure, and at what point will the development team let the infrastructure team get closer to their code? There’s still too many arguments one way or the other (your code isn’t optimised for a virtual stack, your infrastructure isn’t dynamic enough for our code, etc. etc.).

General , , , ,

Explain Snapshots

March 11th, 2014

This seems to be a popular search term so I think it’s worth covering off. This is covered on my old top post about Fractional Reservation, but I’ll cover the alternatives here also.


NetApp snapshots used to be pretty unique in the industry, but the industry term for this technology is generally now Append-on-Write / Redirect-on-Write (new writes are appended to the “end”, or redirected to free blocks, depending how you look at it) and quite a few vendors do it this way. Put very simply, all new data is written to new (zeroed) blocks on disk. This does mean that snapshot space has to be logically in the same location as the production data, but that really shouldn’t be a problem with wide-striping / aggregates / storage pools (pick preferred vendor term). When a snapshot is taken, the inode table is frozen and copied. The inode table points to the data blocks, and these data blocks now become fixed. As the active filesystem “changes” blocks, these actually get written to new locations on disk, and so there is no overhead to the write (the new blocks are already zeroed). In other technologies (not NetApp) this also forms the basis of automated tiering, once data is “locked” by a snapshot, it’ll never be over-written so it can safely be tiered out of SSD or even SAS as read performance is rarely an issue. NetApp use FlashPools to augment this, and a snapshot is a trigger for data to be tiered out of FlashPools as it’ll never be “overwritten”.

Web Searches , , ,

VMware CPU Ready Time

March 11th, 2014

I have been surprised that recently this has come back to haunt me as an issue, and a major one at that.

So what’s the issue? Well, long story short, if you starve your virtual estate of CPU resources you’ll get CPU ready-state issues. Broadly this is caused by 2 issues, you’ve over-committed your CPU resources (consolidation ratio is too high), or your virtual machines are sized too big (and their workload is too high).

VMware vSphere is very clever with it’s CPU virtualisation. In order to allow multiple virtual machines share the same CPU space, it schedules them in and out. Needless to say this happens very quickly, and generally speaking the only thing you’ll notice is that you consume very little CPU and have a very high consolidation ratio. The problem really occurs with large VMs (4+ vCPU’s). vSphere needs to be a lot more intelligent about this, as all vCPU’s need to be scheduled at the same time, or skewed slightly (part of the relaxed co-scheduling in 5.0+). The window of opportunity to schedule these gets narrower the more vCPU’s you assign, so a 4 vCPU machine needs to wait for 4 logical cores to be available (hyper-threaded cores count as individual logical cores), and 8 vCPU machine needs to wait for 8. The busier a vSphere host is, the longer a queue there may be for CPU resources and the harder it is to schedule all the vCPU’s is. While a machine is waiting for CPU resources to be available, it is in a ready-state (meaning it has CPU transactions to process, but can’t as no resources are available). The relaxed co-scheduling means it doesn’t always have to wait for all vCPU’s to be scheduled at the same time on logical physical cores, but it’s a rule of thumb when sizing.

General , , , ,

New beginnings

March 11th, 2014

First new post in what, 2-3 years? still performs admirably and now that I’ve moved roles I figure it’s time to re-visit my old flames. My home lab needs a re-build and upgrade and that should get documented!

So I’ve moved into the big scary world of contract based work, I started my first role a couple of weeks ago, and so far it’s going great. I want to keep an update on my exploits, the challenges real customers are facing and share some of my generic musings. The site will be less NetApp centric, but I still have my roots in storage!

My first role involved a lot of interesting challenges, but there’s some great technology available here too. A strong DevOps team (that need help integrating the Ops bit, doesn’t everyone?), lots of Big Data challenges, and an immediate project to look at creating a much more responsive infrastructure, including where cloud services fit in. I started life as a web developer, and it’s great being back at a dotcom company and seeing how the challenges have evolved.

Enhanced by Zemanta

General , , ,

NetApp Debuts OnCommand Performance Manager

March 3rd, 2014

NetApp last week released OnCommand Performance Manager 1.0 Release Candidate 1 (RC1) to all NetApp customers and partners. This new software provides performance management, troubleshooting, and event notification for systems running clustered Data ONTAP 8.2 and 8.2.1.

Performance Manager 1.0 RC1 is deployed, not installed, as a virtual appliance within VMware ESX or ESXi. A virtual appliance is a prebuilt software bundle, containing an operating system and software applications that are integrated, managed, and updated as a package. This software distribution method simplifies what would be an otherwise complex installation process.

Upon deployment, the Linux 2.6.32-based virtual appliance creates a virtual machine containing the user software, third-party applications, and all configuration information pre-installed on the virtual machine. Much of the virtual appliance middleware is built primarily with Java and includes several open-source components – most notably from (but not limited to) the Apache Software Foundation, the Debian Project, and the Free Software Foundation.

Sizing Performance Manager is based upon a number of factors: the number of clustered Data ONTAP clusters, maximum number of nodes in each cluster, and maximum number of volumes on any node in a cluster.

In order to meet the official supportability status from NetApp, Performance Manager 1.0 RC1 requires 12GB of (reserved) memory, 4 virtual CPUs, and a total of 9572 MHz of (reserved) CPU. This qualified configuration meets minimum levels of acceptable performance and configuring these settings smaller than specified is not supported. Interestingly, increasing any of these resources is permitted – but not recommended – as doing so provides little additional value.

In fact, according to December 2013 AutoSupport data from NetApp, most customers should expect to deploy a single Performance Manager virtual appliance; as one instance will be suitable for 95% of all currently deployed clustered Data ONTAP systems.

Geek ONTAP ,

NetApp Unveils FAS8000

February 19th, 2014

NetApp today launched the FAS8000 Series, its latest enterprise platform for shared infrastructure, with three new models: FAS8020, FAS8040, and FAS8060, which replace the FAS/V3220, FAS/V3250, and FAS/V6220, respectively. This new line will initially ship with Data ONTAP 8.2.1 RC2, supporting either 7-Mode or clustered Data ONTAP.

All systems are available in either standalone and HA configurations within a single chassis. All standalone FAS8000 controller configurations can have a second controller (of the same model) added to the chassis to become HA.

The new FAS8000 has been qualified with the DS2246, DS4246, DS4486, DS4243, DS14mk4, and the DS14mk2-AT disk shelves with IOM6, IOM3, ESH4, and AT-FCX shelf modules. Virtualized storage from multiple vendors can also be added to the FAS8000 — without a dedicated V-Series “gateway” system — with the new “FlexArray” software feature.

NetApp will not offer a separate FlexCache model for the FAS8000 Series.

Let’s explore the technical details of each one of these new storage systems.

The 3U form factor FAS8020 (codenamed: “Buell”) is targeted towards mid-size enterprise customers with mixed workloads. Each Processor Control Module (PCM) includes a single-socket, 2.0 GHz Intel E5-2620 “Sandy Bridge-EP” processor with 6 cores (12 per HA pair), an Intel Patsburg-J SouthBridge, and 24GB of DDR3 physical memory (48GB per HA pair).

NetApp supports single and dual controller configurations in one chassis, but unlike previous systems, I/O Expansion Module (IOXM) configurations are not supported. The increased mix of high-performance on-board ports and the flexibility offered by the new Unified Target Adapter 2 (UTA2) ports reduces the need for higher slot counts on the FAS8000 series.

Geek ONTAP, General

Super Storage: A Fan’s View of the NFL’s Data Storage

February 11th, 2014

Like most Americans, I recently watched the biggest, boldest, and coldest event in American football: Super Bowl XLVIII with 112.2 million of my closest “friends”.

But even if you didn’t get excited about the big game, you might still be interested to learn about the role of data storage for the most-watched television program in American history.

During the week leading up to the Super Bowl, I had the privilege to help ring the opening bell at the NASDAQ MarketSite in New York City — and what an experience! I also had the opportunity to chat with the NFL’s Director of Information Technology, Aaron Amendolia, to explore how they leverage NetApp storage systems for data management.

It starts with 40 NetApp FAS2200 Series storage systems that store, protect, and serve data to all 32 NFL teams, thousands of personnel, and millions of fans. For example:

Want player stats during the game? All game play raw data is instantly available and served by NetApp storage systems.

Like those action shots? Television and newspaper photographers take hundreds of thousands of photos and videographers capture high-definition video of regular-season games, the playoffs, and the Super Bowl – all stored on NetApp storage systems.

See someone wearing a badge? NetApp provides the infrastructure that supports security credentialing for everyone from hot dog vendors to the NFL commissioner.

I also learned that the NFL leverages the entire protocol stack (both SAN and NAS), with over 90% of their infrastructure running virtual machines on NetApp storage systems.

Yet, every Super Bowl is unique.

The NFL’s end-users are often located in hotels with high-latency connections; hardware is subjected to harsh environments usually not found within most datacenters (soda can spills, dirt, grit, etc.). The good news is that SnapMirror, the replication software built into Data ONTAP, allows the NFL to failover in the event of a problem.

Geek ONTAP, General

NetApp Releases Flash Accel 1.3.0

January 30th, 2014

NetApp today announced the availability of Flash Accel 1.3.0, its server-side software that turns supported server flash into a cache for the backend Data ONTAP storage systems.

Coherency, Persistence
As with previous releases, Flash Accel 1.3.0 detects and corrects for coherency at the block level — rather than flushing the entire cache. Flushing the entire cache may be good as there are no data coherency issues, but terrible for performance. Flash Accel cache invalidation corrects cache, while keeping the cache persistent.

In addition to intelligent data coherency, Flash Accel also provides persistence across VM / server reboots.

The benefit of both intelligent data coherency and persistence is to ensure that both the cache optimizes performance at its peak (i.e. when the cache is warm) and that peak performance can last as long as possible (by keeping the cache warm for as long as possible).

Side note: Flash Accel code manages server cache, accelerating access to data that is stored and managed by Data ONTAP on the storage system.  Flash Accel is NOT Data ONTAP code.

What’s New
Flash Accel 1.3.0 adds the following features and functionalities:

  • Support for Windows Server bare metal caching:
    • Windows 2008 R2, Windows 2012, and Windows 2012 R2
    • FC and iSCSI support for bare metal
    • Clustered apps supported (cold cache on failover)
  • Adds support for Windows 2012 and 2012 R2 VMs and vSphere 5.5 support
    • Note: for use of Flash Accel with Flash Accel Management Console (FAMC), vSphere 5.5 support will be added within weeks of general availability of Flash Accel 1.3
  • Up to 4TB of cache per server
  • Support for sTEC PCI-e Accelerator
    • Note: For VMware environment, Flash Accel 1.3.0 is initially only available for use with FAMC. 1.3.0 support for use with NetApp Virtual Storage Console (VSC) will be available when VSC 5.0 releases

Geek ONTAP, General

Snap Creator Deep Dive

December 30th, 2013

NetApp Snap Creator Framework is data protection software that integrates NetApp features with third-party applications, databases, hypervisors, and operating systems.

Snap Creator was originally developed in October 2007 by NetApp Professional Services and Rapid Response Engineering to reduce (or even eliminate) scripting. Nowadays, Snap Creator is a fully supported software distribution available from the NetApp Support Site.

The Snap Creator Team provides two versions of Snap Creator: a community version and an official NetApp release. The community version includes the latest plug-ins, enhancements, and features but is not supported by NetApp Support. The NetApp Version is fully tested and supported, but does not include the latest plug-ins, features, and enhancements.

Let’s explore the architecture of the recently released Snap Creator 4.1 Community Release in November 2013.

Snap Creator Server Architecture
The Snap Creator Server is normally installed on a centralized server. It includes a Workflow Engine, which is a multi-threaded, XML-driven component that executes all Snap Creator commands.

Both the Snap Creator GUI and CLI, as well as third-party solutions (such as PowerShell Cmdlets), leverage the Snap Creator API. For example, NetApp Workflow Automation can leverage PowerShell Cmdlets to communicate to Snap Creator.

To store its configurations, Snap Creator includes configuration files and profiles in its Repository; this includes global configs and profile-level global configs. If you’re familiar with previous versions of Snap Creator Server, one of the new components is the Extended Repository. This extension provides a database location for every job, imported information about jobs, and even plug-in metadata.

For persistence, the Snap Creator Database stores details on Snap Creator schedules and jobs, as well as RBAC users and roles.

Geek ONTAP, General

This site is not affiliated or sponsored in anyway by NetApp or any other company mentioned within.