ZFS
Zetabyte File System

So, what is ZFS?

It's a new filesystem developed by Sun Microsystems before it was acquired by Oracle. It's not just another enhancement of UFS or SVM. It was developed from the ground up with ease of administration, data integrity and high scalability in mind.

Boy, did these guys deliver! They have created a filesystem that's really easy to use and so many features it would make your head spin. I will quickly run through some of them before I'll show you some easy examples on how to create filesystems with this software.

ZFS is a pooled storage model where filesystems are then created form this pool. You can literally create thousands of file systems from a single pool. A pool is basically a set of disks configured with or without redundancy. It's similar to raid 5 and mirroring but is more robust and easy to manage.

Data is written in a copy on write fashion so you always have instant snapshots and valid on disk state.

It also check-sums the data to prevent silent data corruption. Traditional volume managers does not check the data written to the disks. If you mirror the data in a traditional volume manager then the data is protected if a disk or disks fail. But, what about the data? The data might be corrupt on one side of the mirror and when you eventually read from this mirror, you get corrupted data and have a bad day.

If the data is check-summed and corrupted data is found on one side of the mirror, zfs knows that it has another copy, and will read the data from that copy. Nice.

It also uses a data protection model called raid-z. It's similar to raid 5 but it uses variable stripe widths, so the data is always in a full stripe write. This eliminates the read-modify-write overhead associated with raid 5. This means better performance on disks that does not have cache, such as hardware raid controllers.

You can create snapshots and clones. Snapshots are read-only whereas clones are writable. You can also create incremental snapshots.

You have built in compression that you can enable or disable plus de-duplication.

It provides file sharing capabilities like NFS, CIFS and iSCSI to mention just a few. These can be easily configured via the command line.

There are no complicated vfstab files to update when you create a filesystem. In fact, you don't have to update any files when creating a filesystem.

It does not use complicated disk partitions that takes hours of your time to plan and create. And then you still need to run newfs on the partition to make it useable. You just create the pool and then create you filesystems, thousands, for use.

You can also, easily, add additional capacity to your pools and filesystems.

You can also easily export and import pools between systems if the storage is shared between them. This is great for clusters and clones between systems.

These are just a few of the great features you get from ZFS. The best part is, it's free! Yep, you heard right, free. After you installed Solaris, or Open Solaris, you can start creating pools without the need for a license. Other vendors will charge a lot of money for some of these features and it's open source.

I will now show you some examples on how to create pools and filesystems. I will keep it simple and to the point. I don't want to make it complicated cause that's not the idea of this site or the filesystem.

Let me explain the setup. I have a virtualbox with Solaris 10 running. The cool thing about zfs is that you can use normal files as storage disks in the pool. This is nice if you just want to quickly test something to see if it works.

If you use physical disks, then it's recommended that you use the entire disk for zpool creation.

I'll create 4 files of 200Mbyte each and use this for my pools. Later I will add some more "disks".

For now 4 is all we need.

bash-3.00# mkfile 200m /disk1
bash-3.00# mkfile 200m /disk2
bash-3.00# mkfile 200m /disk3
bash-3.00# mkfile 200m /disk4

Let's start by creating a pool containing only one disk. This is just to show you how easy it is to create a pool.

bash-3.00# zpool create mypool /disk1

bash-3.00# zpool list
NAME     SIZE  ALLOC   FREE    CAP  HEALTH  ALTROOT
mypool   195M  95.5K   195M     0%  ONLINE  -

bash-3.00# zpool status
  pool: mypool
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        mypool      ONLINE       0     0     0
          /disk1    ONLINE       0     0     0

errors: No known data errors

I used the zpool create command to create the pool. Then you have to specify the pool name, in this example I called the pool, mypool. Then you specify the devices to use for the pool. I used the file called /disk1

To list the pools, the zpool list command was used. It gives us information about the pools created on the system.

The zpool status command gives us a more detailed display of the pools. It show the actual disks that the pool consists of.

We created the pool with one disk, so this is basically a simple volume with no redundancy in ZFS.

Let's make it redundant by attaching another disk to the pool mypool. Notice the size of mypool in the above zpool list output. It is 195Mbyte. Let's see what happens if we attach another disk.

bash-3.00# zpool attach mypool /disk1 /disk2
bash-3.00# zpool list
NAME     SIZE  ALLOC   FREE    CAP  HEALTH  ALTROOT
mypool   195M   165K   195M     0%  ONLINE  -
bash-3.00#

Hm, it's still 195Mbyte. Let's do a zpool status to see what happened.

bash-3.00# zpool status
  pool: mypool
 state: ONLINE
 scrub: resilver completed after 0h0m with 0 
errors on Tue Jun 28 22:04:00 2011
config:

        NAME        STATE     READ WRITE CKSUM
        mypool      ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            /disk1  ONLINE       0     0     0
            /disk2  ONLINE       0     0     0  75K resilvered

errors: No known data errors
bash-3.00#

Ah, it created a two way mirror. The two disks are now mirrored. When you attach disks, you must tell the zpool command to what you want to attach it to.

Before I forget. Did you also notice that we did not do any newfs or edited any vfstab files. The question is, can we use the pool? Of course we can! When you create a pool, zfs also creates a filesystem that we can use immediately.

Let's do a df -h and zfs list and look at the output.

bash-3.00# df -h
Filesystem        size   used  avail capacity  Mounted on
/dev/dsk/c0d0s0    14G   5.2G   8.4G    39%    /
/devices            0K     0K     0K     0%    /devices
ctfs                0K     0K     0K     0%    /system/contract
proc                0K     0K     0K     0%    /proc
mnttab              0K     0K     0K     0%    /etc/mnttab
swap              2.3G  1004K   2.3G     1%    /etc/svc/volatile
objfs               0K     0K     0K     0%    /system
swap              2.3G    80K   2.3G     1%    /tmp
swap              2.3G    32K   2.3G     1%    /var/run
mypool            163M    21K   163M     1%    /mypool

bash-3.00# zfs list
NAME     USED  AVAIL  REFER  MOUNTPOINT
mypool    84K   163M    21K  /mypool

Yep, it's there and mounted under /mypool. We didn't have to edit any files or run any newfs commands. By default zpool create will create a filesystem and mount it under the same name. Later we will look in a bit more detail at the zfs command.

Remember, the whole idea behind ZFS is to create pools and then use the pools for filesystems.

Let's attach another disk to mypool.

bash-3.00# zpool attach mypool /disk1 /disk3

bash-3.00# zpool list
NAME     SIZE  ALLOC   FREE    CAP  HEALTH  ALTROOT
mypool   195M   176K   195M     0%  ONLINE  -
bash-3.00# zpool status
  pool: mypool
 state: ONLINE
 scrub: resilver completed after 0h0m with 0 
errors on Tue Jun 28 22:14:31 2011
config:

        NAME        STATE     READ WRITE CKSUM
        mypool      ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            /disk1  ONLINE       0     0     0
            /disk2  ONLINE       0     0     0
            /disk3  ONLINE       0     0     0  85.5K resilvered

errors: No known data errors

The size stays 195Mbyte but we basically now have a three way mirror.

Let's detach a disk.

bash-3.00# zpool detach mypool /disk3

bash-3.00# zpool status
  pool: mypool
 state: ONLINE
 scrub: resilver completed after 0h0m with 0 
errors on Tue Jun 28 22:14:31 2011
config:

        NAME        STATE     READ WRITE CKSUM
        mypool      ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            /disk1  ONLINE       0     0     0
            /disk2  ONLINE       0     0     0

errors: No known data errors

All that happened was we removed one of the mirrors and the pool is now a two way mirror again.

I will now detach /disk2 and then use the add command to add a disk to the pool.

bash-3.00# zpool detach mypool /disk2

bash-3.00# zpool status
  pool: mypool
 state: ONLINE
 scrub: resilver completed after 0h0m with 
0 errors on Tue Jun 28 22:14:31 2011
config:

        NAME        STATE     READ WRITE CKSUM
        mypool      ONLINE       0     0     0
          /disk1    ONLINE       0     0     0

errors: No known data errors

The mirror disappeared and now we have a normal simple volume.

Let's use the add command to add another disk to the pool and see what happens.

bash-3.00# zpool add mypool /disk2

bash-3.00# zpool list
NAME     SIZE  ALLOC   FREE    CAP  HEALTH  ALTROOT
mypool   390M   168K   390M     0%  ONLINE  -

bash-3.00# zpool status
  pool: mypool
 state: ONLINE
 scrub: resilver completed after 0h0m with 0 
errors on Tue Jun 28 22:14:31 2011
config:

        NAME        STATE     READ WRITE CKSUM
        mypool      ONLINE       0     0     0
          /disk1    ONLINE       0     0     0
          /disk2    ONLINE       0     0     0

errors: No known data errors

Look at the size, it's now 395Mbyte. So the add command added the disk to the pool and the size doubled in this case.

If we now try to detach a disk it should not work cause we sit with a stripe. Let's try it.

bash-3.00# zpool detach mypool /disk2
cannot detach /disk2: only applicable to mirror and replacing vdevs
bash-3.00#

Yep as expected. you cannot detach a disk from a pool that is not mirrored.

We will now destroy the pool and create mirrors and raid-z pools.

bash-3.00# zpool destroy mypool

We used the zpool destroy command to destroy the pool.

In the previous examples we created a zpool and attached the disk to the pool to create a mirror. You can also create a mirrored pool from the command line. You don't have to use the attach method to add a disk like we did previously.

Be careful when you create these mirrors cause specifying the devices incorrectly might not have the desired results.

Below I will use the mirror command to create a mirror when I create the pool. Look at the results between the two methods I used.

bash-3.00# zpool create mypool mirror /disk1 /disk2 /disk3 /disk4

bash-3.00# zpool list
NAME     SIZE  ALLOC   FREE    CAP  HEALTH  ALTROOT
mypool   195M    97K   195M     0%  ONLINE  -

bash-3.00# zpool status
  pool: mypool
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        mypool      ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            /disk1  ONLINE       0     0     0
            /disk2  ONLINE       0     0     0
            /disk3  ONLINE       0     0     0
            /disk4  ONLINE       0     0     0

errors: No known data errors

Well, the mirror command did create a mirror, but it created a four way mirror. That's not what I wanted. I wanted to create 2 two way mirrors which gives me more capacity and redundancy. I wanted to create a raid 1+0.

Let me try again. First I will destroy the pool and recreate it.

bash-3.00# zpool destroy mypool

bash-3.00# zpool create mypool mirror /disk1 /disk2 mirror /disk3 /disk4

bash-3.00# zpool list
NAME     SIZE  ALLOC   FREE    CAP  HEALTH  ALTROOT
mypool   390M    97K   390M     0%  ONLINE  -

bash-3.00# zpool status
  pool: mypool
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        mypool      ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            /disk1  ONLINE       0     0     0
            /disk2  ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            /disk3  ONLINE       0     0     0
            /disk4  ONLINE       0     0     0

errors: No known data errors

That's more like it. I now have something similar to a raid 1+0 setup.

How about a raid-Z configuration. Raid-Z is similar to a raid 5. Again, I will destroy the current pool and create it as a raid-Z.

bash-3.00# zpool destroy mypool

bash-3.00# zpool create mypool raidz /disk1 /disk2 /disk3

bash-3.00# zpool list
NAME     SIZE  ALLOC   FREE    CAP  HEALTH  ALTROOT
mypool   584M   176K   584M     0%  ONLINE  -

bash-3.00# zpool status
  pool: mypool
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        mypool      ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            /disk1  ONLINE       0     0     0
            /disk2  ONLINE       0     0     0
            /disk3  ONLINE       0     0     0

errors: No known data errors
bash-3.00#

The above command created a single parity raid-z pool. You can also create double and triple parity pools as well.

You can also add disks to a raid-z pool as long as you add the same amount as disks already configured in raidz pool.

Let's add a hotspare to the pool.

bash-3.00# zpool add mypool spare /disk4

bash-3.00# zpool status
  pool: mypool
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        mypool      ONLINE       0     0     0
          /disk1    ONLINE       0     0     0
          /disk2    ONLINE       0     0     0
          /disk3    ONLINE       0     0     0
        spares
          /disk4    AVAIL

errors: No known data errors

We can also remove a hotspare from the pool.

bash-3.00# zpool remove mypool /disk4

bash-3.00# zpool status
  pool: mypool
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        mypool      ONLINE       0     0     0
          /disk1    ONLINE       0     0     0
          /disk2    ONLINE       0     0     0
          /disk3    ONLINE       0     0     0

errors: No known data errors

Let's export and import the pool. You might want to do this if you have shared storage and you want to export the pool on one server and import the pool on another server. Solaris cluster uses this export and import function.

It's also useful if you want to change the pool name.

bash-3.00# zpool export mypool
bash-3.00# zpool list
no pools available

bash-3.00# zpool import -d / mypool
bash-3.00# zpool list
NAME     SIZE  ALLOC   FREE    CAP  HEALTH  ALTROOT
mypool   585M   416K   585M     0%  ONLINE  -
bash-3.00#

You can also get statistics of how the pool is utilized with the iostat command. Below is an example.

bash-3.00# zpool iostat -v mypool
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
mypool       370K   585M      0      2  20.5K  25.0K
  //disk1    114K   195M      0      0  6.53K  8.34K
  //disk2    122K   195M      0      0  6.53K  8.34K
  //disk3    134K   195M      0      0  7.49K  8.34K
----------  -----  -----  -----  -----  -----  -----

As you can see there is a lot you can do with pools. We looked at creating and managing pools. We still need to look at the filesystem aspect of ZFS. This is where the real strength of the Zetabyte FileSystem lies.

To display the current filesystems in the pools, we will use the zfs command.

bash-3.00# zfs list
NAME     USED  AVAIL  REFER  MOUNTPOINT
mypool  79.5K   553M    21K  /mypool

First of all I want to create 2 filesystems just to show you what the command looks like.

bash-3.00# zfs create mypool/vol1
bash-3.00# zfs create mypool/vol2

bash-3.00# zfs list
NAME          USED  AVAIL  REFER  MOUNTPOINT
mypool        138K   553M    23K  /mypool
mypool/vol1    21K   553M    21K  /mypool/vol1
mypool/vol2    21K   553M    21K  /mypool/vol2

bash-3.00# df -h
Filesystem             size   used  avail capacity  Mounted on
/dev/dsk/c0d0s0         14G   5.6G   8.0G    42%    /
/devices                 0K     0K     0K     0%    /devices
.
.
.
.
mypool                 553M    23K   553M     1%    /mypool
mypool/vol1            553M    21K   553M     1%    /mypool/vol1
mypool/vol2            553M    21K   553M     1%    /mypool/vol2
bash-3.00#

Very simple, and I did not have to edit any files to make the mounts permanent.

I'm not happy with the mountpoints. I want to change them to /data1 and /data2

bash-3.00# zfs set mountpoint=/data1 mypool/vol1
bash-3.00# zfs set mountpoint=/data2 mypool/vol2

bash-3.00# zfs list
NAME          USED  AVAIL  REFER  MOUNTPOINT
mypool        158K   553M    23K  /mypool
mypool/vol1    21K   553M    21K  /data1
mypool/vol2    21K   553M    21K  /data2

bash-3.00# df -h
Filesystem             size   used  avail capacity  Mounted on
/dev/dsk/c0d0s0         14G   5.6G   8.0G    42%    /
/devices                 0K     0K     0K     0%    /devices
.
.
.
.
mypool                 553M    23K   553M     1%    /mypool
mypool/vol1            553M    21K   553M     1%    /data1
mypool/vol2            553M    21K   553M     1%    /data2
bash-3.00#

This is just getting easier and easier. So a simple command changed the mountpoint. I did not even had to unmount and then remount.

If you look at the list output, both the file systems have all of the space available from the pool, mypool. This could be a potential problem, cause users might think they have all this space available. This could easily fill up the pool.

We will use quotas to limit the space that a filesystem may consume. Again it's a very straight forward command.

bash-3.00# zfs set quota=200m mypool/vol1
bash-3.00# zfs set quota=100m mypool/vol2

bash-3.00# zfs list
NAME          USED  AVAIL  REFER  MOUNTPOINT
mypool        156K   553M    21K  /mypool
mypool/vol1    21K   200M    21K  /data1
mypool/vol2    21K   100M    21K  /data2
bash-3.00#

Now the available space is shown as 100Mbyte for /data2 and 200Mbyte for /data1. You can also use the reservation option to reserve space on a filesystem.

We can use the get subcommand to check what options have been enabled for a filesystem.

bash-3.00# zfs get -r -s local -o name,property,value all mypool
NAME         PROPERTY              VALUE
mypool/vol1  quota                 200M
mypool/vol1  mountpoint            /data1
mypool/vol2  quota                 100M
mypool/vol2  mountpoint            /data2
bash-3.00#

I used the -o name,property,value options to output only these values. If you don't use the -o option then all values for all options will be displayed. This could be a bit overwhelming, cause there is a lot of properties for filesystems.

You can also use compression on a filesystem. It's very easy to do, below is the command.

bash-3.00# zfs set compression=on mypool/vol1
bash-3.00# zfs get -r -s local -o name,property,value all mypool
NAME         PROPERTY              VALUE
mypool/vol1  quota                 200M
mypool/vol1  mountpoint            /data1
mypool/vol1  compression           on
mypool/vol2  quota                 100M
mypool/vol2  mountpoint            /data2
bash-3.00#

Another cool feature is the fact that you can share a filesystems in many ways with ZFS. Some of the more popular types include, NFS, CIFS and iSCSI. You only have to specify the set sharenfs=on for NFS, sharesmb=on for CIFS and shareiscsi=on for iSCSI.

bash-3.00# zfs set sharenfs=on mypool/vol2

Very simple. You could, of course, specify all kinds of options that goes with the normal nfs shares such as permissions and user access.

The last feature I want to show you is the snapshot. Again, it's very simple to create and manage. I will use the vol1 volume as an example. I will create a file in this volume and then snapshot it and show you where it's put.

bash-3.00# mkfile 100m /data1/myfile
bash-3.00# ls -l /data1
total 1
-rw------T   1 root     root     104857600 Jun 29 21:13 myfile

bash-3.00# zfs snapshot mypool/vol1@snapit
bash-3.00# zfs list
NAME                 USED  AVAIL  REFER  MOUNTPOINT
mypool               164K   553M    21K  /mypool
mypool/vol1           22K   200M    22K  /data1
mypool/vol1@snapit      0      -    22K  -
mypool/vol2           21K   100M    21K  /data2
bash-3.00#

The syntax for a snapshot is as follows, you use the sub command snapshot to specify that you want to create a snapshot. Then the filesystem you want to snap, a @ character followed by the name of the snapshot.

Hm, now what. How can I access the data in the snapshot? The snapshot is located in the, filesystem/.zfs/snapshot directory. Let's change directory to the snapshot.

bash-3.00# cd /data1/.zfs/snapshot/snapit
bash-3.00# ls -la
total 4
drwxr-xr-x   2 root     root           3 Jun 29 21:13 .
dr-xr-xr-x   3 root     root           3 Jun 29 21:14 ..
-rw------T   1 root     root     104857600 Jun 29 21:13 myfile

Yep, there it is. Let's delete the file form the /data1 directory, and see if the file is still available from the snapshot.

bash-3.00# rm /data1/myfile

bash-3.00# pwd
/data1/.zfs/snapshot/snapit

bash-3.00# ls -la
total 4
drwxr-xr-x   2 root     root           3 Jun 29 21:13 .
dr-xr-xr-x   3 root     root           3 Jun 29 21:14 ..
-rw------T   1 root     root     104857600 Jun 29 21:13 myfile

There it is! The file is still available in the snapshot. Let's use the rollback option and roll the filesystem back to it's original state using the snapshot.

bash-3.00# zfs rollback mypool/vol1@snapit
bash-3.00# ls -l /data1
total 1
-rw------T   1 root     root     104857600 Jun 29 21:13 myfile
bash-3.00#

Magic! Can you start to see the use for snapshots? You can now create a pool and use this pool for users home directories. So, every user has it's own filesystem with quotas, compression, reservations, shares, etc...

The users can create their own snapshots and restore them as they like. For example, users can do snapshots of their home directories during the week, and roll back to whatever day they desire. They can test software or patches for a application or whatever they want to do.

They don't need to bug the administrator to make backups and then restore if something goes wrong. They can do all of it by themselves. This is a really cool feature of the zetabyte filesystem. It's like the users own incremental backup.

You can also do clones which you can actually read and write to as well, instead of snapshots.

As you can see, the creators of the zetabyte filesystem has really put in effort to make it as simple as possible to create and maintain these filesystems.

I only showed you some stuff that you can do. This page would just be too long if I showed you all of the features of the zetabyte filesystem.

These examples should give you a good understanding of how the zetabyte filesystem works. Use the man pages if you are unsure.

The best way to learn this is to do and play with it. Get Oracle VM VirtualBox, install Solaris or OpenSolaris, create your disks and play with it. It's not difficult, as you have seen from my examples above. Don't be scared, just go for it.

Return from ZFS to Solaris 10

Return to What is my Computer homepage