ZFS
Zetabyte File System
So, what is ZFS?
It's a new filesystem developed by Sun Microsystems before it was acquired by Oracle. It's not just another enhancement of UFS or SVM. It was developed from the ground up with ease of administration, data integrity and high scalability in mind.
Boy, did these guys deliver! They have created a filesystem that's really easy to use and so many features it would make your head spin. I will quickly run through some of them before I'll show you some easy examples on how to create filesystems with this software.
ZFS is a pooled storage model where filesystems are then created form this pool. You can literally create thousands of file systems from a single pool. A pool is basically a set of disks configured with or without redundancy. It's similar to raid 5 and mirroring but is more robust and easy to manage.
Data is written in a copy on write fashion so you always have instant snapshots and valid on disk state.
It also check-sums the data to prevent silent data corruption. Traditional volume managers does not check the data written to the disks. If you mirror the data in a traditional volume manager then the data is protected if a disk or disks fail. But, what about the data? The data might be corrupt on one side of the mirror and when you eventually read from this mirror, you get corrupted data and have a bad day.
If the data is check-summed and corrupted data is found on one side of the mirror, zfs knows that it has another copy, and will read the data from that copy. Nice.
It also uses a data protection model called raid-z. It's similar to raid 5 but it uses variable stripe widths, so the data is always in a full stripe write. This eliminates the read-modify-write overhead associated with raid 5. This means better performance on disks that does not have cache, such as hardware raid controllers.
You can create snapshots and clones. Snapshots are read-only whereas clones are writable. You can also create incremental snapshots.
You have built in compression that you can enable or disable plus de-duplication.
It provides file sharing capabilities like NFS, CIFS and iSCSI to mention just a few. These can be easily configured via the command line.
There are no complicated vfstab files to update when you create a filesystem. In fact, you don't have to update any files when creating a filesystem.
It does not use complicated disk partitions that takes hours of your time to plan and create. And then you still need to run newfs on the partition to make it useable. You just create the pool and then create you filesystems, thousands, for use.
You can also, easily, add additional capacity to your pools and filesystems.
You can also easily export and import pools between systems if the storage is shared between them. This is great for clusters and clones between systems.
These are just a few of the great features you get from ZFS. The best part is, it's free! Yep, you heard right, free. After you installed Solaris, or Open Solaris, you can start creating pools without the need for a license. Other vendors will charge a lot of money for some of these features and it's open source.
I will now show you some examples on how to create pools and filesystems. I will keep it simple and to the point. I don't want to make it complicated cause that's not the idea of this site or the filesystem.
Let me explain the setup. I have a virtualbox with Solaris 10 running. The cool thing about zfs is that you can use normal files as storage disks in the pool. This is nice if you just want to quickly test something to see if it works.
If you use physical disks, then it's recommended that you use the entire disk for zpool creation.
I'll create 4 files of 200Mbyte each and use this for my pools. Later I will add some more "disks".
For now 4 is all we need.
bash-3.00# mkfile 200m /disk1
bash-3.00# mkfile 200m /disk2
bash-3.00# mkfile 200m /disk3
bash-3.00# mkfile 200m /disk4
Let's start by creating a pool containing only one disk. This is just to show you how easy it is to create a pool.
bash-3.00# zpool create mypool /disk1
bash-3.00# zpool list
NAME SIZE ALLOC FREE CAP HEALTH ALTROOT
mypool 195M 95.5K 195M 0% ONLINE -
bash-3.00# zpool status
pool: mypool
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
/disk1 ONLINE 0 0 0
errors: No known data errors
I used the zpool create command to create the pool. Then you have to specify the pool name, in this example I called the pool, mypool. Then you specify the devices to use for the pool. I used the file called /disk1
To list the pools, the zpool list command was used. It gives us information about the pools created on the system.
The zpool status command gives us a more detailed display of the pools. It show the actual disks that the pool consists of.
We created the pool with one disk, so this is basically a simple volume with no redundancy in ZFS.
Let's make it redundant by attaching another disk to the pool mypool. Notice the size of mypool in the above zpool list output. It is 195Mbyte. Let's see what happens if we attach another disk.
bash-3.00# zpool attach mypool /disk1 /disk2
bash-3.00# zpool list
NAME SIZE ALLOC FREE CAP HEALTH ALTROOT
mypool 195M 165K 195M 0% ONLINE -
bash-3.00#
Hm, it's still 195Mbyte. Let's do a zpool status to see what happened.
bash-3.00# zpool status
pool: mypool
state: ONLINE
scrub: resilver completed after 0h0m with 0
errors on Tue Jun 28 22:04:00 2011
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
/disk1 ONLINE 0 0 0
/disk2 ONLINE 0 0 0 75K resilvered
errors: No known data errors
bash-3.00#
Ah, it created a two way mirror. The two disks are now mirrored. When you attach disks, you must tell the zpool command to what you want to attach it to.
Before I forget. Did you also notice that we did not do any newfs or edited any vfstab files. The question is, can we use the pool? Of course we can! When you create a pool, zfs also creates a filesystem that we can use immediately.
Let's do a df -h and zfs list and look at the output.
bash-3.00# df -h
Filesystem size used avail capacity Mounted on
/dev/dsk/c0d0s0 14G 5.2G 8.4G 39% /
/devices 0K 0K 0K 0% /devices
ctfs 0K 0K 0K 0% /system/contract
proc 0K 0K 0K 0% /proc
mnttab 0K 0K 0K 0% /etc/mnttab
swap 2.3G 1004K 2.3G 1% /etc/svc/volatile
objfs 0K 0K 0K 0% /system
swap 2.3G 80K 2.3G 1% /tmp
swap 2.3G 32K 2.3G 1% /var/run
mypool 163M 21K 163M 1% /mypool
bash-3.00# zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 84K 163M 21K /mypool
Yep, it's there and mounted under /mypool. We didn't have to edit any files or run any newfs commands. By default zpool create will create a filesystem and mount it under the same name. Later we will look in a bit more detail at the zfs command.
Remember, the whole idea behind ZFS is to create pools and then use the pools for filesystems.
Let's attach another disk to mypool.
bash-3.00# zpool attach mypool /disk1 /disk3
bash-3.00# zpool list
NAME SIZE ALLOC FREE CAP HEALTH ALTROOT
mypool 195M 176K 195M 0% ONLINE -
bash-3.00# zpool status
pool: mypool
state: ONLINE
scrub: resilver completed after 0h0m with 0
errors on Tue Jun 28 22:14:31 2011
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
/disk1 ONLINE 0 0 0
/disk2 ONLINE 0 0 0
/disk3 ONLINE 0 0 0 85.5K resilvered
errors: No known data errors
The size stays 195Mbyte but we basically now have a three way mirror.
Let's detach a disk.
bash-3.00# zpool detach mypool /disk3
bash-3.00# zpool status
pool: mypool
state: ONLINE
scrub: resilver completed after 0h0m with 0
errors on Tue Jun 28 22:14:31 2011
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
/disk1 ONLINE 0 0 0
/disk2 ONLINE 0 0 0
errors: No known data errors
All that happened was we removed one of the mirrors and the pool is now a two way mirror again.
I will now detach /disk2 and then use the add command to add a disk to the pool.
bash-3.00# zpool detach mypool /disk2
bash-3.00# zpool status
pool: mypool
state: ONLINE
scrub: resilver completed after 0h0m with
0 errors on Tue Jun 28 22:14:31 2011
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
/disk1 ONLINE 0 0 0
errors: No known data errors
The mirror disappeared and now we have a normal simple volume.
Let's use the add command to add another disk to the pool and see what happens.
bash-3.00# zpool add mypool /disk2
bash-3.00# zpool list
NAME SIZE ALLOC FREE CAP HEALTH ALTROOT
mypool 390M 168K 390M 0% ONLINE -
bash-3.00# zpool status
pool: mypool
state: ONLINE
scrub: resilver completed after 0h0m with 0
errors on Tue Jun 28 22:14:31 2011
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
/disk1 ONLINE 0 0 0
/disk2 ONLINE 0 0 0
errors: No known data errors
Look at the size, it's now 395Mbyte. So the add command added the disk to the pool and the size doubled in this case.
If we now try to detach a disk it should not work cause we sit with a stripe. Let's try it.
bash-3.00# zpool detach mypool /disk2
cannot detach /disk2: only applicable to mirror and replacing vdevs
bash-3.00#
Yep as expected. you cannot detach a disk from a pool that is not mirrored.
We will now destroy the pool and create mirrors and raid-z pools.
bash-3.00# zpool destroy mypool
We used the zpool destroy command to destroy the pool.
In the previous examples we created a zpool and attached the disk to the pool to create a mirror. You can also create a mirrored pool from the command line. You don't have to use the attach method to add a disk like we did previously.
Be careful when you create these mirrors cause specifying the devices incorrectly might not have the desired results.
Below I will use the mirror command to create a mirror when I create the pool. Look at the results between the two methods I used.
bash-3.00# zpool create mypool mirror /disk1 /disk2 /disk3 /disk4
bash-3.00# zpool list
NAME SIZE ALLOC FREE CAP HEALTH ALTROOT
mypool 195M 97K 195M 0% ONLINE -
bash-3.00# zpool status
pool: mypool
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
/disk1 ONLINE 0 0 0
/disk2 ONLINE 0 0 0
/disk3 ONLINE 0 0 0
/disk4 ONLINE 0 0 0
errors: No known data errors
Well, the mirror command did create a mirror, but it created a four way mirror. That's not what I wanted. I wanted to create 2 two way mirrors which gives me more capacity and redundancy. I wanted to create a raid 1+0.
Let me try again. First I will destroy the pool and recreate it.
bash-3.00# zpool destroy mypool
bash-3.00# zpool create mypool mirror /disk1 /disk2 mirror /disk3 /disk4
bash-3.00# zpool list
NAME SIZE ALLOC FREE CAP HEALTH ALTROOT
mypool 390M 97K 390M 0% ONLINE -
bash-3.00# zpool status
pool: mypool
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
/disk1 ONLINE 0 0 0
/disk2 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
/disk3 ONLINE 0 0 0
/disk4 ONLINE 0 0 0
errors: No known data errors
That's more like it. I now have something similar to a raid 1+0 setup.
How about a raid-Z configuration. Raid-Z is similar to a raid 5. Again, I will destroy the current pool and create it as a raid-Z.
bash-3.00# zpool destroy mypool
bash-3.00# zpool create mypool raidz /disk1 /disk2 /disk3
bash-3.00# zpool list
NAME SIZE ALLOC FREE CAP HEALTH ALTROOT
mypool 584M 176K 584M 0% ONLINE -
bash-3.00# zpool status
pool: mypool
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
/disk1 ONLINE 0 0 0
/disk2 ONLINE 0 0 0
/disk3 ONLINE 0 0 0
errors: No known data errors
bash-3.00#
The above command created a single parity raid-z pool. You can also create double and triple parity pools as well.
You can also add disks to a raid-z pool as long as you add the same amount as disks already configured in raidz pool.
Let's add a hotspare to the pool.
bash-3.00# zpool add mypool spare /disk4
bash-3.00# zpool status
pool: mypool
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
/disk1 ONLINE 0 0 0
/disk2 ONLINE 0 0 0
/disk3 ONLINE 0 0 0
spares
/disk4 AVAIL
errors: No known data errors
We can also remove a hotspare from the pool.
bash-3.00# zpool remove mypool /disk4
bash-3.00# zpool status
pool: mypool
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
/disk1 ONLINE 0 0 0
/disk2 ONLINE 0 0 0
/disk3 ONLINE 0 0 0
errors: No known data errors
Let's export and import the pool. You might want to do this if you have shared storage and you want to export the pool on one server and import the pool on another server. Solaris cluster uses this export and import function.
It's also useful if you want to change the pool name.
bash-3.00# zpool export mypool
bash-3.00# zpool list
no pools available
bash-3.00# zpool import -d / mypool
bash-3.00# zpool list
NAME SIZE ALLOC FREE CAP HEALTH ALTROOT
mypool 585M 416K 585M 0% ONLINE -
bash-3.00#
You can also get statistics of how the pool is utilized with the iostat command. Below is an example.
bash-3.00# zpool iostat -v mypool
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
mypool 370K 585M 0 2 20.5K 25.0K
//disk1 114K 195M 0 0 6.53K 8.34K
//disk2 122K 195M 0 0 6.53K 8.34K
//disk3 134K 195M 0 0 7.49K 8.34K
---------- ----- ----- ----- ----- ----- -----
As you can see there is a lot you can do with pools. We looked at creating and managing pools. We still need to look at the filesystem aspect of ZFS. This is where the real strength of the Zetabyte FileSystem lies.
To display the current filesystems in the pools, we will use the zfs command.
bash-3.00# zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 79.5K 553M 21K /mypool
First of all I want to create 2 filesystems just to show you what the command looks like.
bash-3.00# zfs create mypool/vol1
bash-3.00# zfs create mypool/vol2
bash-3.00# zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 138K 553M 23K /mypool
mypool/vol1 21K 553M 21K /mypool/vol1
mypool/vol2 21K 553M 21K /mypool/vol2
bash-3.00# df -h
Filesystem size used avail capacity Mounted on
/dev/dsk/c0d0s0 14G 5.6G 8.0G 42% /
/devices 0K 0K 0K 0% /devices
.
.
.
.
mypool 553M 23K 553M 1% /mypool
mypool/vol1 553M 21K 553M 1% /mypool/vol1
mypool/vol2 553M 21K 553M 1% /mypool/vol2
bash-3.00#
Very simple, and I did not have to edit any files to make the mounts permanent.
I'm not happy with the mountpoints. I want to change them to /data1 and /data2
bash-3.00# zfs set mountpoint=/data1 mypool/vol1
bash-3.00# zfs set mountpoint=/data2 mypool/vol2
bash-3.00# zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 158K 553M 23K /mypool
mypool/vol1 21K 553M 21K /data1
mypool/vol2 21K 553M 21K /data2
bash-3.00# df -h
Filesystem size used avail capacity Mounted on
/dev/dsk/c0d0s0 14G 5.6G 8.0G 42% /
/devices 0K 0K 0K 0% /devices
.
.
.
.
mypool 553M 23K 553M 1% /mypool
mypool/vol1 553M 21K 553M 1% /data1
mypool/vol2 553M 21K 553M 1% /data2
bash-3.00#
This is just getting easier and easier. So a simple command changed the mountpoint. I did not even had to unmount and then remount.
If you look at the list output, both the file systems have all of the space available from the pool, mypool. This could be a potential problem, cause users might think they have all this space available. This could easily fill up the pool.
We will use quotas to limit the space that a filesystem may consume. Again it's a very straight forward command.
bash-3.00# zfs set quota=200m mypool/vol1
bash-3.00# zfs set quota=100m mypool/vol2
bash-3.00# zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 156K 553M 21K /mypool
mypool/vol1 21K 200M 21K /data1
mypool/vol2 21K 100M 21K /data2
bash-3.00#
Now the available space is shown as 100Mbyte for /data2 and 200Mbyte for /data1. You can also use the reservation option to reserve space on a filesystem.
We can use the get subcommand to check what options have been enabled for a filesystem.
bash-3.00# zfs get -r -s local -o name,property,value all mypool
NAME PROPERTY VALUE
mypool/vol1 quota 200M
mypool/vol1 mountpoint /data1
mypool/vol2 quota 100M
mypool/vol2 mountpoint /data2
bash-3.00#
I used the -o name,property,value options to output only these values. If you don't use the -o option then all values for all options will be displayed. This could be a bit overwhelming, cause there is a lot of properties for filesystems.
You can also use compression on a filesystem. It's very easy to do, below is the command.
bash-3.00# zfs set compression=on mypool/vol1
bash-3.00# zfs get -r -s local -o name,property,value all mypool
NAME PROPERTY VALUE
mypool/vol1 quota 200M
mypool/vol1 mountpoint /data1
mypool/vol1 compression on
mypool/vol2 quota 100M
mypool/vol2 mountpoint /data2
bash-3.00#
Another cool feature is the fact that you can share a filesystems in many ways with ZFS. Some of the more popular types include, NFS, CIFS and iSCSI. You only have to specify the set sharenfs=on for NFS, sharesmb=on for CIFS and shareiscsi=on for iSCSI.
bash-3.00# zfs set sharenfs=on mypool/vol2
Very simple. You could, of course, specify all kinds of options that goes with the normal nfs shares such as permissions and user access.
The last feature I want to show you is the snapshot. Again, it's very simple to create and manage. I will use the vol1 volume as an example. I will create a file in this volume and then snapshot it and show you where it's put.
bash-3.00# mkfile 100m /data1/myfile
bash-3.00# ls -l /data1
total 1
-rw------T 1 root root 104857600 Jun 29 21:13 myfile
bash-3.00# zfs snapshot mypool/vol1@snapit
bash-3.00# zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 164K 553M 21K /mypool
mypool/vol1 22K 200M 22K /data1
mypool/vol1@snapit 0 - 22K -
mypool/vol2 21K 100M 21K /data2
bash-3.00#
The syntax for a snapshot is as follows, you use the sub command snapshot to specify that you want to create a snapshot. Then the filesystem you want to snap, a @ character followed by the name of the snapshot.
Hm, now what. How can I access the data in the snapshot? The snapshot is located in the, filesystem/.zfs/snapshot directory. Let's change directory to the snapshot.
bash-3.00# cd /data1/.zfs/snapshot/snapit
bash-3.00# ls -la
total 4
drwxr-xr-x 2 root root 3 Jun 29 21:13 .
dr-xr-xr-x 3 root root 3 Jun 29 21:14 ..
-rw------T 1 root root 104857600 Jun 29 21:13 myfile
Yep, there it is. Let's delete the file form the /data1 directory, and see if the file is still available from the snapshot.
bash-3.00# rm /data1/myfile
bash-3.00# pwd
/data1/.zfs/snapshot/snapit
bash-3.00# ls -la
total 4
drwxr-xr-x 2 root root 3 Jun 29 21:13 .
dr-xr-xr-x 3 root root 3 Jun 29 21:14 ..
-rw------T 1 root root 104857600 Jun 29 21:13 myfile
There it is! The file is still available in the snapshot. Let's use the rollback option and roll the filesystem back to it's original state using the snapshot.
bash-3.00# zfs rollback mypool/vol1@snapit
bash-3.00# ls -l /data1
total 1
-rw------T 1 root root 104857600 Jun 29 21:13 myfile
bash-3.00#
Magic! Can you start to see the use for snapshots? You can now create a pool and use this pool for users home directories. So, every user has it's own filesystem with quotas, compression, reservations, shares, etc...
The users can create their own snapshots and restore them as they like. For example, users can do snapshots of their home directories during the week, and roll back to whatever day they desire. They can test software or patches for a application or whatever they want to do.
They don't need to bug the administrator to make backups and then restore if something goes wrong. They can do all of it by themselves. This is a really cool feature of the zetabyte filesystem. It's like the users own incremental backup.
You can also do clones which you can actually read and write to as well, instead of snapshots.
As you can see, the creators of the zetabyte filesystem has really put in effort to make it as simple as possible to create and maintain these filesystems.
I only showed you some stuff that you can do. This page would just be too long if I showed you all of the features of the zetabyte filesystem.
These examples should give you a good understanding of how the zetabyte filesystem works. Use the man pages if you are unsure.
The best way to learn this is to do and play with it. Get Oracle VM VirtualBox, install Solaris or OpenSolaris, create your disks and play with it. It's not difficult, as you have seen from my examples above. Don't be scared, just go for it.
Return from ZFS to Solaris 10
Return to What is my Computer homepage