Where I Work
Add to Google
RSS 0.91
RSS 1.0
RSS 2.0
ATOM 1.0
RSS 2.0 and ATOM
View Ian's profile on LinkedIn
2007 April (1)
2007 February (1)
2007 January (4)
2006 December (2)
2006 November (2)
2006 September (5)
2006 August (4)
2006 July (1)
2006 June (3)
2006 May (2)
2006 March (4)
2006 February (4)
2006 January (1)
2005 December (8)
2005 November (26)
2005 October (10)
2005 September (17)
2005 August (87)
2005 July (48)
2005 June (34)
2005 May (24)
2005 April (243)
2004 April (1)
2004 February (3)
2003 August (2)
2003 June (2)
2003 May (8)
2003 January (1)
2002 September (1)
2002 July (4)
2002 June (2)
2002 May (5)
2002 April (15)
2002 March (15)
Rage Powered
Tampa Bay
Creative Commons OpenSource Linux Individual-i GeoURL Linux Speakeasy Speed Test

follow icblenke at
Ian's shared items in Google Reader (subscribe)

Changing libgnomecups For Multiple Evolution Users

Re-Sync With Compiz Fusion

Capable packages

Happy National Sys Admin Appreciation Day!

NIS on Windows Server 2008

ESX iSCSI Basic Configuration from the CLI

Tape Rants and Raves: LTO4 Rules

IP Filter in OpenSolaris

iSCSI Security with CHAP

Plastic Ocean

apparently you aren't dead until you start to stink

Charlie Goes to Candy Mountain

iSCSI Security with CHAP

Seattle Scalability Conference, Pt II

Singing Tesla Coil.

Magic Tricks Tutorial Videos

Announcing the Hyperic VMware Appliance

SysAdmin Magazine: RIP

The megafreeze development model is broken

Theses on free software


Opcon/xps batch system

PBS batch system

LSF batch system

SGE batch system

UIKit Hello World

Cygnal - When Red5 just won't cut it for an RTMP server

Creepy pooch

IBM's CoScripter - automating web-based processes - Another Michael Robertson company

p0f passive fingerprinting IDS

Talking storage systems with Sun's ZFS team

Dr Nick's Magic Models

SproutCore - a MVC scaffolding for actual Application development

Skype protocol obfuscation layer

Microsoft Silverlight and the Mono team at Novell join up to create the Moonlight project

Bitlbee - bridge IM client networks to an IRC channel.

EJBCA - The J2EE Certificate Authority

OSC CAtool

Festo's latest pneumatic tech

Mcell 3.5" drive has 1GB of DDR RAM 2.5" drive == 110MB/s transfer rates

TENORIO-ON Product Demo

OpenSolaris Xen domU with a linux dom0

Tentakel: distributd command execution

Ganeti: Opensource virtual server management software for Xen

Seemless dynamic image resizing

Mono and XPCOM scripting VirtualBox

The bacon mat

podbrix young woz and jobs playset

Woz gets a speeding ticket for 104mph in a Prius

Sam Ruby's long bets

Project Starfire

The real computer monster

Google Starts Shared Storage Service

The $200 billion ripoff

OS/X TPM driver

Storm Worm DDoSes scanning machines

Defendant wins access to the Intoxilyzer 5000EN Breathalyzer source code



The Funded - VC ratings

Horrible Microsoft Vista song

How to replace graffiti 2 with the original graffiti on a Palm - a firefox plugin for customizing google

iSER - iSCSI Extensions for RDMA

RDMA - Remote Direct Memory Access


Mercurial repo for KVM-Xen


HP $16billion acquisition story

Gentoo Foundation disappearing?

Novell Xen HVM PV Driver Pack for Windows

Play checkers with Chinook

NYTimes checkers article

Checkers solved.

Font rasterization, and why Microsoft's methods are broken.

Pyro Desktop

Democracy Player is now Miro

scanr - turn your camera phone into a scanner, copier, and fax.

AL-2007.071 Windows/Linux/Solaris - Sun Java Runtime Environment vulnerability allows remote compromise

MPX - the multi-point x server project site

MPX - Multi-Point Touch extension for

RubyOS/X - one click Mac ruby deployment.

OpenVZ Xen RHEL5 kernel.

Xen guest implemented as a paravirt_ops backend

Checkpoint VPN client causes airport to spin out of control in Mac OS/X 10.4.8

Rails fancyupload

simple h4x0r skypechatstyle

ext3 checksums == Internal RObustNess (IRON) file system

Joey@Mozilla Labs == Mobile synchronized content == 300TB of S3 storage, at $0.15/G/month, isn't that $45,000 a month?

ELC tabTerm for Mac OS/X, sadly discontinued as leopard will have tabbed terminals

Microsoft iSCSI initiator version 2.02 with Integrated Boot support is a different binary from the standard Microsoft iSCSI initiator 2.02

Microsoft iSCSI boot with MPIO

Boot Win2k3 from the Microsoft iSCSI initiator.

vmgbd - vmware generic block device patch

vmware-bdwrapper - block device ioctl wrapper for VMWare

Google faces brain drain as anniversaries hit

Paper Enigma Engine

iPhone chipset specifics

Xen on Centos host running FreeBSD guest

UDA for ESX 3.x.x

Ultimate Deployment Appliance

Quercus - pure java implementation of php 5




JumpBox Virtual Appliances

How many engineers does it take to turn on a light bulb?

Scalable Distributed Data Structures - P2P based storage

Google SRE

barcamporlando wiki


When Sysadmins Ruled the Earth

Apple iPhone GuidedTour

Dynamic object binding ( - a pretty darned good blog post

C64 USB keyboard conversion

HOWTO: Debugging a remote Windows HVM under Xen using WinDBG

Intuit Enterprise Solutions embraces Linux


VMWare Fusion 1.0b4 with Unity now available

RedHat to release Xen PV drivers with RHEL 5.1

Akamai real-time Internet weather map

Parallel Coherence, meet VMWare Unity. looking for a Django/Python developer in Tampa.

Shooting a moving target

Tracking Assets in the Production of 'Final Fantasy : The Spirits Within'

10 signs we are in a tech bubble

Funny story behind the latest AACS key crack

Unattended Windows Installers

Alternative Fuels lead to Tequila shortage

Trickle - a lightweight userspace LD_PRELOAD bandwidth shaper


dwm - dynamic window manager in under 2k lines of C

xmonad - a tiling window manager written in haskell

Xen Network IO costs

Xen-users post about Xen network performance

OSCircular - boot anonymous OSes from the Internet under Xen via HTTP

Browncoat cursing dictionary


Frysk - system monitoring and debugging tool

Xen 3.1 is out

What's the matter with HDMI

OpenAIS - Standards-bsaed Cluster Framework

Conga - manage RHCS cluster services for Xen VMs

KeyJnote - a python based opengl presentation system

SystemTap - a DTrace workalike for Linux

EE Times Under the Hood - inside the Prius


Embedding flash while retaining standards.

Xen 3.0.5 is now 3.1.0

Hi-Speed USB 2.0 Debug Cable - supports EHCI port 1 8bit debugging console - a console debugging driver for linux now exists.

Hacking Perl at nightclubs

ATLAS RepDB - spread based postgres repl

Migraines may be causing brain damage

Sun mulls deeper opensource dive - a 6502 assembler and emulator written in javascript.

The Alameda Weehawken Burrito Tunnel


Project Honeypot files $1billion John Doe lawsuit against spammers

All your base are belong to google.

Lisp is not an acceptable Lisp

Spread 4.0 released.


policyd - anti-spam measures


High-Performance SSH/SCP patches

DNBD - distributed network block device

O'Reilly School of Technology

Xen network interface bonding whitepaper - what they don't tell you is that bridging with bonded interfaces requires a kernel patch to fix a bug

Lucene Hadoop (previously Nutch NDFS) on Amazon S3

RightScale - an EC2 reseller

Time Series and Stochastic Processes


Software Update for Web Folders - add Web Folders to your Windows 2003 server

Japanese subculture of hacking Prius

Ganglia Whitepaper PDF

XenSource caters to Win2k laggards

DJB article on secure machine identification with a central PKI model

Linux 2.6 swap "files" are just fine

Installing OpenSUSE 10.1 in a domu with installer

AjaxTerm - another Ajax terminal

AnyTerm - an Ajax terminal

Step 23

How to get sound working with a thin-client Linux setup (LTSP)

g4u - Harddisk Image Cloning for PCs

Ingo Molnar's print-fatal-signals kernel patch: a useful kernel patch included by most linux distros.

Using the nVidia binary druver under Xen on Debian Etch

German court finds Microsoft's FAT Patent null and void

Top ten Opensource Innovators


Cool Numbers

Zabbix - a Nagios killer?

Mayans must re-sanctify temple after Bush visit

IEs 4 Linux - a simple script to have all 3 versions of IE running under Linux

9ne - a web based emacs

Deflation is in the cards

Empire, Currency, and Debt

Google's Summer of Code 2007 - About to start

DRBD 0.8.1 Released

CruiseControl - a framework for a continuous build process

Suhosin - Hardened PHP project

Do Not Mispell Google

Vista Activation Cracked by Brute Force

National ID Card Regulations Issued

Running Lucene Hadoop FS on Amazon EC2

Matt Dillon returns with plans for an HA Clustering Filesystem for DragonflyBSD

Comparison of opensource configuration management software

What happened to Wall Street today

Microsoft embracing Ruby?

lssconf - large scale system configuration

Standford Collective Group - virtual appliance computing infrastructure

Jumbo frame clean ethernet gear

Apparently Fraunhofer is the wrong company to license MP3 tech from.

NetDirector - opensource config management that I keep forgetting about and running into again.

ntfs-3g 1.0 released

Growing a brain in Switzerland

OpenQRM Technical Overview Whitepaper from 11/08/06

MPAA steals code, violates linkware license.

Xen clustering and load balancing - a foray into deploying a managed self balancing Xen farm with ATAoE

Twingly screensaver - neat RSS visualization

VMWare Virtual Desktop Infrastructure (VDI) Brokers

IBM System p5 560Q to run 320 virtualized x86 linux images...

Xandros management something.. full of marketing doublespeak

Ceph - another GlusterFS/GFarm/.. petabyte scale filesystem

nVidia CUDA - use your GPU for supercomputing.


NSpluginwrapper: use 32bit firefox plugins with 64bit firefox versions.

Intel Web 2.0 Technology Development Kit

EV SSL is ineffective in deterring users

DST is your friend.

Firefox3 will support offline webapps

Novell offers PV drivers for Xen.

One of the many reasons why Linux is more secure than Windows

A performance comparison between VMWare ESX and Xen Hypervisors

Solaris 10 and 11 with telnet/login enabled are remotely exploitable.

VMWare Fusion Public Beta available for Mac

Accordion Chair

MindMeister - another web2.0 mind mapping site

3d model shows big body of water in earth's mantle.

Doomday vault

Merged Xen/OpenVZ kernel patches worked back in Aug' 2005. Any bets on now?

Ultra-slim Credit Card-Sized Bluetooth Keyboard

- - Open Management Consortium

HiRes Skype for Mac and PC


More on the 16qubit quantum computer.

Hyperic HQ - an up-and-coming OpenSource platform for centralized IT management - a web2.0 directory

Yahoo! Pipes

Interesting UI metaphor

VMWare Fusion for Mac OS/X - virtual 3d gaming

IntelliJ IDEA features Ruby and Ruby on Rails support.

EasyVZ - a GUI for OpenVZ

This Film Not Yet Rated.

Deploying 3d thin-clients with beryl in Largo

IBM z/VM v5.3 can now host the "most number of virtual images on a single hypervisor". That's 1000 VMs to you and me.

Algae based fuel

Attack on DNS root servers

Fabrice Bellard opensources kqemu

QEMU's kqemu internal workings, documented.

vi commands in all of your cocoa apps

Docucolor decoder.

The 2 minute extended version of the GoDaddy Superbowl Commercial

VPX Redline Putting People in the Hospital

Confidential Microsoft memo: "Lets steal Java"

UDP-lite in 2.6.20 kernel

Setup Xen within 20 minutes with grml

Vista dropped OpenGL - legacy games, benchmarks, and CAD programs severely impacted

vmdk2vhd - convert vmware's vmdk images to microsoft's vhd format

GT4 Key Concepts (A Globus Primer)

Microsoft's WPF/E plugin goes Firefox, Safari, etc.

MIT's latest 400 page report: The Future of Geothermal Energy

The Intergovernmental Panel on Climate Change Report

An introduction to lguest (aka lhype, aka RustyVizor)

Using the OpenSolaris Mercurial Repository

Boston Mooninite Fetching $5k on eBay

Storage Virtualization becoming a reality

Linux Guitar Project

Google Talk video on Debmarshal

Debmarshal - a Google code hosted project for building and maintaining your own Debian distributions

eJamming - collaboratively jam live with other musicians online

Michael Dell is CEO of Dell again. Welcome to Dell 2.0.

Technorati WTF launched - another Digg clone

John Udell on Calendar cross-publishing concepts

How to install Enomalism v0.6 on Ubuntu

Microsoft's Jim Gray is missing at sea.

Seagate D.A.V.E - a wifi/bluetooth 20G drive

Apple's $1.99 802.11n enabler for MacBooks for sale now.

FireFox split browser plugin

xrdp - an opensource X11 RDP server

Hypervisor-Based Fault-Tolerance Whitepaper - make free calls through Iow

World's First P2P DVR from NDS - that's right, peer-to-peer video shared with ANYONE - lets see if it survives the legal onslaught that is bound to happen.

HOWTO - Make an iNoPhone (Apple Mouse Bluetooth headset)

Another Asterisk under Xenu HOWTO = rss visualization emotion

The Physics of SomethingAwful

Frozen tidal wave.

Is it me, or does Kidaro sound like the bastard child of Microsoft's SoftGrid ( and Parallels Coherence mode?

Mainsoft Grasshopper 2.0 recompiles .NET CLR bytecode to Java bytecode to run ASP.NET code under JavaVMs

CoPilot 2.0 ships

c9park - Bad Vista

Microsoft Patents stolen BlueJ

GoogleTV beta or silly shenanigans?

XForms for FireFox

Syslog-ng Splunk FIFO Howto

Syslog-ng Splung FIFO Howto

To Splunk or Not To Splunk

Krugle - code search for developers

Apple removes LUN masking from Xserve RAID

xsfs - xenstored FUSE filesystem

PlatForms 2007 kicks off - 9 different web devel platforms fight head to head.

Google maps Java app = Goggles flight sim

Java Posse Episode 100 - a Google Talk video

Xen Remote Management Interfaces

FIC Neo1973 runs OpenMoko

Linden Labs gives getafirstlife a license to use their logo, and rejects the invitation to submit a cease-and-desist letter.

Vonage stock tanking

VMWare Infrastructure 3 demo video

Interview with Mike Downey, Sr. Product Manager of Adobe's Apollo - a cross-platform application development platform.

Parallels CEO Serguei Beloussov creates waves suggesting that OS/X might run under Parallels virtualization - no mention of licensing, just handwaving that somehow Intel VT makes it "easier" (without DRM, you're SOL at the moment).

swivel - a web 2.0 visualization site

manyeyes - IBM's beta visualization site

HSDPA - UMTS' answer to EVDO Rev A is live.

Some KVM developments

University of New Hampshire found that 38 per cent of the U.S.'s power supply was being absorbed by data centres

XenMan is now ConVirt.


ElasticLive - Enomaly's mashup of Amazon EC2 and Globus Virtual Workspaces.

Tue, 28 Mar 2006

After filling a CornFS volume for a couple of days now, I found a few problems that really begged for another release.

I'm still building cornfs with debug flags and under gdb to catch any segfaults in the new caching code. Sure enough, it found a segfault or two that I needed to cleanup my pointer handling a bit. Cacheinsert() now works for a rather huge cacheinventory() run without incident.

There was also a bug with the statfs() setup in the cache upload function. Instead of statfs()ing the /data/cornfs/import/{servername} directory, it was handily using /data/cornfs/import. All of the servers appeared to have the same remaining free space, which caused the last two servers to fill to the brim.

Like I said, there will likely be some rapid releases this week as I stumble upon more nits to pick.

For now you can download cornfs-v0.0.5.1.tar.bz2 and have at it.

Sun, 26 Mar 2006

I've been working on cornfs this weekend a bit to speed things up.

With the help of gprof and gcc -pg, I found that the caching routines were causing a huge performance hit. Every read() and write() was doing a linear linked list search through every cached entry. This is fixed.

Along the way, I found it difficult to debug things with one huge cornfs.c source file. So I've split that up into numerous .c source files to fix this.

I also updated the Makefile to build on its own without building under fuse/examples as before. It is now 2.5.2 friendly, and compiles with 22 ABI compatibility. I'll see about adding the 25 ABI functions shortly.

So, download cornfs-v0.0.5.0.tar.bz2, extract, and build with make.

A few folks have mentioned they were playing with cornfs via private email. With this latest version, and NFS over ssh, NKS is finally running this in a production environment.

Look forward to some rapid updates here in the near future.

Thu, 23 Mar 2006

I've had serious problems using shfs, sshfs, and sfs. The first two fall apart under load, and the latter is a nightmare to get working in our environment (a PAM nightmare, that is).

Rather than dealing with something crazy, I decided to go back to a faithful old standard: NFS. As the remote storage nodes are accessible only via ssh, ssh was the ideal transport for the NFS mounts.

How do you do this? With a little port trickery and some inittab craziness to hold the tunnels up.

NFS v3 and newer have a TCP transport mode that make it possible to tunnel using ssh. Older versions of NFS use a UDP based ONC RPC transport. Make sure you have kernel support for TCP and NFS v3 before you continue.

On the remote nodes, install NFS:

 # apt-get install nfs-kernel-server nfs-common portmap

Then setup an exports file sharing something to localhost:

 # echo "/exports localhost(rw,async,insecure,no_root_squash)" >> /etc/exports

We need to have mountd start on a known port to setup the ssh tunnel from the master. The "-p" flag is used for this. Debian keeps the RPCMOUNTDOPTS flags in /etc/default/nfs-kernel/server, easily updated with this perl one-liner:

 # perl -pi -e 's/^(RPCMOUNTDOPTS)=.*$/$1="-p 32767"/' /etc/default/nfs-kernel-server

It's also a good idea to block portmap request from anything but localhost with tcpwrappers, just in case your firewall rules happen to be down for some reason.

 # echo "portmap: LOCAL" >> /etc/hosts.allow

Now restart things and make sure the mountpoint is being exported:

 # /etc/init.d/nfs-kernel-server stop
 # /etc/init.d/nfs-common stop
 # /etc/init.d/portmap stop
 # /etc/init.d/portmap start
 # /etc/init.d/nfs-common start
 # /etc/init.d/nfs-kernel-server start
 # rpcinfo -p localhost
 # showmount -e localhost

The remote server is now ready to mount. Return to your central master cornfs server that will act as the client and setup an ssh tunnel.

Step 1: Install nfs-client

 # apt-get install nfs-client

Step 2: Setup key trust with the remote server:

 # ssh-keygen -f ~/.ssh/id_dsa-cornfs -P'' -t dsa -b 1024
 # cat ~/.ssh/ | ssh remoteserver 'mkdir ~/.ssh; cat - >> ~/.ssh/authorized_keys'

Step 3: Setup the SSH tunnel with an inittab respawn

 # echo 'N0:23:respawn:/usr/bin/ssh -c blowfish -L 10000:localhost:2049 -L 11000:localhost:32767 remoteserver vmstat 300' >> /etc/inittab
 # telinit q

Now you should see an ssh tunnel running in a process listing. Check your system logs to see if there are any problems.

Step 4: Add fstab entries for NFS:

 # echo 'localhost:/export /data/cornfs/import/remoteserver nfs rw,bg,soft,port=10000,mountport=11000,tcp 0 0' >> /etc/fstab
 # mount /data/cornfs/import/remoteserver

You should now see your remote server /export filesystem mounted under /data/cornfs/import.

Each remote server will need to have a unique nfs and mountd port assignment. Repeat steps 3 and 4 for each.

I started at 10000 and 11000 and worked my way up from there. The next server's port assignments are 10001 and 11001, etc.

This works suprisingly well, and appears to be quite stable (far more stable than the other alternatives).

That's not to say things are as fast as they could possibly be, but it works.

Mon, 22 Aug 2005

Latest version: v0.0.5.0

The braindump for CORNFS explains many things about this project.

CORNFS is an attempt at creating a distributed filesystem that mirrors N copies of files across a group of M number of servers. Everything in CORNFS is stored as a file.

At any time, it is possible to reconstruct the entire filesystem via a simple overlay rsync from the remote filesystems - there is no "special database" to worry about.

Rather than mirroring at the volume or block level, CORNFS mirrors at the file level, tracking what servers a file is mirrored on. CORNFS works with locally cached copies of files and a central metadata state directory.

Extended attributes are used to mark metadata state files with information CORNFS uses to track the mirrors for a particular file, as well as cached files that are marked as "dirty" (for copying back to remote servers when a cached file is modified).

As files are written, the servers with the most available disk space are used for new files (braindead simple algorithm for the moment). When a cached file is modified, the file is copied back to its mirrors (or new mirrors should a server be unavailable). CORNFS keeps metadata centrally to keep a sane filesystem state. Every remote server's metadata state is known by the central server. The central server's metadata state is authorative; while remote servers may go offline, when they come back online, any files that were updated while they were unavailable will have been removed from that server in the central metadata and will not be referred to (such "orphaned files" will need to be pruned periodically).

As a last resort, the master's cached copy is authorative. If mirrors cannot be written to, the cache file will remain dirty, and will not be expired.

This is a production running release, as used by my employer today.

The history of development so far:

cornfs.c v0.0.1.0

The first (broken) release.

cornfs.c v0.0.0.2

A number of fixes make this version _usable_. There are most definitely corner cases
that have not been dealt with yet, though it seems to suffer an rsync/rm well now.

cornfs.c v0.0.0.3

Adds partial read()s while the copy is underway during an open() (until I figure out
how to spawn a pthread() for the copy, this does not really do much yet).

cornfs.c v0.0.0.4

Added pthread_mutex_lock(&corn_copy_lock) to copy_file. Added corn_magic and
USE_MAGIC wrappers for magic file identification.

cornfs.c v0.0.1.0

The dynamic expiring cache code is now present. Added cache_inventory(), 
cache_insert(), cache_update(), cache_expire_to_limit(), cache_rename(),
and cache_remove().

cornfs.c v0.0.1.1

Added S_ISREG check to read() and write(). Any non-regular file read/write calls 
are now mapped correctly to state files. Also fixed cache_insert.

cornfs.c v0.0.1.2

Turned off debugging, removed hardcoded size limit.

cornfs.c v0.0.1.3

Remove stat()ing of cached files, replace with cache_exists(), particularly in read()
and write(). Move as many dirty checks to corn_cache as possible. cache_mark_dirty(),
cache_mark_clean(),  cache_is_clean(). Fixed some more mallocs.

cornfs.c v0.0.1.4

Add dirty check to cache_expire
(do not expire something from cache if it does not have a good mirror!!!)

cornfs.c v0.0.2.0

Add copy_file_thread, copy_file_wait, and copy_file_nowait. 
Copying is now threadable!

cornfs.c v0.0.3.0

Add fsck_thread and xmp_init/xmp_destroy. The fsck_handler_* 
functions are no done yet, but are ready to fill in.

cornfs.c v0.0.3.1

Relabeled all xmp_ functions to cornfs_;
Reworked open() function quite a bit: moved much
of the copy to cache logic to download_to_cache();
Defined corn_file_info struct, used to pass open()
file descriptor to read() and write();
Moved code from release() to upload_from_cache(),
aadded to fsck_cache_handler()

cornfs.c v0.0.3.2

Filled in fsck_meta_handler() and fsck_state_handler()
Fixed some logic errors in fsck_import_handler()
Filesystem appears to fsck correctly now.


Add control_file_read()/write() and corn_file_info structure updates to handle control file IO.
Profiled code, found cache_update()/cache_insert() biggest culprit
Made corn_cache a two-way linked list to remedy above.
Split up cornfs.c into numerous .c source files to simplify coding
Now in a tarball because of above.

Or grab the latest cornfs.tar.bz2 with everything you need.

Things to fix:

  • Hardlinks don't work right.
  • Add userspace tool to monitor live filesystem in action.
  • Working toward MetaFS style searchable metadata. The libmagic stuff is just a beginning. I'm looking into id3lib integration now. The storage backend for the searchable data will likely be a BerkeleyDB database.

The easiest way to build this is to grab fuse-2.5.2.tar.gz and extract it:

$ tar xvzf fuse-2.5.2.tar.gz

Then extract the cornfs.tar.bz2 somewhere and build it:

$ cd /tmp
$ wget
$ cd /tmp ; tar xvjf cornfs.tar.bz2
$ make -C /tmp/cornfs

You should now have a "cornfs" runtime. If not, drop me an email.

The directory tree to make this usable is hardcoded at the moment into the runtime (constants toward the top of the source file).

$ mkdir -p /data/cornfs/cfgs/servers
$ echo /remote/path > /data/cornfs/cfgs/servers/SERVERNAME
$ mkdir -p /data/cornfs/metadata/state
$ mkdir -p /data/cornfs/metadata/cache
$ mkdir -p /data/cornfs/metadata/SERVERNAME
$ mkdir -p /data/cornfs/import/SERVERNAME

The only missing bits are mounting the import/SERVERNAME directories for each filesystem configured in cfgs/servers/. You can use SHFS, NFS, DAVFS2, or whatever the heck your linux kernel has support for. The CORNFS strives to be filesystem agnostic.

$ cd /data/cornfs/cfgs/servers
$ for server in * ; do mkdir -p /data/cornfs/import/$server ; shfsmount $server:`cat $server` /data/cornfs/import/$server ; done

Now start the cornfs server with a reference path:

$ cd /tmp/cornfs
$ mkdir /mnt/cornfs
$ ./cornfs /mnt/cornfs -d

The "-d" flag adds FUSE debugging.

The lower you set the DEBUG level when building cornfs, the more debugging info will appear. It's an enum, so that can easily be reversed (Verbosity).

By default, the DEBUG level isn't set at all. In that mode, all debugging is macroed away to oblivion to speed things up.

CornFS is being used in production with SSH over NFS instead of SHFS for stability. If you plan on using CornFS in a production role, please let me know.


Mon, 25 Jul 2005

Please excuse this brain dump. As ideas come up, I continue to edit this node. Eventually, some structure will be enforced.

Inspired by SSHFS and SHFS, what would it take to make a filesystem that spans a cluster of servers and exposes aggregate diskspace while still mirroring data?

Exposing a filesystem with FUSE on a master node would be ideal, with some form of WebDAV network access (using something as simple as Apache mod_dav) for client access.

Most distributed filesystems have the idea of a "master" for metadata:

  • Google's Filesystem has a master model with distributed "chunk servers" for the data. Not OpenSource. Also not POSIX, it's a programming API interface, you can't "mount" it AFAIK. They could probably throw a FUSE filesystem together in short order if they really wanted to.
  • HDFS (previously NDFS), or the Hadoop (Nutch) Distributed Filesystem is a Java knockoff of the Google Filesystem. As a backend for the Apache Lucene Nutch project, it is a programmatic API inteface filesystem. While you can't mount it, writing a FUSE frontend wouldn't be hard.
  • PVFS v1 has one master, v2 has multiple masters, but no mirroring - meant for high-IO scientific clusters.
  • OpenAFS has many servers, and mirrors at the volume level, but requires a complex kerberos infrastruture and much manual volume creation to balance the layout. There is only one read/write volume, the rest of the volume replicas are read-only. Don't think I'm not temped by OpenAFS, it just doesn't solve the need we have at the moment (long story).
  • CODA (sometimes referred to as AFSv3) offers disconnected roaming, but mirrors at the server level - not at a volume level.
  • Lustre has a master model, but mirrors on a volume level.
  • Intermezzo was Peter J. Braam's predecessor to Lustre. Ideal for straight mirroring, not distributing files throughout a cluster.
  • both GFS and OpenGFS use a DLM cluster arrangement with shared storage to present a shared filesystem. CLVM mirroring is very young (lvmcreate -m is undocumented at best, allocation is impossible to specify, and you can't have more than one mirror log volume yet). Boy was this fun to play with.
  • CXFS is SGI's Clustered XFS. Very similar to GFS, only cross platform and very scalable.
  • OpenSSI's CFS is little more than network mirroring across whatever underlying filesystem to present a unified root image for the OpenSSI cluster. Not what we're looking for.
  • MFS and DFSA are from Mosix / Openmosix. MFS is the feature of openMosix that enables you access to remote filesystems as if those filesystems were locally mounted. With DFSA enabled, system calls will be executed on the remote node without migrating the process back to it's home node

There are others, but these are the "big boys" that I can think of.

There are a couple of distributed filesystems that run without a master server. This isn't trivial to implement:

  • GPFS is IBM's General Parallel File System. What is claims is downright nirvana. I've not have the time (or money) to play with it. Seriously, read this page. I want a copy. Not OpenSource. ;)
  • xFS is Berkeley's Serverless Network File Service. Basically, a log based network striped filesystem with metadata "map" servers that trade "write tokens" to update files between each other.

Storage servers in the cluster might each have some space set aside to this purpose. The easiest way would be to create and mount a loopback file filesystem with the space to be shared:

storage-node$ mkdir -p /data/cornfs/spool/ /data/cornfs/export/
storage-node$ dd if=/dev/zero of=/data/cornfs/spool/storage_fs bs=1M count=5k
storage-node$ mke2fs -f /data/cornfs/spool/storage_fs
storage-node$ mount -o loop /data/cornfs/spool/storage_fs /data/cornfs/export/storage

On the Master, each storage server's remote filesystem would be mounted based on the master's config (which is modeled likewise in a filesystem tree):

master-node$ mkdir -p /data/cornfs/cfgs/nodes
master-node$ cd /data/cornfs/cfgs/nodes
master-node$ echo /data/cornfs/export/storage > storage-node1
master-node$ echo /data/cornfs/export/storage > storage-node2

master-node$ mkdir -p /data/cornfs/import
master-node$ for node in * ; do mkdir -p /data/cornfs/import/$node ; shfsmount $node:`cat $node` /data/cornfs/import/$node ; done

The beauty of this is that shfs caches files and works with pretty much any host you can ssh into (including Windows via Cygwin). There are some shortcomings to shfs: "df -i" doesn't work, extended attributes aren't maintained, and it only works from linux kernels (were there only a Mac port ;)

Each file in the master tree will have a FILE pathname, including the filename.

Ideally, each file would have at least two copies. For our purposes, I'll suggest that this filesystem should endeavor to track two mirrors for every file, and clean up any "extra" copies.

The Master itself should have a few trees for the metadata. This leaves us with a few directory trees:

- the FILE has the same owner, group, permissions, ctime/atime/mtime, and size as the actual FILE (as a sparse file). 
- Extended attributes make a great storage for things like the primary and secondary mirror server names (setxattr/getxattr).

- contains the actual file, if SERVER is one of the FILE mirrors.

- this is a sparse version of the above file, used as a sanity check and for regenerating a SERVER from scratch. 
- This local metadata replica of a remote server is the masters opinion of what the server actually holds. 
- If something does not exist in this copy, but exists on the server, it should be removed from that server. 
- If something exists in this copy but not on the server, corruption has occurred.

- a directory tree containing the past N days worth of accessed FILEs (pruned via cron)

This ends up requiring more than twice the number of actual file inodes to represent the full filesystem on the master. One full copy of the entire metadata state, one copy spread across all of the servers for their metadata state replica on the master server, and some fraction of the filesystem in cache for frequent and/or recent file access.

The Master filesystem would be mounted somewhere handy to be filled, like /master:

master$ mkdir /master
master$ /opt/cornfs/current/bin/cornfs /master

Any new files created under /master would be written to the cache until the user closes the file. On file close, the Master needs to:

  1. Lock the file in the metadata state tree so that no two close operations can occur in parallel. Run a "df" on all of the /data/cornfs/import/ filesystems to see which two have the most available space, then fork off a copy to those respective filesystems.
  2. Creates a /data/cornfs/metadata/state/ sparse file
  3. Tag the /data/cornfs/metadata/state/ file with a "mirror1" extended attribute when the copy completes (setxattr). Update the /data/cornfs/metadata/SERVER/ file to mark that the copy was successful.
  4. Tag the /data/cornfs/metadata/state file with a "mirror2" extended attribute when the copy completes (setxattr). update the /data/cornfs/metadata/SERVER/ file to mark that the copy was successful.

When release() is called for a file, if any write() calls were used on the file, it should have been flagged as "dirty" (by an associative array in memory, along with an extended attribute just in case the running daemon is killed). If a file is dirty, it needs to be written out to the mirrors on release(). If a file is clean, don't do anything at all! The file is handily in the cache for the next access.

When reading a file:

  1. Check /data/cornfs/metadata/cache/ for the file. Open if it exists.
  2. If the file does not exist, one of the mirrors would be selected for the file.
  3. Copy the file to the cache. There is nothing wrong with allowing the client to read, as long as it doesn't try to read more data than has been streamed from the mirror server so far (seek or read() past the EOF as the cache file grows). In that case, the read or seek should block until the entire file is in the cache.
  4. If no mirrors are accessible, an error would be returned.

When moving a file/directory:

  1. Move the state/ copy of the file, if it exists. If this fails for any reason, pass the error code up.
  2. Move the cache/ copy of the file, if it exists.
  3. Iterate through the local metadata/SERVER, moving the file, if it exists.
  4. Iterate through the remote import/SERVER, moving the file, if it exists.

When unlinking (removing) a file/directory:

  1. Remove the state/ copy of the file, if it exists. If this fails for any reason, pass the error code up.
  2. Remove any cache/ copy of the file, if it exists.
  3. Iterate through the local metadata/SERVER, removing the file/dir, if it exists.
  4. Iterate through the remote import/SERVER, removing the file/dir, if it eixsts.

Changing permissions, access times, or ownership would really only affect the /data/cornfs/metadata/state/ sparse file.

Most metadata information would use the state sparse file.

A "helper daemon" needs to run periodically to make sure that servers are accessible.

  1. If a server becomes unreachable but has not timed out as "dead", read()s fail over to the other mirror (or fail if both mirrors are unreachable - such operations should probably trigger a mirror copy() as well), and write()s move the unreachable mirror of a file over to another reachable server.
  2. If a server is totally inaccessible for a period of time to mark it as "dead", the helper daemon needs to refer to the /data/cornfs/metadata/SERVER/ tree and create a new mirrored copy for each file across the farm. In the process, the metadata/SERVER tree will be pruned.
  3. A "sanity" script must be periodically run against each metadata/SERVER tree to see if a copy of a file exists on the server that does NOT exist in the metadata/SERVER tree. If so, that's an orphaned mirror, and should be deleted. Orphans would happen when the master's metadata state for a server says something shouldn't be there, but the server has been down during the time when the mirror would have been removed

As metadata state is updated, locking must be used to ensure atomic operations on the metadata tree. We would not want multiple updates to a file to occur out of order due to a delay in a copy operation to a server in the field.

Speed and availability should be consistently monitored to select faster responding mirrors (if possible) and/or noting that nodes are unreachable for file operations to trigger a mirror for a file with a broken mirror.

Symlinks, block/character devices, and other non-files are stored in the metadata state/ tree alongside the sparse files that represent the actual files that are being distributed.

There is no "inode" construct per se, outside of the metadata state/ tree. That is the "master metadata" that most filesystem operations use. Only when reading/writing, opening/closing, moving, or unlinking, do the mounted server filesystems under import/ get involved to hold the data.

Making this a single instance store (ideal for backups) would require just a bit more logic to include an SHA1/MD5 hash encoded as a directory tree (broken up by octet to a path tree structure); something like:


Another neat extension would be to build a "revision history" of documents in the filesystem by:

  1. On close(), if a file has changed, it should be archived.
  2. Move original version of files into a revision/ metadata tree by hash ID.
  3. Copy in the new version of file from the cache to the mirrors.
  4. Tag the state/ tree of the new file with an extended attribute as to the "previous revision"'s SHA1/MD5 HASH in the revision/ metadata tree.

This would address files that change, but would not save us from directory trees that are removed. For this, we would want an archive/ metadata tree by datestamp:

  1. On unlink(), create an archive/TIMESTAMP/ metadata tree and move the file there.

Moving files and/or directory trees around in state/ would maintain the extended attributes, effectively retaining the revisionist history FOR FREE! When files are moved, the mirrors must be moved as well.

Reconstructing things from the revision/ and archive/ trees would be interesting, but well beyond the initial scope of this endeavor.

The quickest way to throw this together would be with the perl module. I'm actively writing code now.

The eventual goal would be to write a thread aware C version based on the above prototype, primarily for speed reasons.

More to come.. SOON..