Archive for August, 2007

named pipelines in unix

Tuesday, August 28th, 2007

while using a collection of filter and mungers and sorters to extract statistics from the apache logfiles in the traditional manner, it occurred to me that we really need a shell that’s able to create a thing called a ‘named pipeline’.

this is similar to, but not the same as, a named pipe. to wit: suppose my webserver’s filesystem is 80% full due to immense logfiles, and i am sifting those files for data. first, i have to combine them:

cat file1 fil2 file3 | sort | grep | munge > output

.. except actually file2 has got a bunch of crud from a previous egrep in it, and needs special treatment, so:

cut -d: -f2- file2 > file2.fixed
cat file1 file2.fixed file3 | sort | grep | munge > output

… only there’s not enough space on my filesystem for a second copy of most of the contents of file2, and anyway i’m only looking for a tiny percentage of that file.

what to do? well, named pipes are a handy unix-ism for this situation:

mknod file2.fixed p
cut -d: -f2- file2 > file2.fixed
cat file1 file2.fixed file3 | sort | grep | munge > output

that solves my space problem. only, i’m still re-running my sort over and over — kind of figuring it out as i go — and whenever i need to run it again, i have to re-start that stupid cut command. feh. what i need is the power of bash:

mknod file2.fixed p
(while true; do; cut -d: -f2- file2 > file2.fixed ; done) &
cat file1 file2.fixed file3 | sort | grep | munge > output

… which will re-start the command every time the pipe is opened for reading. that’s about what i want. only, instead of typing all this mess:

mknod foo p
(while true; do; cmd | cmd | cmd | cmd > p; done)&

i’d much rather just type this:

cmd | cmd | cmd | cmd |% p

when i type that special |% notation, the shell should first create the named pipe, then spawn a process to execute the pipeline into the named pipe, restarting it as necessary for as long as the shell is running and the named pipe exists.

in addition, it ought to have a mechanism for quietly halting when the named pipe is deleted, and maybe for automatically restarting such pipelines when they are found abandoned by a previous shell. and someone ought to be able to figure out what’s coming down that pipe without having to suck on it — i.e. there needs be a mechanism to inspect which commands are hooked to what pipes.

in short, the resulting named pipe ‘p’ is more than just a pipe, it’s a predefined pipeline that can be started and read from any time, but that consumes minimal resources until it is called upon. hence ‘named pipeline’.

(of course we’re consuming some memory here, with all those commands standing around blocking for output. but perhaps our clever implementation will avoid executing the commands until such time as their output is requested.)

if i was smarter or less lazy i might hack that %| syntax into the source of bash. but instead, i’m going to write a utility called ‘plumb’, maybe like so:

plumb p ‘cmd|cmd|sort|grep|etc’ # to create and start a pipeline
plumb p # to inspect
replumb p # to restart

sadly, the commands must be single-quoted, or the pipes escaped somehow, in order for this to work in the shell. but it’ll do for now.

making named-pipeline creation trivial allows me to construct what would otherwise be an unwiedly-long unix pipeline as a series of small, easily joined & inspected fittings. i can set up a large pile of sub-processes, each ready to filter the data another step, none of them consuming file space in the process of doing so, and with these craft my ultimate super-grep one step at a time.

plumb file2.fixed ‘cut -f2- -d: file2’
plumb sorted ‘cat file1 file2.fixed file3 | sort -u’
plumb sifted_a ‘egrep a sorted’
plumb sifted_b ‘egrep b sorted’
plumb formated ‘prettyfy sifted_a sifted_b’
plumb beer ‘mail -s “today’s traffic report” myboss@myjob < formatted’

what’s brilliant is, if i inspect ‘formatted’ and find an error introduced by ‘file2’, i can edit ‘file2’ to fix it, and ‘formatted’ will reflect the change without additional work or resource-consumption — an implementation of functional programming in plumbing, basically — thereby enabling:

00 17 * * * cat beer > /dev/mykle