Configuring logging for Clojure applications

Banks have thousands of applications, each with large complex code-bases that must be supported long after the original application authors have moved onto something else. When something goes wrong a support engineer will invariably check the log file first. Few things are as common to every application as the humble log file.

It’s quite surprising to think that Java didn’t have an official logging API in the JDK until its fifth major release (1.4).

The sad history of Java logging

In the beginning there was System.out.println which writes out to the console. Many still use that today.

For those who needed more sophisticated logging, a package called log4j emerged in 1999 from IBM ‘alphaworks’. This steadily grew in popularity until it became the de-facto standard. Then Sun decided to write their own logging package and include it in the JDK, as java.util.logging. Not that there was anything significantly wrong with log4j, it was simply an example of the ‘not-invented-here’ syndrome that Sun was suffering from at the time. Despite a hard fought campaign to persuade Sun to incorporate log4j into the platform. Soon after, Sun realised that this sort of thing was damaging to Java’s reputation and created the JSR process to incorporate the wider Java community in decision making.

But the fact that Java had 2 near-identical logging packages led to further complexity. Apache created a facade called commons-logging that attempted to wrap both packages. By deciding to use a Java interface, Apache commons-logging negated the carefully crafted performance of the log4j package it was wrapping.

There have been attempts in the Java world to improve matters. One notable example is slf4j.

Logging in Clojure

Clojure logging neatly solves the problems of Java logging.

First, there’s a single API to use: clojure.tools.logging. Thank Alex Taggart and others for this. Second, it integrates with all the important logging frameworks: log4j, java.util.logging, Apache’s commons-logging and now slf4j. Finally, it side-steps many of the performance penalties associated with logging.

If you have a lot of logging statements in your code-base, you only want to pay the performance penalty for logging statements that are actually being written to the log file. However, too often CPU is wasted in creating a logging message only for that message to be filtered out later on. The conventional workaround in Java is to wrap every complex logging statement with a conditional.

if (logger.isDebugEnabled()) {
    logger.debug(x + " plus " + y + " is " + (x + y));
}

Without the conditional wrapper, the String concatenation would be quite expensive if it was ultimately unnecessary (for example, if the logger in question was configured to output warnings only). But the workaround is obviously cumbersome, implemented inconsistently and adds noise to the code’s readability. (Being the new(est) kid on the block, slf4j is able to make use of the JDK 1.5 varargs feature to avoid common string concatenation until after the message has made it through the filters).

Fortunately for Clojure programmer, the clojure.tools.logging takes care of this for them by using macros. As a Clojure programmer you get the benefit of optimized performance without the unsightly code-bloat.

Configuration

Unfortunately the Clojure logging facility doesn’t offer anything to help configure the logging package being used underneath.

Typically, Java logging is configured using configuration files. As you’d expect from Java, every configuration file is different, each one balancing flexibility with complexity. This is a horrible Java legacy to subject Clojure programmers to. An alternative approach is needed, so I’ve written a library: clj-logging-config.

You start by using requiring the library with :use (or :require) in your namespace declaration. In these examples I assume you are using log4j configuration, but you can configure java.util.logging in a similar way.

(ns my-ns
  (:use clojure.tools.logging 
        clj-logging-config.log4j))

Then you configure the logging system declaring a logger with set-logger!

(set-logger!)

and just start logging :-

(info "Just a plain logging message")

Check your console (not the REPL).

console> INFO - Just a plain logging message

In log4j it is common to set a pattern against the logger. Full syntax can be found in org.apache.log4j.EnhancedPatternLayout. For example, to set a pattern which just displays the message, do this:

(set-logger! :pattern "%m%n")
(info "Just the message this time")

console> Just the message this time

Or this.

(set-logger! :pattern "%d - %m%n")
(info "A logging message with the date in front")

console> 2011-08-07 10:58:21,084 - A logging message with the date in front

In the same way you can set the level of the logger. For example, so that the logger only prints warnings, you can do this :-

(set-logger! :level :warn) 

Or if you want to use the log4j Java class, you can do that as well :-

(set-logger! :level org.apache.log4j.Level/WARN) 

You can also set the appender to one of the appenders provided by the logging package you are using. For example, you can do this :-

(set-logger! :out (org.apache.log4j.DailyRollingFileAppender.))

Or even provide your own appender function.

(set-logger! :out (fn [ev] (println (:message ev))))

You can also set filters in the same way. Filters are useful but rarely used in log4j because they’re fundamentally a function, and log4j configurations formats don’t support writing functions – perhaps they should have used LISP!

(set-logger! :filter (fn [ev] (not (re-seq #"password" (:message ev)))))

For filters and appenders the underlying logging event is converted to a Clojure map (with the original event bound to the :event key, should you need it)

You can also set the additivity of the logger – the default is true but you can set it to false. Addivity is described in the Log4J user-guide – so we won’t repeat it here.

(set-logger! :additivity false)

It (almost) goes without saying that you can combine all these options together :-

(set-logger! :level :warn 
             :additivity false
             :pattern "%p - %m%n"
             :filter (constantly true))

There are some combinations that doesn’t make sense (such as using :pattern and :layout together). In these cases an exception will be thrown.

So much for getting started. In a real system you’ll want to configure multiple loggers in the same place :-

(set-loggers! 

    "com.malcolmsparks.foo" 
    {:level :info :pattern "%m"}

    ["com.malcolmsparks.bar" "com.malcolmsparks.zip"]
    {:level :debug})

You’ll find a lot more in this package, including support for per-thread (per-agent) logging which is useful in certain situations. One of the design goals of this library is that it ‘does the right thing’ most of the time. The aim is to prevent the common experience in log4j where you can’t figure out why you can’t see your logging statements and revert to println in frustration.

I do hope that Clojure programmers find this library useful. The only way to find out is to ask people to use it. It definitely needs more exposure to real projects but I wanted to release it now to get more feedback. The library is licensed under the EPL (same as Clojure’s) and available here.

Processing large files with Clojure.

By day, I work for a large investment bank in a department that calculates and distributes ‘Risk and PnL’. Simply put, for every transaction that is made by the bank a large amount of data needs to be re-computed periodically to help the bank understand its exposure to various things that can happen in the financial markets. Like many large organisations, banks produce large amounts of data files that must be processed. However, the data files are often so large that they’re impossible to load into the memory of a single 64-bit JVM.

Processing files as a stream is a good strategy. It minimizes ‘out of memory’ errors by utilizing a plentiful supply of disk space. How do you process a large file as a stream in Clojure?

We start by breaking the file up into a lazy sequence of tokens. The Clojure core library contains a function called line-seq that breaks a file into individual lines. Or you can use the input-stream or reader functions in the clojure.java.io namespace to turn a file into a lazy sequence of bytes or characters if you need finer granularity to form your own tokens depending on the nature of the data you are reading.

(ns example
  (:require ))

(line-seq (io/reader "bigfile"))

This is simplified for the purposes of this article. In real-world scenarios you may choose to improve performance by using NIO memory-mapped files. These techniques utilize main memory when it’s available without breaking the system when it isn’t.

Tokenizing

Invaribly you will form a lazy sequence of tokens in some way. How should we go about processing these tokens? I’ve found that using a state machine is a useful approach. We first define some initial state. Our tokens are fed to a function (which we’ll call f) along with the existing state. The function returns a new state, possibly changed.

Pacman

There are lots of other uses for this idea. Imagine we are asked to re-create the game of Pacman. We can list all the elements of the Pacman world :-

  • Our hero, Pacman
  • The positions of the ghosts
  • The maze
  • The pills
  • The magic pills
  • The existence of fruit
  • Whether the ghosts are afraid of Pacman or not
  • The player’s score
  • The number of lives the player has left
  • The state of the player’s joystick

We can store all this data in a single data-structure we call ‘the state’. We define a function, f, which is given the current world as an argument and which computes the new world we would expect to see on the screen one frame later. The job of programming the game of Pacman is then greatly simplified – we just call f iteratively and letting the game run as a ‘feedback loop’. All we have to do is implement the function f. (That job is left to the reader!)

How to apply the state-machine pattern in Clojure

There are a number of ways you can apply the state-machine method in Clojure. One simple way is to use the ‘second form’ of the reduce function in the Clojure core library (that’s the one that takes some initial state).

 (reduce f {} coll)

This first calls the function f with two arguments- an initial state (we used an empty map here) and the first element of the collection. The result of this becomes the first argument of another invocation of the same function, while the second argument is the next element of the collection. This repeated function application continues until the collection is exhausted.

Let’s pretend our tokens are ‘H’, ‘A’, ‘L’. This would be the result of applying reduce :-

 (f (f (f {} 'H') 'A') 'L')

The reduce function is a good starting point but it usually the wrong choice of function for parsing token streams :-

  • It is not possible to look-ahead to know what subsequent tokens will be
  • The function must always consume each token and cannot push the token back onto the stream.

Many parsers need both these features. So we need a refinement.

There’s also another subtler problem with reduce that we’ll come on to :-

  • We can’t inspect the operation of reduce while it is running, we have to wait for it to complete.

Using iterate

Fortunately the Clojure core library contains a more general method for the repetitive application of a function: iterate.

Here’s how we can use iterate to replace reduce.

(iterate f {:remaining coll})

Our state is initialized with the stream itself. The function has the responsibility for taking the first element and processing it. It returns a new state with a new entry representing the remaining elements. Usually the function will replace the :remaining element with remaining elements of the collection once the first element has been processed. However, it can do anything it chooses with this element, including looking ahead for future tokens and pushing back old ones.

The function is repeatedly applied with the map. The map can contain any state that is necessary for the processing of the stream. Often this includes stacks (vectors) to retain state at each level of context.

For the purposes of the rest of this article we’ll rewrite this using the ->> threading macro. We use this macro a lot when manpulating sequences.

(->> {:remaining coll}
     (iterate f))

Preventing infinite application

The problem with iterate is that it tends to go on forever. We can avoid this by adding an :end into our map when we detect the end of the stream. We can then make the resulting sequence finite by using take-while.

(->> {:remaining coll}
     (iterate f)
     (take-while #(not (contains? % :end)))

Since iterate returns a sequence of states we can choose to inspect any single state

Yielding

Often we want to parse a stream of tokens into a stream of something else, usually something more concrete and coarse-grained.

For example, huge XML files are often the result of repeated structures. In my case, this could be the calculated risk for every trade in a book. The structure itself for each trade isn’t complicated, but a large number of trades can make the resulting file very large. We have an algorithm for turning a stream of XML events into a stream of ‘virtual’ XML documents, each representing the entire XML file as it would look like if it contained just a single trade. The ‘stream’ is made up of these virtual XML documents, one of each trade. This allows us to tame the size of the file and avoid reading the entire file into memory.

The way to solve this problem is to allow our function to ‘yield’ whenever it has enough information to return a new data structure. We use the key :yield in the returned map to indicate this.

We use filter to filter out interrim states.

(->> {:remaining coll}
     (iterate f)
     (take-while #(not (contains? % :end)))
     (map :yield)
     (filter #(not (nil? %))))

Factoring out the pattern

It’s time to put all this together in its own function.

(defn parse [f coll]
  (->> {:remaining coll}
       (iterate f)
       (take-while #(not (contains? % :end)))
       (map :yield)
       (filter #(not (nil? %)))))

Destructuring

We now discuss the function f. Before we provide examples, let’s explain the structure of these functions a little more. The function f is given a map as its only parameter.

This example function does nothing except consume the first token. It does this by applying next to the :remaining entry in the map.

(fn [m] 
    {:remaining (next (:remaining m))})

This is less convenient than if we had named arguments for everything in the map, but we can use destructuring to get close to this.
So here’s our function we some destructuring :-

(fn [{remaining :remaining}] 
    {:remaining (next remaining)})

But usually we want to do something with each token in the stream. Let’s destructure a little more.

(fn [{[token & remaining :as stream] :remaining}] 
    {:remaining remaining})

This sets token to the token in the stream we shall process, remaining as the collection of tokens that remain in stream after the token and stream as the collection of remaining tokens includinng the token. This last setting is achieved using the :as keyword. This is useful when we want to return a new state but without consuming the token, equivalent to a pushback. There are certain cases where this is useful.

Notice also how destructuring allows us to balance the complexity of the function body with the function signature. Let’s digress while we’re discussing destructuring to show how far we can go.

Extreme destructuring

Jay Fields provides a good overview of the destructuring ‘mini-language’ inside Clojure. At the time of writing there aren’t many other good web resources so I’ll show how useful destructuring can be for our example.

[jf]: http://blog.jayfields.com/2010/07/clojure-destructuring.html “Jay Fields”

In a real-world situation our function will need to make use of other entries in the map. Let’s assume our map also contains entries for :stack, :indent and :count. We can destructure these in one go using the :keys keyword :-

(fn [{[token & remaining :as stream] :remaining 
                 :keys [stack indent count] 
                 :or {indent 0}}] 
    {:remaining remaining})

In the case of indent we have also indicated a default value of 0. Variables that do not have corresponding keys in the map argument and which do not have defaults get the value of nil.

Footnotes

The pattern described involes a lot of map manipulation. Of course, thanks to Clojure’s STM (Software Transactional Memory) we can mutate the map without worrying about where else we’ve referenced the map, since those values won’t change. However, since we often want to change multiple values in the map it would be useful to have a helper function.

(defn 
  ^{:doc "Apply functions to values in a map."}
  mapply
  [m & kfs]
  (apply hash-map
         (mapcat (fn [[k f]] [k (f (get m k))])
                 (partition 2 kfs))))

In some situations mapply has certain advantages over assoc. With assoc we have to get the data value out of the map, apply a function to it and then put it back in the map. We have to be careful to use the same key for the access and overwrite operations. By contrast, with mapply we send the function straight to the map entry.

With mapply, keys that are not specified are removed from the map. This is usually the safest thing to do to prevent temporary entries from bleeding across multiple state transitions. But it is easy to retain existing values in the map by specifying the key with the value of identity.

;; Increment :count and retain the value of :foo.
(mapply {:count 1 :foo "foo"} :count inc :foo identity) 

Finally, mapply can insert a new value, just like assoc does, by applying the constantly function.

;; Remove :count and set the :foo to "bar".
(mapply {:count 1 :foo "foo"} :foo (constantly "bar"))