<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Clojure in a bank</title>
	<atom:link href="http://blog.malcolmsparks.com/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://blog.malcolmsparks.com</link>
	<description>Applying a functional language to a large organisation</description>
	<lastBuildDate>Thu, 19 Apr 2012 16:17:10 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Coding a transactional database using Clojure&#8217;s &#8216;reader literals&#8217;.</title>
		<link>http://blog.malcolmsparks.com/?p=67</link>
		<comments>http://blog.malcolmsparks.com/?p=67#comments</comments>
		<pubDate>Thu, 19 Apr 2012 00:01:14 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.malcolmsparks.com/?p=67</guid>
		<description><![CDATA[I want to show how you can create a simple persistent database by utilising reader literals in Clojure 1.4. You can copy these lines into a Clojure program if you like &#8211; the only dependency you require is clojure.java.io. Let&#8217;s start by defining a transaction as an update to some state using a protocol :- [...]]]></description>
			<content:encoded><![CDATA[<p>I want to show how you can create a simple persistent database by utilising reader literals in Clojure 1.4. You can copy these lines into a Clojure program if you like &#8211; the only dependency you require is <code>clojure.java.io</code>.</p>
<p>Let&#8217;s start by defining a transaction as an update to some state using a protocol :-</p>
<pre><code>(defprotocol Transaction (update [_ state]))
</code></pre>
<p>Here&#8217;s some example transactions we want to perform. For this article let&#8217;s imagine we&#8217;re creating a database of user accounts. We&#8217;ll just create two transaction types- creating users and deleting them. Let&#8217;s record the name and email address, plus the date they joined or left.</p>
<pre><code>(defrecord CreateUser [name email start-date]
  Transaction
  (update [this db]
    (update-in db [:users] conj
      {:name name :email email :start-date start-date})))

(defrecord DeleteUser [email remove-date]
  Transaction
  (update [this db]
    (update-in db [:users]
      (fn [coll] (remove (fn [r] (= (:email r) email)) coll)))))
</code></pre>
<p>Data is important. We&#8217;ll want to keep records of these transactions.</p>
<pre><code>(defprotocol TransactionLog (record [_ tx]))
</code></pre>
<p>Here&#8217;s an implementation that uses an agent to write the records to a print-stream.</p>
<pre><code>(defrecord DefaultTransactionLog [ag]
  TransactionLog
  (record [this tx]
    (send-off ag
      (fn [out]
        (binding [*out* out *flush-on-newline* true]
          (io! (prn tx)) out)))))
</code></pre>
<p>The agent will hold the print-stream. Agents are good for I/O because messages to them only get delivered if the transaction completes its update succesfully. If a retry is needed the message doesn&#8217;t get delivered. We let the Clojure prn function figure out how to serialize it. It&#8217;s optional, but I&#8217;ve wrapped the actual print statement in an <code>io!</code> wrapper as a safety check.</p>
<p>Having written the records we&#8217;ll need some way of reading them back. We&#8217;ll just use the standard Clojure reader for this. This function returns a list of all the transactions in a file.</p>
<pre><code>(defn read-transactions [f]
  (if-not (.exists f) []
    (let [rdr (java.io.PushbackReader. (clojure.java.io/reader f))]
      (take-while identity
        (repeatedly
          (fn [] (read rdr false nil)))))))
</code></pre>
<p>Now we move on to the database state. Let&#8217;s decorate the transaction log with something that will hold this state. As transactions are added to it they will update the in-memory state and be recorded on disk. This can support the TransactionLog protocol as well. The state will be a Clojure ref. The delegate will be a backing transaction log.</p>
<pre><code>(defrecord Database [state delegate]
  TransactionLog
  (record [this tx]
    (dosync
     (alter state (partial update tx))
     (record delegate tx))))
</code></pre>
<p>Now we&#8217;re ready to create our database with the Clojure ref and backing transaction log. We&#8217;ll initialize the ref with a result of updating an empty map with a series of transactions read from our file.</p>
<pre><code>(def db
   (Database.
     (ref (reduce (fn [db tx] (update tx db)) {}
                    (read-transactions
                      (clojure.java.io/file "my.db.clj"))))
     (DefaultTransactionLog.
       (agent (clojure.java.io/writer
                (clojure.java.io/file "my.db.clj")
                :append true)))))
</code></pre>
<p>Remember those transactions we coded earlier? Let&#8217;s add some convenience functions that will create and apply these transactions :-</p>
<pre><code>(defn create-user [name email]
    (record db (CreateUser. name email (java.util.Date.))))

(defn delete-user [email]
    (record db (DeleteUser. email (java.util.Date.))))
</code></pre>
<p>Notice how we&#8217;re creating <code>java.util.Date</code> instances.</p>
<p>Now, let&#8217;s test it.</p>
<pre><code>(create-user "Bob" "bob@example.org")
(create-user "Alice" "alice@example.org")
(create-user "Carol" "carol@example.org")
(delete-user "alice@example.org")
</code></pre>
<p>What&#8217;s the state of the database?</p>
<pre><code>(deref (:state db))
</code></pre>
<p>You should see something like this :-</p>
<pre><code>{:users
  ({:name "Carol",
    :email "carol@example.org",
    :start-date #inst "2012-04-19T16:00:35.903-00:00"}
   {:name "Bob",
    :email "bob@example.org",
    :start-date #inst "2012-04-19T16:00:35.901-00:00"})}
</code></pre>
<p>What&#8217;s in our database file? Below is the raw content. Unlike some database files it&#8217;s quite easy to read and even possible to edit by hand if the need arises.</p>
<pre><code>#blog.CreateUser{:name "Bob", :email
"bob@example.org", :start-date #inst
"2012-04-19T16:00:35.901-00:00"}
#blog.CreateUser{:name "Alice", :email
"alice@example.org", :start-date #inst
"2012-04-19T16:00:35.903-00:00"}
#blog.CreateUser{:name "Carol", :email
"carol@example.org", :start-date #inst
"2012-04-19T16:00:35.903-00:00"}
#blog.DeleteUser{:email "alice@example.org",
:remove-date #inst
"2012-04-19T16:00:35.904-00:00"}
</code></pre>
<p>Note, unless you aren&#8217;t using Clojure 1.4 or above you won&#8217;t see the <code>#inst</code> tags!</p>
<p>Notice how Clojure is writing out and reading back the <code>java.util.Date</code> instance. You can also register your own Java types to it through the <code>*data-readers*</code> dynamic binding (more details in the Clojure 1.4 release notes). This makes things really easy to create some fairly complex atomic transactions.</p>
<p>Now try recompiling &#8211; you should find the database is restored. You can also try removing the user. The transaction log will store both the create-user event and the delete-user event, both with timings.</p>
<p>That&#8217;s around 30 lines of code to code a transactional database, and a few more lines to code some custom transactions.</p>
<p>You don&#8217;t have to limit yourself to a map as your state. At Deutsche Bank we&#8217;ve used a similar design but with RDF triple-stores that are queryable in a datalog syntax. A key benefit with using Clojure&#8217;s persistent data-structures for simple in-memory datastores is that you don&#8217;t have to worry about locks or concurrent updates.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malcolmsparks.com/?feed=rss2&#038;p=67</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Reflections on a real-world Clojure application (take 2)</title>
		<link>http://blog.malcolmsparks.com/?p=56</link>
		<comments>http://blog.malcolmsparks.com/?p=56#comments</comments>
		<pubDate>Wed, 18 Jan 2012 23:01:34 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.malcolmsparks.com/?p=56</guid>
		<description><![CDATA[Last night I gave a talk at the London Clojure Users Group (LCUG) about a &#8216;real-world&#8217; (16K lines-of-code) application we built in less than a year with Clojure at Deutsche Bank. I really enjoyed the event and thanks to SkillsMatter who were fantastic hosts. There were a lot of questions during the Q&#38;A at the [...]]]></description>
			<content:encoded><![CDATA[<p>Last night I gave a talk at the London Clojure Users Group (LCUG) about a &#8216;real-world&#8217; (16K lines-of-code) application we built in less than a year with Clojure at Deutsche Bank. I really enjoyed the event and thanks to SkillsMatter who were fantastic hosts.</p>
<p>There were a lot of questions during the Q&amp;A at the end which I did my best to answer at the time. Now I&#8217;ve had some more thinking time I&#8217;d like to add a few extra comments.</p>
<p>If you couldn&#8217;t attend the talk you can catch it <a href="http://skillsmatter.com/podcast/scala/real-world-clojure">here</a>.</p>
<p>Below is the original presentation in blog form (thank you Markdown!). My extra comments can be found in the epilogue &#8211; feel free to ask further questions in the comments area.</p>
<h1>Reflections on a real-world Clojure application-</h1>
<h2>Background</h2>
<ul>
<li>Java background, especially early J2EE circa 1999-2002</li>
<li>Test Driven Development &#8211; ran 20 courses</li>
<li>Mastering TDD helped me to write Java using <em>values</em> rather than <em>objects</em></li>
<li>Began to write Java in a more functional way &#8211; but much more verbose!!</li>
<li>Started using Clojure at work for user web interfaces in November 2009</li>
<li>Began to attend Clojure Dojos in London</li>
<li>February 2011 &#8211; Clojure used extensively on a new application, now 16K LOCs!</li>
</ul>
<h1>The &#8216;main&#8217; function</h1>
<h2>Developer bootstrap</h2>
<p>For developers</p>
<pre><code>$ mvn dependency:copy-dependencies
$ ./run
</code></pre>
<p>which does this :-</p>
<pre><code>#!/bin/sh
echo "Starting Fandango run script..."

export PATH=$PATH:target/bin

# Set debug to nil to disable JVM debugging.
classpath='src/main/clojure:target/dependency/*'
main=src/main/clojure/com/db/mis4/fandango/main.clj

java -cp ${classpath} clojure.main ${main}
</code></pre>
<p>Then slime in with Emacs!</p>
<p>(Let&#8217;s look at configuration in more detail)</p>
<h1>Configuration</h1>
<h2>Requirements of a configuration system</h2>
<ul>
<li>Flexibility &#8211; we should be able to add configuration where we need it</li>
<li>Distributed ownership &#8211; we shouldn&#8217;t have to know the live passwords</li>
<li>Source agnostic &#8211; we&#8217;d like to be able to use local files <em>and</em> centralised storage.</li>
</ul>
<h2>Candidates?</h2>
<ul>
<li>Java properties files</li>
<li>JSON/YAML</li>
<li>XML &#8211; tree based, schemas enforces structure rather than value</li>
<li>Databases &#8211; records for configuration are too diverse</li>
<li>RDF &#8211; graph based, queryable</li>
</ul>
<h2>Clojure as configuration?</h2>
<blockquote><p>
  &#8220;Protocols and file formats that are Turing-complete input languages are the worst offenders, because for them, recognizing valid or expected inputs is UNDECIDABLE: no amount of programming or testing will get it right&#8230; A Turing-complete input language destroys security for generations of users. Avoid Turing-complete input languages! &#8221; &#8212; Corey Doctorow
</p></blockquote>
<h4>So&#8230;</h4>
<p>Be careful if you choose Clojure as your configuration format!!</p>
<h2>&#8216;Open Data&#8217;</h2>
<p>All our data (application &amp; environment configuration, report definitions, user details &amp; entitlements, etc.) are stored as RDF statements</p>
<ul>
<li>&#8220;<em>The cat sat on the mat</em>&#8221;
<ul>
<li>Subject: <em>the cat</em></li>
<li>Predicate (also known as <em>property</em>): <em>sat on</em></li>
<li>Object: <em>the mat</em></li>
</ul>
</li>
<li>
<p>Relations are at an <em>individual</em> level rather than at a <em>set</em> (ie. table) level.</p>
</li>
<li>More intro to RDF here:
<ul>
<li>http://www.bbc.co.uk/blogs/radiolabs/s5/linked-data/s5.html</li>
<li>http://linkeddatabook.com</li>
</ul>
</li>
</ul>
<h2>Our configuration system</h2>
<ul>
<li>RDF files (mostly Turtle format)</li>
<li>SPARQL queries</li>
<li>Uses a dynamic var: <code>(with-config ...)</code></li>
<li>Delays to avoid unnecessary queries</li>
</ul>
<h2>Example</h2>
<p>create-assocations :-</p>
<pre><code>(defn create-associations [model]
  {::directories
   (delay
    (sparql/select1-map
     model
     [:proc cmdb/host :host]
     [:proc cmdb/install-dir (as-uri (format "file://%s" (or (System/getenv "FANDANGO_INSTALL_DIR")
                                                             (System/getProperty "user.dir"))))]
     [:host a cmdb/Host]
     [:host cmdb/hostname (get-hostname)]
     [:proc cmdb/userid (System/getProperty "user.name")]
     [:proc ["http://mis4.gto.intranet.db.com/fandango/" "dataDirectory"] :data-dir]
     [:proc ["http://mis4.gto.intranet.db.com/fandango/" "logDirectory"] :log-dir]
     [:proc ["http://mis4.gto.intranet.db.com/fandango/" "workDirectory"] :work-dir]
     [:proc ["http://mis4.gto.intranet.db.com/fandango/" "pidDirectory"] :pid-dir]
     [:optional [:proc cmdb/source-dir :source-dir]]
</code></pre>
<h1>Security</h1>
<h2>Entitlements</h2>
<p>All users are given FOAF &#8216;profiles&#8217;, with added VCARD and other statements.</p>
<p>Given these prefixes</p>
<pre><code>@prefix foaf: &amp;lt;http://xmlns.com/foaf/0.1/&gt; .
@prefix rdfs: &amp;lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
</code></pre>
<p>This statement (in the configuration) gives all users a &#8216;Guest&#8217; role.</p>
<pre><code>foaf:Person rdfs:subClassOf &amp;lt;Guest&gt; .
</code></pre>
<h2>N-triples</h2>
<p>Statements are then added to create users, request roles, approve or reject roles</p>
<p>Creating a user</p>
<pre><code>&amp;lt;events/5afcf604-16c0-4cab-a6d1-656ed3f3420c&gt; &amp;lt;time&gt; "2011-12-25T12:00Z"^^&amp;lt;http://www.w3.org/2001/XMLSchema#dateTime&gt; .
&amp;lt;events/5afcf604-16c0-4cab-a6d1-656ed3f3420c&gt; rdfs:type &amp;lt;CreateUser&gt; .
&amp;lt;events/5afcf604-16c0-4cab-a6d1-656ed3f3420c&gt; &amp;lt;eventfor&gt; &amp;lt;users/malcolm.sparks%40db.com&gt; .
</code></pre>
<p>Request a role</p>
<pre><code>&amp;lt;events/b5bed531-a324-4aec-9ace-2785c65a19b7&gt; &amp;lt;time&gt; "2011-12-25T14:00Z"^^&amp;lt;http://www.w3.org/2001/XMLSchema#dateTime&gt; .
&amp;lt;events/b5bed531-a324-4aec-9ace-2785c65a19b7&gt; rdfs:type &amp;lt;RequestRole&gt; .
&amp;lt;events/b5bed531-a324-4aec-9ace-2785c65a19b7&gt; &amp;lt;role&gt; &amp;lt;Administrator&gt; .
&amp;lt;events/b5bed531-a324-4aec-9ace-2785c65a19b7&gt; &amp;lt;eventfor&gt; &amp;lt;users/malcolm.sparks%40db.com&gt; .
</code></pre>
<h2>Language integrated query</h2>
<p>Data can be queried directly from Clojure</p>
<pre><code>(defn get-approved-roles-for-user [user]
  (sparql/select-map
   [(get-combined-model) (config/get-config-model)]
   [:approval a events-ns/RoleApproved]
   [:approval events-ns/time :approval-time]
   [:approval roles/approver :approver]
   [:approver foaf/name :approver-name]
   [:optional [:approver foaf/homepage :approver-homepage]]
   [:approval roles/cause :request]
   [:request roles/requester user]
   [:request events-ns/time :request-time]
   [:request roles/role :role]
   [:role rdfs/label :role-name]))
</code></pre>
<h1>Deployment</h1>
<h2>Releasing to production</h2>
<pre><code>$ git clone http://github....db.com/.../fandango.git
$ git verify-tag 4.5.0
$ git checkout tags/4.5.0
$ make release
</code></pre>
<p>Derive the version from git!</p>
<p>GNU Make incantation &#8230;</p>
<pre><code>describe := $(subst -, ,$(shell git describe --tags --long HEAD))
version := $(word 1,$(subst -, ,$(describe)))
release := $(shell expr 1 + $(word 2,$(describe)))
</code></pre>
<p>And generate the <code>pom.xml</code> &#8211; ie. in Make :-</p>
<pre><code>pom.xml:    pom.template.xml
            cat $&lt; | sed -e "s/@VERSION@/$(version)/g" &gt;$@

mvn dependency:copy-dependencies
cp -r src/ dest/
</code></pre>
<p>We use RPM but the principle of copying the source and dependency jars over is the same.</p>
<h2>Installation</h2>
<p>Installation is easy</p>
<pre><code>$ rpm --dbpath /opt/privatedb -Uvh fandango-4.5.0-1-x86_64.rpm
</code></pre>
<h2>Production bootstrap</h2>
<pre><code>$ fandango start
</code></pre>
<p>A lot more complex than the developer bootstrap.</p>
<ul>
<li>Init script (from Java Service Wrapper &#8211; enhanced with <code>roqet</code> to read environment variables from configuration)</li>
<li>Init script generates the <code>wrapper.conf</code>, then calls Java Service Wrapper native executable</li>
<li>Native binary spawns JVM with 2 args <code>clojure.main boot.clj</code></li>
<li>boot.clj sets up a classloader which pulls in the dependency jars</li>
<li>boot.clj hands off to main.clj, rest is as the developer bootstrap.</li>
</ul>
<p>But source code is still copied onto the system as is.</p>
<h1>Logging</h1>
<h2>Getting started</h2>
<p>Logging is important because it&#8217;s what everyone expects to find.</p>
<p>These will get you started :-</p>
<pre><code>(clojure.core/println)
(clojure.pprint/pprint)
</code></pre>
<p>However, as your application grows you will eventually need a more sophisticated logging system. We use Log4J and configure it with clj-logging-config.</p>
<p>You&#8217;ll need the following packages to do this :-&#8221;</p>
<pre><code>(use 'clojure.tools.logging)
(use 'clj-logging-config/log4j)
</code></pre>
<h2>(with-logging-config)</h2>
<pre><code>(with-logging-config
  [:root {:level :debug
          :out (io/file workdir "job.log")}]
  ...
</code></pre>
<h2>(with-logging-context)</h2>
<p>For using the NDC and MDC of Log4J.</p>
<pre><code>(with-logging-config
  [:root {:pattern "%d [%p] (for Customer %X{customer}) %m%n"}]
   ...

   (with-logging-context {"customer" "John Smith"}
     ...
</code></pre>
<h1>Reflections</h1>
<h2>The Good</h2>
<ul>
<li>Retain the JVM</li>
<li>No class files, yippee!</li>
<li>Sliming in! EDD: Eval Driven Development!</li>
<li>Separation of value, identity, state: State is a timeline of changing values.</li>
<li>Learning time &#8211; even our DBA is now comfortable with Clojure.</li>
</ul>
<h2>The Bad</h2>
<ul>
<li>People are justifiably afraid of new things</li>
<li>Tooling (for those not comfortable with Emacs)</li>
<li>Java interop can bite you</li>
</ul>
<h2>The Ugly</h2>
<ul>
<li>Stack traces</li>
<li>Debugging</li>
</ul>
<h2>Quality versus value</h2>
<blockquote><p>
  &#8220;Value is what you are trying to produce, and quality is only one aspect of it, intermixed with cost, features, and other factors.&#8221;   &#8212; John Carmack, http://altdevblogaday.com/2011/12/24/static-code-analysis/
</p></blockquote>
<h4>cf. &#8216;Agile&#8217; absolutes</h4>
<ul>
<li>Always write the tests first</li>
<li>Tests should always pass</li>
<li>Always fix the build before working on new features</li>
<li>Integrate continuously</li>
<li>Refactor prior to adding new features</li>
<li>Consistent code style</li>
</ul>
<para class="action">Our experience of Git + Clojure is prompting us to question certain assumptions.</para>
<h2>More info</h2>
<p>http://blog.malcolmsparks.com</p>
<h2>Q &amp; A</h2>
<p>Over to you&#8230;</p>
<h2>Epilogue</h2>
<p>Many of the questions related to the RDF portion of my presentation. There were a lot of others, I can&#8217;t remember all now.</p>
<h3>How big is your team and how did it grow?</h3>
<p>We started with 2 developers and grew to 4. Forcing Clojure on developers is unwise. I know that was tried somewhere else and most developers only used the Java interop!</p>
<h3>Why do you use RDF for configuration rather than XML or JSON or even Clojure itself?</h3>
<p>JSON is certainly more conventional as a configuration format (or XML in the Java world)<br />
There isn&#8217;t a strong reason not to use Clojure itself (I had a slide warning of the dangers of Turing complete input languages but the point stands nevertheless). I don&#8217;t think my answer was very good last night so here are some advantages of RDF :-</p>
<ul>
<li>Meaning &#8211; RDF allows you to make logic set-based statements to classes of what are otherwise straight name/values pairs.</li>
<li>Metadata &#8211; RDF allows you to make statements about statements. You can use metadata to label configuration values, add annotations (in multiple languages if you like), or constrain the values to some valid range or set, or say something about the nature of the property. You can do this in a very limited way with XML (perhaps with attributes) but with JSON there&#8217;s nothing built-in or idiomatic.</li>
<li>Mergeability &#8211; RDF allows you to source statements from a wide variety of sources and merge the models together, whereas there&#8217;s nothing built-in or idiomatic in XML or JSON. In tree formats config statements have to group inside each other in a single hierarchy &#8211; designing this hierarchy is a job in itself. Graphs are more flexible since nodes can exist in multiple hierarchies if needs be.</li>
<li>Inference &#8211; in RDF, having some data allows you to infer other data which you would otherwise have to make explicitly. This has the potential to reduce data discrepancies. For example, given a database name, listener host and port you can &#8216;infer&#8217; a database connection string. </li>
</ul>
<p>That said, I&#8217;m not really pushing RDF as a config format. We took a gamble on it and it paid off in our case. Other projects are different. JSON is a great format that enables fast and simple data exchange (when you control both ends).</p>
<p>I also suggested that a domain model is more valuable for <em>persistent</em> data than for <em>transient</em> data structures. Object oriented languages encourage you to design the domain model internal to a program. But in my view there is <em>more value</em> in a domain model you can communicate <em>between</em> systems, and keep for longer periods, than in a domain model that you can only use <em>privately</em> (ie. in a single memory address space) and only while your application is running. This is the exact opposite of designing domain models in Java/C# classes and serializing out to a database or JSON/XML files, hence the need to illustrate with a real-world example (in this case, configuration).</p>
<h3>What other Clojure frameworks do you use?</h3>
<ul>
<li>Compojure/Ring/Hiccup/Clout for web pages.</li>
<li>Plugboard for REST but the intention is to move towards something like compojure-rest</li>
<li>Swank &#8211; couldn&#8217;t manage without it!</li>
</ul>
<p>It&#8217;s a surprise to me how much we manage to do with just the standard Clojure libraries.</p>
<h3>Do you think functionality rises linearly or exponentially with lines of code?</h3>
<p>I thought this was a great question because it points to the huge amount of algorithmic re-use that we enjoy in Clojure.</p>
<h3>Did you have a specific business problem that led you to Clojure?</h3>
<p>Honestly, no. In my case it was a growing frustration with large Java systems. But since we&#8217;ve been using Clojure in our team there have been a number of business problems that have cropped up that are ideally suited to Clojure. Certainly in my industry (banking) the business is built on mathematical functions and data transformations for which functional languages like Clojure are ideal.</p>
<h3>Do you think Clojure be around in 5 years time?</h3>
<p>This final question was asked by someone sitting in the front row. I don&#8217;t think they would have asked this if they&#8217;d seen how many people were in the room! Clojure is building momentum, at least in London, and as I said in my talk I think it&#8217;s beyond the point of critical mass now.</p>
<p>But on reflection I think it&#8217;s an important question. Why should anyone invest a lot of time in learning something that isn&#8217;t going to be around in a few years? However, technology is always about betting on certain horses (VHS or Betamax?) and you can never be 100% certain. LISP is a good bet though, it&#8217;s survived over 50 years and people keep rediscovering it. So even if Clojure doesn&#8217;t survive, I&#8217;m confident the knowledge you get from learning it will remain relevant.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malcolmsparks.com/?feed=rss2&#038;p=56</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Transitive relations in core.logic</title>
		<link>http://blog.malcolmsparks.com/?p=49</link>
		<comments>http://blog.malcolmsparks.com/?p=49#comments</comments>
		<pubDate>Tue, 27 Dec 2011 19:26:53 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.malcolmsparks.com/?p=49</guid>
		<description><![CDATA[Inspired by some recent blogs I&#8217;ve been playing with core.logic. In our use RDF at work we have been developing on Jena, a Java library for processing and querying RDF triples. Overall, Jena is an excellent library and has powerful inference capabilities. While Clojure has great interop support for Java sometimes you find some mismatches [...]]]></description>
			<content:encoded><![CDATA[<p>Inspired by some recent blogs I&#8217;ve been playing with core.logic.</p>
<p>In our use RDF at work we have been developing on Jena, a Java library for processing and querying RDF triples. Overall, Jena is an excellent library and has powerful inference capabilities. While Clojure has great interop support for Java sometimes you find some mismatches between the two paradigms.</p>
<p>For example, Jena being a Java library supports concurrency via read/write locks rather than Clojure&#8217;s Software Transactional Memory. In practise this mismatch disrupts Clojure&#8217;s preference for lazy sequences &#8211; results have to be read in entirety to avoid concurrent modification errors. Jena&#8217;s concurrency model is also confusing to use, and we&#8217;ve had to fix a number of deadlocks in our code which have been hard to diagnose. Since we only use a small subset of OWL logical predicates in our dataset it may be possible to replace Jena with something that worked better with Clojure.</p>
<p>Both the RDFS and OWL inference engines in Jena work by extending the graph with additional links called &#8216;entailments&#8217; which reflect the possible inferences that can be made, using both forward and backward chaining strategies. Creating an inference graph can be expensive and not something you want to do on every query. However, if your data model isn&#8217;t static you have to re-create the inference graph every time it changes. While there are inference engines that support such incremental updates efficiently, I was intrigued by core.logic to see if it was possible extending the data model and rather to put inferencing in the relations themselves.</p>
<p>Here&#8217;s an example. RDFS declares a subclass predicate which declares one class to be a subclass of another. Any instances of the subclass are, by inference, also instances of the superclass.</p>
<p>We can declare this relation in core.logic like so.</p>
<pre><code>(defrel subclass-of p1 p2)
</code></pre>
<p>Here are some facts to illustrate the example.</p>
<pre><code>(fact subclass-of "Royal Bengal Tiger" "Tiger")
(fact subclass-of "Tiger" "Cat")
(fact subclass-of "Cat" "Mammal")
(fact subclass-of "Mammal" "Animal")
</code></pre>
<p>Now we can get all the graph links with this query.</p>
<pre><code>(run* [q]
      (fresh [a b]
             (subclass-of a b)
             (== q [a b])))
</code></pre>
<p>But rdfs:subclass-of is a owl:transitive. Never mind, we can transform the relation to act transitively. Here&#8217;s the transformer function :-</p>
<pre><code>(defn transitive [r]
  (letfn [(t [p1 p2]
            (fresh [between]
                   (conde
                    ((r p1 p2))
                    ((r p1 between)
                     (t between p2)))))]
    t))
</code></pre>
<p>Now we can use it to get all the entailments as well.</p>
<pre><code>;; This transforms the relation into a transitive relation
(run* [q]
      (fresh [a b]
             ((transitive subclass-of) a b)
             (== q [a b])))
</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://blog.malcolmsparks.com/?feed=rss2&#038;p=49</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Scalable sorting</title>
		<link>http://blog.malcolmsparks.com/?p=42</link>
		<comments>http://blog.malcolmsparks.com/?p=42#comments</comments>
		<pubDate>Thu, 03 Nov 2011 07:50:07 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[sorting]]></category>

		<guid isPermaLink="false">http://blog.malcolmsparks.com/?p=42</guid>
		<description><![CDATA[Sorting data is an important step in gathering data together for aggregation and reporting purposes. Most Clojure sort algorithms you&#8217;ll find on the web assume you can fit all the items to sort into the memory of a single JVM instance. For example, many merge sorts will use partition to split the data set into [...]]]></description>
			<content:encoded><![CDATA[<p>Sorting data is an important step in gathering data together for aggregation and reporting purposes.</p>
<p>Most Clojure sort algorithms you&#8217;ll find on the web assume you can fit all the items to sort into the memory of a single JVM instance. For example, many merge sorts will use <code>partition</code> to split the data set into 2 halves.</p>
<p>Where I work we frequently have to deal with large volumes of data where the data sets we deal with are far too large for these algos. Instead we have to find alternatives.</p>
<p>Here&#8217;s our approach to sorting large data. Although we did spend some time researching the problem, there&#8217;s bound to be more better algorithms out there. Please let me know if you find one!</p>
<h2>Divide and conquer</h2>
<p>To sort a large dataset we break it into smaller &#8216;bite-sized&#8217; chunks which we <em>can</em> sort in memory, and write these sorted chunks to disk. Then we can run a lazy merge sort against these chunks. Although we currently run this on one node the nature of merge sort is that it can scale-out recursively and is compatible with map-reduce. In production use we&#8217;ve found that even on a single node huge datasets need a small degree of recursion just to avoid having too many chunks and running out of file handles. However, we&#8217;ll ignore recursion for the purposes of this discussion (we can always add it later).</p>
<h2>A simple merge-sort function</h2>
<p>First, let me print our merge-sort function minus the requisite doc-strings and comments.</p>
<pre><code>(defn merge-sort
  ([^java.util.Comparator comp colls]
     (letfn [(next-item [[_ colls]]
               (if (nil? colls)
                 [:end nil]
                 (let [[[yield &amp; p] &amp; q]
                       (sort-by first comp colls)]
                   [yield (if p (cons p q) q)])))]
           (-&gt;&gt; colls
                (vector :begin)
                (iterate next-item)
                (drop 1)
                (map first)
                (take-while (partial not= :end)))))
  ([colls]
     (merge-sort compare colls)))
</code></pre>
<p>Firstly, note the main function overload takes 2 arguments. The first is the Java comparator we&#8217;ll use to sort the data, the second is a <em>collection of sorted sequences</em>. These sequences don&#8217;t have to be in memory, they are lazy and their elements can be read in as we need them (from disk, network, etc.). However, we do assume the sequences have <em>already been sorted</em> with the same Java comparator given in the first argument. At the lowest level of recursion these sequences will be small enough to be sortable in memory and written out to disk &#8211; this is fairly trivial so we won&#8217;t cover it here.</p>
<p>We define a function called <code>next-item</code> which takes the collection sequences and returns a pair:</p>
<ul>
<li>the first value is the next result in the merge-sorted sequence. This is called the <em>yield</em>.</li>
<li>the second value is the same collection of sequences we started with <em>minus the yielded value</em>. This is called the <em>remainder</em> and is used to calculate the next result.</li>
</ul>
<p>Here&#8217;s the <em>next-item</em> function (first draft) :-</p>
<pre><code>(fn [colls]
    (let [[[yield &amp; p] &amp; q]
          (sort-by first comp colls)]
        [yield (cons p q))]))
</code></pre>
<p>The first job of the function is to sort the collections of sequences by their first values.</p>
<pre><code>(sort-by first comp colls)
</code></pre>
<p>Here, <code>comp</code> is just the Java comparator that dictates the order of the sort.</p>
<p>Now we destructure. Since the item we want is the first item of the first sequence, we destructure like this :-</p>
<pre><code>[[yield &amp; p] &amp; q]
</code></pre>
<ul>
<li><code>yield</code> is the item we are saying is the next item in the merged sorted sequence.</li>
<li><code>p</code> is the rest of the sequence which contained <code>yield</code>.</li>
<li><code>q</code> is the collection of all the other sequences.</li>
</ul>
<p>(A nice &#8216;by-product&#8217; of the <code>sort-by</code> is that <code>q</code> is already sorted. Even if the contents of <code>p</code> demand that we sort again, it won&#8217;t be an expensive operation.  In fact, my own tests have shown that little or no performance improvement can be made by avoiding the use of Java&#8217;s sort by moving the head collection down the list &#8211; Java&#8217;s sort is very efficient for sets that are either &#8216;already sorted&#8217; or &#8216;almost sorted&#8217;.)</p>
<p>Now we need to form our pair&#8230;</p>
<p>The first item is <code>yield</code>. That&#8217;s easy.</p>
<p>The second item is the collection of sequences with <code>yield</code> removed. That&#8217;s <code>p</code> joined onto the head of <code>q</code>, so we use <code>cons</code>.</p>
<pre><code>(cons p q)
</code></pre>
<p>Remember that <code>q</code> is a collection of sequences sorted by &#8216;first item&#8217;. We don&#8217;t know that <code>(cons p q)</code> is sorted according to first item. It&#8217;s likely that the next result is in <code>p</code> or in <code>(first q)</code>. However, we defer this sort operation to the next iteration, because re-sorting is the first thing we do in each iteration.</p>
<p>By the way, if <code>p</code> is nil then <code>p</code> is exhausted and we&#8217;ll just take <code>q</code>, hence this tweak :-</p>
<pre><code>(if p (cons p q) q)
</code></pre>
<p>Putting this altogether, our final <code>next-item</code> function is as follows :-</p>
<pre><code>(fn [colls]
    (let [[[yield &amp; p] &amp; q]
          (sort-by first comp colls)]
        [yield (if p (cons p q) q)]]))
</code></pre>
<h2>Repeat as necessary</h2>
<p>Each time we run the <code>next-item</code> function we&#8217;ll get the next result in our merged sorted sequence. Sounds like another job for Clojure&#8217;s <code>iterate</code> function!</p>
<p>Our <code>next-item</code> function needs to be amend slightly so that it can work iteratively &#8211; it needs to accept the same kind of structure that it produces. We amend the function signature so it can take a pair as the argument, and we destructure to ignore the first item and label the second.</p>
<pre><code>(fn [[_ colls]]
    (let [[[yield &amp; p] &amp; q]
          (sort-by first comp colls)]
        [yield (if p (cons p q) q)]])
</code></pre>
<p>We&#8217;ll just have to initiate the iteration with a fake pair.</p>
<pre><code>[:begin colls]
</code></pre>
<p>And we&#8217;ll have to ensure the iteration can tell the caller when to stop.</p>
<pre><code>[:end nil]
</code></pre>
<p>Let&#8217;s build that &#8216;stop&#8217; into our function :-</p>
<pre><code>(next-item [[_ colls]]
  (if (nil? colls) [:end nil]
    (let [[[yield &amp; p] &amp; q]
          (sort-by first comp colls)]
      [yield (if p (cons p q) q)])))
</code></pre>
<p>Now the inner function is finished let&#8217;s move onto the main body of the merge-sort function.</p>
<pre><code> (-&gt;&gt; colls
     (vector :begin)
     (iterate next-item)
     (drop 1)
     (map first)
     (take-while (partial not= :end))
     (lazy-seq))
</code></pre>
<p>We start by taking the <code>colls</code> argument and apply the <code>-&gt;&gt;</code> threading operator to it. This means that the result of each step will be added as the <em>last</em> argument of the next step.</p>
<pre><code>(-&gt;&gt; colls
    ... )
</code></pre>
<p>As we discussed earlier, we have to compose the first &#8216;fake&#8217; pair to get the first iteration to work.</p>
<pre><code>(vector :begin)
</code></pre>
<p>This creates <code>[:begin colls]</code>. We should remember this is also the first item that <code>iterate</code> will yield so we&#8217;ll have to drop it later.</p>
<p>Now let&#8217;s run our next-item function (infinitely)</p>
<pre><code>(iterate next-item)
</code></pre>
<p>We don&#8217;t want the first &#8216;fake&#8217; result &#8211; it was just to help the iterate function, let&#8217;s drop it :-</p>
<pre><code>(drop 1)
</code></pre>
<p>We&#8217;re only interested in the yield values (the first in each pair) :-</p>
<pre><code>(map first)
</code></pre>
<p>And we need to continue until the stop marker :-</p>
<pre><code>(take-while (partial not= :end))
</code></pre>
<p>Finally we overload the merge-sort function to use a default comparator where it isn&#8217;t specified.</p>
<h2>Testing</h2>
<p>Let&#8217;s define a function that can create <em>n</em> collections of <em>m</em> size populated with random numbers.</p>
<pre><code>user&gt; (defn make-colls [n m]
  (for [i (range n)]
    (sort
     (for [j (range m)]
       (rand-int 1000000)))))
</code></pre>
<p>Let&#8217;s evaluate now so this structure in in memory and its creation doesn&#8217;t affect our merge-sort performance measurements.</p>
<pre><code>user&gt; (def colls (make-colls 100 20000))
</code></pre>
<p>Let&#8217;s test it to ensure that it&#8217;s working. We expect the first collection to be sorted.</p>
<pre><code>user&gt; (take 10 (first colls))

(15 33 88 94 114 119 148 164 189 190)
</code></pre>
<p>However, when we merge sort, the numbers from all the collections are merged so we get a less rapidly incrementing sequence of numbers.</p>
<pre><code>user&gt; (take 10 (merge-sort colls))    

(1 1 1 2 2 5 6 7 7 8)
</code></pre>
<h2>Actually merge-sorting is quite useful</h2>
<p>Sort algorithms are often used as a learning guide, especially for functional languages. But how useful is sorting for tackling business problems?</p>
<p>It turns out we use merge-sort in multiple places in our Clojure system. The first use was for grouping of risk data so it can be collated on a trade-by-trade basis. Then we decided to rewrite an old Java-based scheduler that was expensive to maintain. We now have a lazy scheduler (we call it &#8216;Chime&#8217;) that uses this lazy merge-sort to merge together multiple schedules of repeating tasks into a single schedule. I&#8217;ll discuss more about Chime in a future post.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malcolmsparks.com/?feed=rss2&#038;p=42</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Configuring logging for Clojure applications</title>
		<link>http://blog.malcolmsparks.com/?p=37</link>
		<comments>http://blog.malcolmsparks.com/?p=37#comments</comments>
		<pubDate>Sun, 07 Aug 2011 10:06:02 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[logging]]></category>

		<guid isPermaLink="false">http://blog.malcolmsparks.com/?p=37</guid>
		<description><![CDATA[Banks have thousands of applications, each with large complex code-bases that must be supported long after the original application authors have moved onto something else. When something goes wrong a support engineer will invariably check the log file first. Few things are as common to every application as the humble log file. It&#8217;s quite surprising [...]]]></description>
			<content:encoded><![CDATA[<p>Banks have thousands of applications, each with large complex code-bases that must be supported long after the original application authors have moved onto something else. When something goes wrong a support engineer will invariably check the log file first. Few things are as common to every application as the humble log file.</p>
<p>It&#8217;s quite surprising to think that Java didn&#8217;t have an official logging API in the JDK until its fifth major release (1.4).</p>
<h2>The sad history of Java logging</h2>
<p>In the beginning there was <code>System.out.println</code> which writes out to the console. Many still use that today.</p>
<p>For those who needed more sophisticated logging, a package called <em>log4j</em> emerged in 1999 from IBM &#8216;alphaworks&#8217;. This steadily grew in popularity until it became the <em>de-facto</em> standard. Then Sun decided to write their own logging package and include it in the JDK, as <em>java.util.logging</em>. Not that there was anything significantly wrong with <em>log4j</em>, it was simply an example of the &#8216;not-invented-here&#8217; syndrome that Sun was suffering from at the time. Despite a hard fought campaign to persuade Sun to incorporate <em>log4j</em> into the platform. Soon after, Sun realised that this sort of thing was damaging to Java&#8217;s reputation and created the JSR process to incorporate the wider Java community in decision making.</p>
<p>But the fact that Java had 2 near-identical logging packages led to further complexity. Apache created a facade called <a href="http://commons.apache.org/logging/">commons-logging</a> that attempted to wrap both packages. By deciding to use a Java interface, Apache commons-logging negated the carefully crafted performance of the log4j package it was wrapping.</p>
<p>There have been attempts in the Java world to improve matters. One notable example is <a href="http://slf4j.org/"><em>slf4j</em></a>.</p>
<h2>Logging in Clojure</h2>
<p>Clojure logging neatly solves the problems of Java logging.</p>
<p>First, there&#8217;s a single API to use: <a href="https://github.com/clojure/tools.logging">clojure.tools.logging</a>. Thank Alex Taggart and others for this. Second, it integrates with all the important logging frameworks: <em>log4j</em>, <code>java.util.logging</code>, Apache&#8217;s commons-logging and now <em>slf4j</em>. Finally, it side-steps many of the performance penalties associated with logging.</p>
<p>If you have a lot of logging statements in your code-base, you only want to pay the performance penalty for logging statements that are actually being written to the log file. However, too often CPU is wasted in creating a logging message only for that message to be filtered out later on. The conventional workaround in Java is to wrap every complex logging statement with a conditional.</p>
<pre><code>if (logger.isDebugEnabled()) {
    logger.debug(x + " plus " + y + " is " + (x + y));
}
</code></pre>
<p>Without the conditional wrapper, the String concatenation would be quite expensive if it was ultimately unnecessary (for example, if the logger in question was configured to output warnings only). But the workaround is obviously cumbersome, implemented inconsistently and adds noise to the code&#8217;s readability. (Being the new(est) kid on the block, <em>slf4j</em> is able to make use of the JDK 1.5 varargs feature to avoid common string concatenation until after the message has made it through the filters).</p>
<p>Fortunately for Clojure programmer, the <code>clojure.tools.logging</code> takes care of this for them by using macros. As a Clojure programmer you get the benefit of optimized performance without the unsightly code-bloat.</p>
<h2>Configuration</h2>
<p>Unfortunately the Clojure logging facility doesn&#8217;t offer anything to help configure the logging package being used underneath.</p>
<p>Typically, Java logging is configured using configuration files. As you&#8217;d expect from Java, every configuration file is different, each one balancing flexibility with complexity. This is a horrible Java legacy to subject Clojure programmers to. An alternative approach is needed, so I&#8217;ve written a library: <a href="http://www.github.com/malcolmsparks/clj-logging-config">clj-logging-config</a>.</p>
<p>You start by using requiring the library with <code>:use</code> (or <code>:require</code>) in your namespace declaration. In these examples I assume you are using log4j configuration, but you can configure <code>java.util.logging</code> in a similar way.</p>
<pre><code>(ns my-ns
  (:use clojure.tools.logging
        clj-logging-config.log4j))
</code></pre>
<p>Then you configure the logging system declaring a logger with <code>set-logger!</code></p>
<pre><code>(set-logger!)
</code></pre>
<p>and just start logging :-</p>
<pre><code>(info "Just a plain logging message")
</code></pre>
<p>Check your console (not the REPL).</p>
<pre><code>console&gt; INFO - Just a plain logging message
</code></pre>
<p>In <em>log4j</em> it is common to set a pattern against the logger. Full syntax can be found in <a href="http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/EnhancedPatternLayout.html"><code>org.apache.log4j.EnhancedPatternLayout</code></a>. For example, to set a pattern which just displays the message, do this:</p>
<pre><code>(set-logger! :pattern "%m%n")
(info "Just the message this time")

console&gt; Just the message this time
</code></pre>
<p>Or this.</p>
<pre><code>(set-logger! :pattern "%d - %m%n")
(info "A logging message with the date in front")

console&gt; 2011-08-07 10:58:21,084 - A logging message with the date in front
</code></pre>
<p>In the same way you can set the level of the logger. For example, so that the logger only prints warnings, you can do this :-</p>
<pre><code>(set-logger! :level :warn)
</code></pre>
<p>Or if you want to use the <em>log4j</em> Java class, you can do that as well :-</p>
<pre><code>(set-logger! :level org.apache.log4j.Level/WARN)
</code></pre>
<p>You can also set the appender to one of the appenders provided by the logging package you are using. For example, you can do this :-</p>
<pre><code>(set-logger! :out (org.apache.log4j.DailyRollingFileAppender.))
</code></pre>
<p>Or even provide your own appender function.</p>
<pre><code>(set-logger! :out (fn [ev] (println (:message ev))))
</code></pre>
<p>You can also set filters in the same way. Filters are useful but rarely used in <em>log4j</em> because they&#8217;re fundamentally a function, and <em>log4j</em> configurations formats don&#8217;t support writing functions &#8211; perhaps they should have used LISP!</p>
<pre><code>(set-logger! :filter (fn [ev] (not (re-seq #"password" (:message ev)))))
</code></pre>
<p>For filters and appenders the underlying logging event is converted to a Clojure map (with the original event bound to the <code>:event</code> key, should you need it)</p>
<p>You can also set the additivity of the logger &#8211; the default is true but you can set it to false. Addivity is described in the Log4J user-guide &#8211; so we won&#8217;t repeat it here.</p>
<pre><code>(set-logger! :additivity false)
</code></pre>
<p>It (almost) goes without saying that you can combine all these options together :-</p>
<pre><code>(set-logger! :level :warn
             :additivity false
             :pattern "%p - %m%n"
             :filter (constantly true))
</code></pre>
<p>There are some combinations that doesn&#8217;t make sense (such as using <code>:pattern</code> and <code>:layout</code> together). In these cases an exception will be thrown.</p>
<p>So much for getting started. In a real system you&#8217;ll want to configure multiple loggers in the same place :-</p>
<pre><code>(set-loggers! 

    "com.malcolmsparks.foo"
    {:level :info :pattern "%m"}

    ["com.malcolmsparks.bar" "com.malcolmsparks.zip"]
    {:level :debug})
</code></pre>
<p>You&#8217;ll find a lot more in this package, including support for per-thread (per-agent) logging which is useful in certain situations. One of the design goals of this library is that it &#8216;does the right thing&#8217; most of the time. The aim is to prevent the common experience in <em>log4j</em> where you can&#8217;t figure out why you can&#8217;t see your logging statements and revert to <code>println</code> in frustration.</p>
<p>I do hope that Clojure programmers find this library useful. The only way to find out is to ask people to use it. It definitely needs more exposure to real projects but I wanted to release it now to get more feedback. The library is licensed under the EPL (same as Clojure&#8217;s) and available <a href="http://www.github.com/malcolmsparks/clj-logging-config">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.malcolmsparks.com/?feed=rss2&#038;p=37</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Processing large files with Clojure.</title>
		<link>http://blog.malcolmsparks.com/?p=17</link>
		<comments>http://blog.malcolmsparks.com/?p=17#comments</comments>
		<pubDate>Sun, 10 Jul 2011 13:38:58 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[streams]]></category>
		<category><![CDATA[iterate]]></category>

		<guid isPermaLink="false">http://malc.webfactional.com/?p=17</guid>
		<description><![CDATA[By day, I work for a large investment bank in a department that calculates and distributes &#8216;Risk and PnL&#8217;. Simply put, for every transaction that is made by the bank a large amount of data needs to be re-computed periodically to help the bank understand its exposure to various things that can happen in the [...]]]></description>
			<content:encoded><![CDATA[<p>By day, I work for a large investment bank in a department that calculates and distributes &#8216;Risk and PnL&#8217;. Simply put, for every transaction that is made by the bank a large amount of data needs to be re-computed periodically to help the bank understand its exposure to various things that can happen in the financial markets. Like many large organisations, banks produce large amounts of data files that must be processed. However, the data files are often so large that they&#8217;re impossible to load into the memory of a single 64-bit JVM.</p>
<p>Processing files as a stream is a good strategy. It minimizes &#8216;out of memory&#8217; errors by utilizing a plentiful supply of disk space. How do you process a large file as a stream in Clojure?</p>
<p>We start by breaking the file up into a lazy sequence of tokens. The Clojure core library contains a function called <code>line-seq</code> that breaks a file into individual lines. Or you can use the <code>input-stream</code> or <code>reader</code> functions in the <code>clojure.java.io</code> namespace to turn a file into a lazy sequence of bytes or characters if you need finer granularity to form your own tokens depending on the nature of the data you are reading.</p>
<pre><code>(ns example
  (:require [clojure.java.io :as io]))

(line-seq (io/reader "bigfile"))
</code></pre>
<p>This is simplified for the purposes of this article. In real-world scenarios you may choose to improve performance by using NIO memory-mapped files. These techniques utilize main memory when it&#8217;s available without breaking the system when it isn&#8217;t.</p>
<h2>Tokenizing</h2>
<p>Invaribly you will form a lazy sequence of tokens in some way. How should we go about processing these tokens? I&#8217;ve found that using a <em>state machine</em> is a useful approach. We first define some initial state. Our tokens are fed to a function (which we&#8217;ll call <code>f</code>) along with the existing state. The function returns a new state, possibly changed.</p>
<h2>Pacman</h2>
<p>There are lots of other uses for this idea. Imagine we are asked to re-create the game of Pacman. We can list all the elements of the Pacman world :-</p>
<ul>
<li>Our hero, Pacman</li>
<li>The positions of the ghosts</li>
<li>The maze</li>
<li>The pills</li>
<li>The magic pills</li>
<li>The existence of fruit</li>
<li>Whether the ghosts are afraid of Pacman or not</li>
<li>The player&#8217;s score</li>
<li>The number of lives the player has left</li>
<li>The state of the player&#8217;s joystick</li>
</ul>
<p>We can store all this data in a single data-structure we call &#8216;the state&#8217;. We define a function, <code>f</code>, which is given the current world as an argument and which computes the new world we would expect to see on the screen one frame later. The job of programming the game of Pacman is then greatly simplified &#8211; we just call <code>f</code> iteratively and letting the game run as a &#8216;feedback loop&#8217;. All we have to do is implement the function f. (That job is left to the reader!)</p>
<h2>How to apply the state-machine pattern in Clojure</h2>
<p>There are a number of ways you can apply the state-machine method in Clojure. One simple way is to use the &#8216;second form&#8217; of the <code>reduce</code> function in the Clojure core library (that&#8217;s the one that takes some initial state).</p>
<pre><code> (reduce f {} coll)
</code></pre>
<p>This first calls the function f with two arguments- an initial state (we used an empty map here) and the first element of the collection. The result of this becomes the first argument of another invocation of the same function, while the second argument is the next element of the collection. This <em>repeated function application</em> continues until the collection is exhausted.</p>
<p>Let&#8217;s pretend our tokens are &#8216;H&#8217;, &#8216;A&#8217;, &#8216;L&#8217;. This would be the result of applying <tt>reduce</tt> :-</p>
<pre><code> (f (f (f {} 'H') 'A') 'L')
</code></pre>
<p>The <code>reduce</code> function is a good starting point but it usually the wrong choice of function for parsing token streams :-</p>
<ul>
<li>It is not possible to look-ahead to know what subsequent tokens will be</li>
<li>The function must always consume each token and cannot push the token back onto the stream.</li>
</ul>
<p>Many parsers need both these features. So we need a refinement.</p>
<p>There&#8217;s also another subtler problem with <code>reduce</code> that we&#8217;ll come on to :-</p>
<ul>
<li>We can&#8217;t inspect the operation of reduce while it is running, we have to wait for it to complete.</li>
</ul>
<h2>Using <code>iterate</code></h2>
<p>Fortunately the Clojure core library contains a more general method for the repetitive application of a function: <code>iterate</code>.</p>
<p>Here&#8217;s how we can use <code>iterate</code> to replace <code>reduce</code>.</p>
<pre><code>(iterate f {:remaining coll})
</code></pre>
<p>Our state is initialized with the stream itself. The function has the responsibility for taking the first element and processing it. It returns a new state with a new entry representing the remaining elements. Usually the function will replace the <code>:remaining</code> element with remaining elements of the collection once the first element has been processed. However, it can do anything it chooses with this element, including looking ahead for future tokens and pushing back old ones.</p>
<p>The function is repeatedly applied with the map. The map can contain any state that is necessary for the processing of the stream. Often this includes stacks (vectors) to retain state at each level of context.</p>
<p>For the purposes of the rest of this article we&#8217;ll rewrite this using the <code>->></code> threading macro. We use this macro a lot when manpulating sequences.</p>
<pre><code>(->> {:remaining coll}
     (iterate f))
</code></pre>
<h2>Preventing infinite application</h2>
<p>The problem with <code>iterate</code> is that it tends to go on forever. We can avoid this by adding an <code>:end</code> into our map when we detect the end of the stream. We can then make the resulting sequence finite by using <code>take-while</code>.</p>
<pre><code>(->> {:remaining coll}
     (iterate f)
     (take-while #(not (contains? :end %))))
</code></pre>
<p>Since <code>iterate</code> returns a sequence of states we can choose to inspect any single state</p>
<h2>Yielding</h2>
<p>Often we want to parse a stream of tokens into a stream of something else, usually something more concrete and coarse-grained.</p>
<p>For example, huge XML files are often the result of repeated structures. In my case, this could be the calculated risk for every trade in a book. The structure itself for each trade isn&#8217;t complicated, but a large number of trades can make the resulting file very large. We have an algorithm for turning a stream of XML events into a stream of &#8216;virtual&#8217; XML documents, each representing the entire XML file as it would look like if it contained just a single trade. The &#8216;stream&#8217; is made up of these virtual XML documents, one of each trade. This allows us to tame the size of the file and avoid reading the entire file into memory.</p>
<p>The way to solve this problem is to allow our function to &#8216;yield&#8217; whenever it has enough information to return a new data structure. We use the key <code>:yield</code> in the returned map to indicate this.</p>
<p>We use <code>filter</code> to filter out interrim states.</p>
<pre><code>(->> {:remaining coll}
     (iterate f)
     (take-while #(not (contains? :end %)))
     (map :yield)
     (filter #(not (nil? %))))
</code></pre>
<h2>Factoring out the pattern</h2>
<p>It&#8217;s time to put all this together in its own function.</p>
<pre><code>(defn parse [f coll]
  (->> {:remaining coll}
       (iterate f)
       (take-while #(not (contains? :end %)))
       (map :yield)
       (filter #(not (nil? %)))))
</code></pre>
<h2>Destructuring</h2>
<p>We now discuss the function <code>f</code>. Before we provide examples, let&#8217;s explain the structure of these functions a little more. The function <code>f</code> is given a map as its only parameter.</p>
<p>This example function does nothing except consume the first token. It does this by applying <code>next</code> to the <code>:remaining</code> entry in the map.</p>
<pre><code>(fn [m]
    {:remaining (next (:remaining m))})
</code></pre>
<p>This is less convenient than if we had named arguments for everything in the map, but we can use <em>destructuring</em> to get close to this.<br />
So here&#8217;s our function we some destructuring :-</p>
<pre><code>(fn [{remaining :remaining}]
    {:remaining (next remaining)})
</code></pre>
<p>But usually we want to do something with each token in the stream. Let&#8217;s destructure a little more.</p>
<pre><code>(fn [{[token &#038; remaining :as stream] :remaining}]
    {:remaining remaining})
</code></pre>
<p>This sets <code>token</code> to the token in the stream we shall process, <code>remaining</code> as the collection of tokens that remain in stream after the token and <code>stream</code> as the collection of remaining tokens includinng the token. This last setting is achieved using the <code>:as</code> keyword. This is useful when we want to return a new state but without consuming the token, equivalent to a pushback. There are certain cases where this is useful.</p>
<p>Notice also how destructuring allows us to balance the complexity of the function body with the function signature. Let&#8217;s digress while we&#8217;re discussing destructuring to show how far we can go.</p>
<h3>Extreme destructuring</h3>
<p><a href="http://blog.jayfields.com/2010/07/clojure-destructuring.html">Jay Fields</a> provides a good overview of the destructuring &#8216;mini-language&#8217; inside Clojure. At the time of writing there aren&#8217;t many other good web resources so I&#8217;ll show how useful destructuring can be for our example.</p>
<p>[jf]: http://blog.jayfields.com/2010/07/clojure-destructuring.html  &#8220;Jay Fields&#8221;</p>
<p>In a real-world situation our function will need to make use of other entries in the map. Let&#8217;s assume our map also contains entries for <code>:stack</code>, <code>:indent</code> and <code>:count</code>. We can destructure these in one go using the <code>:keys</code> keyword :-</p>
<pre><code>(fn [{[token &#038; remaining :as stream] :remaining
                 :keys [stack indent count]
                 :or {indent 0}}]
    {:remaining remaining})
</code></pre>
<p>In the case of indent we have also indicated a default value of 0. Variables that do not have corresponding keys in the map argument and which do not have defaults get the value of <code>nil</code>.</p>
<h2>Footnotes</h2>
<p>The pattern described involes a lot of map manipulation. Of course, thanks to Clojure&#8217;s STM (Software Transactional Memory) we can mutate the map without worrying about where else we&#8217;ve referenced the map, since those values won&#8217;t change. However, since we often want to change multiple values in the map it would be useful to have a helper function.</p>
<pre><code>(defn
  ^{:doc "Apply functions to values in a map."}
  mapply
  [m &#038; kfs]
  (apply hash-map
         (mapcat (fn [[k f]] [k (f (get m k))])
                 (partition 2 kfs))))
</code></pre>
<p>In some situations <code>mapply</code> has certain advantages over <code>assoc</code>. With <code>assoc</code> we have to get the data value out of the map, apply a function to it and then put it back in the map. We have to be careful to use the same key for the access and overwrite operations. By contrast, with <code>mapply</code> we send the function straight to the map entry.</p>
<p>With <code>mapply</code>, keys that are not specified are removed from the map. This is usually the safest thing to do to prevent temporary entries from bleeding across multiple state transitions. But it is easy to retain existing values in the map by specifying the key with the value of <code>identity</code>.</p>
<pre><code>;; Increment :count and retain the value of :foo.
(mapply {:count 1 :foo "foo"} :count inc :foo identity)
</code></pre>
<p>Finally, <code>mapply</code> can insert a new value, just like <code>assoc</code> does, by applying the <code>constantly</code> function.</p>
<pre><code>;; Remove :count and set the :foo to "bar".
(mapply {:count 1 :foo "foo"} :foo (constantly "bar"))
</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://blog.malcolmsparks.com/?feed=rss2&#038;p=17</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>

