Saturday, June 20, 2015

Announcing the 'extra' library

Summary: I wrote an extra library, which contains lots of my commonly used functions.

When starting to write Bake, my first step was to copy a lot of utility functions from Shake - things like fst3 (select the first element of a triple), concatMapM (monadic version of concatMap), withCurrentDirectory (setCurrentDirectory under bracket). None of these functions are specific to either Bake or Shake, and indeed, many of the ones in Shake had originally came from HLint. Copy and pasting code is horrible, so I extracted the best functions into a common library which I named extra. Unlike the copy/paste versions in each package, I then wrote plenty of tests, made sure the functions worked in the presence of exceptions, did basic performance optimisation and filled in some obvious gaps in functionality.

I'm now using the extra library in all the packages above, plus things like ghcid and Hoogle. Interestingly, I'm finding my one-off scripts are making particularly heavy use of the extra functions. I wrote this package to reduce my maintenance burden, but welcome other users of extra.

My goal for the extra library is simple additions to the standard Haskell libraries, just filling out missing functionality, not inventing new concepts. In some cases, later versions of the standard libraries provide the functions I want, so there extra makes them available all the way back to GHC 7.2, reducing the amount of CPP in my projects. A few examples:

The module Extra documents all functions provided by the library, so is a good place to go to see what is on offer. Modules such as Data.List.Extra provide extra functions over Data.List and also reexport Data.List. Users are recommended to replace Data.List imports with Data.List.Extra if they need the extra functionality.

Which functions?

When selecting functions I have been guided by a few principles.

  • I have been using most of these functions in my packages - they have proved useful enough to be worth copying/pasting into each project.
  • The functions follow the spirit of the original Prelude/base libraries. I am happy to provide partial functions (e.g. fromRight), and functions which are specialisations of more generic functions (whenJust).
  • Most of the functions have trivial implementations. If a beginner couldn't write the function, it probably doesn't belong here.
  • I have defined only a few new data types or type aliases. It's a package for defining new utilities on existing types, not new types or concepts.


One feature I particularly like about this library is that the documentation comments are tests. A few examples:

Just True ||^ undefined  == Just True
retry 3 (fail "die") == fail "die"
whenJust (Just 1) print == print 1
\x -> fromRight (Right x) == x
\x -> fromRight (Left  x) == undefined

These equalities are more semantic equality than Haskell's value equality. Things with lambda's are run through QuickCheck. Things which print to stdout are captured, so the print 1 test really does a print, which is scraped and compared to the LHS. I run these tests by passing them through a preprocessor, which spits out this code, which I then run with some specialised testing functions.

Friday, June 19, 2015

Maximum Patches, Minimum Time

Summary: Writing a fast continuous integration server is tricky.

When I started writing the continuous integration server Bake, I thought I had a good idea of what would go fast. It turned out I was wrong. The problem Bake is trying to solve is:

  • You have a set of tests that must pass, which take several hours to run.
  • You have a current state of the world, which passes all the tests.
  • You have a stream of hundreds of incoming patches per day that can be applied to the current state.
  • You want to advance the state by applying patches, ensuring the current state always passes all tests.
  • You want to reject bad patches and apply good patches as fast as possible.

I assume that tests can express dependencies, such as you must compile before running any tests. The test that performs compilation is special because it can go much faster if only a few things have changed, benefiting from incremental compilation.

Both my wrong solution, and my subsequent better solution, are based on the idea of a candidate - a sequence of patches applied to the current state that is the focus of testing. The solutions differ in when patches are added/removed from the candidate and how the candidates are compiled.

A Wrong Solution

My initial solution compiled and ran each candidate in a separate directory. When the directory was first created, it copied a nearby candidate to try and benefit from incremental compilation.

Each incoming patch was immediately included in the candidate, compiled, and run on all tests. I would always run the test that had not passed for the longest time, to increase confidence in more patches. Concretely, if I have run test T1 on patch P1, and P2 comes in, I start testing T2 on the combination of P1+P2. After that passes I can be somewhat confident that P1 passes both T1 and T2, despite not having run T2 on just P1.

If a test fails, I bisect to find the patch that broke it, reject the patch, and immediately throw it out of the candidate.

The Problems

There are three main problems with this approach:

  • Every compilation starts with a copy of a nearby candidate. Copying a directory of lots of small files (the typical output of a compiler) is fantastically expensive on Windows.
  • When bisecting, I have to compile at lots of prefixes of the candidate, the cost of which varies significantly based on the directory it starts from.
  • I'm regularly throwing patches out of the candidate, which requires a significant amount of compilation, as it has to recompile all patches that were after the rejected patch.
  • I'm regularly adding patches to the candidate, each of which requires an incremental compilation, but tends to be dominated by the directory copy.

This solution spent all the time copying and compiling, and relatively little time testing.

A Better Solution

To benefit from incremental compilation and avoid copying costs, I always compile in the same directory. Given a candidate, I compile each patch in the series one after another, and after each compilation I zip up the interesting files (the test executables and test data). To bisect, I unzip the relevant files to a different directory. On Windows, unzipping is much more expensive than zipping, but that only needs to be done when bisecting is required. I also only need to zip the stuff required for testing, not for building, which is often much smaller.

When testing a candidate, I run all tests without extending the candidate. If all the tests pass I update the state and create a new candidate containing all the new patches.

If any test fails I bisect to figure out who should be rejected, but don't reject until I've completed all tests. After identifying all failing tests, and the patch that caused each of them to fail, I throw those patches out of the candidate. I then rebuild with the revised candidate and run only those tests that failed last time around, trying to seek out tests where two patches in a candidate both broke them. I keep repeating with only the tests that failed last time, until no tests fail. Once there are no failing tests, I extend the candidate with all new patches, but do not update the state.

As a small tweak, if there are two patches in the queue from the same person, where one is a superset of the other, I ignore the subset. The idea is that if the base commit has an error I don't want to track it down twice, once to the first failing commit and then again to the second one.

Using this approach in Bake

First, the standard disclaimer: Bake may not meet your needs - it is a lot less developed than other continuous integration systems. If you do decide to use Bake, you should run from the git repo, as the Hackage release is far behind. That said, Bake is now in a reasonable shape, and might be suitable for early adopters.

In Bake this approach is implemented in the StepGit module, with the ovenStepGit function. Since Bake doesn't have the notion of building patches in series it pretends (to the rest of Bake) that it's building the final result, but secretly caches the intermediate steps. If there is a failure when compiling, it caches that failure, and reports it to each step in the bisection, so Bake tracks down the correct root cause.

I am currently recommending ovenStepGit as the "best" approach for combining git and an incremental build system with Bake. While any incremental build system works, I can't help but plug Shake, because its the best build system I've ever written.

Sunday, June 07, 2015

ghcid 0.4 released

Summary: I've released a new version of ghcid, which is faster and works better for Vim users.

I've just released version 0.4 of ghcid. For those who haven't tried it, it's the simplest/dumbest "IDE" that could possibly exist. It presents a small fixed-height console window that shows the active warnings/errors from ghci, which is updated on every save. While that might not seem like much, there are a number of improvements over a standard workflow:

  • You don't have to switch to a different window and hit :reload, just saving in your editor is enough.
  • It includes warnings from modules that successfully compiled previously, as opposed to ghci which omits warnings from modules that didn't reload.
  • It reformats the messages to the height of your console, so you can see several messages at once.

I find it gives me a significant productivity boost, and since there is no editor integration, it adds to my existing tools, rather than trying to replace them (unlike a real IDE).

Version 0.4 offers a few additional improvements:

  • We use fsnotify to watch for changes, rather than polling as in older versions. The result is a significant decrease in battery usage, combined with faster responses to changes.
  • If you change the .cabal or .ghci file then ghcid will restart the ghci session to pick up the changes.
  • If you are a Vim user, then the sequence of temporary files Vim uses on save could upset ghcid. I believe these issues are now all eliminated.
  • The number of errors/warnings is displayed in the titlebar of the console.
  • There is a feature to run a test after each successful save (using --test). I am expecting this to be a niche feature, but feel free to prove me wrong.

Tempted to give it a try? See the README for details.

Thursday, May 21, 2015

Handling Control-C in Haskell

Summary: The development version of ghcid seemed to have some problems with terminating when Control-C was hit, so I investigated and learnt some things.

Given a long-running/interactive console program (e.g. ghcid), when the user hits Control-C/Ctrl-C the program should abort. In this post I'll describe how that works in Haskell, how it can fail, and what asynchronous exceptions have to do with it.

What happens when the user hits Ctrl-C?

When the user hits Ctrl-C, GHC raises an async exception of type UserInterrupt on the main thread. This happens because GHC installs an interrupt handler which raises that exception, sending it to the main thread with throwTo. If you install your own interrupt handler you won't see this behaviour and will have to handle Ctrl-C yourself.

There are reports that if the user hits Ctrl-C twice the runtime will abort the program. In my tests, that seems to be a feature of the shell rather than GHC itself - in the Windows Command Prompt no amount of Ctrl-C stops an errant program, in Cygwin a single Ctrl-C works.

What happens when the main thread receives UserInterrupt?

There are a few options:

  • If you are not masked and there is no exception handler, the thread will abort, which causes the whole program to finish. This behaviour is the desirable outcome if the user hits Ctrl-C.
  • If you are running inside an exception handler (e.g. catch or try) which is capable of catching UserInterrupt then the UserInterrupt exception will be returned. The program can then take whatever action it wishes, including rethrowing UserInterrupt or exiting the program.
  • If you are running with exceptions masked, then the exception will be delayed until you stop being masked. The most common way of running while masked is if the code is the second argument to finally or one of the first two arguments to bracket. Since Ctrl-C will be delayed while the program is masked, you should only do quick things while masked.

How might I lose UserInterrupt?

The easiest way to "lose" a UserInterrupt is to catch it and not rethrow it. Taking a real example from ghcid, I sometimes want to check if two paths refer to the same file, and to make that check more robust I call canonicalizePath first. This function raises errors in some circumstances (e.g. the directory containing the file does not exist), but is inconsistent about error conditions between OS's, and doesn't document its exceptions, so the safest thing is to write:

canonicalizePathSafe :: FilePath -> IO FilePath
canonicalizePathSafe x = canonicalizePath x `catch`
    \(_ :: SomeException) -> return x

If there is any exception, just return the original path. Unfortunately, the catch will also catch and discard UserInterrupt. If the user hits Ctrl-C while canonicalizePath is running the program won't abort. The problem is that UserInterrupt is not thrown in response to the code inside the catch, so ignoring UserInterrupt is the wrong thing to do.

What is an async exception?

In Haskell there are two distinct ways to throw exceptions, synchronously and asynchronously.

  • Synchronous exceptions are raised on the calling thread, using functions such as throw and error. The point at which a synchronous exception is raised is explicit and can be relied upon.
  • Asynchronous exceptions are raised by a different thread, using throwTo and a different thread id. The exact point at which the exception occurs can vary.

How is the type AsyncException related?

In Haskell, there is a type called AsyncException, containing four exceptions - each special in their own way:

  • StackOverflow - the current thread has exceeded its stack limit.
  • HeapOverflow - never actually raised.
  • ThreadKilled - raised by calling killThread on this thread. Used when a programmer wants to kill a thread.
  • UserInterrupt - the one we've been talking about so far, raised on the main thread by the user hitting Ctrl-C.

While these have a type AsyncException, that's only a hint as to their intended purpose. You can throw any exception either synchronously or asynchronously. In our particular case of caonicalizePathSafe, if canonicalizePath causes a StackOverflow, we probably are happy to take the fallback case, but likely the stack was already close to the limit and will occur again soon. If the programmer calls killThread that thread should terminate, but in ghcid we know this thread won't be killed.

How can I catch avoid catching async exceptions?

There are several ways to avoid catching async exceptions. Firstly, since we expect canonicalizePath to complete quickly, we can just mask all async exceptions:

canonicalizePathSafe x = mask_ $
    canonicalizePath x `catch` \(_ :: SomeException) -> return x

We are now guaranteed that catch will not receive an async exception. Unfortunately, if canonicalizePath takes a long time, we might delay Ctrl-C unnecessarily.

Alternatively, we can catch only non-async exceptions:

canonicalizePathSafe x = catchJust
    (\e -> if async e then Nothing else Just e)
    (canonicalizePath x)
    (\_ -> return x)

async e = isJust (fromException e :: Maybe AsyncException)

We use catchJust to only catch exceptions which aren't of type AsyncException, so UserInterrupt will not be caught. Of course, this actually avoids catching exceptions of type AsyncException, which is only related to async exceptions by a partial convention not enforced by the type system.

Finally, we can catch only the relevant exceptions:

canonicalizePathSafe x = canonicalizePath x `catch`
    \(_ :: IOException) -> return x

Unfortunately, I don't know what the relevant exceptions are - on Windows canonicalizePath never seems to throw an exception. However, IOException seems like a reasonable guess.

How to robustly deal with UserInterrupt?

I've showed how to make canonicalizePathSafe not interfere with UserInterrupt, but now I need to audit every piece of code (including library functions I use) that runs on the main thread to ensure it doesn't catch UserInterrupt. That is fragile. A simpler alternative is to push all computation off the main thread:

import Control.Concurrent.Extra
import Control.Exception.Extra

ctrlC :: IO () -> IO ()
ctrlC act = do
    bar <- newBarrier
    forkFinally act $ signalBarrier bar
    either throwIO return =<< waitBarrier bar

main :: IO ()
main = ctrlC $ ... as before ...

We are using the Barrier type from my previous blog post, which is available from the extra package. We create a Barrier, run the main action on a forked thread, then marshal completion/exceptions back to the main thread. Since the main thread has no catch operations and only a few (audited) functions on it, we can be sure that Ctrl-C will quickly abort the program.

Using version 1.1.1 of the extra package we can simplify the code to ctrlC = join . onceFork.

What about cleanup?

Now we've pushed most actions off the main thread, any finally sections are on other threads, and will be skipped if the user hits Ctrl-C. Typically this isn't a problem, as program shutdown automatically cleans all non-persistent resources. As an example, ghcid spawns a copy of ghci, but on shutdown the pipes are closed and the ghci process exits on its own. If we do want robust cleanup of resources such as temporary files we would need to run the cleanup from the main thread, likely using finally.

Should async exceptions be treated differently?

At the moment, Haskell defines many exceptions, any of which can be thrown either synchronously or asynchronously, but then hints that some are probably async exceptions. That's not a very Haskell-like thing to do. Perhaps there should be a catch which ignores exceptions thrown asynchronously? Perhaps the sync and async exceptions should be of different types? It seems unfortunate that functions have to care about async exceptions as much as they do.

Combining mask and StackOverflow

As a curiosity, I tried to combine a function that stack overflows (using -O0) and mask. Specifically:

main = mask_ $ print $ foldl (+) 0 [1..1000000]

I then ran that with +RTS -K1k. That prints out the value computed by the foldl three times (seemingly just a buffering issue), then fails with a StackOverflow exception. If I remove the mask, it just fails with StackOverflow. It seems that by disabling StackOverflow I'm allowed to increase my stack size arbitrarily. Changing print to appendFile causes the file to be created but not written to, so it seems there are oddities about combining these features.


I'm certainly not an expert on async exceptions, so corrections welcome. All the above assumes compiling with -threaded, but most applies without -threaded.

Wednesday, May 06, 2015

Announcing js-jquery Haskell Library

Summary: The library js-jquery makes it easy to get at the jQuery Javascript code from Haskell. I've just released a new version.

I've just released the Haskell library js-jquery 1.11.3, following the announcement of jQuery 1.11.3. This package bundles the minified jQuery code into a Haskell package, so it can be depended upon by Cabal packages. The version number matches the upstream jQuery version. It's easy to grab the jQuery code from Haskell using this library, as an example:

import qualified Language.Javascript.JQuery as JQuery

main = do
    putStrLn $ "jQuery version " ++ show JQuery.version ++ " source:"
    putStrLn =<< readFile =<< JQuery.file

There are two goals behind this library:

  • Make it easier for jQuery users to use and upgrade jQuery in Haskell packages. You can upgrade jQuery without huge diffs and use it without messing around with extra-source-files.
  • Make it easier for upstream packagers like Debian. The addition of a jQuery file into a Haskell package means you are mixing licenses, authors, and distributions like Debian also require the source (unminified) version of jQuery to be distributed alongside. By having one package provide jQuery they only have to do that work once, and the package has been designed to meet their needs.

It's pretty easy to convert something that has bundled jQuery to use the library, as some examples:

The library only depends on the base library so it shouldn't cause any version hassles, although (as per all Cabal packages) you can't mix and match libraries with incompatible js-jquery version constraints in one project.

As a companion, there's also js-flot, which follows the same ideas for the Flot library.

Saturday, April 25, 2015

Cleaning stale files with Shake

Summary: Sometimes source files get deleted, and build products become stale. Using Shake, you can automatically delete them.

Imagine you have a build system that compiles Markdown files into HTML files for your blog. Sometimes you rename a Markdown file, which means the corresponding HTML will change name too. Typically, this will result in a stale HTML file being left, one that was previously produced by the build system, but will never be updated again. You can remove that file by cleaning all outputs and running the build again, but with the Shake build system you can do better. You can ask for a list of all live files, and delete the build products not on that list.

A basic Markdown to HTML converter

Let's start with a simple website generator. For each Markdown file, with the extension .md, we generate an HTML file. We can write that as:

import Development.Shake
import Development.Shake.FilePath

main :: IO ()
main = shakeArgs shakeOptions $ do
    action $ do
        mds <- getDirectoryFiles "." ["//*.md"]
        need ["output" </> x -<.> "html" | x <- mds]

    "output//*.html" %> \out -> do
        let src = dropDirectory1 out -<.> "md"
        need [src]
        cmd "pandoc -s -o" [out, src]

    phony "clean" $ do
        removeFilesAfter "output" ["//*.html"]

Nothing too interesting here. There are three parts:

  • Search for all .md files, and for each file foo/ require output/foo/bar.html.
  • To generate an .html file, depend on the source file then run pandoc.
  • To clean everything, delete all .html files in output.

Using a new feature in Shake 0.15, we can name save this script as Shakefile.hs and then:

  • shake will build all the HTML files.
  • shake -j0 will build all the files, using one thread for each processor on our system.
  • shake output/foo.html will build just that one HTML file.
  • shake clean will delete all the HTML files.

Removing stale files

Now let's imagine we've added a blog post Before publishing we decide to rename our post to If we've already run shake then there will be a stale file output/using-pipes.html. Since there is no source .md file, Shake will not attempt to rebuild the file, and it won't be automatically deleted. We can do shake clean to get rid of it, but that will also wipe all the other HTML files.

We can run shake --live=live.txt to produce a file live.txt listing all the live files - those that Shake knows about, and has built. If we run that after deleting it will tell us that and output/ are both "live". If we delete all files in output that are not mentioned as being live, that will clean away all our stale files.

Using Shake 0.15.1 (released in the last hour) you can write:

import Development.Shake
import Development.Shake.FilePath
import Development.Shake.Util
import System.Directory.Extra
import Data.List
import System.IO

pruner :: [FilePath] -> IO ()
pruner live = do
    present <- listFilesRecursive "output"
    mapM_ removeFile $ map toStandard present \\ map toStandard live

main :: IO ()
main = shakeArgsPrune shakeOptions pruner $ do
     ... as before ...

Now when running shake --prune it will build all files, then delete all stale files, such as output/using-pipes.html. We are using the shakePrune function (just sugar over --live) which lets us pass a pruner function. This function gets called after the build completes with a list of all the live files. We use listFilesRecursive from the extra package to get a list of all files in output, then do list difference (\\) to delete all the files which are present but not live. To deal with the / vs \ path separator issue on Windows, we apply toStandard to all files to ensure they match.

A few words of warning:

  • If you run shake output/foo.html --prune then it will only pass output/foo.html and as live files, since they are the only ones that are live as you have asked for a subset of the files to be built. Generally, you want to enable all sensible targets (typically no file arguments) when passing --prune.
  • Sometimes a rule will generate something you care about, and a few files you don't really bother tracking. As an example, building a GHC DLL on Windows generates a .dll and a .dll.a file. While the .dll.a file may not be known to Shake, it probably doesn't want to get pruned. The pruning function may need a few special cases, like not deleting the .dll.a file if the .dll is live.

Thursday, March 26, 2015

New website with new talks

My website is now located at and has a few new talks on it:

My old website at is going to stop being updated, and I'll be putting in redirections shortly. That server is going to stop hosting websites, so I bought myself a domain name and setup a GitHub pages website. The repo is here, including all the data, metadata, templates and scripts.