I have been spending quite a lot of time lately looking at A simple TCP server written by the elusive and somewhat anonymous mrd. It is, as the name suggests a simple TCP server which acts as a rudimentary chat server. Any input given is output to all clients. All very straightforward. Or so I thought.
First steps
My problems began many months ago when I began building a kind of generic network server on top of this basic framework. Everything seemed to be going well until I wondered to myself how it would cope under load. So I fired up ab and set about stressing my server. Sadly the stress was too much and my server exited cleanly.
Cleanly? Yes, it seemed that my server just shut down without a murmur. I found that a little odd – I was expecting it to become unresponsive and churn away to itself, or to crash with some error about too many sockets open or the like. What I was not expecting was what appeared to all intents and purposes to be a nice normal exit as though everything was normal.
Much head-scratching, hand-wringing, nail-biting and teeth-gnashing later, I was a bit of a wreck but my server would still merrily shutdown under load. Clearly this was a time to go back to first principles. Unfortunately it wasn’t so clear to me at the time, so I continued ever onwards, changing bits and pieces here and there to try and tease the answer out of the rather convoluted server that was evolving before my very eyes. Unsurprisingly, the answer was not forthcoming.
Time to get serious
Fast-forward a few months and I had another go. More fool me. Once again after a few days work when I performed the stress test, the graceful shutdown again occurred. This time I stopped myself from beginning to create large convoluted forking mechanisms and instead decided to try testing the original building block itself.
Lo and behold – the same behaviour was present in the original. Now we were (finally) getting somewhere. Even better, the shutdowns weren’t my fault – they must be either present in the basic server or else are some artefact of the interaction with the operating system.
Time constraints got in the way however, and I didn’t get a chance to look at what exactly was the problem. And so the issue sat in the back of my mind waiting for the opportunity to taunt me once more. And that time came a few days ago.
I sat down with the code and promised myself that I would resolve the issue. First, I made sure the problem was still present – which it was. My next step was to try the Haskell IRC channel but there were no clues to be found there. So I resorted to poking the code myself with surgical precision looking for any tumour that could be causing the symptoms.
Specifics
At this point I should probably start providing some specifics about what I was actually doing, since I suspect there are actually several issues present. The test run is
ab -n 1000 -c 100 http://localhost:9999/
with the following behaviours:
- left to run
- interrupted with ctrl-c
- interrupted with ctrl-c, rerun, interrupted, rerun etc.
There are few things to be aware of here. First, ab is sending HTTP requests and expecting HTTP responses. It doesn’t get these, instead it gets a whole bunch of HTTP requests sent back if it is lucky and they are likely to be all mixed up. Here is a typical log of the console when running:
new client data: GET / HTTP/1.0 data: Host: localhost:9999 data: User-Agent: ApacheBench/2.3 data: Accept: */* data: data: GET / HTTP/1.0 data: Host: localhost:9999 data: User-Agent: ApacheBench/2.3 data: Accept: */* data: data: Host: localhost:9999 data: User-Agent: ApacheBench/2.3 data: Accept: */* new client new client data: GET / HTTP/1.0
As a result of these malformed HTTP responses, the individual responses will, if all is well, time out. So, there are potentially 1000 requests waiting for responses.
The second thing to be aware of is that the behaviour is non-deterministic. This is not unexpected, as we are at the mercy of threading. Sometimes the first test (left to run) will not cause the server to shutdown, other times it will. Sometimes the second test (interrupted) will not cause the server to shutdown, other times it also will. However, the third test has always resulted in the server shutting down, almost always on the second interruption – although not always.
The investigation begins
The first exploratory incision was made by removing the
hPutStrLn h line
line. This does appear to have the effect of making the server reliable once more. Unfortunately it also has the effect of making the char server very quiet, since nothing is ever output to the clients. Rather an extreme solution, but it begins to suggest where the problem might lie.
The next attempt was to reintroduce the client communication but to make it small – by simply outputting a single character. This relatively small change retrieved the shutdown once more. So even a single character seems to cause problem. Unless, it is not the character itself, but the flushing of the character that is the issue?
Attempt number three reintroduced the original client communication but removed the
hFlush h
line trailing it. Now, I fully expected that this would again change the client behaviour most likely by preventing any output, but I was curious as to whether it was the flush rather than the write that was the problem. Interestingly, the server still shutdown but it seemed to be a little more reliable as it was only the third (repeated interrupts and reruns) which cause the shutdown.
At this point, I sat back a little and considered what (little) I knew. I knew that the server was shutting down cleanly. I knew that the process of writing to the clients was causing the problem. I knew precious little else.
Resorting to logic
Having exhausted my surgical probing for now, I resorted to trying to reason logically about the problem. The main issue was the fact that the shutdown was controlled – this suggested by looking at the code that there must be an exception being called.
start servSock `finally` sClose servSock
Commenting out the finally clause I was confident I would see some nice error pointing me to the cause of all my problems. But it was not to be. Commenting out the finally had no apparent effect whatsoever. Deflated, I replaced it and thought some more.
My next attempt to track the problem was, I thought, inspired. Since I had already reasoned that there was clearly some sort of I/O issue, I decided to remove all the internal putStrLns in case they were somehow overloading something. Namely, these likely culprits
putStrLn "new client"
putStrLn $ "data: " ++ line
when (dropped > 0) $
putStrLn ("clients lost: " ++ show dropped)
Once again though, my hopes were dashed – the console was much quieter, and then provided me with the by-now all-too-familiar bash prompt. Still it shutdown without a murmur.
The next plan of attack was to try and find if there was some uncaught exception somewhere which was somehow causing the server to exit. A quick look on Hoogle led to setUncaughtExceptionHandler. Feeling confident again, I liberally sprinkled some trivial handlers to do nothing more that output something so I could see where the uncaught exception might be. But as was becoming an apparent trend, nothing was output.
From despair to documentation
Around this time I was beginning to despair of actually finding the problem. Instead, I thought that perhaps the way to go was to accept the shortcomings of my little server and instead make the server talk HTTP to the client so that ab would get meaningful responses. However I felt that this was akin to sweeping the problem under the carpet.
By this stage I was desperate and had resorted to reading documentation. While doing do, I stumbled upon GHC’s implementation of concurrency. Up to this point had been simply compiling my server with the command
ghc --make original
but this suggested to me that perhaps a -threaded would be worth a try. Try I did, and while it seemed to take a bit more effort to get the server to quit, quit it nevertheless did. Not to be daunted I decided that while I was messing around with threads, I might as well change the calls to forkIO to forkOS to see if that made a difference. Make a difference it did – the server was now again much easier to overload. Back to the drawing board once more.
It was time to start poking at the operating system. I set up a script to try and start 2000 instances of my server to see what happened.
for i in `seq 3000 4999` do ./original $i & echo $i done
I kept an eye on the number of sockets being used with socklist | wc -l, but I needn’t have worried. The OS seemed quite happy to start up that many instances. It even seemed to manage when I then ran ab against one of them. However, the server still died under load. Still, I thought, it might be worth increasing the ulimits just in case. In particular, the maximum number of open files might be worth increasing in case that is having an effect. After quite a lot of tinkering with making the number of open files ludicrously large in both the server shell and the testing shell, I came to the probably inevitable conclusion that this was not causing the problem either.
Blame the code, not the OS!
By now, I had exhausted my efforts at trying to blame the operating system, so I was back to poking at the server code.
It seemed that it was time to try to sweep the problem under the carpet for now, do I modified the code to transform it from being a simple chat server to effectively be a very dull HTTP server by simply returning a trivial response to each request, and then closing the connection. To do this, I removed the forM clients loop and instead added the following lines to just after the “new client” output:
hPutStrLn h "HTTP/1.1 200 Ok\r\nContent Length: 2\r\n\r\nHi" hClose h
Now, upon running the first (uninterrupted test, everything works much better. Which is to say, the server still shuts down, but it does it now with a proper error, namely
hPutStr: illegal operation (handle is closed)
This is progress! In addition, ab reports that there 1000 requests made (as expected) and that 64 failed with exceptions. There is also only one place in the code where hPutStr is used, so the search is narrowed down quite considerably.
The question is now this: how can the handle be closed when it has only just connected? Also, upon repeating the test, I found that sometimes, the additional error
accept: invalid argument (Bad file descriptor)
is returned. This however, seems more than likely due to my over-zealous testing before sufficient sockets have cleaned themselves up, so I don’t worry too much about it. As expected, if I curb my impatience and wait until
socklist | wc -l
returns a nice small number like 6 rather than a silly number like 1214, the test behaves better. Unfortunately, repeating the test yet more times discovers that the number of sockets seems irrelevant and that sometimes the test fails spectacularly badly and sometimes it almost works well – but the server always shuts down with the illegal operation error, occasionally coupled with the invalid argument error.
Now, the accept error looks like a problem with the underlying socket which could be resolved with a suitable catch clause, so I add one in:
start servSock `catch` (\e -> error "TROUBLE!") `finally` sClose servSock
Voila! Now we get the server exiting with the single word “TROUBLE”. The plan now is to try drilling back in the callstack to that point to isolate the call that is causing the problem.
Distractions
As an aside, I notice at this point that the network package that I am using is two versions out of date – 2.1.0.0 vs 2.2.0.1. Updating doesn’t appear to have any effect though, but it is probably better to keep up-to-date.
Another aside I noticed while looking at the code is that the tselect function appears to favour newer connections – they get added to the front of the list and then have their input dealt with first. This means that the first client to connect may just sit there, never getting its input read while the last client to connect is dealt with immediately. In order to see what effect this had, I changed the line
mainLoop servSock acceptChan $ (ch,h):clients
to
mainLoop servSock acceptChan $ clients ++ [(ch,h)]
which, while admittedly will be slower to perform might have a beneficial effect. So did it? Well, with fast, original, pre-pending version, the server seems to quit with the double error described above. However with the slow, new, append version, we get a single error. Unfortunately the new version is reluctant to actually return any response – the console gets a few “new client” messages, but not the full 1000.
Revelations coming thick and fast
More revelations came to me now – due to the alteration of the server to return a response immediately upon connection, a lot of the code was now redundant. To see just how redundant, I removed the bits that weren’t needed. This left me with the following code:
module Main where import Prelude hiding (catch) import Network (listenOn, accept, sClose, Socket, withSocketsDo, PortID(..)) import System.IO import System.Environment (getArgs) import Control.Exception (finally, catch) import Control.Concurrent main = withSocketsDo $ do [portStr] <- getArgs let port = fromIntegral (read portStr :: Int) servSock <- listenOn $ PortNumber port putStrLn $ "listening on: " ++ show port acceptLoop servSock `catch` (\e -> error $ "TROUBLE" ++ (show e)) `finally` sClose servSock acceptLoop :: Socket -> IO () acceptLoop servSock = do (cHandle, host, port) <- accept servSock hPutStrLn cHandle "HTTP/1.1 200 Ok\r\nContent Length: 2\r\n\r\nHi" hFlush cHandle hClose cHandle acceptLoop servSock
The result when attempting to test this were surprising. The server was now fine, from what I could tell, but ab would now give errors of the form
apr_socket_recv: Connection reset by peer (104)
after a number of successful requests. Investigation via Google led to a discussion on an Apache dev mailing list] which threw up the -r option. Using this allowed the test to complete but with a large number of errors.
If thing aren’t working, swap tools
Beginning to wonder if this will all ever end, I decide that ab may be at fault, and try out httperf] instead. Initial runs were reasonably successful, but were all giving connreset errors which looked odd. Unfortunately all my efforts to fix this by using HTTP/1.0 and/or Connection: close were to no avail and then httperf decided to become rather unuseful and consistently fail with an assertion failure.
Noone said this was going to be easy!
Inspiration strikes
And then there is an inspiration. Because of course, behind a lot of the madness there is the simple fact of human error. To be specific, the use of putStrLn rather than putStr. Oops. This was adding some extra characters to the response, which was no doubt confusing things somewhat. With this change, httperf works once more, but still announces connreset errors which I have now decided to ignore.
The good news is that the server itself now seems to be solid. Which it really should be, since it doesn’t do any real work any more. Sadly, the more complex version does not benefit so greatly and still experiences the illegal operation errors.
The home straight
But then there was more good news, accompanied, almost inevitably by more human error. In face the call I was using to httperf was invalid since I was trying to use the --num-calls option which I thought meant make a certain number of calls to the server. In fact it means the number of calls to issue on each connection – and requires that the server support persistent connections: which mine does not. Removing this option makes the httperf test pass with flying colours – hurrah! This valuable information was found via Theodore Bullock’s httperf QuickStart Guide.
While ab still complains with its apr_socket_recv it looks as though this might just be because the server is simply too fast. The documentation for ab does recommend that the client and server are not on the same machine. So for now, I am content that I have a solid platform to build upon and will now use `httperf` as my stress testing mechanism.
Conclusion
So what am I left with after all this investigation? Several lessons learnt and a very minimal but nonetheless robust Haskell HTTP server.
- When basing work on others’ work, test the original thoroughly first.
- When using command line tools, make sure you know what the options do.
- When implementing a protocol, be careful that you follow it.
- Writing down what you investigate helps organise your thoughts.
The code
> module Main where > import Prelude hiding (catch) > import Network (listenOn, accept, sClose, Socket, withSocketsDo, PortID(..)) > import System.IO > import System.Environment (getArgs) > import Control.Exception (finally, catch) > main = withSocketsDo $ do > [portStr] <- getArgs > let port = fromIntegral (read portStr :: Int) > servSock <- listenOn $ PortNumber port > putStrLn $ "listening on: " ++ show port > acceptLoop servSock > `catch` (\e -> error $ "===EXCEPTION: " ++ (show e) ++ "===") > `finally` sClose servSock > acceptLoop :: Socket -> IO () > acceptLoop servSock = do > (cHandle, host, port) <- accept servSock > hPutStr cHandle $ "HTTP/1.1 200 Ok\r\n" ++ > "Content-Length: 2\r\n" ++ > "\r\n" ++ > "Hi" > hClose cHandle > acceptLoop servSock
Related posts:
- Testing a chat server After my f
- XML parsing with Haskell Although I

{ 9 comments… read them below or add one }
Thanks for the “play-by-play” of your debugging exercise! It’s good to hear how others go about figuring out the issues in their code are, and how they come down to the simple solutions.
Good luck with your progress! I probably am going to be attempting to write some network code in Haskell soon myself.
Good to hear it is of use. I found that pouring my thoughts and rants into a text file as I was going along helped to ease some of the frustration!
The saga is not over yet though, as I have revisited the code again, with some more success, but also more confusion.
Stay tuned!
Nice. I tried the following:
import Prelude hiding (catch)
import Network (listenOn, accept, sClose, Socket, withSocketsDo, PortID(..))
import System.IO
import System.Environment (getArgs)
import Control.OldException (finally, catch)
import Text.Printf
import Control.Concurrent
import qualified Data.ByteString.Char8 as C
main = withSocketsDo $ do
[port] <- map read `fmap` getArgs
servSock error $ printf "===EXCEPTION: %s ===" (show e))
`finally` sClose servSock
acceptLoop :: Socket -> IO ()
acceptLoop servSock = do
(cHandle, _, _) <- accept servSock
C.hPutStr cHandle msg
hClose cHandle
acceptLoop servSock
msg = C.pack $ concat
["HTTP/1.1 200 Ok\r\n"
,"Content-Length: 13\r\n"
,"\r\n" ++ "Hello, world!" ]
When compiled with:
ghc -O2 -fvia-C -optc-O3 -funbox-strict-fields
And run with:
httperf --client=0/1 --server=localhost --port=5002 --uri=/ --send-buffer=4096 --recv-buffer=16384 --num-conns=10000 --num-calls=1
Maximum connect burst length: 1
Total: connections 10000 requests 10000 replies 10000 test-duration 1.351 s
Connection rate: 7403.3 conn/s (0.1 ms/conn, <=1 concurrent connections)
Connection time [ms]: min 0.1 avg 0.1 max 10.7 median 0.5 stddev 0.4
Connection time [ms]: connect 0.1
Connection length [replies/conn]: 1.000
7.4k conn/s seems pretty good for a first attempt. Nice work.
At olsner’s suggestion, adding -threaded tips it over 10k conn/s:
Request rate: 11427.3 req/s (0.1 ms/req)
Using: ghc -O2 -fvia-C -optc-O2 -funbox-strict-fields -threaded –make
Oh, nice… I keep meaning to get around to putting ByteStrings in my code and using more optimisation. For now though, it is baby steps and happiness if it doesn’t crash on me!
Thanks for the tips though – I will be sure to try them out as I try to take the server forwards. May the connection rate move onwards and especially upwards!
And switching to the new network-bytestring saves one layer of copying:
http://github.com/bos/network-bytestring/tree/master
Code from Bryan O’Sullivan:
http://hpaste.org/13719#a2
Gives me:
$ httperf –client=0/1 –server=localhost –port=5002 –uri=/ –send-buffer=4096 –recv-buffer=16384 –num-conns=10000 –num-calls=1 | grep ‘Request rate:’
Request rate: 14424.9 req/s (0.1 ms/req)
Too fast! I spend days trying to get it to work at all and then you breeze in and double its speed in the blink of an eye
network-bytestring does certainly seem to be doing the business in this case.
Please feel free to carry on optimising!
It probably doesn’t matter too much for this trivial example, but when you get the server doing “real work” you should probably spin off a handler thread using
forkIOafter you get the response fromaccept.Otherwise, long-running handlers will cause connections to pile up on your accept queue — if more than a handful arrive, you’ll get bounced connections.
@Gregory
Thanks for the tip – I shall add it to my ever-growing list of things to remember to add as the server evolves.