The Net.MSMQ Binding in Windows Communication Foundation has some built-in support
for detecting and handling "Poison Messages".
One of the most useful ways of handling Poison Messages in queuing systems is to get
them out of the way so that following messages can still be processed (if you're not
doing in-order
processing), but still keep the messages around so that they are not lost.
Unfortunately, this behavior is only supported by WCF in MSMQ 4.0 with the use of
a poison sub-queue, which is new in the Vista and WinServer2k8 versions of MSMQ. It's
really a shame because, honestly, this could've been implemented on MSMQ 3.0 just
by using a separate queue instead of sub-queues.
Instead, we're stuck with the PoisonErrorHandler
sample included in the SDK. The sample provides a way to mimic this behavior by
introducing an IErrorHandler implementation
that detects the poison message and moves it to another queue. To be able to do this,
you need to set the binding properties so that the ServiceHost listening to the queue
faults when the poison message is detected. The error handler then moves the message
to a new queue and starts a new service host instance to resume processing of following
messages.
The Sample itself isn't all that bad, but does provide insight into a rather significant
feature left out of the way the Net.MSMQ Binding reports poison messages: All it provides
the error handler with is the MSMQ LookupId of
the poison message. Unfortunately, this LookupId is queue-specific and the exception
does not provide information on which queue or at least which endpoint was
the one that received the poison message, which is a huge gaping
hole in this feature.
Because of that the LookupId is useless without more information provided out-of-band
[1]. The sample works around this limitation by adding an entry in the <appSettings/>
section of the app.config file with the path to the queue where messages are being
received and the path to the queue to move poison messages to.
This works OK, until you need a single service host listening on multiple endpoints,
and then the sample won't really work anymore because you don't know which of the
endpoints caused the error.
Working around the limitation
I spent some time recently playing around with WCF and ran into this same problem,
so I tried to find a way of working around this limitation and still use the basic
functionality the sample provided.
I found one possible workaround, which seems to work so far in my limited tests. I'm
not sure yet how well it will hold, as it seems to me there's always the possibility
of some race conditions here, but let me illustrate at least the basic mechanics.
First we keep the basic code of the PoisonErrorHandler class. However, instead of
using appSettings to keep track of the queue we're listening to, we just try to find
out dynamically which endpoint configured for the service is causing the error, and
extract the path to the queue from there based on the URI of the endpoint address.
To do this, in my implementation of IErrorHandler.HandleError() I grab
the current OperationContext, reach out to the ServiceHost associated
with it, and then iterate over the ChannelDispatchers attached to it.
The one that's in the Faulted state is very likely the one that caused the MsmqPoisonMessageException in
the first place.
Code could be something like this:
private String FindFaultedQueue()
{
ServiceHostBase host = OperationContext.Current.Host;
Uri faultedQueue = null;
foreach ( ChannelDispatcher chd in host.ChannelDispatchers
) {
if ( chd.State ==
CommunicationState.Faulted ) {
faultedQueue = chd.Listener.Uri;
}
}
StringBuilder fn = new StringBuilder("FormatName:DIRECT=OS:");
fn.Append(faultedQueue.Host).Append("\\");
string path = faultedQueue.PathAndQuery;
if ( path.StartsWith("/private/")
) {
fn.Append("private$\\");
path = path.Substring("/private".Length);
}
path = path.Substring(1);
fn.Append(path);
return fn.ToString();
}
This lacks some error handling and a few other things, but it illustrates the point.
Now that we have the FormatName of the queue where the poison message came from, we
can try to move it to the poison message queue. We could handle this by convention:
for queue Q1, the corresponding poison message queue could be Q1Poison. The rest of
the sample is pretty much the same.
I probably wouldn't use this in production, as it doesn't feel very reliable
given the basic limitations imposed by the WCF implementation, but it was interesting
to take a stab at. YMMV.
[1] I'll leave to you, dear reader, to wonder about the sense of
providing a sample that so clearly points out such a giant hole in a product feature
rather than actually fixing it. Oh wait, guess they did fix it by changing
MSMQ for Vista and WinServer2k8.