You've developed this gorgeous, well factored service oriented architecture. And you've deployed it to production. You have tons of users hitting your system everyday and throwing all of the unexpected things users do at it.
Now something goes wrong, exceptions start flowing, from every system, all at once. How do you pin down the cause? How many people is it affecting? Is it localized to one component of your architecture?
As systems grow in complexity these questions become harder to answer, I'll talk about some good techniques to help reduce the time to find the answers when things are going wrong.