Putting I.T. All Together
I recently had a problem with my mail servers where messages received by my MX were bounced, rather than deferred, when the internal mail server the messages were destined for went offline.
This turned out to be because of my DNS configuration. The cause of the problem surprised me, because it was so decoupled from the mail setup.
My mail setup consists of three sites. My primary site is a colocation facility, where my primary MX (farmx.thogan.com) and my main mail server (mail1.thogan.com) reside. The second site is a collection of virtual servers in the Rackspace Cloud. My backup MX resides here (cloudmx.thogan.com). The third site is my home, where my backup mail server resides (mail2.near.lan).
The three sites are all have their own private networks in addition to their public addresses. These private networks are all joined via VPN. For the purposes of this problem, the type of link dosen't really matter. In my case it is a VPN, but this applies to any multi-site mail configuration.
So, back to my current situation. My primary site is offline while I move the server to a new colocation facility. Mail is being delievered through my backup MX at the cloud site, and from there is being relayed to my backup mail server at my home site. Two days after taking my primary site offline for the move, my router at home died due to hardware failure. Now to only operating site is the cloud site, with my backup MX. I noticed the failure of the home site while I was at work, and expected that it would not be a problem. Mail should have just queued up at the backup MX until I fixed the router at home, re-established the VPN connection, and flushed the mail queue.
Once I fixed the router, I logged into the backup MX and ran `postqueue -p`. Surprise! Nothing. I examined the mail logs and saw that mail was not being deferred, but rather was bounced all day long! But, that's not how it was supposed to work!
Turns out the problem was DNS. Each site had a DNS server that would resolve names for the internal domain names (near.lan, far.lan, cloud.lan). When I configured the near.lan zone on the cloud site's DNS server, I set it up to forward the queries to the home site's DNS server. It was not a slave for that domain, just a forwarder. And there was my mistake.
When mail would come into the backup MX at the cloud site, it would attempt to deliver it to mail2.near.lan. When postfix went to lookup mail2.near.lan in DNS, Bind would attempt to query to home site's DNS server for the name. When the VPN link went down, the cloud site's DNS server responded to the queries for mail2.near.lan with NXDOMAIN. So postfix bounced the message instead of deferring it, as it would if it had resolved the name and received a connection timeout trying to contact the home site mail server.
The moral of the story is, in a multi-site setup, internal domains should be slave zones on remote DNS server, NOT forwarded zones! If I had setup near.lan as a slave zone in the cloud site's DNS server, then postfix would have been able to resolve the name, failed to connect to that address, and deferred the message.