Bug 693 - SNAT is failing to maquerade some TCP RST packets
SNAT is failing to maquerade some TCP RST packets
Status: RESOLVED INVALID
Product: netfilter/iptables
Classification: Unclassified
Component: NAT
linux-2.6.x
All All
: P3 normal
Assigned To: netfilter buglog mailinglist
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-01-13 20:49 CET by Ludovico Cavedon
Modified: 2012-12-06 18:37 CET (History)
5 users (show)

See Also:


Attachments
pcap dump on the vbox0 interface (i.e. before masquerading) (1.60 KB, application/octet-stream)
2011-01-13 21:00 CET, Ludovico Cavedon
Details
A test case (1.01 KB, text/plain)
2011-12-05 01:26 CET, www
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ludovico Cavedon 2011-01-13 20:49:40 CET
We are running a windows VM in a VirtuaBox guest, hosted on a Ubuntu lucid machine (kernel 2.6.32-27-server x86_64).

The virtualbox guest is configured in host-only mode (i.e. all of its traffic reaches the host kernel via the virtual interface vboxnet0).

The Ubuntu host is then configured to masquerade all outgoing traffic.
However the source IP address of some packets is not being correctly rewritten, and such packets are sent out eth0 with an unroutable source IP address.

We identified a precise pattern of packets that triggers such behavior, as described below.

Interface configuration:

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:25:90:0d:0c:fa brd ff:ff:ff:ff:ff:ff
    inet 192.35.222.101/24 brd 192.35.222.255 scope global eth0
5: vboxnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
    link/ether 0a:00:27:00:00:00 brd ff:ff:ff:ff:ff:ff
    inet 192.168.56.1/24 brd 192.168.56.255 scope global vboxnet0

Output of iptables-save:
*nat
:PREROUTING ACCEPT [7006363:375972010]
:POSTROUTING ACCEPT [129:36813]
:OUTPUT ACCEPT [2745927:167688047]
-A PREROUTING -d 128.130.56.3/32 -j DNAT --to-destination 128.111.48.45 
-A POSTROUTING -o eth0 -j SNAT --to-source 192.35.222.101 
COMMIT


This is the dump of a connection on vboxnet0
11:13:37.855631 IP 192.168.56.124.2687 > 216.163.120.2.25: Flags [S], seq 3487144353, win 65535, options [mss 1460,nop,nop,sackOK], length 0
11:13:37.925142 IP 216.163.120.2.25 > 192.168.56.124.2687: Flags [S.], seq 2898804835, ack 3487144354, win 5840, options [mss 1460,nop,nop,sackOK], length 0
11:13:37.925297 IP 192.168.56.124.2687 > 216.163.120.2.25: Flags [.], ack 1, win 65535, length 0
11:13:38.004286 IP 216.163.120.2.25 > 192.168.56.124.2687: Flags [P.], seq 1:21, ack 1, win 5840, length 20
11:13:38.004597 IP 192.168.56.124.2687 > 216.163.120.2.25: Flags [P.], seq 1:24, ack 21, win 65515, length 23
11:13:38.074047 IP 216.163.120.2.25 > 192.168.56.124.2687: Flags [.], ack 24, win 5840, length 0
11:13:38.074075 IP 216.163.120.2.25 > 192.168.56.124.2687: Flags [P.], seq 21:41, ack 24, win 5840, length 20
11:13:38.075049 IP 192.168.56.124.2687 > 216.163.120.2.25: Flags [P.], seq 24:71, ack 41, win 65495, length 47
11:13:38.144756 IP 216.163.120.2.25 > 192.168.56.124.2687: Flags [P.], seq 41:61, ack 71, win 5840, length 20
11:13:38.145456 IP 192.168.56.124.2687 > 216.163.120.2.25: Flags [P.], seq 71:98, ack 61, win 65475, length 27
11:13:38.214822 IP 216.163.120.2.25 > 192.168.56.124.2687: Flags [P.], seq 61:117, ack 98, win 5840, length 56
11:13:38.215269 IP 192.168.56.124.2687 > 216.163.120.2.25: Flags [P.], seq 98:104, ack 117, win 65419, length 6
11:13:38.215311 IP 192.168.56.124.2687 > 216.163.120.2.25: Flags [F.], seq 104, ack 117, win 65419, length 0
11:13:38.284414 IP 216.163.120.2.25 > 192.168.56.124.2687: Flags [P.], seq 117:137, ack 104, win 5840, length 20
11:13:38.284442 IP 216.163.120.2.25 > 192.168.56.124.2687: Flags [F.], seq 137, ack 104, win 5840, length 0
11:13:38.284464 IP 216.163.120.2.25 > 192.168.56.124.2687: Flags [.], ack 105, win 5840, length 0
11:13:38.284692 IP 192.168.56.124.2687 > 216.163.120.2.25: Flags [R.], seq 105, ack 137, win 0, length 0
11:13:38.284731 IP 192.168.56.124.2687 > 216.163.120.2.25: Flags [R], seq 3487144457, win 0, length 0
11:13:38.284768 IP 192.168.56.124.2687 > 216.163.120.2.25: Flags [R], seq 3487144458, win 0, length 0


This is what is sent on eth0 (notice the last 2 packets)

11:13:37.855671 IP 192.35.222.101.2687 > 216.163.120.2.25: Flags [S], seq 3487144353, win 65535, options [mss 1460,nop,nop,sackOK], length 0
11:13:37.925126 IP 216.163.120.2.25 > 192.35.222.101.2687: Flags [S.], seq 2898804835, ack 3487144354, win 5840, options [mss 1460,nop,nop,sackOK], length 0
11:13:37.925316 IP 192.35.222.101.2687 > 216.163.120.2.25: Flags [.], ack 1, win 65535, length 0
11:13:38.004277 IP 216.163.120.2.25 > 192.35.222.101.2687: Flags [P.], seq 1:21, ack 1, win 5840, length 20
11:13:38.004618 IP 192.35.222.101.2687 > 216.163.120.2.25: Flags [P.], seq 1:24, ack 21, win 65515, length 23
11:13:38.074039 IP 216.163.120.2.25 > 192.35.222.101.2687: Flags [.], ack 24, win 5840, length 0
11:13:38.074070 IP 216.163.120.2.25 > 192.35.222.101.2687: Flags [P.], seq 21:41, ack 24, win 5840, length 20
11:13:38.075067 IP 192.35.222.101.2687 > 216.163.120.2.25: Flags [P.], seq 24:71, ack 41, win 65495, length 47
11:13:38.144748 IP 216.163.120.2.25 > 192.35.222.101.2687: Flags [P.], seq 41:61, ack 71, win 5840, length 20
11:13:38.145474 IP 192.35.222.101.2687 > 216.163.120.2.25: Flags [P.], seq 71:98, ack 61, win 65475, length 27
11:13:38.214815 IP 216.163.120.2.25 > 192.35.222.101.2687: Flags [P.], seq 61:117, ack 98, win 5840, length 56
11:13:38.215291 IP 192.35.222.101.2687 > 216.163.120.2.25: Flags [P.], seq 98:104, ack 117, win 65419, length 6
11:13:38.215329 IP 192.35.222.101.2687 > 216.163.120.2.25: Flags [F.], seq 104, ack 117, win 65419, length 0
11:13:38.284405 IP 216.163.120.2.25 > 192.35.222.101.2687: Flags [P.], seq 117:137, ack 104, win 5840, length 20
11:13:38.284438 IP 216.163.120.2.25 > 192.35.222.101.2687: Flags [F.], seq 137, ack 104, win 5840, length 0
11:13:38.284461 IP 216.163.120.2.25 > 192.35.222.101.2687: Flags [.], ack 105, win 5840, length 0
11:13:38.284715 IP 192.35.222.101.2687 > 216.163.120.2.25: Flags [R.], seq 105, ack 137, win 0, length 0
*** 11:13:38.284746 IP 192.168.56.124.2687 > 216.163.120.2.25: Flags [R], seq 3487144457, win 0, length 0
11:13:38.284783 IP 192.35.222.101.2687 > 216.163.120.2.25: Flags [R], seq 3487144458, win 0, length 0

The packet before the last one has the wrong (i.e. not rewritten) source IP address.

Thanks in advance!
Comment 1 Ludovico Cavedon 2011-01-13 21:00:19 CET
Created attachment 342 [details]
pcap dump on the vbox0 interface (i.e. before masquerading)
Comment 2 Doug Smythies 2011-02-19 17:02:29 CET
(In reply to comment #1)
> Created an attachment (id=342) [details]
> pcap dump on the vbox0 interface (i.e. before masquerading)

I had the same issue, however it is not limited to TCP RST packets only. I also had the issue with the second side of a FIN sequence if it occurs after contrack forgets about the connection because the CLOSE_WAIT period has expired.
In my case, I was able to work around the problem (I called it packet leakage) by the addition of one iptable rule at the start of the FORWARD chain:

$IPTABLES -A FORWARD -i $INTIF -p tcp -m state --state INVALID -j DROP

I was also able to drastically reduce the number of occurences of the problem, if I did not have the above rule, by increasing the close-wait timeout period from 60 seconds to 3600 seconds (It is my understanding that the timeout used to be 3 days). 

net.netfilter.nf_conntrack_tcp_timeout_close_wait = 3600
net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait = 3600

However, and even though there is a work around, I still think it is a bug that packets go through NAT untranslated.

My complete web notes write up (which I am still improving) including wireshark screen captures and my entire iptables script can be found at:

http://www.smythies.com/~doug/network/iptables_notes/index.html


Comment 3 David Davidson 2011-08-10 01:26:47 CEST
I have the exact same problem.

Many thanks to Ludovico and Doug for discovering and posting about this issue. I am so grateful that somebody else noticed this and was able to comment about it and even be able to provide a workaround.

This is very weird and adverse behavior! I noticed this by accident yesterday afternoon. Because I have never had a problem, I never really scrutinized the masquerading interface for RFC1918 addresses. But I was testing something else, and while looking at this interface with a packet capture, I accidentally noticed that a public internet interface was occasionally sending packets sourced from internal private machines, using an RFC1918 address as the source address. These were FIN and RSTs, only, just like you say.

Unlike some of you, I use a front-end for IPtables that makes use of perl scripting (which in the end, scripts IPtables similarly in the way most of you are also scripting your policies).
So then I composed a huge bug report that I was going to submit to the developer of this front-end, and I probably spent 2-3 hours describing the bug and what I noticed was happening. I was just about to post it and I said to myself "well you better have one more re-check of the IPtables/netfilter bugs and make certain that this isn't a netfilter problem." Then I noticed this post. I am relieved to know that I wasn't the only one affected by this really obscure and strange problem.

Thank you Doug for the workaround! I have inserted a drop rule in the forward chain and I can say that I have NOT seen any packets traversing the internet interface [masqueraded interface] with RFC1918 addresses for TCP packets that are marked with the FIN or RST flags. The new workaround has survived 24 hours without a leak; again, my hat's off to you Doug for your work on this. I am so thankful I found this post and this workaround because I believe that NO RFC1918 addressed packets should ever be able to escape to the internet, and if I worked for an ISP, I would deprecate this kind of thing (the RFC1918 leakage).

You're absolutely right - it only seems to affect only RST's or FIN's. I agree with Doug - this seems like a bug. Certainly adverse behavior too, because RFC1918 addresses should NEVER escape the internet interface as it wouldn't be routed by anyone's ISP anyway. Leakage is the absolute best and most appropriate word for the behavior.

It also seems that this is a connection tracking related problem, so now I am wondering if this has anything to do with iptables/netfilter at all, since conntrack is usually implemented in the Linux kernel (while even sometimes as a loadable module, but still operated in kernel space).

The other related bug that was mentioned (http://bugzilla.netfilter.org/show_bug.cgi?id=627) also appears to be something related to conntrack. That used downgraded the Linux kernel and the problem seemed to go away (or so I recall). 

So what do you guys think? Is this something netfilter might consider fixing in iptables, or should patches be submitted to the Linux kernel source to fix the conntrack code?
Comment 4 Leonid Egorov 2011-09-19 00:37:45 CEST
Hi, I also have the same problem: on my workplace we have 3 ISPs and it was necessary to switch internet traffic via different ISPs in case of accidence. It made by simple shell scripts (switchover default route and make some changes in routing table). One ISP (main provider) is accessed via PPPoE connection, another one have direct connection and last provider is accessed via remote gate. In case of accidence on PPPoE side switching to another provider is done, but after restoration PPPoE connection back switching is not working. Internet traffic not work, reason: all our packet from internal network go to internet with internal source addresses (no SNAT translation done). I can fix this packets on my FORWARD rule, but POSTROUTING nat rule never detect them. I need do reboot whole PC in order to restore proper working.
# uname -a
Linux ubuntu-gw 2.6.38-11-server #48-Ubuntu SMP Fri Jul 29 19:20:32 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

I play with net.ipv4.conf.all.arp_ignore=1(http://www.spinics.net/lists/netfilter/msg51016.html), set additional FORWARD rule for INVALID packets, but no success.
Comment 5 www 2011-12-05 01:26:07 CET
Created attachment 370 [details]
A test case

SNAT fails to maquerade some TCP CWR, TCP ECN, TCP URG, TCP ACK, and TCP PSH packets
Comment 6 Jozsef Kadlecsik 2011-12-05 12:39:24 CET
The NAT engine ignores any packet with state INVALID, because there's no reliable way to determine what kind of NAT should be performed. So the proper way to prevent the leakage of private address space is to drop INVALID packets.

It's not a well documented feature, unfortunately.

If the conntrack engine fails to properly identify a packet and thus assings it to the INVALID state, that's a bug. But too late packets do not fall to that category.
Comment 7 Doug Smythies 2011-12-05 16:59:00 CET
(In reply to comment #6)
Thank you for your reply Jozef. Since my posting of February 19th, I have continued to study iptables, nat and conntrack and such. Yes, I had come realize that the NAT stage simply could not know what to do with an INVALID packet. I also agree that this "feature" is not well documented or well known.
 
Comment 8 Doug Smythies 2011-12-05 21:08:44 CET
(In reply to comment #5)
www@applejelly.org: If I understand your example correctly, you are trying to make new TCP sessions in violation of the protocol. That senario is, in my opinion, well documented (or at least better documented). Following is the related segment of my iptables script:

# A NEW TCP connection requires SYN bit set and FIN,RST,ACK reset.
# Un-NAT'ed packets go out to internet without this rule.
# Sending RFC1918 packets to internet is considered poor form, by me anyhow.
$IPTABLES -A INPUT -p tcp ! --syn -m state --state NEW -j LOG --log-prefix "NEW TCP no SYN:" --log-level info
$IPTABLES -A INPUT -p tcp ! --syn -m state --state NEW -j DROP
Comment 9 www 2011-12-06 19:30:11 CET
(In reply to comment #8)
> (In reply to comment #5)
> www@applejelly.org: If I understand your example correctly, you are trying to
> make new TCP sessions in violation of the protocol. That senario is, in my
> opinion, well documented (or at least better documented). Following is the
> related segment of my iptables script:
> 
> # A NEW TCP connection requires SYN bit set and FIN,RST,ACK reset.
> # Un-NAT'ed packets go out to internet without this rule.
> # Sending RFC1918 packets to internet is considered poor form, by me anyhow.
> $IPTABLES -A INPUT -p tcp ! --syn -m state --state NEW -j LOG --log-prefix "NEW
> TCP no SYN:" --log-level info
> $IPTABLES -A INPUT -p tcp ! --syn -m state --state NEW -j DROP
> 

That would be what I was doing while I was investigating why I saw an internal IP on PPP0 as in the initial report. Sorry for wasting time. Summery: me too
Thanks for comment #6 and informing me.
Comment 10 Myroslav Opyr 2012-07-04 16:10:55 CEST
We're experiencing a bug in Fedora 16 with kernel-3.2.9-2.fc16.x86_64 and kernel-3.3.4-3.fc16.x86_64. Adding following rule helped get id of packets with "internal" IP on "external" interface:

$IPTABLES -A FORWARD -i $INTIF -p tcp -m state --state INVALID -j DROP


Additional information for somebody that will be hit by the issue (to be able to google this comment) follows:

We've been doing Nagios' check_http with --no-body (don't wait for document body: close socket after receiving headers). The closed socket resulted into TCP RST packet in response of all http response body payload packets that were received into closed socket. NAT of these RST packets didn't work due to this bug. Our server was effectively disabled by Datacenter provider (Hetzner) due to unroutable packets that our server emitted.

This bug was not present in kernel-2.6.21.7-5.fc8xen from Fedora 8 (that we'd routed through for the test).
Comment 11 Jozsef Kadlecsik 2012-12-06 18:37:43 CET
I close the bug as invalid, because this is the way how the sytem works: in a natted environment INVALID packets must be dropped.