This section should describe everything from a user's perspective, ie. how to use the tools provided to create the network special effects you are after.
This section generally won't cover the merits of various approaches, or why you would want to do these things. That is the domain of larger documents (such as a book on networking). Sure, there are places when this document will tell you how to do terrible, terrible things which you shouldn't ever do. Ever.
On the other hand, if I make it easier for you to do them, then all those Linux consultants get paid the really big bucks to clean up after you. (Insert evil laugh here).
iptables
is the name given to the latest packet filtering
system in the ipfwadm and ipchains line of Linux packet filtering
methods. Like those two, it has both a kernel component and a
userspace component. Unlike the others, it is usually built as a
kernel module, and was designed to run on top of the netfilter
framework.
The first thing you'll notice about iptables is the similarity to ipchains. This is partially to ease transition, and partially because I wrote both of them. The main benefit of iptables (for user, programmer and feature-lover) is extensibility.
So let's start with a quick guide to the differences between iptables and ipchains from a user's point of view, then go into a blow-by-blow description of the features.
[* Replaced by individual modules which provide this facility]
iptables may look a little like ipchains, but under the skin, iptables shows itself to be a simple framework for simple IP filtering, with the possibility of adding in new and funky features.
As an example, ipchains understands TCP, UDP and ICMP, so it can filter them. iptables does not; it loads up extra modules to help with these when specified.
There are many niche needs which ipchains didn't meet: this is my way of having my cake and eating it, too. It also means that distributions shipping pre-built kernels don't need to agonize over the growing kb wasted on packet filtering: it's modular.
Also, this allowed me to get the core code smaller than the old ipfwadm kernel code (unlike, *cough*, the ipchains kernel code), and I don't have to be embarrassed next time I see Jos Vos.
The kernel part of iptables is a module, called "iptables": you simply insmod it. It takes one optional argument: "forward=n", where n is 0 or 1, which sets the default policy for forwarding: DROP or ACCEPT. If it's not specified, the default is DROP.
Sometimes, when a rule asks for a specific (non-builtin) target or test, such as the "REJECT" target, or "tcp" test, it will require additional modules. If placed in the right directory (usually /lib/modules/`uname -r`/net/), they should auto-load. Otherwise these modules (with names like "ipt_REJECT.o" and "ipt_tcp.o") can be manually insmod'd after the core iptables module.
Once this has been done, the iptables program can be run. Using `iptables --help' will give reasonable usage information.
The `-L' or `--list' argument allows you to list the chains. If it is followed by a chain name, only that chain is listed (if it exists).
The `-v' or `--verbose' argument causes all information to be printed. The `-n' or `--numeric' option causes information to be printed numerically; useful for suppressing name lookup attempts. Finally, the `-x' or `--exact' option, when used with the -v option, causes the exact packet and byte counters (not the `324k'-style abbreviations) to be printed.
The `-v' or `--verbose' argument can also be used with any other command, to see exactly what is happening.
The `-A' or `--append' argument, followed by a chain name, is used to append a rule to the end of a chain. Similarly, `-I' or `--insert' inserts a rule at the beginning of a chain (or, followed by a rule number, insert at a given position in a chain, starting with the head at number 1).
The `-R' or `--replace' argument (followed by a chain name and a number) is used for atomically replacing one rule with another.
A rule can be deleted (`-d' or `--delete' followed by the chain name) with a similar syntax to appending; the first rule in the chain which matches the specification will be deleted. The other method is to delete by number, which simply follows the chain name.
New (user-defined) chains can be created using `-N' or `--new', followed by a chain name. An empty chain of that name is created if one does not already exist.
Empty user-defined chains can be deleted if no rules have it as a target, using the `-X' or `--delete-chain' arguments followed by the victim chain name.
Many fields in the standard IP header can be filtered on:
Followed by an optional `!' (meaning not), then an IP address or name. If an IP address, it can be followed by a mask, such as `/8' or `/255.0.0.0', both of which mean that only the first 8 bits of the address should be compared.
Used identically to the source address specification.
Followed by an optional `!', and an interface name. The name of the input interface to match. A packet passing through the OUTPUT chain has an input interface of "". If the interface name ends in a `+', then it means that any interface name starting with those characters should be matched, eg `ppp+' will match `ppp0' and `ppp10'.
Used identically to the above. A packet passing through the INPUT chain has an output interface of "".
Followed by an optional `!' and a number (usually a hexadecimal number prefixed with 0x), this matches only IP packets with the given Type of Service field.
Only match non-first fragments, ie. IP packets with a non-zero offset field. This argument can be preceeded by a `!' to indicate that it is to be inverted (ie. match IP packets with a zero offset field only).
Followed by an optional `!' and a protocol name or number, this matches only IP packets of the given protocol. It also has a side effect of loading the per-protocol module, if any, as we will see below.
In addition to the standard fields above, extended options are available in two cases. Firstly, the `-m' or `--match' argument can be used to load up a match module, which may provide extra options. Secondly, if no `--match' is specified, but a `--protocol' argument is specified, then if that protocol provides a match module, that it may provide extra options. If you want the help message to include help on a particular set of extended packet matching options, just use the `--protocol' or `--match' before the `--help' argument.
These match modules have two parts: the kernel part is described above, and the userspace part lives in a shared library. By default, iptables tries to load the library from the `/usr/local/lib/iptables/' directory.
Man, I don't get paid enough to write documentation.
Anyway, if you get an "Unknown arg `--syn'" error, or similar, it could be that you didn't specify the match or protocol argument first.
There are four extended packet matching modules included in the base package at the moment. These are:
This module is automatically loaded if `--protocol tcp' is specified, and no other match is specified. It provides the following options:
Followed by an optional `!', then two strings of flags, allows you to filter on specific TCP flags. The first string of flags is the mask: a list of flags you want to examine. The second string of flags tells which one(s) should be set. For example,
# iptables -A INPUT --protocol tcp --tcp-flags ALL SYN,ACK -j DENY
This indicates that all flags should be examined (`ALL' is synonomous with `SYN,ACK,FIN,RST,URG,PSH'), but only SYN and ACK should be set. There is also an argument `NONE' meaning no flags.
Optionally preceeded by a `!', this is shorthand for `--tcp-flags SYN,RST,ACK SYN'.
followed by an optional `!', then either a single TCP port, or a range of ports. Ports can be port names, as listed in /etc/services, or numeric. Ranges are either two port names separated by a `:', or (to specify greater than or equal to a given port) a port with a `:' appended, or (to specify less than or equal to a given port), a port preceeded by a `:'.
is synonymous with `--source-port'.
and
are the same as above, only they specify the destination, rather than source, port to match.
followed by an optional `!' and a number, matches a packet with a TCP option equalling that number.
This module is automatically loaded if `--protocol udp' is specified, and no other match is specified. It provides the options `--source-port', `--sport', `--destination-port' and `--dport' as detailed for TCP above.
This module is automatically loaded if `--protocol icmp' is specified, and no other match is specified. It provides only one option:
followed by an optional `!', then either an icmp type name (eg `host-unreachable'), or a numeric type (eg. `3'), or a numeric type and code separated by a `/' (eg. `3/3'). A list of available icmp type names is given using `-p icmp --help'.
This module must be explicitly specified with `-m mac' or `--match mac'. It is used for matching incoming packet's source ethernet (MAC) address, and thus only useful for packets traversing the INPUT and FORWARD chains. It provides only one option:
followed by an optional `!', then an ethernet address in colon-separated hexbyte notation, eg `--mac-source 00:60:08:91:CC:B7'.
Once you've specified what packets to match, you have to specify
what to do with the matched packets. This is done using the `-j'
option. If the end of a user-defined chain is reached, then the
packet traversal resumes at the chain which called it. If the end of
a built-in chain is reached, then the chain's policy
is
consulted: this is an unconditional rule at the end of the chain which
says what to do in this case (often, DROP the packet).
The standard targets are:
Pass the packet on.
Drop the packet; eat it.
Act as if this rule was the last in its chain; if it's a user-defined chain, this returns to the calling chain. If it's a built-in chain, the chain's policy is consulted.
Queue the packet for userspace handling. If there is no program waiting to handle the packet, or there are too many packets queued, this has the same effect as DROP.
If the option after `-j' is the name of a user-defined chain, then any packet which matches the rule will begin traversing that chain.
If no `-j' option is specified, then the next rule in that chain will be consulted. As each rule has a packet and a byte counter, this is useful for counting types of packets.
New target modules can be written for iptables, which add options in a similar manner to the way new packet-matching modules add options.
Like packet-matching modules, target modules have two parts: the kernel part is described above, and the userspace part lives in a shared library. By default, iptables tries to load the library from the `/usr/local/lib/iptables/' directory, same as extended match modules.
There are two extended target modules included in the default distribution. These are:
This module provides kernel logging of matching packets. It provides these additional options:
Followed by a level number or name. Valid names are (case-insensitive) `debug', `info', `notice', `warning', `err', `crit', `alert' and `emerg', corresponding to numbers 7 through 0. See the man page for syslog.conf for an explanation of these levels.
This option specifies that the messages should be limited; bursts are allowed, but the average rate can never exceed one message every 5 seconds. This avoid severe log-flooding or overloading.
Followed by a string of up to 14 characters, this message is sent at the start of the log message, to allow it to be uniquely identified.
This module has the same effect as `DROP', except that the sender is sent an ICMP `port unreachable' error message. Note that the ICMP error message is not sent if (see RFC 1122):
This module provides mark facility: setting the `nfmark' field in an skbuff. It provides two options:
Followed by a number: the value to set nfmark to. Note that this value starts at 0 for fresh packets.
Followed by `FOR_ROUTING' or `FOR_CLS_FW', which indicates to the routing code or firewall classifier that the mark value was set for it. If the reason was set to something else previously (eg. by the conntrack code), the mark will NOT be altered. Only if noone has set the mark, or the reason for the mark matches the reason you are altering it, then the mark will take effect.
Forced obsolescence is a Bad Thing. You should be backwards compatible; you shouldn't optimize for it, but it should be painless. In other words, it may be slow, but it must be simple.
Hence there is a compatibility layer in the compat/ directory, which contains two modules: `ipchains.o' and `ipfwadm.o'. These modules are incompatible with almost ALL the other netfilter modules, and hence are not installed in the module directory by `make install'.
The effect is that of a kernel 2.2/2.0 kernel compiled with the following:
CONFIG_FIREWALL=y
CONFIG_IP_FIREWALL=y
CONFIG_IP_ALWAYS_DEFRAG=y
CONFIG_IP_TRANSPARENT_PROXY=y
CONFIG_IP_MASQUERADE=y
Once you've done an insmod on these modules, you should be able to use ipfwadm or ipchains as normal. There are, however, the following caveats:
There is a new Network Address Translation system which works on top of the netfilter framework. Like iptables, it has a kernel part and a userspace part, and it is extensible to cover new protocols, and other wierd cases.
Network Address Translation is funky (that's a technical term). The idea is to mangle packets on the fly as they pass through one way, and hope you can recognize the reply packets passing through the other way so you can unmangle them.
Note that this requires that both the original and reply packets pass through the Network Address Translation box (this is important to realize if you're getting really tricky).
On one level, this is simple. For example, if you have a network with IP addresses 1.2.0.0/16 behind your Linux box, and you want them to have addresses 1.3.0.0/16 instead, you can get the Linux box to alter all the source IP addresses on the way out of your network from 1.2 to 1.3, and the destination IP addresses on the way into your network from 1.3 to 1.2.
That is called static NAT, or (as implemented by Alexey Kuznetsov in Linux) "Fast NAT". It's actually quite easy to do, and is controlled by the routing code.
Unfortunately, life isn't always that easy. Sometimes you want to map a range of addresses onto a smaller range. The most frequent use of NAT in the world at the moment is Linux 2.0 and 2.2's "masquerading" feature, in which an entire network is mapped onto a single IP address (the IP address of the masquerading box's external interface), which is also used by the masquerading box itself!
On top of that, some standard protocols (ftp) don't like (ftp) being masqueraded (ftp), but I won't (ftp) mention any (ftp) names just yet. Proprietary protocols are even worse in this regard.
Another use of NAT is what I call RNAT (Reverse NAT), or load-sharing NAT. In this case it is the destination, not the source, which is altered: frequently this is used to map a single IP address onto a farm of servers, such as for a heavy porn... err... Web server.
The most common form of RNAT today is Linux 2.2's "port-forwarding" feature, which is usually used to direct connections to a single TCP port to another server. This is frequently used in combination with masquerading, where the server in question doesn't have a valid IP address, and so cannot be connected to directly.
One of the good things about Free Software is the cool people involved, such as Freshmeat's Patrick Lenz, who provides me with Freshmeat stats using ipchains, to count the total number of connections, and the number which are probably from masqueraded connections. Here are a recent snapshot of those stats:
Chain input (policy ACCEPT: 228314119 packets, 21789959697 bytes):
pkts bytes target prot opt tosa tosx ifname mark outsize source destination ports
1083067 55439939 - tcp -y---- 0xFF 0x00 any anywhere anywhere 61000:65095 -> any
12363685 621294605 - tcp -y---- 0xFF 0x00 any anywhere anywhere any -> any
That's 8.76% of connections to Freshmeat are masqueraded.
Another common form used is "transparent proxying", where connections which would ordinary pass through the masquerading box are RNAT'ed to the box itself, enabling much deviousness.
To do Network Address Translation meaningfully, you have to recognize replies so you can translate them back. Recognizing related packets is called "connection tracking"; because rather than treating each packet as an individual, it's treated as either a part of an existing packet connection or an attempt to start a new connection.
Connection tracking is a separate kernel module to NAT, but must be loaded before the NAT module can be loaded. (If you put the modules in the right directory, modprobe will figure this out and load the connection tracking module for you when the NAT module is loaded. Describing this minor magic is on my TODO list).
You can load the connection tracking module by itself using `insmod ip_conntrack.o' in the conntrack/ subdirectory of the source. This creates a file /proc/net/ip_conntrack which details the states of various connections it is currently aware of.
Like iptables
, connection tracking is extensible; new
protocol modules can be written to increase its understanding. It
understands ICMP, UDP and TCP by default; an example module to
understand FTP traffic is included.
Let's look a simple example:
This gets a little more complicated when Network Address Translation gets involved.
Your gateway to the wonderful world of Linux NAT is the tool
ipnatctl
. You can think of this as a handy tool for screwing
your network over worse than you ever imagined was possible.
ipnatctl
allows you to insert (`-I'), delete (`-D') and
list (`-L') rules. Rules are implicitly ordered, like the way routing
information is implicitly ordered: more specific rules take precedence
over less specific rules.
Each rule has three parts:
When we see a packet which creates a new connection, we look up the rules the user specified to see if we should modify packets in this connection. If we do, we record the modifications required to packets (and their replies), and alter the connection tracking code's expectations so it recognizes the reply packets.
Let's look a simplified example. We have create a rule which says all UDP packets going out ppp0 should be mapped onto the source IP address 1.2.3.4:
ipnatctl
is also extensible: new protocols and new mapping
types can be created, and several are included in the base
distribution.
There are several standard options for matching packets: each protocol can provide extra options, as we will see below.
Followed by an IP address or range, this option allows the specification of a particular source IP address, or a range of source addresses (using the `/' mask notation, such as `192.168.1.0/24' or `192.168.1.0/255.255.255.0').
Followed by an IP address or range, this is used to specify a particular destination IP address or range, similar to the above.
Followed by part or all of the words "source" or "destination", to indicate whether a source (NAT) or destination (RNAT) mapping is desired.
Followed by a protocol number or name, means that only packets of the given protocol will match the rule. If the protocol contains special support, this causes the loading of extra options, as we'll see below.
A protocol can provide extended options: currently TCP and UDP do. Each extension has two parts: a kernel module (eg. "ip_nat_tcp.o"), and a shared library (eg. "libnatctl_proto_tcp.so"). The shared libraries should reside in the "/usr/local/lib/ipnatctl/" directory.
Protocol-specifics are enabled by using the `--protocol' option to ipnatctl: if the kernel modules are placed in the right directory (usually /lib/modules/`uname -r`/net/), they should auto-load. Otherwise these modules can be manually insmod'd after the core ip_nat module. The TCP, UDP and ICMP modules are built-in already.
The protocols which have specific options are:
This provides the following options:
or
Followed by a port number, indicates that only packets from this TCP port should match the rule.
or
Followed by a port number, indicates that only packets to this TCP port should match the rule.
This provides the following options:
or
Followed by a port number, indicates that only packets from this UDP port should match the rule.
or
Followed by a port number, indicates that only packets to this UDP port should match the rule.
There is only one standard output option:
followed by either a single IP address or an address range, indicates the IP range onto which the source or destination IP address of the packet is to be matched.
The range can be an address and a mask, like the `--source' option, or a `-' separated inclusive IP range, like `192.168.1.1-192.168.1.3'.
The `--protocol' option not only can add extra match options, but also extra output options, which add protocol-specific restrictions on how the packet source or destination address can be mapped.
This provides the following options:
Followed by a port number, or a `-' separated port range, indicates that packets must be mapped onto this TCP port or port range.
This provides the following options:
Followed by a port number, or a `-' separated port range, indicates that packets must be mapped onto this UDP port or port range.
Once we've specified what packets the rule matches, and the range into which they should be mangled, we need to specify exactly how the packets are to be mapped onto that range.
This is the reason for the optional `--mapping-type' option: it is followed by a mapping type which will handle creation of the binding.
If no `--mapping-type' option is specified, the `generic' mapping type is used. This binding searches for an unused mapping in the given range, as follows:
Some mapping types in the standard package are:
Instead of mapping the source onto a fixed IP address, this maps the source onto the IP address of the interface the packet is heading out, making the packet seem to come from the box itself. Thus, with this mapping, the `--to' option is ignored (but the protocol-specific options, such as TCP's `--to-port' still have effect). This only works as a source manipulation.
This should only be used for dynamically-assigned IP addresses: when the interface goes down, all masqueraded connections will be forgotten. This prevents old connections from sending out packets with the wrong source address when the interface comes up again (in particular, you should get an ICMP error or TCP RST when the interface comes back, alerting you that you lost it).
If you have a statically-assigned IP address, I recommend you don't use masquerade, but simply use `--to X.X.X.X'. These connections will not be forgotten when an interface goes down.
Instead of mapping the destination onto a fixed IP address, this maps the destination onto the IP address of the interface the packet is heading in, making the packet head to the box itself. Thus, with this mapping, the `--to' option is ignored (but the protocol-specific options, such as TCP's `--to-port' still have effect). This only works as a destination manipulation.
If you're trying to masquerade your 100,000 node network onto three TCP ports, you'll eventually have more than three of them trying to connect to the same server, and creating the binding fails. (This is because the NAT code will never create two connections which look identical). The packet which evoked the binding will be dropped.
Normally, dropping the packet is the right thing: IP is designed with the assumption that in case of congestion, packets will be dropped. If those three connections are web connections, and thus short lived, you're in with a good chance when your TCP stack retransmits. If, however, those connections are all long lived, you're SOL.
The classic problem at the moment is TCP connections which don't close properly (usually a Windows 95 machine or a Mac got the plug pulled; such machines have no place on a network). Established TCP connections take 6 hours to time out from the NAT code.
To quote a post to Linux Kernel:
> Hi,
>
> We have a problem with ip_masqurading set up as a firewall. When someone
> runs a stealth scan from the masquraded net to the outside net, it will
> very fast consume all available masqurade ports. The result is a nasty
> DoS for all adresses on the masquraded net.
Take a baseball bat to the stealth-scanning motherfucker, and the
problem will be resolved.
There are several possible DOS attacks from INSIDE a NAT host. Fixing
this one doesn't win much.
Trust me on the baseball bat,
Rusty.
I included this for two reasons: firstly, we don't have enough vulgarities in HOWTOs (cf. Linux kernel code). Secondly, it illustrates a problem, especially if you are running a big site.
There is a solution for TCP; if the oldest connection has been idle for longer than some arbitrary time, and you haven't done this in the last few seconds, fake up a TCP ACK packet (keepalive). You'll either get back another ACK, which means the connection will not have been idle, or a TCP RST if the connection is really dead, which will free the slot. Darren Reed is against this, and in his experience it simply hasn't been a serious problem. If I find that it is for my users, I'll code it up.
You have to understand ICMP, UDP and TCP. ICMP, because it's used by the other protocols to report errors, and UDP and TCP because they're misdesigned, such that their internal checksum includes the IP source and destination addresses, so NAT breaks them.
For other protocols, we give it a damn good shot. If they're NAT-friendly, they should "just work", although congestion (see above) gets worse. Of course, since we know nothing about them, we don't know when a stream is finished, so we just keep the binding around for an hour since the last packet, making congestion an even bigger possibility.
The kernel will merge these into a single rule, providing seamless integration. Birds will sing, the sun will shine, and your breath will smell sweeter.
Consider the case of the following two rules:
# ipnatctl -I -s 192.168.1.2 -b source -t 1.2.3.2
# ipnatctl -I -s 192.168.1.2 -b source -t 1.2.3.3
#
Internally, these get rolled into one rule with a combined range of `1.2.3.2 and 1.2.3.3'. This works ad-infinitum. Rules which cannot be combined (eg. identical conditions, but one a `-b source' and one `-b dest', or different mapping types) are rejected with a funky ``Directory not empty'' error message. You get the idea.
Deleting a rule deducts from the range, but you can't add a rule then delete part of it.
One use for this kind of thing is to map over several port ranges:
# ipnatctl -I -s 192.168.1.2 -p tcp -b source -t 1.2.3.2 --to-ports 1024-5999
# ipnatctl -I -s 192.168.1.2 -p tcp -b source -t 1.2.3.2 --to-ports 6000-65535
#
../NAT/userspace/ipnatctl -I -s $TAP0NET.2 -d $TAP1NET.0/24 -b source -t $TAP0NET.3 || exit 1 ../NAT/userspace/ipnatctl -I -s $TAP0NET.2 -d $TAP1NET.0/24 -b source -t $TAP0NET.4 || exit 1
I have grand plans for dealing with NAT of fragments, simply because it is possible. I want to see how messy it gets, though, and I can't think of a good reason for doing it. There are plenty of bad reasons though: it would open the doors to implementing parallel NAT machines, and it would piss off the authors of the draft NAT RFC, who said it couldn't be done.
Since you were insulting, I'm not goint to give a comprehensive set of examples. But here are a few to get you started:
Try these:
# ipnatctl --help
# ipnatctl -p tcp --help
# ipnatctl -p udp --help
Try this:
# insmod netfilter/NAT/protocols/ip_nat_tcp.o
# insmod netfilter/NAT/protocols/ip_nat_udp.o
# insmod netfilter/NAT/mapping-types/ipnat_bind_masquerade.o
# ipnatctl -I -s 192.168.0.0/16 -b source --mapping-type masquerade
# ipnatctl -I -b source -s 10.0.0.0/8 --to 1.2.3.4
# insmod netfilter/NAT/mapping-types/ipnat_bind_redirect.o
# ipnatctl -I -d 10.0.0.0/8 -b dest --mapping-type redirect
# ipnatctl -I -p tcp -b dest --dport 80 --to-port 25
# ipnatctl -I -p tcp --dport 80 -d 127.0.0.1 --to-port 8080
In the compat/ directory, there are two modules "ipfwadm.o" and "ipchains.o". These provide "backwards compatibility" with ipfwadm (from Linux 2.0) and ipchains (2.2); they both act as if CONFIG_IP_ALWAYS_DEFRAG was compiled in. The accounting chains (ipfwadm) don't see foreign packets in the wire, even if the interface is in promiscuous mode, not do they see malformed packets.
I use "backwards compatibility" in quotes, because I don't want users of these treated like second-class citizens. I prefer not to leave behind a trail of disappointed, bitter, out-for-blood Rusty haters. Let's leave that for my love-life.
This doesn't yet work with "backwards compatibility" modules. I'm working on it.
Should work fine. You should always see "end-to-end" IP addresses in iptables (as far as such things exist in the real world).
So whether you're masquerading or not, the packet filter rules won't see it; it looks like your private network is directly connecting to the outside world, and vice-verse.
If you're redirecting to a server farm, it looks to your packet filter as if external machines are connecting straight to each individual machine.