Network interfaces for developers

If you look for information about network interfaces in Linux, you tend to get a lot of results from the perspective of system or network administrators. But there are things a software developer would find useful to know about day-to-day. I attempt to explain these here. First I'll explain at a high level, then in more gritty detail with Linux source code to back me up.

This is going to focus on running servers, in particular HTTP servers. I'll mostly stick to Linux and IPv4 for simplicity. Other operating systems have close equivalents, and IPv6 is similar.

So what are these useful things to know?

Sockets

To create a server you create a socket for communication. You bind this to an IP address, then listen on that socket to accept incoming connections, eg HTTP requests.

Binding

You can only bind to addresses the kernel thinks correspond to itself. You can't just bind to the IP address of some random website.

Network interfaces

Network interfaces are Linux's way of representing where network packets can enter and leave the system. You almost certainly have several interfaces. They can correspond to physical hardware like your network card, or virtual software-defined interfaces.

There is a loopback interface which handles 127.0.0.0/8 addresses (this is CIDR notation), meaning 127.*.*.* like 127.0.0.1. These are private to the current host. No packets on these addresses leave the host or can enter from outside it¹.

Linux has network namespaces which isolate sets of network interfaces from each other. Docker leverages these namespaces. This means that a Docker container has its own network interfaces independent of the host, including it's own loopback. This is why binding to 127.0.0.1 in your Docker container isn't accessible from the host.

`0.0.0.0`

When binding, there is a special address that can be used, 0.0.0.0, which can receive packets arriving on any available network interface. This is typically used when you want the server accessible outside of the host.

In more detail

Let's talk about these topics in more detail. It isn't a one-to-one mapping with the titles above but covers the same ground.

Creating a server

If you want to create a server without higher-level libraries or frameworks, you need to interact with the kernel via system calls (syscalls). Beej's Guide to Network Programming is an excellent guide on doing just that.

There are only a few system calls required. First you need to make a socket with the socket syscall. A socket is a very generic object in Linux for communication between things. It looks a lot like a file in how you interact with it. You can read from it, write to it, and much more. For example, a socket can be created to make HTTP requests, send UDP packets, communicate with the Docker daemon, and even configure network interfaces. It can also be a way to receive TCP connections, and therefore, act as a server.

To make our socket into a server, we need to tell the kernel the IP address and port we want it to have. We do this with bind . For example we can bind 127.0.0.1:8080 to our socket.

Finally we need to tell the kernel we'd like to accept incoming connections, making it a server. We do this with listen. We can then call accept to accept connections. There's a lot more to making a functional server, for which you can check out Beej's guide.

Most languages will have higher-level libraries for doing all of this. For example we have Express for Node, where making a server is quite simple:

const express = require('express')
const app = express()

app.get('/', (req, res) => {
  res.send('Hello World!')
})

app.listen(8080, "127.0.0.1", () => {})

That handy listen function does all of the work for us.

Python's FastAPI has a very similar example:

from fastapi import FastAPI
import uvicorn

app = FastAPI()

@app.get("/")
def hello():
    return "Hello World!"

uvicorn.run(app, host="127.0.0.1", port=8080)

Who can talk to it?

Should connections from outside the host be possible?

If the answer is no, the outside world should not be able to access the server, you should bind a loopback address such as 127.0.0.1. This can also generally be referred to as localhost. Be aware that localhost may refer to the IPv6 equivalent, ::1, which can cause issues if you bind to localhost but try to connect to 127.0.0.1. There are many such loopback IP addresses available. In general people stick to 127.0.0.1 and bind to different ports if they wish to have multiple servers running. You can bind to 127.0.0.2:8080 and 127.0.0.3:8080 to no ill effect other than localhost no longer referring to them.

If the answer is yes, the outside world should be able to access the server, you'll generally want to bind to the special address 0.0.0.0. As we will see later, the kernel will send on any connection from any network interface as long as everything else matches, such as the port.

More complicated networking set-ups can require more nuance, where you may want to bind to a given network card's IP address explicitly.

Servers in Docker containers

Docker requires some special attention here. Since a Docker container operates in its own network namespace, its localhost is different to the localhost of the Docker host. Even if you bind to 0.0.0.0:8080 inside a container, is it still not available outside the host, since it will only apply to the network interfaces inside the container.

However, Docker will set up a network interface on the host and on containers to link the two worlds. The ip command can be used to see various information about network interfaces in Linux. If we use ip route both on the host and in a container we get the following:

#
# On the host
#
host% ip route
default via 192.168.1.254 dev enp2s0 proto dhcp src 192.168.1.190 metric 100
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
(trimmed output)

#
# In the container
#
container% ip route
default via 172.17.0.1 dev eth0
172.17.0.0/16 dev eth0 proto kernel scope link src 172.17.0.2

In the host's routes, we can see one for docker0. This is the network interface Docker created on our host to talk to containers. It tells us that the addresses 172.17.0.0/16 should be sent to this interface. In the container's routes we can see the eth0 interface. It also tells us our container's IP address for this interface is 172.17.0.2.

If we bind to 0.0.0.0:8080 in our container, or even bind to 172.17.0.2:8080 directly, then we can access our server from the host:

host% curl 172.17.0.2:8080
Hello World!

But if we bind to a loopback address like 127.0.0.1 then we would not be able to do this cURL request successfully, since the server would only be bound to the Docker containers loopback interface.

This doesn't provide us a way to expose the server running in a Docker container to the world outside of the host, since to do that our server would need to be bound to 0.0.0.0 on the host or one of the host's specific addresses.

Docker provides us a way to connect these through the -p or --publish flags, such as

docker run -p 127.0.0.1:80:8080 nginx:alpine

When trying to parse these arguments it's good to remember the host is on the left of the colon. In 127.0.0.1:80:8080 the 127.0.0.1:80 is referring to the host's network and 8080 is referring to the containers network.

This command causes Docker to bind to the host's 127.0.0.1:80 and forward any packets to/from the container's 8080 port. Inside the container these packets will be via the eth0 interface we saw earlier. Because they're on eth0, they still won't be seen/sent by a server bound to localhost inside the container. The container's server would need to be bound to 0.0.0.0 for this forwarding to/from the host to work.

The shorthand version of the flag, -p 80:8080 will bind to 0.0.0.0 on the host, and will therefore expose the container's server to outside the host. This can easily be an accidental security hole. Hopefully you have some firewall external to your host anyway, but either way it would be good practice to only bind to localhost if that is sufficient. You can also be explicit with -p 0.0.0.0:80:8080.

How does `0.0.0.0` work?

At this point, I'm going to start digging up Linux source code to show how this works. If you're happy that you understand the behaviour of 0.0.0.0 and that's all you need, this section isn't going to add much for you. But if you want a better idea of how this occurs in Linux read on.

Our sockets, funnily enough, correspond to a struct socket in the kernel. If you're not familiar with C, this is a bit like a class:

/**
 *  struct socket - general BSD socket
 *  @state: socket state (%SS_CONNECTED, etc)
 *  @type: socket type (%SOCK_STREAM, etc)
 *  @flags: socket flags (%SOCK_NOSPACE, etc)
 *  @ops: protocol specific socket operations
 *  @file: File back pointer for gc
 *  @sk: internal networking protocol agnostic socket representation
 *  @wq: wait queue for several uses
 */
struct socket {
	socket_state state;
	short type;
	unsigned long flags;
	struct file *file;
	struct sock *sk;
	const struct proto_ops *ops; /* Might change with IPV6_ADDRFORM or MPTCP. */
	struct socket_wq wq;
};

C doesn't have classes, but it can mimic them manually by storing function pointers to act as the 'methods' on that class. You can see that, in the struct proto_ops field. This stores a bunch of function pointers for things like bind for that particular protocol.

For IPv4 sockets, this ends up calling the __inet_bind function in af_inet.c This sets up a bunch of stuff for the socket object, but not too much relevant to us now. The listen syscall ends up calling inet_csk_listen_start:

int inet_csk_listen_start(struct sock *sk)
{
	struct inet_connection_sock *icsk = inet_csk(sk);
	struct inet_sock *inet = inet_sk(sk);
	int err;

	err = inet_ulp_can_listen(sk);
	if (unlikely(err))
		return err;

	reqsk_queue_alloc(&icsk->icsk_accept_queue);

	sk->sk_ack_backlog = 0;
	inet_csk_delack_init(sk);

	/* There is race window here: we announce ourselves listening,
	 * but this transition is still not validated by get_port().
	 * It is OK, because this socket enters to hash table only
	 * after validation is complete.
	 */
	inet_sk_state_store(sk, TCP_LISTEN);
	err = sk->sk_prot->get_port(sk, inet->inet_num);
	if (!err) {
		inet->inet_sport = htons(inet->inet_num);

		sk_dst_reset(sk);
		err = sk->sk_prot->hash(sk);

		if (likely(!err))
			return 0;
	}

	inet_sk_set_state(sk, TCP_CLOSE);
	return err;
}

The important line here is err = sk->sk_prot->hash(sk). This hash function takes aspects of our bound socket and stores it in a global kernel data structure for fast look up later (a hash table). It actually does the storing rather than returning a hash of the socket. For IPv4 the hash function is ipv4_portaddr_hash:

static inline u32 ipv4_portaddr_hash(
	const struct net *net,
	__be32 saddr,
	unsigned int port)
{
	return jhash_1word((__force u32)saddr, net_hash_mix(net)) ^ port;
}

You can see that the hash mostly uses the IP address saddr and the port to produce the hash. This means the kernel can look up our listening socket from just this information. When the kernel receives a packet, the destination IP and destination port can be used to look up our socket.

When the kernel receives a packet, it uses __inet_lookup_listener to look up the listener socket:

struct sock *__inet_lookup_listener(const struct net *net,
	struct inet_hashinfo *hashinfo,
	struct sk_buff *skb, int doff,
	const __be32 saddr, __be16 sport,
	const __be32 daddr, const unsigned short hnum,
	const int dif, const int sdif)
{
	struct inet_listen_hashbucket *ilb2;
	struct sock *result = NULL;
	unsigned int hash2;

	/* Lookup redirect from BPF */
	if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
	    hashinfo == net->ipv4.tcp_death_row.hashinfo) {
		result = inet_lookup_run_sk_lookup(net, IPPROTO_TCP, skb, doff,
						   saddr, sport, daddr, hnum, dif,
						   inet_ehashfn);
		if (result)
			goto done;
	}

	hash2 = ipv4_portaddr_hash(net, daddr, hnum);
	ilb2 = inet_lhash2_bucket(hashinfo, hash2);

	result = inet_lhash2_lookup(net, ilb2, skb, doff,
				    saddr, sport, daddr, hnum,
				    dif, sdif);
	if (result)
		goto done;

	/* Lookup lhash2 with INADDR_ANY */
	hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum);
	ilb2 = inet_lhash2_bucket(hashinfo, hash2);

	result = inet_lhash2_lookup(net, ilb2, skb, doff,
				    saddr, sport, htonl(INADDR_ANY), hnum,
				    dif, sdif);
done:
	if (IS_ERR(result))
		return NULL;
	return result;
}

We can see the use of the very same hash function ipv4_portaddr_hash, which is then looked up in two stages by inet_lhash2_bucket and inet_lhash2_lookup. You can think of this as just a hash table look up.

This hashing and lookup happens up to twice. The first time the destination address and port is used in the look up. This means that if we bound our server to an explicit address such as 127.0.0.1:8080 then this first look up will succeed if this is the packet's destination.

But this first lookup would fail if we bound our server to 0.0.0.0:8080 since the hash would not be the same as the destination address. A second lookup is performed, this time with the address portion set to INADDR_ANY, which is constant referring to 0.0.0.0, and the destination port. This time, our server bound to 0.0.0.0:8080 will be found, and the socket will receive the packet assuming all else is good.

This captures a bunch of the core logic for actually deciding a packet arrives on a given socket. There isn't much reference to network interfaces here. The lookup function does consider network interfaces and scores candidate sockets with them in mind.

Conclusion

This covers a bunch of information about network interfaces and servers that I think is helpful for a software developer. When I set out to write this my understanding was very implicit. I understood the behaviour to some extent, but had no idea of the actual mechanisms.

Researching this made these concepts more concrete and I learned a lot along the way. There are bits I did not include, such as exactly what circumstances need to be met for the kernel to allow you to bind to a given address, or that DHCP ultimately causes your device to learn about it's IP address on the local network and automatically configures that network interface.

I hope you found it useful.

Not without further shenanigans such as port forwarding at least. ↩