I will be talking about the fakeARP driver I wrote which you can access from GitHub [source]. I wrote this code under Linux kernel version 3.4.9 but it should be valid for some earlier (maybe 2 years) versions.
What you need to know and target audience
Network drivers are quite different from char drivers we considered in last post. And interestingly you don't need to know any kind of socket programming to write a network device driver. I'll be talking about an ethernet device driver I wrote so you need to know how layer 2 packets are transmitted in ethernet protocol. You also need to know how ARP works since our driver is about ARP. It is best if you know the IP stack up to layer 4. You may want to recall encapsulation and header structures. A reference book (I read Can Okan Dirican's network book many years ago, I recommend it for Turkish readers) or Wireshark may be useful for that.
The intended audience is also different from previous post. You need to understand interrupt handling and concurrency on kernel side to be able to follow this one (and as I repeated infinite times in previous post you need to read an actual book on kernel development to understand those). It will be easier for you if you wrote a char device driver before. Also I may compare some stuff to how they are handled in char devices.
The driver we are considering is as crappy as the "Hello, world!" char driver. Again it lacks so many things it should be doing and again this is not the correct way to write network device drivers. I am writing a proper version of this driver though, I will publish it with name "FakeARP Unleashed" or something (hopefully) in upcoming weeks.
This post will be mostly about receiving and giving network packets (or more precisely layer 2 frames) to kernel. As I said I read LDD3 to learn how to write drivers and unfortunately most of the info about network drivers is outdated there. I wasted so much time trying to find some sources on NAPI and polling. So I am trying to help people who are struggling with the same topics.
Documentation scarcity and disclaimer
It is really hard to find up-to-date and useful information about Linux kernel network API on web. One time I felt like I was back in nineties, when search engines were new and you needed to follow links from web pages to other web pages. There exists some good documentation but it is scattered therefore it feels like solving a puzzle sometimes.
The outdated resources also make it harder. Interestingly both Linux's networking subsystem's inner workings and the networking API undergone many changes in last few years. I guess those were hard times for driver programmers. That also resulted in different documentation layers remaining from different network subsystem and API versions which makes finding useful documentation harder for us.
I will share whatever up-to-date resources I could find as I write about specific topics below. Other than that I read some of the related function and structure definitions in the code. There are good and explanatory comments in those source files. I also read the bridge driver (brctl in user-space) and Intel's e1000e Gbit PCIe ethernet card driver sources because I thought I was familiar with them from my previous job as a sysadmin. I suggest you to do the same, read the struct and function definitions in the source and check out some working driver code you might be familiar with. See the API at work. Don't forget: The most accurate source for Linux kernel is the Linux kernel source.
Just as I said in the char driver tutorial, things I write here may be wrong or misleading. I may have taken some outdated info here or I might have understand some stuff wrong. I am still a newbie. Please warn me in the comments if you see something fallacious.
HOW EVERYTHING WORKSLet's talk about network drivers a little then what this module does and how it works specifically.
What is a network driver?
I said there are different kinds of drivers in my post about writing a char driver. Network drivers are one of those which don't work with system calls. They take and give information (frames,datagrams, packets etc.) with interrupts or special kernel mechanisms. There is a special struct called sk_buff (and abbreviated skb as a variable name) you need to know well. It represents a packet (or a datagram, a frame whatever you are using in your layer) and all data transmission between kernel's network sub-system and your device is done with sk_buff structs. Network drivers don't interact with user-space in any direct way (they do with rtnetlink but we won't cover it here), they take packets from and give packets to the network sub-system.
By the way I will be referring to all packets as packets from now on regardless of their layer.
You can think most network device drivers as a pipe which has two sides. One side looks to the kernel. The other side looks to the physical port of the device, preferably with a cable (or another medium) connecting your device to another network device outside your computer. Network drivers take each packet arriving at one side and pass it to the other side after making some adjustments on it if necessary. In this context, rx is when you take a packet from the cable(via hardware) and give it to the kernel. Tx is when you take a packet from the kernel and give it to the hardware (and it sends the packet using its cable).
What does our module do?
Our device is again a virtual one and unfortunately cannot enjoy a physical port which connects it to the outside world. Still we will act as if it has a cable and trick the kernel into thinking that we are giving it packets coming from the outside world. Our target is ARP packets because their length (for ipv4) and ingredients are constant. You can find lots of resources on ARP and the structure of ARP packets. We will intercept the ARP requests given to us by the kernel to send over our cable. Then we will change a few octets (network people call bytes "octets" FYI), create a valid ARP response to the request and give it back to the kernel as if it was coming from outside. We'll drop any other packets. By the way, I'm planning to write a driver which is capable of faking ping requests and TCP handshakes too. It is possible for a driver to fake a whole network behind it.
I guess we can move onto the data structures and module code now.
Data structures and associating functions
First data structure we will consider is the struct net_device. It is defined in include/linux/netdevice.h under source tree. It has certain registration and initialization functions just like struct cdev for char devices. It also employs pointers to other structs associated with it.
Here is a web reference that might be useful. [Link] It explains some functions. Might be outdated a little.
struct net_device_ops is also defined in include/linux/netdevice.h and similar to file operations struct in char devices. It contains certain function hooks for various network operations (most of which you are familiar from ifconfig). Just like file operations struct this struct provides an interface to the net_device. As I said earlier it doesn't contain any methods to interact with user-space directly. It has hooks (function pointers) for functions which interact with kernel's internal network sub-system. It also has hooks to change device configuration by using ifconfig or more flexible applications (like iproute2) which use RTNETLINK sockets. Finally it has methods to return information about the device, like number of rx/tx packets, number of dropped packets etc.
Info like what is expected from the functions and where they are used is provided just before the net_device_ops struct definition in source code. I will write about the ones we will use but read the comments in source code too, they are a good resource.
Another struct associated with netdevice is netdev_priv, which is a custom struct provided by us, module developers. It is used to hold device specific information and allocated at the same time with the net_device struct automatically. Yes, we just define the struct and rest is handled by the net_device functions per device. When we need to access it we call netdev_priv(struct net_device *dev); to get a pointer to the struct.
We will use the old struct net_device_stats which is defined as stats in net_device for received and transmitted packets. It is read automatically by ifconfig but you can write a special function to pass a net_device_stats struct which is first processed by the module. You need to write a pointer to your function to *ndo_get_stats in net_device_ops struct to do that. You can also use newer rtnl_link_stats64 struct too.
ethersetup and destructor
Let's talk about registration and initialization functions a little bit. We first allocate the net_device struct along with private section struct using the alloc_etherdev() function. It is defined in net/eth/eth.c. It calls alloc_etherdev_mqs() with tx and rx queue count equal to 1. And that calls alloc_netdev_mqs() which is the generic function for allocating network devices.
alloc_etherdev() names our device "eth%d" and we change it after allocation. %d means a number will be assigned. For example our device name will be eth0 if there was no device named eth before and it will be eth3 if there were 3 network devices named eth before our device.
It uses ethersetup initialization function right after allocating memory. It sets up basic properties of the device according to the ethernet requirements. You can read what it changes in ether_setup function definition in net/eth/eth.c.
As you can see, we can take care of registration and initialization in just one function call for ethernet devices.
We remove our device from the system using unregister_netdev function. We can also set net_device->destructor function hook to take care of freeing our device automatically when the unregister function is called. If we set it to free_netdev function it will free the memory allocated to net_device struct and the private part. You can of course add any other jobs that should be taken care of before removing the device using custom functions.
SK_BUFF STRUCT (SKB)
sk_buff is a fundemental struct in network subsystem of Linux kernel. It represent basically all the packets/fragments/datagrams. IP messages consisting of multiple packets are also stored as a linked list of sk_buff structs. There are some good references for struct sk_buff online. I can't guarantee their validity because some functions or data structures might be out-of-date. One reference that helped me tremendously is this .pdf file [Link]. Sk_buff functions and structures are defined in net/core/skbuff.c and include/linux/skbuff.h
I will only cover the parts of sk_buff we will use for our driver but you can be sure they are up-to-date. For any other function and data structure reference, check out the source code. As I said sk_buff plays a central role in network subsystem and apparently network subsystem has changed many times during last years.
I will talk about the journey of a sk_buff in later sections, let's investigate some fields and functions we are going to use in our driver.
Data container of a sk_buff is some kind of a double ended queue made of chars. You can reserve space in both the beginning and the end. You can also add new data to both ends. One can see why and how this kind of a data structure makes life easier when working on network packets.
- head/end: The beginning and end of the whole reserved area.
- data/tail: The beginning and end of the area in use ie. packet's payload. There may be written data before skb->data and after skb->tail but it may not be used. For example when a packet is first read from a device skb->data points to the beginning of ethernet header (or a few octets back due to alignment). When the sk_buff is processed by the driver skb->data is set to point to the end of the ethernet header before it is fed to the network sub-system (by using eth_type_trans()). When it is transferred to the IP layer skb->data is set to point to the end of the IP header and so on. What I mean is, there may be actual written data before skb->data and after skb->tail but they are most likely useless in the layer sk_buff is being processed.
- len/size: len = tail - data which gives the amount of useful bytes in sk_buff. size = head - end which gives the total reserved space in the sk_buff (you can't use more than this unless you reserve/allocate more space by using related functions). Note that the packets will be memory aligned according to the network device's specification. For example in my case received packets are always aligned to 32 bits (ie. size is always a multiple of 4).
- There are some other fields concerning fragmented packets. For example skb->data_len gives the total length of an IP packet which our sk_buff is a fragment of. That is one of the reasons why we are working with ARP packets, they are not fragmented in any case. Putting fragmented IP packets is handled in above layers by the kernel and it can be quite complicated.
A network device takes an incoming packet (from another network device outside the host) in raw binary form and the driver writes whole information to the data section of the sk_buff we mentioned above. The information is in layer 1 encapsulation (in other words maximum encapsulation) in a sense. Therefore we cannot know which protocols the packet has passed through, how many headers it contains nested in each other or where the real payload is. However as the sk_buff structure is processed by various network layers inside the kernel the headers are stripped off one by one. Pointers to where each header is located is recorded inside some fields of sk_buff structure. The pointers to headers concerning TCP/IP are listed below.
- h = tcp/udp header (layer 4 header)
- nh = ip header (layer 3 header)
- mac = ethernet header (layer 2 header)
Also note that these fields are empty in fresh skbs, they are filled as the packets move up and down the network sub-system.
There are two other fields which are filled as the packet is interpreted inside network subsystem, namely pkt_type and ip_summed. But they are not relevant to our driver therefore I am skipping them. I will discuss them if I write a longer post about skbs.
Below is a diagram for an HTTP packet.
There are lots of functions acting on skbs but I will only talk about the ones useful for our driver.
Alloc/free and copy functions
Allocation and copy functions can be quite confusing because you have the option to clone skbs. Also there are functions to copy/clone and allocate header space. I don't know how everything works exactly but I will share what I learned. Again I have to warn you that some these may be wrong.
Copying means making deep copies. It allocates new memory for reserved and data sections and copies them from head to end from the original skb. All pointers like skb->head, skb->data, skb->tail, skb->end and header pointers are re-arranged to the new allocated space. The comments in the source code states that fragmented skbs are united to a whole as a side effect of this process.
Cloning means copying all skb members including the pointers to original skbs reserved space and data area. No new allocation is made for reserved and data spaces and those sections are not copied. Both original and clone sk_buff structs share the same data. Therefore one should call copy if he is to write or modify the data or header of the packet. Cloning is useful since most of the time you just need to read stuff. When an skb is cloned its reference count is incremented and skb_cloned(skb) function returns true.
- alloc_skb(size, gfp_mask) allocates an skb with data section size bytes long. Of course full size of the skb is larger. Initially it has head = data = tail and end = head + size. So basically it consists of tailroom = size. If you use dev_alloc_skb(size) it always allocates with GFP_ATOMIC.
- dev_kfree_skb_any(skb) frees the skb. _any means the function may be called in either normal time or interrupt time. It has other types of functions for normal time and interrupt time specifically. I don't know what happens to clones of the skb if there are any.
- skb_clone(skb, gfp_mask) clones the skb as I described above. Allocation of the new sk_buff structure is done according to gfp_mask.
- skb_copy(skb, gfp_mask) copies the skb and its data secion as I described above. Allocation is done according to gfp_mask.
data field functions
These functions act on the reserved space and data area. One should be careful not to overflow data area over reserved space. If skb->data pointer gets in front of skb->head or skb->tail pointer goes beyond skb->end then you can get a kernel panic. It seems like a harsh punishment for an error in just one packet but developers explain in the source code how it can be critical.
- skb_headroom(): returns the length of headroom space (reserved place in the front) which is equal to skb->data - skb->head.
- skb_tailroom(): returns the length of tailroom space (reserved place in the back) which is equal to skb->tail - skb->end.
- skb_push(): pulls back the skb->data pointer and opens space in the front of the packet for writing. It can be used to write headers in front of packets when the packet is to be given to a lower layer. You should be careful not to take skb->data in front of skb->head.
- skb_pull(): pushes data pointer forward and stretchs headroom. It is usually used to pass the layer's header before giving the packet to an upper layer so that upper layer can access the payload for that layer directly using skb->data. For example kernel uses skb_pull(skb, ETH_HLEN) on layer2 packets before giving them to the IP layer.
- skb_put(): pushes skb->tail forward, creating free space to write in the back of the packet. It causes tailroom to shrink of course. You can get kernel panic if you push skb->tail beyond skb->end.
- skb_trim(): Similar to skb_put() but instead of pushing skb->tail size bytes it moves skb->tail to create a packet equal to len = size. Usually used to shrink the packet's data section to size bytes, hence it is called skb_trim. But you can use it to enlarge the data section too.
- skb_reserve(): pushes both the skb->data and skb->tail pointers forward but does not copy the contents. For empty skbs it moves data section forward while preserving skb->len. It should not be used in skbs which contain data inside.
I learned this function from Intel's e1000e driver. It pushes the skb->data pointer behind ethernet header and points skb->mac to the beginning of ethernet header. It returns the new layer/protocol of the packet. It must be doing lots of other stuff too but I didn't read the source code.
You need to use this function before giving the packets you received over the cable to the kernel. It is used like: skb->protocol = eth_type_trans(skb,dev); where dev is a pointer to your net_device struct.
Note that you won't be able use the packet as a layer1 packet after calling the function. So it is best if you call this function just before giving the packet to the network subsystem. If you need the data inside the packet after giving it to the kernel you'd better take a copy. You can of course use the skb after calling the function but it is hard work to rearrange everything that was changed by the eth_type_trans() function.
I will post the second part tomorrow (this time for real). I will talk about how a network driver can work integrated with NAPI and how our driver works.