Tuesday, November 19, 2013

Interview Question at narus

1. What is virtual memory and advantage and disadvantage of virtual memory
2. how compiler manage scoping of static variable
3. difference between macro and const variable
4. network buffers : linux
5. how watchdog run how do find that some CPU is locked
6. who restrict size of virtual memory
7. what is kernal memory size , can it be virtual
8. static gloabl and same static local variable in function is possible
9. what is volatile
10. fork what all it do

Linux kernal flow : v imp

kernel_flow

By Linux Foundatio... - November 19, 2009 - 10:23am

networking

by Arnout Vandecappelle, Mind

http://www.linuxfoundation.org/collaborate/workgroups/networking/kernel_flow

This article describes the control flow (and the associated data buffering) of the Linux networking kernel. The picture on the left gives an overview of the flow. Open it in a separate window and use it as a reference for the explanation below.

This article is based on the 2.6.20 kernel. Please feel free to update for newer kernels.

Another article gives a similar description based on a 2.4.20 kernel. Unfortunately, that one is not on a Wiki so it can't be updated...

Preliminaries

Refer to Net:Network Overview for an overview of all aspects of the networking kernel: routing, neighbour discovery, NAPI, filtering, ...

The network data (including headers etc.) is managed through the sk_buff data structure. This minimizes copying overhead when going through the networking layers. A basic understanding of sk_buff is required to understand the networking kernel.

The kernel as a whole makes heavy use of virtual methods. These are recorded as function pointers in data structures. In the figure these are indicated with diamonds. This article never shows all possible implementations for these virtual methods, just the main ones.

This article only discusses TCP over IPv4 over Ethernet connections. Of course, many combinations of the different networking layers are possible, as well as tunnelling, bridging, etc.

Transmission path

Layer 5: Session layer (sockets and files)

There are three system calls that can send data over the network:

write (memory data to a file descriptor)
sendto (memory data to a socket)
sendmsg (a composite message to a socket)

All of these eventually end up in __sock_sendmsg(), which does security_sock_sendmsg() to check permissions and then forwards the message to the next layer using the socket's sendmsg virtual method.

Layer 4: Transport layer (TCP)

tcp_sendmsg: for each segment in the message

find an sk_buff with space available (use the one at the end if space left, otherwise allocate and append a new one)
copy data from user space to sk_buff data space (kernel space, probably DMA-able space) using skb_add_data().
- The buffer space is pre-allocated for each socket. If the buffer runs out of space, communication stalls: the data remains in user space until buffer space becomes available again (or the call returns with an error immediately if it was non-blocking).
- The size of allocated sk_buff space is equal to the MSS (Maximum Segment Size) + headroom (MSS may change during connection, and is modified by user options).
- Segmentation (or coalescing of individual writes) happens at this level. Whatever ends up in the same sk_buff will become a single TCP segment. Still, the segments can be fragmented further at IP level.
The TCP queue is activated; packets are sent with tcp_transmit_skb() (called multiple times if there are more active buffers).
tcp_transmit_skb() builds the TCP header (the allocation of the sk_buff has left space for it). It clones the skb in order to pass control to the network layer. The network layer is called through the queue_xmit virtual function of the socket's address family (inet_connection_sock->icsk_af_ops).

Layer 3: Network layer (IPv4)

ip_queue_xmit() does routing (if necessary), creates the IPv4 header
nf_hook() is called in several places to perform network filtering (firewall, NAT, ...). This hook may modify the datagram or discard it.
The routing decision results in a destination (dst_entry) object. This destination models the receiving IP address of the datagram. The dst_entry's output virtual method is called to perform actual output.
The sk_buff is passed on to ip_output() (or another output mechansim, e.g. in case of tunneling).
ip_output() does post-routing filtering, re-outputs it on a new destination if necessary due to netfiltering, fragments the datagram into packets if necessary, and finally sends it to the output device.
- Fragmentation tries to reuse existing fragment buffers, if possible. This happens when forwarding an already fragmented incoming IP packet. The fragment buffers are special sk_buff objects, pointing in the same data space (no copy required).
- If no fragment buffers are available, new sk_buff objects with new data space are allocated, and the data is copied.
- Note that TCP already makes sure the packets are smaller than MTU, so normally fragmentation is not required.
Device-specific output is again through a virtual method call, to output of the dst_entry's neighbour data structure. This usually is dev_queue_xmit. There is some optimisation for packets with a known destination (hh_cache).

Layer 2: Link layer (e.g. Ethernet)

The main function of the kernel at the link layer is scheduling the packets to be sent out. For this purpose, Linux uses the queueing discipline (struct Qdisc) abstraction. For detailed information, see Chapter 9 (Queueing Disciplines for Bandwidth Management) of the Linux Advanced Routing & Traffic Control HOWTO and Documentation//networking/multiqueue.txt.

dev_queue_xmit puts the sk_buff on the device queue using the qdisc->enqueue virtual method.

If necessary (when the device doesn't support scattered data) the data is linearised into the sk_buff. This requires copying.
Devices which don't have a Qdisc (e.g. loopback) go directly to dev_hard_start_xmit().
Several Qdisc scheduling policies exist. The basic and most used one is pfifo_fast, which has three priorities.

The device output queue is immediately triggered with qdisc_run(). It calls qdisc_restart(), which takes an skb from the queue using the qdisc->dequeue virtual method. Specific queueing disciplines may delay sending by not returning any skb, and setting up a qdisc_watchdog_timer() instead. When the timer expires, netif_schedule() is called to start transmission.

Eventually, the sk_buff is sent with dev_hard_start_xmit() and removed from the Qdisc. If sending fails, the skb is re-queued. netif_schedule() is called to schedule a retry.

netif_schedule() raises a software interrupt, which causes net_tx_action() to be called when the NET_TX_SOFTIRQ is ran by ksoftirqd. net_tx_action() calls qdisc_run() for each device with an active queue.

dev_hard_start_xmit() calls the hard_start_xmit virtual method for the net_device. But first, it calls dev_queue_xmit_nit(), which checks if a packet handler has been registered for the ETH_P_ALL protocol. This is used for tcpdump.

The device driver's hard_start_xmit function will generate one or more commands to the network device for scheduling transfer of the buffer. After a while, the network device replies that it's done. This triggers freeing of the sk_buff. If the sk_buff is freed from interrupt context, dev_kfree_skb_irq() is used. This delays the actual freeing until the next NET_TX_SOFTIRQ run, by putting the skb on the softnet_data completion_queue. This avoids doing frees from interrupt context.

Receive flow

Layer 2: Link layer (e.g. Ethernet)

The network device pre-allocates a number of sk_buffs for reception. How many, is configured per device. Usually, the addresses to the data space in these sk_buffs are configured directly as DMA area for the device. The device interrupt handler takes the sk_buff and performs reception handling on it. Before NAPI, this was done using netif_rx(). In NAPI, it is done in two phases.

From the interrupt handler, the device driver just calls netif_rx_schedule() and returns from interrupt. netif_rx_schedule() adds the device to sofnet_data's poll_list and raises the NET_RX_SOFTIRQ software interrupt.
ksoftirqd runs net_rx_action(), which calls the device's poll virtual method. The poll method does device-specific buffer management, calls netif_receive_skb() for each sk_buff, allocates new sk_buffs as required, and terminates with netif_rx_complete().

netif_receive_skb() finds out how to pass the sk_buff to upper layers.

netpoll_rx() is called, to support the Netpoll API
Call packet handlers for ETH_P_ALL protocol (for tcpdump)
Call handle_ing() for ingress queueing
Call handle_bridge() for bridging
Call handle_macvlan() for virtual LAN
Call the packet handler registered for the L3 protocol specified by the packet.

The packet handlers are called with the deliver_skb() function, which calls the protocol's func virtual method to handle the packet.

Layer 3: Network layer (IPv4, ARP)

ARP

ARP packets are handled with arp_rcv(). It processes the ARP information, stores it in the neighbour cache, and sends a reply if required. In the latter case, a new sk_buff (with new data space) is allocated for the reply.

IPv4

IPv4 packets are handled with ip_rcv(). It parses headers, checks for validity, sends an ICMP reply or error message if required. It also looks up the destination address using ip_route_input(). The destination's input virtual method is called with the sk_buff.

ip_mr_input() is called for multicast addresses. The packet may be forwarded using ip_mr_forward(), and it may be delivered locally using ip_local_delivery().
ip_forward() is called for packets with a different destination for which we have a route. It directly calls the neighbour's output virtual method.
ip_local_deliver() is called if this machine is the destination of the packet. Datagram fragments are collected here.

ip_local_deliver() delivers to any raw sockets for this connection first, using raw_local_deliver(). Then, it calls the L4 protocol handler for the protocol specified in the datagram. The L4 protocol is called even if a raw socket exists.

Throughout, xfrm4_policy_check calls are included to support IPSec.

Layer 4: Transport layer (TCP)

The net_protocol handler for TCP is tcp_v4_rcv(). Most of the code here deals with the protocol processing in TCP, for setting up connections, performing flow control, etc.

A received TCP packet may include an acknowledgement of a previously sent packet, which may trigger further sending of packets (tcp_data_snd_check()) or of acknowledgements (tcp_ack_snd_check()).

Passing the incoming packet to an upper layer is done in tcp_rcv_established() and tcp_data_queue(). These functions maintain the tcp connection's out_of_order_queue, and the socket's sk_receive_queue and sk_async_wait_queue. If a user process is already waiting for data to arrive, the data is immediately copied to user space using skb_copy_datagram_iovec(). Otherwise, the sk_buff is appended to one of the socket's queues and will be copied later.

Finally, the receive functions call the socket's sk_data_ready virtual method to signal that data is available. This wakes up waiting processes.

Layer 5: Session layer (sockets and files)

There are three system calls that can receive data from the network:

read (memory data from a file descriptor)
recvfrom (memory data from a socket)
recvmsg (a composite message from a socket)

All of these eventually end up in __sock_recvmsg(), which does security_sock_recvmsg() to check permissions and then requests the message to the next layer using the socket's recvmsg virtual method. This is often sock_common_recvmsg(), which calls the recvmsg virtual method of the socket's protocol.

tcp_recvmsg() either copies data from the socket's queue using skb_copy_datagram_iovec(), or waits for data to arrive using sk_wait_data(). The latter blocks and is woken up by the layer 4 processing.

Groups:

Networking in the Linux Kernel

http://wiki.openwrt.org/doc/networking/praxis

Above we read merely about the theory of networking, about the basic ideas, about communication protocols and standards. Now, let us see, how all of this is being handled by the Linux Kernel 2.6:

Everything related is found under /net/. But drivers, for the network devices, are of course found /drivers/.

NOTE:

The Linux kernel is only one component of the operating system
1. it does require libraries itself (we at OpenWrt use the µCLibC, see →links.software.libraries) Section: 3 - C library functions
2. it is very modular and there are many modules
3. it does require applications to provide features to end users (these run in userspace)

The main interface between the kernel and userspace is the set of system calls. There are about FIXME

system calls. Network related system calls include:

writes to socket

Network Data Flow through the Linux Kernel

Kernel Flow

Packet Handling

TX Transmission

Queue No.1: The application process does a write() on a socket and all the data is copied from the process space into the send socket buffer
Queue No.2: The data goes through the TCP/IP stack and the packets are put (Evaluation strategy#Call_by_reference) into the NIC's egress buffer (here works the packet scheduler)
Queue No.3: After a packet gets dequeued, the transmission procedure of the driver is called, and it is copied into the tx_ring, a ring buffer the driver shares with the NIC

RX Reception

Queue No.1: The hardware (NIC) puts all incoming network packets into the rx_ring, a ring buffer the driver shares with the NIC
Queue No.2: The IRQ handler of the driver takes the packet from the rx_ring, puts it (by (Evaluation strategy#Call_by_reference)) in the ingress buffer (aka backlog queue) and schedules a SoftIRQ (in former days, every incoming packet triggered an IRQ, since Kernel this is solved by polling instead)
Queue No.3: is the the receive socket buffer

Typical queue lengths

The socket buffers can be set by the application (set_sockopt())
- cat /proc/sys/net/core/rmem_default or cat /proc/sys/net/core/wmem_default
The default queuing discipline is a FIFO queue. Default length is 100 packets (ether_setup(): dev→queue_len, drivers/net/net_init.c)
The tx_ring and rx_ring are driver dependent (e.g. the e1000 driver set these lengths to 80 packets)
The backlog queue is 300 packets in size (/proc/sys/net/core/netdev_max_backlog). Once it is full, it waits for being totally empty to allow again an enqueue() (netif_rx(), net/core/dev.c).

/proc

/proc is the POSIX complient mount point for the Virtual Filesystem for the processes.

/proc/cpuinfo: processor information
/proc/meminfo: memory status
/proc/version: kernel version and build information
/proc/cmdline: kernel command line
/proc//environ: calling environment
/proc//cmdline: process command line

See Procfs or http://www.comptechdoc.org/os/linux/howlinuxworks/linux_hlproc.html or proc.txt

See → http://gettys.wordpress.com/2010/11/29/home-router-puzzle-piece-one-fun-with-your-switch/ for some "fun" with all the queues.

Transmitting

So you can install hardware capable of Ethernet (usually a network card or more precisely an Ethernet card) on two hosts, connect them with a standardized cable, like a Category 5 cable and communicate with one another over Ethernet as far as your software supports Ethernet ;-)

Sooner or later the sausage will get to the Ethernet thingy of the network stack, this will prepare the data conforming to the Ethernet standard, then will deliver the frames to the network card drivers and this will make the hardware, the network card, transmit the data.

Receiving

The NIC on the other side will receive the signal, relay it to the Ethernet thingy of the network stack, this will create one huge data out of the Ethernet frames and relay it to the software.

When a packet is enqueued on an interface with dev queue xmit (in net/core/dev.c), the enqueue operation of the packet scheduler is triggered and qdisc wakeup is being called (innet/pkt_sched.h) to send the packet on that device.

A transmit queue is associated with each device. When a network packet is ready for transmission, the "networking code" will call the driver's hard_start_xmit()-function to let it know, a packet is waiting. The driver will then put that packet into the transmit queue of the hardware.

You find the sources for the whole TCP/IP protocol suite implementation

Monday, November 18, 2013

Zero Copy I: User-Mode Perspective

From Issue #105
January 2003

Jan 01, 2003 By Dragan Stancevic

SysAdmin

Explaining what is zero-copy functionality for Linux, why it's useful and where it needs work.

By now almost everyone has heard of so-called zero-copy functionality under Linux, but I often run into people who don't have a full understanding of the subject. Because of this, I decided to write a few articles that dig into the matter a bit deeper, in the hope of unraveling this useful feature. In this article, we take a look at zero copy from a user-mode application point of view, so gory kernel-level details are omitted intentionally.

What Is Zero-Copy?

To better understand the solution to a problem, we first need to understand the problem itself. Let's look at what is involved in the simple procedure of a network server dæmon serving data stored in a file to a client over the network. Here's some sample code:

read(file, tmp_buf, len);
write(socket, tmp_buf, len);

Looks simple enough; you would think there is not much overhead with only those two system calls. In reality, this couldn't be further from the truth. Behind those two calls, the data has been copied at least four times, and almost as many user/kernel context switches have been performed. (Actually this process is much more complicated, but I wanted to keep it simple). To get a better idea of the process involved, take a look at Figure 1. The top side shows context switches, and the bottom side shows copy operations.

Figure 1. Copying in Two Sample System Calls

Step one: the read system call causes a context switch from user mode to kernel mode. The first copy is performed by the DMA engine, which reads file contents from the disk and stores them into a kernel address space buffer.
Step two: data is copied from the kernel buffer into the user buffer, and the read system call returns. The return from the call caused a context switch from kernel back to user mode. Now the data is stored in the user address space buffer, and it can begin its way down again.
Step three: the write system call causes a context switch from user mode to kernel mode. A third copy is performed to put the data into a kernel address space buffer again. This time, though, the data is put into a different buffer, a buffer that is associated with sockets specifically.
Step four: the write system call returns, creating our fourth context switch. Independently and asynchronously, a fourth copy happens as the DMA engine passes the data from the kernel buffer to the protocol engine. You are probably asking yourself, “What do you mean independently and asynchronously? Wasn't the data transmitted before the call returned?” Call return, in fact, doesn't guarantee transmission; it doesn't even guarantee the start of the transmission. It simply means the Ethernet driver had free descriptors in its queue and has accepted our data for transmission. There could be numerous packets queued before ours. Unless the driver/hardware implements priority rings or queues, data is transmitted on a first-in-first-out basis. (The forked DMA copy in Figure 1 illustrates the fact that the last copy can be delayed).
As you can see, a lot of data duplication is not really necessary to hold things up. Some of the duplication could be eliminated to decrease overhead and increase performance. As a driver developer, I work with hardware that has some pretty advanced features. Some hardware can bypass the main memory altogether and transmit data directly to another device. This feature eliminates a copy in the system memory and is a nice thing to have, but not all hardware supports it. There is also the issue of the data from the disk having to be repackaged for the network, which introduces some complications. To eliminate overhead, we could start by eliminating some of the copying between the kernel and user buffers.
One way to eliminate a copy is to skip calling read and instead call mmap. For example:

tmp_buf = mmap(file, len);
write(socket, tmp_buf, len);

To get a better idea of the process involved, take a look at Figure 2. Context switches remain the same.

Figure 2. Calling mmap

Step one: the mmap system call causes the file contents to be copied into a kernel buffer by the DMA engine. The buffer is shared then with the user process, without any copy being performed between the kernel and user memory spaces.
Step two: the write system call causes the kernel to copy the data from the original kernel buffers into the kernel buffers associated with sockets.
Step three: the third copy happens as the DMA engine passes the data from the kernel socket buffers to the protocol engine.
By using mmap instead of read, we've cut in half the amount of data the kernel has to copy. This yields reasonably good results when a lot of data is being transmitted. However, this improvement doesn't come without a price; there are hidden pitfalls when using the mmap+write method. You will fall into one of them when you memory map a file and then call write while another process truncates the same file. Your write system call will be interrupted by the bus error signal SIGBUS, because you performed a bad memory access. The default behavior for that signal is to kill the process and dump core—not the most desirable operation for a network server. There are two ways to get around this problem.
The first way is to install a signal handler for the SIGBUS signal, and then simply call return in the handler. By doing this the write system call returns with the number of bytes it wrote before it got interrupted and the errno set to success. Let me point out that this would be a bad solution, one that treats the symptoms and not the cause of the problem. Because SIGBUS signals that something has gone seriously wrong with the process, I would discourage using this as a solution.
The second solution involves file leasing (which is called “opportunistic locking” in Microsoft Windows) from the kernel. This is the correct way to fix this problem. By using leasing on the file descriptor, you take a lease with the kernel on a particular file. You then can request a read/write lease from the kernel. When another process tries to truncate the file you are transmitting, the kernel sends you a real-time signal, the RT_SIGNAL_LEASE signal. It tells you the kernel is breaking your write or read lease on that file. Your write call is interrupted before your program accesses an invalid address and gets killed by the SIGBUS signal. The return value of the write call is the number of bytes written before the interruption, and the errno will be set to success. Here is some sample code that shows how to get a lease from the kernel:

if(fcntl(fd, F_SETSIG, RT_SIGNAL_LEASE) == -1) {
    perror("kernel lease set signal");
    return -1;
}
/* l_type can be F_RDLCK F_WRLCK */
if(fcntl(fd, F_SETLEASE, l_type)){
    perror("kernel lease set type");
    return -1;
}

You should get your lease before mmaping the file, and break your lease after you are done. This is achieved by calli

Sendfile

In kernel version 2.1, the sendfile system call was introduced to simplify the transmission of data over the network and between two local files. Introduction of sendfile not only reduces data copying, it also reduces context switches. Use it like this:

sendfile(socket, file, len);

To get a better idea of the process involved, take a look at Figure 3.

Figure 3. Replacing Read and Write with Sendfile

Step one: the sendfile system call causes the file contents to be copied into a kernel buffer by the DMA engine. Then the data is copied by the kernel into the kernel buffer associated with sockets.
Step two: the third copy happens as the DMA engine passes the data from the kernel socket buffers to the protocol engine.
You are probably wondering what happens if another process truncates the file we are transmitting with the sendfile system call. If we don't register any signal handlers, the sendfile call simply returns with the number of bytes it transferred before it got interrupted, and the errno will be set to success.
If we get a lease from the kernel on the file before we call sendfile, however, the behavior and the return status are exactly the same. We also get the RT_SIGNAL_LEASE signal before the sendfile call returns.
So far, we have been able to avoid having the kernel make several copies, but we are still left with one copy. Can that be avoided too? Absolutely, with a little help from the hardware. To eliminate all the data duplication done by the kernel, we need a network interface that supports gather operations. This simply means that data awaiting transmission doesn't need to be in consecutive memory; it can be scattered through various memory locations. In kernel version 2.4, the socket buffer descriptor was modified to accommodate those requirements—what is known as zero copy under Linux. This approach not only reduces multiple context switches, it also eliminates data duplication done by the processor. For user-level applications nothing has changed, so the code still looks like this:

sendfile(socket, file, len);

To get a better idea of the process involved, take a look at Figure 4.

Figure 4. Hardware that supports gather can assemble data from multiple memory locations, eliminating another copy.

Step one: the sendfile system call causes the file contents to be copied into a kernel buffer by the DMA engine.
Step two: no data is copied into the socket buffer. Instead, only descriptors with information about the whereabouts and length of the data are appended to the socket buffer. The DMA engine passes data directly from the kernel buffer to the protocol engine, thus eliminating the remaining final copy.
Because data still is actually copied from the disk to the memory and from the memory to the wire, some might argue this is not a true zero copy. This is zero copy from the operating system standpoint, though, because the data is not duplicated between kernel buffers. When using zero copy, other performance benefits can be had besides copy avoidance, such as fewer context switches, less CPU data cache pollution and no CPU checksum calculations.
Now that we know what zero copy is, let's put theory into practice and write some code. You can download the full source code fromwww.xalien.org/articles/source/sfl-src.tgz. To unpack the source code, typetar -zxvf sfl-src.tgz at the prompt. To compile the code and create the random data file data.bin, run make.
Looking at the code starting with header files:

/* sfl.c sendfile example program
Dragan Stancevic <
header name                 function / variable
-------------------------------------------------*/
#include           /* printf, perror */
#include           /* open */
#include          /* close */
#include           /* errno */
#include          /* memset */
#include      /* socket */
#include      /* sockaddr_in */
#include    /* sendfile */
#include       /* inet_addr */
#define BUFF_SIZE (10*1024) /* size of the tmp
                               buffer */

Besides the regular and required for basic socket operation, we need a prototype definition of the sendfile system call. This can be found in the server flag:

/* are we sending or receiving */
if(argv[1][0] == 's') is_server++;
/* open descriptors */
sd = socket(PF_INET, SOCK_STREAM, 0);
if(is_server) fd = open("data.bin", O_RDONLY);

The same program can act as either a server/sender or a client/receiver. We have to check one of the command-prompt parameters, and then set the flag is_server to run in sender mode. We also open a stream socket of the INET protocol family. As part of running in server mode we need some type of data to transmit to a client, so we open our data file. We are using the system call sendfile to transmit data, so we don't have to read the actual contents of the file and store it in our program memory buffer. Here's the server address:

/* clear the memory */
memset(&sa, 0, sizeof(struct sockaddr_in));
/* initialize structure */
sa.sin_family = PF_INET;
sa.sin_port = htons(1033);
sa.sin_addr.s_addr = inet_addr(argv[2]);

We clear the server address structure and assign the protocol family, port and IP address of the server. The address of the server is passed as a command-line parameter. The port number is hard coded to unassigned port 1033. This port number was chosen because it is above the port range requiring root access to the system.
Here is the server execution branch:

if(is_server){
    int client; /* new client socket */
    printf("Server binding to [%s]\n", argv[2]);
    if(bind(sd, (struct sockaddr *)&sa,
                      sizeof(sa)) < 0){
        perror("bind");
        exit(errno);
    }

As a server, we need to assign an address to our socket descriptor. This is achieved by the system call bind, which assigns the socket descriptor (sd) a server address (sa):

if(listen(sd,1) < 0){
    perror("listen");
    exit(errno);
}

Because we are using a stream socket, we have to advertise our willingness to accept incoming connections and set the connection queue size. I've set the backlog queue to 1, but it is common to set the backlog a bit higher for established connections waiting to be accepted. In older versions of the kernel, the backlog queue was used to prevent syn flood attacks. Because the system call listen changed to set parameters for only established connections, the backlog queue feature has been deprecated for this call. The kernel parameter tcp_max_syn_backlog has taken over the role of protecting the system from syn flood attacks:

if((client = accept(sd, NULL, NULL)) < 0){
    perror("accept");
    exit(errno);
}

The system call accept creates a new connected socket from the first connection request on the pending connections queue. The return value from the call is a descriptor for a newly created connection; the socket is now ready for read, write or poll/select system calls:

if((cnt = sendfile(client,fd,&off,
                          BUFF_SIZE)) < 0){
    perror("sendfile");
    exit(errno);
}
printf("Server sent %d bytes.\n", cnt);
close(client);

A connection is established on the client socket descriptor, so we can start transmitting data to the remote system. We do this by calling the sendfile system call, which is prototyped under Linux in the following manner:

extern ssize_t
sendfile (int __out_fd, int __in_fd, off_t *offset,
          size_t __count) __THROW;

The first two parameters are file descriptors. The third parameter points to an offset from which sendfile should start sending data. The fourth parameter is the number of bytes we want to transmit. In order for the sendfile transmit to use zero-copy functionality, you need memory gather operation support from your networking card. You also need checksum capabilities for protocols that implement checksums, such as TCP or UDP. If your NIC is outdated and doesn't support those features, you still can use sendfile to transmit files. The difference is the kernel will merge the buffers before transmitting them.ng fcntl F_SETLEASE with the lease type of F_UNLCK.

Portability Issues

One of the problems with the sendfile system call, in general, is the lack of a standard implementation, as there is for the open system call. Sendfile implementations in Linux, Solaris or HP-UX are quite different. This poses a problem for developers who wish to use zero copy in their network data transmission code.
One of the implementation differences is Linux provides a sendfile that defines an interface for transmitting data between two file descriptors (file-to-file) and (file-to-socket). HP-UX and Solaris, on the other hand, can be used only for file-to-socket submissions.
The second difference is Linux doesn't implement vectored transfers. Solaris sendfile and HP-UX sendfile have extra parameters that eliminate overhead associated with prepending headers to the data being transmitted.

Looking Ahead

The implementation of zero copy under Linux is far from finished and is likely to change in the near future. More functionality should be added. For example, the sendfile call doesn't support vectored transfers, and servers such as Samba and Apache have to use multiple sendfile calls with the TCP_CORK flag set. This flag tells the system more data is coming through in the next sendfile calls. TCP_CORK also is incompatible with TCP_NODELAY and is used when we want to prepend or append headers to the data. This is a perfect example of where a vectored call would eliminate the need for multiple sendfile calls and delays mandated by the current implementation.
One rather unpleasant limitation in the current sendfile is it cannot be used when transferring files greater than 2GB. Files of such size are not all that uncommon today, and it's rather disappointing having to duplicate all that data on its way out. Because both sendfile and mmap methods are unusable in this case, a sendfile64 would be really handy in a future kernel version.

Conclusion

Despite some drawbacks, zero-copy sendfile is a useful feature, and I hope you have found this article informative enough to start using it in your programs. If you have a more in-depth interest in the subject, keep an eye out for my second article, titled “Zero Copy II: Kernel Perspective”, where I will dig a bit more into the kernel internals of zero copy.

Friday, November 15, 2013

What are high memory and low memory on Linux?

On a 32-bit architecture, the address space range for addressing RAM is:

0x00000000 - 0xffffffff

or 4'294'967'295 (4 GB).

The linux kernel splits that up 3/1 (could also be 2/2, or 1/3) into user space (high memory) and kernel space (low memory) respectively.

The user space range:

0x00000000 - 0xbfffffff

Every newly spawned user process gets an address (range) inside this area. User processes are generally untrusted and therefore are forbidden to access the kernel space. Further, they are considered non-urgent, as a general rule, the kernel tries to defer the allocation of memory to those processes.

The kernel space range:

0xc0000000 - 0xffffffff

A kernel processes gets its address (range) here. The kernel can directly access this 1 GB of addresses (well, not the full 1 GB, there are 128 MB reserved for high memory access).

Processes spawned in kernel space are trusted, urgent and assumed error-free, the memory request gets processed instantaneously.

Every kernel process can also access the user space range if it wishes to. And to achieve this, the kernel maps an address from the user space (the high memory) to its kernel space (the low memory), the 128 MB mentioned above are especially reserved for this.

What is difference between User space and Kernel space

Is Kernel space used when Kernel is executing on the behalf of the user program i.e. System Call? Or is it the address space for all the Kernel threads (for example scheduler)?

Yes and yes.

Before we go any further, we should state this about memory.

Memory get's divided into two distinct areas:

The user space, which is a set of locations where normal user processes run (i.e everything other than the kernel). The role of the kernel is to manage applications running in this space from messing with each other, and the machine, and
The kernel space, which is the location where the code of the kernel is stored, and executes under.

Processes running under the user space have access only to a limited part of memory, whereas the kernel has access to all of the memory. Processes running in user space also don't have access to the kernel space. User space processes can only access a small part of the kernel via an interface exposed by the kernel - the system calls. If an process performs a system call, a software interrupt is sent to the kernel, which then dispatches the appropriate interrupt handler and continues its work after the handler has finished.

Kernel space code has the property to run in "kernel mode", which (in your typical desktop -x86- computer) is what you call code that executes under ring 0. Typically in x86 architecture, there are 4 rings of protection. Ring 0 (kernel mode), Ring 1 (may be used by virtual machine hypervisors or drivers), Ring 2 (may be used by drivers, I am not so sure about that though) and Ring 3. Ring 3 is what typical applications run under. It is the least privileged ring, and applications running on it have access to a subset of the processor's instructions. Ring 0 (kernel space) is the most privileged ring, and has access to all of the machine's instructions. For example to this, a "plain" application (like a browser) can not use x86 assembly instructions lgdt to load the global descriptor table or hlt to halt a processor.

If it is the first one, than does it mean that normal user program cannot have more than 3GB of memory (if the division is 3GB + 1GB)? Also, in that case how can kernel use High Memory, because to what virtual memory address will the pages from high memory be mapped to, as 1GB of kernel space will be logically mapped?

For an answer to this, please refer to the excellent answer by wag here

Thursday, November 14, 2013

NAT DSlite 6rd NAT64

NAT

Network components

1. DS lite

DS-Lite enables service providers to natively allocate IPv6 addresses to new customers while continuing to support IPv4 customers. Main functional components involved in DS-Lite are B4 (Basic Bridging BroadBand) and AFTR (Address Family Translation Router) as shown in figure below:

Following sequence describes the connection establishment process using DS Lite:

Host with private IPv4 address initiates a connection to a resource on the public internet
Traffic is sent to B4, which is the default gateway
B4, using its service provider network facing IPv6 addresses establishes the tunnel with AFTR. Address of the AFTR can be pre-configured or can be discovered using DHCPv6
B4 encapsulates the IPv4 packets in IPv6 transport and sends across to AFTR
AFTR terminates the tunnel and de-capsulate the IPv4 packet
AFTR device performs IPv4-IPv4 NAT before sending traffic to the destination IPv4 network

There are many benefits that DS Lite provides:

A lightweight solution to allow IPv4 connectivity over IPv6 network
Avoids the need of multiple levels of NAT as in case of LSN
Allows service providers to move their core and access networks to IPv6 thus enabling them to benefit from IPv6 advantages
Allows coexistence of IPv4 and IPv6
Helps resolve IPv4 address scarcity issue
Allows incremental migration to native IPv6 environmen

But as always is the case, benefits don’t come without its own set of challenges:

DS Lite does not provide IPv6 and IPv4 hosts to talk to each other
Increases the size of traffic due to tunnel headers – requires MTU management to avoid fragmentation
Need to manage and maintain bindings between customer addresses and public addresses used for translation in the AFTR device
Brings in additional challenges for DPI in service provider network

2. 6rd
6rd is a mechanism to facilitate IPv6 rapid deployment across IPv4 infrastructures of Internet service providers (ISPs).

6rd is just opposite of DS lite (just replace v4 to v6)

It is derived from 6to4, a preexisting mechanism to transfer IPv6 packets over the IPv4 network, with the significant change that it operates entirely within the end-user's ISP's network, thus avoiding the major architectural problems inherent in the original design of 6to4. The name 6rd is a reference to both the rapid deployments of IPv6 it makes possible and, informally, the initials (RD) of its inventor, Rémi Després. A description of 6rd principles and of how they were first used by Free is published in RFC 5569.^[1]

Service Providers to provision IPv6 addresses to end customers without upgrading their core infrastructure to IPv6. 6rd enables IPv6 hosts separated by IPv4 networks to communicate with each other by establishing an IPv4 tunnel. The tunnel origination point on the sender’s side of the tunnel encapsulates the IPv6 traffic within IPv4 packets, and sends them over IPv4 to the device at the remote end of the tunnel. The device on the other end of the tunnel decapsulates the packets and sends them over the IPv6 network to their destination.

Though 6rd helps ISPs to provision IPv6 connectivity to end users but it does not allow IPv6 clients to talk with IPv4 servers. For that to work solutions like NAT64 / SLB64 are required.

3. NAT64

This brings in the challenge of ensuring that new IPv6 enabled devices are able to access any content – hosted on IPv4 or IPv6. NAT64 is technology that provides the bridge between IPv6 and IPv4 by doing the protocol transformation that the other side understands.

NAT64 does the translation from IPv6 to IPv4 based on a pre-assigned /96 prefix that is carried in the destination IPv6 address of packets. Last 32 bits of the IPv6 address carry the
IPv4 address of the destination IPv4 host. As the traffic passes through the NAT64 device, it looks for the prefix match – if the match is found, the device knows that the destination is IPv4 host and needs translation. Unlike the good old NAT devices, NAT64 devices need to perform protocol transformation – creating an IPv4 header based on the information in the IPv6 header. There are additional requirements to make sure that two disparate networks can talk to each other like ICMP translation along with translation of traditional applications that embed Layer 3/4 information in the packets (e.g. FTP, SIP etc).

Now the question is – how does the connection initiating IPv6 host knows the destination address? This gap is filled by DNS64 device. Whenever DNS64 device gets a query (AAAA) to resolve a name – it first tries to fetch the IPv6 address. If the address is found, it is returned to the host but if there is no IPv6 address – DNS64 device gets the IPv4 address prepends it with the preconfigured 96 bit prefix and returns to the host. Following diagram shows the sequence of events when an IPv6 host tries to connect with an IPv4 server.

In early stages of IPv6 transition content providers are primarily providing access to web content hosted on IPv4 servers. In such scenarios SLB64 offer advantages over pure NAT64 technology. SLB64 is Server Load Balancing by exposing IPv6 connection points for IPv4 servers – so SLB64 devices not only provide translation but also provide other benefits that come along with advanced ADCs like NetScaler.

There are many detractors of NAT yet there is no denying that NAT has been in use for years and is not going away anywhere soon. For IPv6 transition it is emerging as a very strong enabling technology.

Neeraj Gupta

Tuesday, November 19, 2013

Interview Question at narus

Linux kernal flow : v imp

kernel_flow

Contents

Preliminaries

Transmission path

Layer 5: Session layer (sockets and files)

Layer 4: Transport layer (TCP)

Layer 3: Network layer (IPv4)

Layer 2: Link layer (e.g. Ethernet)

Receive flow

Layer 2: Link layer (e.g. Ethernet)

Layer 3: Network layer (IPv4, ARP)

ARP

IPv4

Layer 4: Transport layer (TCP)

Layer 5: Session layer (sockets and files)

Networking in the Linux Kernel

Network Data Flow through the Linux Kernel

Packet Handling

TX Transmission

RX Reception

Typical queue lengths

/proc

Transmitting

Receiving

Monday, November 18, 2013

Zero Copy I: User-Mode Perspective

Zero Copy I: User-Mode Perspective

Friday, November 15, 2013

What are high memory and low memory on Linux?

What is difference between User space and Kernel space

Thursday, November 14, 2013

NAT DSlite 6rd NAT64

Network components

Followers

About Me