Ray Of Hope

Interrupt Path on Xen and on Baremetal system.

I was exploring an issue where qlogic device interrupt vector corresponding to a particular TX ring remained in masked state after the device comes up or is resetted. So the no packet transfer was possible on a port which was using that Tx queue and due to inaction, after some time this ports used to reset.

While the the device behaved properly when used on a baremetal system and interrupt vectors corresponding to all the Tx ports were being properly unmasked.

Following are my learnings during the course of finding the issue. It expores how interrupts handling is handled on a baremetal vs dom0 under Xen hypervisor.

Device used is qlcnic
Baremetal OS: Linux
Hypervisor Xen
Non-Baremetal: Dom0

On Baremetal system:
ql82xx_setup_intr() qlcnic_main.c—> qlc_enable_msix() qlc_main.c —>pci_enable_msix() drivrs/pci/msi.c —>msix_capability_init () drivers/pci/msi.c —-> msix_program_entries —> msix_mask_irq —> which leads to entry->masked = 1;

As part of qlcnic driver initialization following calls are made which set entry->masked to 0 and thus interrupts are enabled.
qlcnic_open()—>qlcnic_attach()—>qlcnic_request_irq() qlcnic_main.c —>request_irq() kernel/interrupt.h–>__setup_irq kernel/manage.c –>request_threaded_irq()—>irq_startup() chip.c —>which calls ->irq_startup which leads to a call in io_apic.c—>irq_enable() io_apic.—>chip.irq_unmask .irq_unmask = unmask_msi_irq —> unmask_msi_irq() drivers/pci/msi.c —> msix_mask_irq–>entry->masked = 0. So mask corresponding to interrupts is cleared and now device is in working condition and qlogic cards can receives msi/x interrupts for packet transfers.

ON a system with dom0 and xen hypervisor.
Initial calls remain same as there is a common code path for kernel and dom0.
ql82xx_setup_intr() qlcnic_main.c—> qlc_enable_msix() qlc_main.c —>pci_enable_msix() drivrs/pci/msi.c —>msix_capability_init () drivers/pci/msi.c —-> msix_program_entries —> msix_mask_irq —> which leads to entry->masked = 1;

qlcnic_open()—>qlcnic_attach()—>qlcnic_request_irq() qlcnic_main.c —>request_irq() kernel/interrupt.h–>__setup_irq kernel/manage.c –>request_threaded_irq()—>irq_startup() calls irq_enable which calls “–>irq_startup” which leads to call to startup_pirq() events_base.c “.irq_startup = startup_pirq” —>HYPERVISOR_event_channel_op(EVTCHOP_bind_pirq, &bind_pirq) //hypercall traps into Xen hypervisor.

Xen hypervisor
do_event_channel_op() { case EVTCH_bind_pirq : { evtchn_bind_irq() } } —>evtchn_bind_irq()—>pirq_guest_bind() event_channel.c—>startup_msi_irq()—->unmask_msi_irq () xen/msi.c —> msi_set_mask_bit() xen/msi.c—> entry->msi_attrib.masked = 0.

Xen hypervisor also has request_irq(), setup_irq() and irq_startup() functions but they are only used in case of pci passthrough devices. Under normal devices working on linux kernel dom0 or guest kernel, it uses the kernel request irq to request an irq.
As device starts operating and if it raises an interrupt, Xen uses event channels to send those interrupts to the guest. The original vector that is generated by the device may not be used by the Xen to forward the request to guest and also may not have an entry in the MSIX table. Xen maintains the mapping of the vector that is generated by the device and the vector that is forwarded to the guest and also MSIX table.

Now another question that arises is how the bifurcation of path takes place after irq_startup in case of baremetal kernel and dom0. Answer: During kernel init pv_ops data structure is filled which assign different pointer values to irq_startup in both these different cases.

The basic problem was as mentioned in “http://marc.info/?l=linux-kernel&m=138366196704831&w=4″

Part of this problem is that all of the interrupt vector setting (either
be it GSI, MSI or MSI-X) is handled by the hypervisor. That means the kernel
consults the hypervisor for the right ‘vector’ value for all of the different
types of interrupts. And that ‘vector’ value is not even used – the interrupts
first hit the hypervisor – which dispatches them to the guest via a software
event channel mechanism (a bitmap of ‘active’ events – and an event can be
tied to a physical interrupt or an IPI, etc).

Even more recently we have been clamping down – so that the kernel pagetables
for the MSI-X tables for example are R/O – so it can’t write (or over-write)
with a different vector value (or the same one). The hypervisor is the one
that does this change.

Perhaps a different way of fixing this is making the ‘__msi_mask_irq’ and
‘__msix_mask_irq’ be part of the x86.msi function ops? And then the platform
can over-write it with its own mechanism for masking/unmasking? (and in case
of Xen it would be a nop as that has already been done by the hypervisor?)

The ‘write_msi_msg’ we don’t have to worry about as it is only used by
default_restore_msi_irqs (which is part of the x86.msi and can be over-written).”

What we aim to achieve is to prevent guest kernel / pvops kernel (dom 0 or domU) to modify the msix table and to mask or unmask interrupt. Its the sole responsibility of the hypervisor to make these changes. Guest kernel should’nt do that.

Relevant patch section explanation.— a/arch/x86/include/asm/pci.h
/******* if its baremetal then run this function and allow to mask/unmask irq */
+++ b/arch/x86/include/asm/pci.h
@@ -123,6 +123,8 @@ static inline void x86_restore_msi_irqs(struct pci_dev *dev, int irq)
#define arch_teardown_msi_irqs x86_teardown_msi_irqs
#define arch_teardown_msi_irq x86_teardown_msi_irq
#define arch_restore_msi_irqs x86_restore_msi_irqs
+#define arch_msi_mask_irq x86_msi_mask_irq
+#define arch_msix_mask_irq x86_msix_mask_irq
/* implemented in arch/x86/kernel/apic/io_apic. */
struct msi_desc;
int native_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
@@ -135,6 +137,14 @@ int setup_msi_irq(struct pci_dev *dev, struct msi_desc *msidesc,
void default_teardown_msi_irqs(struct pci_dev *dev);
void default_restore_msi_irqs(struct pci_dev *dev, int irq);

+static inline u32 x86_msi_mask_irq(struct msi_desc *desc, u32 mask, u32 flag)
+ return x86_msi.msi_mask_irq(desc, mask, flag);
+static inline u32 x86_msix_mask_irq(struct msi_desc *desc, u32 flag)
+ return x86_msi.msix_mask_irq(desc, flag);

diff –git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index 48e8461..8cfc525 100644
— a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -383,6 +383,14 @@ static void xen_teardown_msi_irq(unsigned int irq)

/*** if its guest kernel/ pv_ops kernel (dom 0 or DomU), then don’t modify the msi/x mask value. Remember, only the xen hypervisor is allowed to modify mask/unmask interrupts be it msi/x, nmis or any other interrupts. If a guest is running in Xen environment and its aware of it (pvkernel whether domU or dom0), it has to pass every interrupts to Xen or any request for mask or unmask to Xen.
+static u32 xen_nop_msi_mask_irq(struct msi_desc *desc, u32 mask, u32 flag)
+ return 0;
+static u32 xen_nop_msix_mask_irq(struct msi_desc *desc, u32 flag)
+ return 0;

int __init pci_xen_init(void)
@@ -406,6 +414,8 @@ int __init pci_xen_init(void)
x86_msi.setup_msi_irqs = xen_setup_msi_irqs;
x86_msi.teardown_msi_irq = xen_teardown_msi_irq;
x86_msi.teardown_msi_irqs = xen_teardown_msi_irqs;
+ x86_msi.msi_mask_irq = xen_nop_msi_mask_irq;
+ x86_msi.msix_mask_irq = xen_nop_msix_mask_irq;
return 0;
@@ -485,6 +495,8 @@ int __init pci_xen_initial_domain(void)
x86_msi.setup_msi_irqs = xen_initdom_setup_msi_irqs;
x86_msi.teardown_msi_irq = xen_teardown_msi_irq;
x86_msi.restore_msi_irqs = xen_initdom_restore_msi_irqs;
+ x86_msi.msi_mask_irq = xen_nop_msi_mask_irq;
+ x86_msi.msix_mask_irq = xen_nop_msix_mask_irq;

queries: anshul_makkar@justkernel.com


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.