netmap的出現(xiàn)，它既實(shí)現(xiàn)了一個(gè)高性能的網(wǎng)絡(luò)I/O框架，代碼量又不算大，非常適合學(xué)習(xí)和研究。

netmap簡(jiǎn)單介紹

首先要感謝netmap的作者，創(chuàng)造出了netmap并無私的分享了他的設(shè)計(jì)和代碼。netmap的文檔寫得很不錯(cuò)，這里我簡(jiǎn)單說明一下為什么netmap可以達(dá)到高性能。

利用mmap，將網(wǎng)卡驅(qū)動(dòng)的ring內(nèi)存空間映射到用戶空間。這樣用戶態(tài)可以直接訪問到原始的數(shù)據(jù)包，避免了內(nèi)核和用戶態(tài)的兩次拷貝；——前兩天我還想寫這么一個(gè)東西呢。
利用預(yù)先分配的固定大小的buff來保存數(shù)據(jù)包。這樣減少了內(nèi)核原有的動(dòng)態(tài)分配；——對(duì)于網(wǎng)絡(luò)設(shè)備來說，固定大小的內(nèi)存池比buddy要有效的多。之前我跟Bean_lee也提過此事呵。
批量處理數(shù)據(jù)包。這樣就減少了系統(tǒng)調(diào)用;

更具體的內(nèi)容，大家直接去netmap的官方網(wǎng)站上看吧，寫得很詳細(xì)。雖然英文，大家還是耐著性子好好看看，收獲良多。

netmap的源碼分析

從上面netmap的簡(jiǎn)單介紹中可以看到，netmap不可避免的要修改網(wǎng)卡驅(qū)動(dòng)。不過這個(gè)修改量很小。

驅(qū)動(dòng)的修改

下面我以e1000.c為例來分析。由于netmap最早是在FreeBSD上實(shí)現(xiàn)的，為了在linux達(dá)到最小的修改，使用了大量的宏，這給代碼的閱讀帶來了一些困難。

e1000_probe的修改俺不是寫驅(qū)動(dòng)的。。。e1000_probe里面很多代碼看不明白，但是不影響我們對(duì)netmap的分析。通過netmap的patch，知道是在e1000完成一系列硬件初始化以后，并注冊(cè)成功，這時(shí)調(diào)用e1000_netmap_attach

@@ -1175,6 +1183,10 @@ static int __devinit e1000_probe(struct
if (err)
goto err_register;

+#ifdef DEV_NETMAP
+   e1000_netmap_attach(adapter);
+#endif /* DEV_NETMAP */
+ 

/* print bus type/speed/width info */
e_info(probe, "(PCI%s:%dMHz:%d-bit) %pMn",
((hw- >bus_type == e1000_bus_type_pcix) ? "-X" : ""),

下面是e1000_netmap_attach的代碼

01.static void02.e1000_netmap_attach(struct SOFTC_T *adapter)03.{04.struct netmap_adapter na;05.bzero(&na, sizeof(na));06. 07.na.ifp = adapter- >netdev;08.na.separate_locks = 0;09.na.num_tx_desc = adapter- >tx_ring[0].count;10.na.num_rx_desc = adapter- >rx_ring[0].count;11.na.nm_register = e1000_netmap_reg;12.na.nm_txsync = e1000_netmap_txsync;13.na.nm_rxsync = e1000_netmap_rxsync;14.netmap_attach(&na, 1);15.}

SOFTC_T是一個(gè)宏定義，對(duì)于e1000，實(shí)際上是e1000_adapter，即e1000網(wǎng)卡驅(qū)動(dòng)對(duì)應(yīng)的private data。下面是struct netmap_adapter的定義

/*
* This struct extends the 'struct adapter' (or
* equivalent) device descriptor. It contains all fields needed to
* support netmap operation.
*/
struct netmap_adapter {
/*
* On linux we do not have a good way to tell if an interface
* is netmap-capable. So we use the following trick:
* NA(ifp) points here, and the first entry (which hopefully
* always exists and is at least 32 bits) contains a magic
* value which we can use to detect that the interface is good.
*/
uint32_t magic;
uint32_t na_flags;  /* future place for IFCAP_NETMAP */
int refcount; /* number of user-space descriptors using this
interface, which is equal to the number of
struct netmap_if objs in the mapped region. */
/*
* The selwakeup in the interrupt thread can use per-ring
* and/or global wait queues. We track how many clients
* of each type we have so we can optimize the drivers,
* and especially avoid huge contention on the locks.
*/
int na_single;  /* threads attached to a single hw queue */
int na_multi;   /* threads attached to multiple hw queues */

int separate_locks; /* set if the interface suports different
locks for rx, tx and core. */

u_int num_rx_rings; /* number of adapter receive rings */
u_int num_tx_rings; /* number of adapter transmit rings */

u_int num_tx_desc; /* number of descriptor in each queue */
u_int num_rx_desc;


/* tx_rings and rx_rings are private but allocated
* as a contiguous chunk of memory. Each array has
* N+1 entries, for the adapter queues and for the host queue.
*/
struct netmap_kring *tx_rings; /* array of TX rings. */
struct netmap_kring *rx_rings; /* array of RX rings. */

NM_SELINFO_T tx_si, rx_si;  /* global wait queues */

/* copy of if_qflush and if_transmit pointers, to intercept
* packets from the network stack when netmap is active.
*/
int     (*if_transmit)(struct ifnet *, struct mbuf *);

/* references to the ifnet and device routines, used by
* the generic netmap functions.
*/
struct ifnet *ifp; /* adapter is ifp- >if_softc */

NM_LOCK_T core_lock;    /* used if no device lock available */

int (*nm_register)(struct ifnet *, int onoff);
void (*nm_lock)(struct ifnet *, int what, u_int ringid);
int (*nm_txsync)(struct ifnet *, u_int ring, int lock);
int (*nm_rxsync)(struct ifnet *, u_int ring, int lock);

int bdg_port;
#ifdef linux
struct net_device_ops nm_ndo;
int if_refcount;    // XXX additions for bridge
#endif /* linux */
};

從struct netmap_adapter可以看出，netmap的注釋是相當(dāng)詳細(xì)。所以后面，我不再列出netmap的結(jié)構(gòu)體定義，大家可以自己查看，免得滿篇全是代碼。————這樣的注釋，有幾個(gè)公司能夠做到？

e1000_netmap_attach完成簡(jiǎn)單的初始化工作以后，調(diào)用netmap_attach執(zhí)行真正的attach工作。前者是完成與具體驅(qū)動(dòng)相關(guān)的attach工作或者說是準(zhǔn)備工作，而后者則是真正的attach。

int
netmap_attach(struct netmap_adapter *na, int num_queues)
{
    int n, size;
    void *buf;
    /* 這里ifnet又是一個(gè)宏，linux下ifnet實(shí)際上是net_device */
    struct ifnet *ifp = na- >ifp;

    if (ifp == NULL) {
        D("ifp not set, giving up");
        return EINVAL;
    }
    /* clear other fields ? */
    na- >refcount = 0;
    /* 初始化接收和發(fā)送ring */
    if (na- >num_tx_rings == 0)
        na- >num_tx_rings = num_queues;
    na- >num_rx_rings = num_queues;
    /* on each direction we have N+1 resources
     * 0..n-1   are the hardware rings
     * n        is the ring attached to the stack.
     */
    /* 
    這么詳細(xì)的注釋。。。還用得著我說嗎？
    0到n-1的ring是用于轉(zhuǎn)發(fā)的ring，而n是本機(jī)協(xié)議棧的隊(duì)列
    n+1為哨兵位置
    */
    n = na- >num_rx_rings + na- >num_tx_rings + 2;
    /* netmap_adapter與其ring統(tǒng)一申請(qǐng)內(nèi)存 */
    size = sizeof(*na) + n * sizeof(struct netmap_kring);

    /* 
    這里的malloc，實(shí)際上為kmalloc。  
    這里還有一個(gè)小trick。M_DEVBUF，M_NOWAIT和M_ZERO都是FreeBSD的定義。那么在linux下怎么使用呢？ 
    我開始以為其被定義為linux對(duì)應(yīng)的flag，如GFP_ATOMIC和__GFP_ZERO，于是grep了M_NOWAIT，也沒有找到任何的宏定義。
    正在奇怪的時(shí)候，想到一種情況。讓我們看看malloc的宏定義


    /* use volatile to fix a probable compiler error on 2.6.25 */
    #define malloc(_size, type, flags)                      
            ({ volatile int _v = _size; kmalloc(_v, GFP_ATOMIC | __GFP_ZERO); })
    這里type和flags完全沒有任何引用的地方。所以在linux下，上面的M_DEVBUG實(shí)際上直接被忽略掉了。
    */
    buf = malloc(size, M_DEVBUF, M_NOWAIT | M_ZERO);
    if (buf) {
        /* Linux下重用了struct net_device- >ax25_ptr，用其保存buf的地址 */
        WNA(ifp) = buf;
        /* 初始化tx_rings和rx_rings，tx_rings和rx_rings之間用了一個(gè)額外的ring分隔，目前不知道這個(gè)ring是哨兵呢，還是本主機(jī)的ring */
        na- >tx_rings = (void *)((char *)buf + sizeof(*na));
        na- >rx_rings = na- >tx_rings + na- >num_tx_rings + 1;
        /* 復(fù)制netmap_device并設(shè)置對(duì)應(yīng)的標(biāo)志位，用于表示其為netmap_device*/
        bcopy(na, buf, sizeof(*na));
        NETMAP_SET_CAPABLE(ifp);

        na = buf;
        /* Core lock initialized here.  Others are initialized after
         * netmap_if_new.
         */
        mtx_init(&na- >core_lock, "netmap core lock", MTX_NETWORK_LOCK,
            MTX_DEF);
        if (na- >nm_lock == NULL) {
            ND("using default locks for %s", ifp- >if_xname);
            na- >nm_lock = netmap_lock_wrapper;
        }
    }
    /* 這幾行Linux才用的上的代碼，是為linux網(wǎng)卡的驅(qū)動(dòng)框架準(zhǔn)備的。未來有用處 */
#ifdef linux
    if (ifp- >netdev_ops) {
        D("netdev_ops %p", ifp- >netdev_ops);
        /* prepare a clone of the netdev ops */
        na- >nm_ndo = *ifp- >netdev_ops;
    }
    na- >nm_ndo.ndo_start_xmit = linux_netmap_start;
#endif
    D("%s for %s", buf ? "ok" : "failed", ifp- >if_xname);

    return (buf ? 0 : ENOMEM);
}

完成了netmap_attach，e1000的probe函數(shù)e1000_probe即執(zhí)行完畢。

前面e1000_probe的分析，按照Linux驅(qū)動(dòng)框架，接下來就該e1000_open。netmap并沒有對(duì)e1000_open進(jìn)行任何修改，而是改動(dòng)了e1000_configure，其會(huì)被e1000_open及e1000_up調(diào)用。

e1000_configure的修改

按照慣例，還是先看diff文件

@@ -393,6 +397,10 @@ static void e1000_configure(struct e1000
    e1000_configure_tx(adapter);
    e1000_setup_rctl(adapter);
    e1000_configure_rx(adapter);
+#ifdef DEV_NETMAP
+   if (e1000_netmap_init_buffers(adapter))
+       return;
+#endif /* DEV_NETMAP */
    /* call E1000_DESC_UNUSED which always leaves
    * at least 1 descriptor unused to make sure
    * next_to_use != next_to_clean */

從diff文件可以看出，netmap替代了原有的e1000申請(qǐng)ring buffer的代碼。如果e1000_netmap_init_buffers成功返回，e1000_configure就直接退出了。

接下來進(jìn)入e1000_netmap_init_buffers：

/*
* Make the tx and rx rings point to the netmap buffers.
*/
static int e1000_netmap_init_buffers(struct SOFTC_T *adapter)
{
    struct e1000_hw *hw = &adapter- >hw;
    struct ifnet *ifp = adapter- >netdev;
    struct netmap_adapter* na = NA(ifp);
    struct netmap_slot* slot;
    struct e1000_tx_ring* txr = &adapter- >tx_ring[0];
    unsigned int i, r, si;
    uint64_t paddr;

    /* 
    還記得前面的netmap_attach嗎？
    所謂的attach，即申請(qǐng)了netmap_adapter，并將net_device- >ax25_ptr保存了指針，并設(shè)置了NETMAP_SET_CAPABLE。
    因此這里做一個(gè)sanity check，以免影響正常的網(wǎng)卡驅(qū)動(dòng)
    */
    if (!na || !(na- >ifp- >if_capenable & IFCAP_NETMAP))
        return 0;
    /* e1000_no_rx_alloc如其名，為一個(gè)不該調(diào)用的函數(shù)，只輸出一行錯(cuò)誤日志 */
    adapter- >alloc_rx_buf = e1000_no_rx_alloc;
    for (r = 0; r < na- >num_rx_rings; r++) {
        struct e1000_rx_ring *rxr;
        /* 初始化對(duì)應(yīng)的netmap對(duì)應(yīng)的ring */
        slot = netmap_reset(na, NR_RX, r, 0);
        if (!slot) {
            D("strange, null netmap ring %d", r);
            return 0;
        }
        /* 得到e1000對(duì)應(yīng)的ring */
        rxr = &adapter- >rx_ring[r];

        for (i = 0; i < rxr- >count; i++) {
            // XXX the skb check and cleanup can go away
            struct e1000_buffer *bi = &rxr- >buffer_info[i];
            /* 將當(dāng)前的buff索引轉(zhuǎn)換為netmap的buff索引 */
            si = netmap_idx_n2k(&na- >rx_rings[r], i);
            /* 獲得netmap的buff的物理地址 */
            PNMB(slot + si, &paddr);
            if (bi- >skb)
                D("rx buf %d was set", i);
            bi- >skb = NULL;
            // netmap_load_map(...)
            /* 現(xiàn)在網(wǎng)卡的這個(gè)buffer已經(jīng)指向了netmap申請(qǐng)的buff地址了 */
            E1000_RX_DESC(*rxr, i)- >buffer_addr = htole64(paddr);
        }

        rxr- >next_to_use = 0;

        /* 
        下面這幾行代碼沒看明白怎么回事。
        有明白的同學(xué)指點(diǎn)一下，多謝。
        */
        /* preserve buffers already made available to clients */
        i = rxr- >count - 1 - na- >rx_rings[0].nr_hwavail;
        if (i < 0)
        i += rxr- >count;
        D("i now is %d", i);
        wmb(); /* Force memory writes to complete */
        writel(i, hw- >hw_addr + rxr- >rdt);
    }

    /* 
    初始化發(fā)送ring，與接收類似.
    區(qū)別在于沒有考慮發(fā)送多隊(duì)列。難道是因?yàn)閑1000只可能是接收多隊(duì)列，發(fā)送只可能是一個(gè)隊(duì)列？
    這個(gè)問題不影響后面的代碼閱讀。咱們可以暫時(shí)將其假設(shè)為e1000只有一個(gè)發(fā)送隊(duì)列
    */
    /* now initialize the tx ring(s) */
    slot = netmap_reset(na, NR_TX, 0, 0);
    for (i = 0; i < na- >num_tx_desc; i++) {
        si = netmap_idx_n2k(&na- >tx_rings[0], i);
        PNMB(slot + si, &paddr);
        // netmap_load_map(...)
        E1000_TX_DESC(*txr, i)- >buffer_addr = htole64(paddr);
    }
    return 1;
}

e1000cleanrx_irq的修改

@@ -3952,6 +3973,11 @@ static bool e1000_clean_rx_irq(struct e1
    bool cleaned = false;
    unsigned int total_rx_bytes=0, total_rx_packets=0;

+#ifdef DEV_NETMAP
+   ND("calling netmap_rx_irq");
+   if (netmap_rx_irq(netdev, 0, work_done))
+       return 1; /* seems to be ignored */
+#endif /* DEV_NETMAP */
    i = rx_ring- >next_to_clean;
    rx_desc = E1000_RX_DESC(*rx_ring, i);
    buffer_info = &rx_ring- >buffer_info[i];

進(jìn)入netmap_rx_irq, int netmaprxirq(struct ifnet *ifp, int q, int *workdone) { struct netmapadapter *na; struct netmap_kring *r; NMSELINFOT *main_wq;

if (!(ifp- >if_capenable & IFCAP_NETMAP))
        return 0;

    na = NA(ifp);

    /* 
    盡管函數(shù)名為rx，但實(shí)際上這個(gè)函數(shù)服務(wù)于rx和tx兩種情況，用work_done做區(qū)分。
    */
    if (work_done) { /* RX path */
        r = na- >rx_rings + q;
        r- >nr_kflags |= NKR_PENDINTR;
        main_wq = (na- >num_rx_rings > 1) ? &na- >rx_si : NULL;
    } else { /* tx path */
        r = na- >tx_rings + q;
        main_wq = (na- >num_tx_rings > 1) ? &na- >tx_si : NULL;
        work_done = &q; /* dummy */
    }


    /* 
    na- >separate_locks只在ixgbe和bridge中會(huì)被設(shè)置為1。
    根據(jù)下面的代碼，這個(gè)separate_locks表示多隊(duì)列時(shí)，是每個(gè)隊(duì)列使用一個(gè)鎖。——這樣可以提高性能
    其余的代碼基本相同。都是喚醒等待數(shù)據(jù)的進(jìn)程。
     */
    if (na- >separate_locks) {
        mtx_lock(&r- >q_lock);
        selwakeuppri(&r- >si, PI_NET);
        mtx_unlock(&r- >q_lock);
        if (main_wq) {
            mtx_lock(&na- >core_lock);
            selwakeuppri(main_wq, PI_NET);
            mtx_unlock(&na- >core_lock);
        }
    } else {
        mtx_lock(&na- >core_lock);
        selwakeuppri(&r- >si, PI_NET);
        if (main_wq)
            selwakeuppri(main_wq, PI_NET);
        mtx_unlock(&na- >core_lock);
    }
    *work_done = 1; /* do not fire napi again */
    return 1;
}

發(fā)送部分的修改與接收類似，就不重復(fù)了。

開始進(jìn)入netmap的核心代碼。一切從init開始。。。

netmap_init

Linux環(huán)境下，netmap使用動(dòng)態(tài)模塊加載，由linuxnetmapinit調(diào)用netmap_init。

static int
netmap_init(void)
{
    int error;

    /* 
    申請(qǐng)netmap的各個(gè)內(nèi)存池，包括netmap_if，netmap_ring，netmap_buf以及內(nèi)存池的管理結(jié)構(gòu)
    */
    error = netmap_memory_init();
    if (error != 0) {
        printf("netmap: unable to initialize the memory allocator.n");
        return (error);
    }
    printf("netmap: loaded module with %d Mbytesn",
        (int)(nm_mem- >nm_totalsize > > 20));

    /* 
    在Linux上，調(diào)用的實(shí)際上是misc_register。make_dev為一共宏定義。
    創(chuàng)建一個(gè)名為netmap的misc設(shè)備，作為userspace和kernel的接口
    */
    netmap_dev = make_dev(&netmap_cdevsw, 0, UID_ROOT, GID_WHEEL, 0660,
                  "netmap");

#ifdef NM_BRIDGE
    {
        int i;
        for (i = 0; i < NM_BRIDGES; i++)
            mtx_init(&nm_bridges[i].bdg_lock, "bdg lock", "bdg_lock", MTX_DEF);
    }
#endif
    return (error);
}

netmapmemoryinit

netmap目前有兩套內(nèi)存分配管理代碼，一個(gè)是netmapmem1.c，另一個(gè)是netmapmem2.c。默認(rèn)使用的是后者。

static int
netmap_memory_init(void)
{
    struct netmap_obj_pool *p;

    /* 先申請(qǐng)netmap內(nèi)存管理結(jié)構(gòu) */
    nm_mem = malloc(sizeof(struct netmap_mem_d), M_NETMAP,
                  M_WAITOK | M_ZERO);
    if (nm_mem == NULL)
        goto clean;

    /* netmap_if的內(nèi)存池 */
    p = netmap_new_obj_allocator("netmap_if",
        NETMAP_IF_MAX_NUM, NETMAP_IF_MAX_SIZE);
    if (p == NULL)
        goto clean;
    nm_mem- >nm_if_pool = p;

    /* netmap_ring的內(nèi)存池 */
    p = netmap_new_obj_allocator("netmap_ring",
        NETMAP_RING_MAX_NUM, NETMAP_RING_MAX_SIZE);
    if (p == NULL)
        goto clean;
    nm_mem- >nm_ring_pool = p;

    /* netmap_buf的內(nèi)存池 */
    p = netmap_new_obj_allocator("netmap_buf",
        NETMAP_BUF_MAX_NUM, NETMAP_BUF_SIZE);
    if (p == NULL)
        goto clean;

    /* 對(duì)于netmap_buf，為了以后的使用方便，將其中的一些信息保存到其它明確的全局變量中 */
    netmap_total_buffers = p- >objtotal;
    netmap_buffer_lut = p- >lut;
    nm_mem- >nm_buf_pool = p;
    netmap_buffer_base = p- >lut[0].vaddr;


    mtx_init(&nm_mem- >nm_mtx, "netmap memory allocator lock", NULL,
         MTX_DEF);

    nm_mem- >nm_totalsize =
        nm_mem- >nm_if_pool- >_memtotal +
        nm_mem- >nm_ring_pool- >_memtotal +
        nm_mem- >nm_buf_pool- >_memtotal;

    D("Have %d KB for interfaces, %d KB for rings and %d MB for buffers",
        nm_mem- >nm_if_pool- >_memtotal > > 10,
        nm_mem- >nm_ring_pool- >_memtotal > > 10,
        nm_mem- >nm_buf_pool- >_memtotal > > 20);
    return 0;

clean:
    if (nm_mem) {
        netmap_destroy_obj_allocator(nm_mem- >nm_ring_pool);
        netmap_destroy_obj_allocator(nm_mem- >nm_if_pool);
        free(nm_mem, M_NETMAP);
    }
    return ENOMEM;
}

netmapnewobj_allocator

進(jìn)入內(nèi)存池的申請(qǐng)函數(shù)——這是netmap中比較長(zhǎng)的函數(shù)了。

static struct netmap_obj_pool *
netmap_new_obj_allocator(const char *name, u_int objtotal, u_int objsize)
{
    struct netmap_obj_pool *p;
    int i, n;
    u_int clustsize;    /* the cluster size, multiple of page size */
    u_int clustentries; /* how many objects per entry */

#define MAX_CLUSTSIZE   (1< 
#define LINE_ROUND  64
    /* 這個(gè)檢查應(yīng)該是netmap不允許申請(qǐng)過于大的結(jié)構(gòu)的內(nèi)存池 */
    if (objsize >= MAX_CLUSTSIZE) {
        /* we could do it but there is no point */
        D("unsupported allocation for %d bytes", objsize);
        return NULL;
    }

    /* 
    讓obj的size取整到64字節(jié)。為啥呢？ 
    因?yàn)?a href="http://www.nxhydt.com/v/tag/132/" target="_blank">CPU的cache line大小一般是64字節(jié)。所以object的size如果和cache line對(duì)齊，可以獲得更好的性能。
    關(guān)于cache line對(duì)性能的影響，可以看一下我以前寫得一篇博文《多核編程：選擇合適的結(jié)構(gòu)體大小，提高多核并發(fā)性能》
    */
    /* make sure objsize is a multiple of LINE_ROUND */
    i = (objsize & (LINE_ROUND - 1));
    if (i) {
        D("XXX aligning object by %d bytes", LINE_ROUND - i);
        objsize += LINE_ROUND - i;
    }
    /*
     * Compute number of objects using a brute-force approach:
     * given a max cluster size,
     * we try to fill it with objects keeping track of the
     * wasted space to the next page boundary.
     */
    /*
    這里有一個(gè)概念：cluster。
    暫時(shí)沒有找到相關(guān)的文檔介紹這里的cluster的概念。
    這里，我只能憑借下面的代碼來說一下我的理解：
    cluster是一組內(nèi)存池分配對(duì)象object的集合。為什么要有這么一個(gè)集合呢？
    眾所周知，Linux的內(nèi)存管理是基于頁的。而object的大小或小于一個(gè)頁，或大于一個(gè)頁。如果基于object本身進(jìn)行內(nèi)存分配，會(huì)造成內(nèi)存的浪費(fèi)。
    所以這里引入了cluster的概念，它占用一個(gè)或多個(gè)連續(xù)頁。這些頁的內(nèi)存大小或?yàn)閛bject大小的整數(shù)倍，或者是浪費(fèi)空間最小。
    下面的方法是一個(gè)比較激進(jìn)的計(jì)算cluster的方法，它盡可能的追求上面的目標(biāo)直到cluster的占用的大小超出設(shè)定的最大值——MAX_CLUSTSIZE。
    */
    for (clustentries = 0, i = 1;; i++) {
        u_int delta, used = i * objsize;
        /* 不能一味的增長(zhǎng)cluster，最大占用空間為MAX_CLUSTSIZE */
        if (used > MAX_CLUSTSIZE)
            break;
        /* 最后頁面占用的空間 */
        delta = used % PAGE_SIZE;
        if (delta == 0) { // exact solution
            clustentries = i;
            break;
        }
        /* 這次利用頁面空間的效率比上次的高，所以更新當(dāng)前的clustentries，即cluster的個(gè)數(shù)*/
        if (delta > ( (clustentries*objsize) % PAGE_SIZE) )
            clustentries = i;
    }
    // D("XXX --- ouch, delta %d (bad for buffers)", delta);
    /* compute clustsize and round to the next page */
    /* 得到cluster的大小，并將其與PAGE SIZE對(duì)齊 */
    clustsize = clustentries * objsize;
    i =  (clustsize & (PAGE_SIZE - 1));
    if (i)
        clustsize += PAGE_SIZE - i;
    D("objsize %d clustsize %d objects %d",
        objsize, clustsize, clustentries);

    /* 申請(qǐng)內(nèi)存池管理結(jié)構(gòu)的內(nèi)存 */
    p = malloc(sizeof(struct netmap_obj_pool), M_NETMAP,
        M_WAITOK | M_ZERO);
    if (p == NULL) {
        D("Unable to create '%s' allocator", name);
        return NULL;
    }
    /*
     * Allocate and initialize the lookup table.
     *
     * The number of clusters is n = ceil(objtotal/clustentries)
     * objtotal' = n * clustentries
     */
    /* 初始化內(nèi)存池管理結(jié)構(gòu) */
    strncpy(p- >name, name, sizeof(p- >name));
    p- >clustentries = clustentries;
    p- >_clustsize = clustsize;
    /* 根據(jù)要設(shè)定的內(nèi)存池object的數(shù)量，來調(diào)整cluster的個(gè)數(shù) */
    n = (objtotal + clustentries - 1) / clustentries;
    p- >_numclusters = n;
    /* 這是真正的內(nèi)存池中的object的數(shù)量，通常是比傳入的參數(shù)objtotal要多 */
    p- >objtotal = n * clustentries;
    /* 為什么0和1是reserved，暫時(shí)不明。擱置爭(zhēng)議，留給后面解決吧。:) */
    p- >objfree = p- >objtotal - 2; /* obj 0 and 1 are reserved */
    p- >_objsize = objsize;
    p- >_memtotal = p- >_numclusters * p- >_clustsize;

    /* 物理地址與虛擬地址對(duì)應(yīng)的查詢表 */
    p- >lut = malloc(sizeof(struct lut_entry) * p- >objtotal,
        M_NETMAP, M_WAITOK | M_ZERO);
    if (p- >lut == NULL) {
        D("Unable to create lookup table for '%s' allocator", name);
        goto clean;
    }

    /* Allocate the bitmap */
    /* 申請(qǐng)內(nèi)存池位圖，用于表示那個(gè)object被分配了 */
    n = (p- >objtotal + 31) / 32;
    p- >bitmap = malloc(sizeof(uint32_t) * n, M_NETMAP, M_WAITOK | M_ZERO);
    if (p- >bitmap == NULL) {
        D("Unable to create bitmap (%d entries) for allocator '%s'", n,
            name);
        goto clean;
    }
    /*
     * Allocate clusters, init pointers and bitmap
     */
    for (i = 0; i < p- >objtotal;) {
        int lim = i + clustentries;
        char *clust;

        clust = contigmalloc(clustsize, M_NETMAP, M_WAITOK | M_ZERO,
            0, -1UL, PAGE_SIZE, 0);
        if (clust == NULL) {
            /*
             * If we get here, there is a severe memory shortage,
             * so halve the allocated memory to reclaim some.
             */
            D("Unable to create cluster at %d for '%s' allocator",
                i, name);
            lim = i / 2;
            for (; i >= lim; i--) {
                p- >bitmap[ (i >>5) ] &=  ~( 1 < < (i & 31) );
                if (i % clustentries == 0 && p- >lut[i].vaddr)
                    contigfree(p- >lut[i].vaddr,
                        p- >_clustsize, M_NETMAP);
            }
            p- >objtotal = i;
            p- >objfree = p- >objtotal - 2;
            p- >_numclusters = i / clustentries;
            p- >_memtotal = p- >_numclusters * p- >_clustsize;
            break;
        }
        /* 初始化位圖即虛擬地址和物理地址插敘表 */
        for (; i < lim; i++, clust += objsize) {
            /* 
            1. bitmap是32位，所以i > > 5;
            2. 為什么(i&31)，也是這個(gè)原因；—— 這就是代碼的健壯性。
            */
            p- >bitmap[ (i >>5) ] |=  ( 1 < < (i & 31) );
            p- >lut[i].vaddr = clust;
            p- >lut[i].paddr = vtophys(clust);
        }
    }

    /* 與前面一樣，保留第0位和第1位。再次擱置爭(zhēng)議。。。 */
    p- >bitmap[0] = ~3; /* objs 0 and 1 is always busy */
    D("Pre-allocated %d clusters (%d/%dKB) for '%s'",
        p- >_numclusters, p- >_clustsize > > 10,
        p- >_memtotal > > 10, name);

    return p;

clean:
    netmap_destroy_obj_allocator(p);
    return NULL;
}

netmapnewobj_allocator的分析結(jié)束。關(guān)于netmap的內(nèi)存管理，依然按照事件的主線分析，而不是集中將一部分搞定。

接下來就要從netmap的使用，自上而下的學(xué)習(xí)分析一下netmap的代碼了。

netmap的應(yīng)用示例

netmap的網(wǎng)站上給出了一個(gè)簡(jiǎn)單的例子——說簡(jiǎn)單，其實(shí)也涵蓋了netmap的框架的調(diào)用。

struct netmap_if *nifp;
struct nmreq req;
int i, len;
char *buf;

fd = open("/dev/netmap", 0);
strcpy(req.nr_name, "ix0"); // register the interface
ioctl(fd, NIOCREG, &req); // offset of the structure
mem = mmap(NULL, req.nr_memsize, PROT_READ|PROT_WRITE, 0, fd, 0);
nifp = NETMAP_IF(mem, req.nr_offset);
for (;;) {
    struct pollfd x[1];
    struct netmap_ring *ring = NETMAP_RX_RING(nifp, 0);

    x[0].fd = fd;
    x[0].events = POLLIN;
    poll(x, 1, 1000);
    for ( ; ring- >avail > 0 ; ring- >avail--) {
        i = ring- >cur;
        buf = NETMAP_BUF(ring, i);
        use_data(buf, ring- >slot[i].len);
        ring- >cur = NETMAP_NEXT(ring, i);
    }
}

咱們還是一路走來，走到哪看到哪。

open操作

這個(gè)其實(shí)跟netmap沒有多大關(guān)系。記得前文中的netmap注冊(cè)了一個(gè)misc設(shè)備netmap_cdevsw嗎？

static struct file_operations netmap_fops = {
    .mmap = linux_netmap_mmap,
    LIN_IOCTL_NAME = linux_netmap_ioctl,
    .poll = linux_netmap_poll,
    .release = netmap_release,
};

static struct miscdevice netmap_cdevsw = {  /* same name as FreeBSD */
    MISC_DYNAMIC_MINOR,
    "netmap",
    &netmap_fops,
};

netmapcdevsw為對(duì)應(yīng)的設(shè)備結(jié)構(gòu)體定義，netmapfops為對(duì)應(yīng)的操作函數(shù)。這里面沒有自定義的open函數(shù)，那么應(yīng)該就使用linux內(nèi)核默認(rèn)的open——這個(gè)是我的推測(cè)，暫時(shí)不去查看linux代碼了。

NIOCREG ioctl操作

ioctl就是內(nèi)核的一個(gè)垃圾桶啊，什么都往里裝，什么都能做。

netmap的ioctl

long
linux_netmap_ioctl(struct file *file, u_int cmd, u_long data /* arg */)
{
    int ret;
    struct nmreq nmr;
    bzero(&nmr, sizeof(nmr));

    /* 
    從上面的例子和這里可以看出，struct nmreq就是netmap內(nèi)核與用戶空間的消息結(jié)構(gòu)體。
    兩者的互動(dòng)就靠它了。
    */
    if (data && copy_from_user(&nmr, (void *)data, sizeof(nmr) ) != 0)
        return -EFAULT;
    ret = netmap_ioctl(NULL, cmd, (caddr_t)&nmr, 0, (void *)file);
    if (data && copy_to_user((void*)data, &nmr, sizeof(nmr) ) != 0)
        return -EFAULT;
    return -ret;
}

進(jìn)入netmap_ioctl，真正的netmap的ioctl處理函數(shù)

static int
netmap_ioctl(struct cdev *dev, u_long cmd, caddr_t data,
    int fflag, struct thread *td)
{
    struct netmap_priv_d *priv = NULL;
    struct ifnet *ifp;
    struct nmreq *nmr = (struct nmreq *) data;
    struct netmap_adapter *na;
    int error;
    u_int i, lim;
    struct netmap_if *nifp;

    /* 
    為了去除warning警告——沒用的參數(shù)。
    void應(yīng)用的一個(gè)小技巧
    */
    (void)dev;  /* UNUSED */
    (void)fflag;    /* UNUSED */

    /* Linux下這兩個(gè)紅都是空的 */
    CURVNET_SET(TD_TO_VNET(td));

    /* 
    devfs_get_cdevpriv在linux下是一個(gè)宏定義。
    得到struct file- >private_data;
    當(dāng)private_data不為NULL時(shí)，返回0；為null時(shí)，返回ENOENT。
    所以對(duì)于linux，后面的條件判斷永遠(yuǎn)為假
    */
    error = devfs_get_cdevpriv((void **)&priv);
    if (error != ENOENT && error != 0) {
        CURVNET_RESTORE();
        return (error);
    }

    error = 0;  /* Could be ENOENT */
    /* 
    又可見到高手代碼健壯性的體現(xiàn)。
    對(duì)于運(yùn)行在kernel中的代碼，一定要穩(wěn)定！強(qiáng)制保證nmr- >nr_name字符串長(zhǎng)度的合法性
    */
    nmr- >nr_name[sizeof(nmr- >nr_name) - 1] = '?';  /* truncate name */

    。。。。。。 。。。。。。

為了流程的清楚，對(duì)于netmap_ioctl的分析就到這里。依然按照之前的使用的流程走。

寫到這里我發(fā)現(xiàn)netmap網(wǎng)站給的實(shí)例應(yīng)該是老古董了。按照netmap當(dāng)前的代碼，上面的例子根本無法使用。不過木已成舟，大家湊合意會(huì)理解這個(gè)例子吧，還好流程沒有太大的變化。

既然示例代碼不可信了，那么就按照ioctl支持的命令順序，來分析netmap吧。

NIOCGINFO

用于返回netmap的基本信息

case NIOCGINFO:     /* return capabilities etc */
    /* memsize is always valid */
    /* 
    如果是我寫，我可能先去做后面的版本檢查
    netmap這樣選擇，應(yīng)該是因?yàn)檫@些信息與版本無關(guān)。
     */
    nmr- >nr_memsize = nm_mem- >nm_totalsize;
    nmr- >nr_offset = 0;
    nmr- >nr_rx_rings = nmr- >nr_tx_rings = 0;
    nmr- >nr_rx_slots = nmr- >nr_tx_slots = 0;
    if (nmr- >nr_version != NETMAP_API) {
        D("API mismatch got %d have %d",
            nmr- >nr_version, NETMAP_API);
        nmr- >nr_version = NETMAP_API;
        error = EINVAL;
        break;
    }
    if (nmr- >nr_name[0] == '?')    /* just get memory info */
        break;
    /* 
    Linux下調(diào)用dev_get_by_name通過網(wǎng)卡名得到網(wǎng)卡struct net_device。
    并且通過NETMAP_CAPABLE來檢查netmap是否attach了這個(gè)net_device——忘記NETMAP_CAPABLE和attach的同學(xué)請(qǐng)自行查看前面幾篇文章。
    */
    error = get_ifp(nmr- >nr_name, &ifp); /* get a refcount */
    if (error)
        break;
    /* 得到attach到網(wǎng)卡結(jié)構(gòu)的netmap結(jié)構(gòu)體 */
    na = NA(ifp); /* retrieve netmap_adapter */
    /* 得到ring的個(gè)數(shù)，以及每個(gè)ring有多少slot */
    nmr- >nr_rx_rings = na- >num_rx_rings;
    nmr- >nr_tx_rings = na- >num_tx_rings;
    nmr- >nr_rx_slots = na- >num_rx_desc;
    nmr- >nr_tx_slots = na- >num_tx_desc;
    nm_if_rele(ifp);    /* return the refcount */
    break;

NIOCREGIF

將特定的網(wǎng)卡設(shè)置為netmap模式

case NIOCREGIF:
    if (nmr- >nr_version != NETMAP_API) {
        nmr- >nr_version = NETMAP_API;
        error = EINVAL;
        break;
    }
    if (priv != NULL) { /* thread already registered */
        /* 重新設(shè)置對(duì)哪個(gè)ring感興趣，這個(gè)函數(shù)，留到后面說 */
        error = netmap_set_ringid(priv, nmr- >nr_ringid);
        break;
    }
    /* 下面幾行拿到netmap_device結(jié)構(gòu)的代碼，和NIOCGINFO case沒什么區(qū)別 */
    /* find the interface and a reference */
    error = get_ifp(nmr- >nr_name, &ifp); /* keep reference */
    if (error)
        break;
    na = NA(ifp); /* retrieve netmap adapter */

    /*
     * Allocate the private per-thread structure.
     * XXX perhaps we can use a blocking malloc ?
     */
    priv = malloc(sizeof(struct netmap_priv_d), M_DEVBUF,
              M_NOWAIT | M_ZERO);
    if (priv == NULL) {
        error = ENOMEM;
        nm_if_rele(ifp);   /* return the refcount */
        break;
    }

    /* 這里循環(huán)等待net_device可用 */
    for (i = 10; i > 0; i--) {
        na- >nm_lock(ifp, NETMAP_REG_LOCK, 0);
        if (!NETMAP_DELETING(na))
            break;
        na- >nm_lock(ifp, NETMAP_REG_UNLOCK, 0);
        tsleep(na, 0, "NIOCREGIF", hz/10);
    }
    if (i == 0) {
        D("too many NIOCREGIF attempts, give up");
        error = EINVAL;
        free(priv, M_DEVBUF);
        nm_if_rele(ifp);    /* return the refcount */
        break;
    }

    /* 保存設(shè)備net_device指針*/
    priv- >np_ifp = ifp; /* store the reference */
    /* 設(shè)置感興趣的ring，即準(zhǔn)備哪些ring來與用戶態(tài)交互 */
    error = netmap_set_ringid(priv, nmr- >nr_ringid);
    if (error)
        goto error;
    /* 
    每一個(gè)netmap的描述符，對(duì)應(yīng)每一個(gè)網(wǎng)卡，都有一個(gè)struct netmap_if, 即priv- >np_nifp.
    */
    priv- >np_nifp = nifp = netmap_if_new(nmr- >nr_name, na);
    if (nifp == NULL) { /* allocation failed */
        error = ENOMEM;
    } else if (ifp- >if_capenable & IFCAP_NETMAP) {
        /* was already set */
        /* 網(wǎng)卡對(duì)應(yīng)的netmap_device的擴(kuò)展已經(jīng)設(shè)置過了 */
    } else {
        /* Otherwise set the card in netmap mode
         * and make it use the shared buffers.
         */
        /* 這時(shí)，這塊網(wǎng)卡真正要進(jìn)入netmap模式，開始初始化一些成員變量 */
        for (i = 0 ; i < na- >num_tx_rings + 1; i++)
            mtx_init(&na- >tx_rings[i].q_lock, "nm_txq_lock", MTX_NETWORK_LOCK, MTX_DEF);
        for (i = 0 ; i < na- >num_rx_rings + 1; i++) {
            mtx_init(&na- >rx_rings[i].q_lock, "nm_rxq_lock", MTX_NETWORK_LOCK, MTX_DEF);
        }
        /* 
        設(shè)置網(wǎng)卡為netmap mode為打開模式
        對(duì)于e1000驅(qū)動(dòng)來說，nm_register即e1000_netmap_reg
        */
        error = na- >nm_register(ifp, 1); /* mode on */
        if (error)
            netmap_dtor_locked(priv);
    }

    if (error) {    /* reg. failed, release priv and ref */
error:
        na- >nm_lock(ifp, NETMAP_REG_UNLOCK, 0);
        nm_if_rele(ifp);    /* return the refcount */
        bzero(priv, sizeof(*priv));
        free(priv, M_DEVBUF);
        break;
    }

    na- >nm_lock(ifp, NETMAP_REG_UNLOCK, 0);
    /* Linux平臺(tái)，將priv保存到file- >private_data*/
    error = devfs_set_cdevpriv(priv, netmap_dtor);

    if (error != 0) {
        /* could not assign the private storage for the
         * thread, call the destructor explicitly.
         */
        netmap_dtor(priv);
        break;
    }

    /* return the offset of the netmap_if object */
    nmr- >nr_rx_rings = na- >num_rx_rings;
    nmr- >nr_tx_rings = na- >num_tx_rings;
    nmr- >nr_rx_slots = na- >num_rx_desc;
    nmr- >nr_tx_slots = na- >num_tx_desc;
    nmr- >nr_memsize = nm_mem- >nm_totalsize;
    /* 
    得到nifp在內(nèi)存池中的偏移。
    因?yàn)閚etmap的基礎(chǔ)就是利用內(nèi)核與用戶空間的內(nèi)存共享。但是眾所周知，內(nèi)核和用戶空間的地址范圍是不用的。
    這樣同樣的物理內(nèi)存，在內(nèi)核態(tài)和用戶態(tài)地址肯定不同。所以必須利用偏移來對(duì)應(yīng)相同的內(nèi)存。
    */
    nmr- >nr_offset = netmap_if_offset(nifp);
    break;

netmap_ioctl

分析完了NIOCGINFO和NIOCREGIF兩個(gè)，剩下的比較簡(jiǎn)單了。接下來是netmap_ioctl調(diào)用的函數(shù)

NIOCUNREGIF

case NIOCUNREGIF:
    if (priv == NULL) {
        /* 沒有priv肯定是不對(duì)的，肯定是沒有調(diào)用過NIOCREGIF */
        error = ENXIO;
        break;
    }

    /* the interface is unregistered inside the
       destructor of the private data. */
    /* 釋放priv內(nèi)存*/
    devfs_clear_cdevpriv();
    break;

NIOCTXSYNC和NIOCRXSYNC

這兩個(gè)使用相同的代碼。

case NIOCTXSYNC:
case NIOCRXSYNC:
    /* 檢查priv，確保之前調(diào)用了NIOCREGIF */
    if (priv == NULL) {
        error = ENXIO;
        break;
    }
    /* 
    記得之前分析NIOCREGIF時(shí)，priv- >np_ifp保存了net_device指針，所有現(xiàn)在可以直接獲得這個(gè)指針。
    要不要擔(dān)心net_device指針的有效性呢？不用，因?yàn)镹IOCREGIF時(shí)，在得到net_device時(shí)，已經(jīng)增加了計(jì)數(shù)
    */
    ifp = priv- >np_ifp; /* we have a reference */
    na = NA(ifp); /* retrieve netmap adapter */

    /* 
    np_qfirst表示需要檢查的第一個(gè)ring 
    當(dāng)其值為NETMAP_SW_RING是一個(gè)特殊的值，表示處理host的ring
    */
    if (priv- >np_qfirst == NETMAP_SW_RING) { /* host rings */
        /* 
        對(duì)于host ring處理，這個(gè)地方的代碼有點(diǎn)奇怪。
        當(dāng)cmd是NIOCTXSYNC，是將數(shù)據(jù)包傳給host；
        當(dāng)cmd是NIOCRXSYNC，是將數(shù)據(jù)包從host發(fā)送出去；
        感覺好像寫反了。我給作者發(fā)了郵件，不知道能不能得到回復(fù)。
        反正從語義上，我是覺得有問題。


        現(xiàn)在已經(jīng)得到了作者的回復(fù)——再次感嘆外國(guó)人的友好。這里的方向，是以netmap的角度去看。
        所以，當(dāng)cmd是txsync時(shí)，是netmap把包送出去，那么自然是交給host。反之亦然。
        */
        if (cmd == NIOCTXSYNC)
            netmap_sync_to_host(na);
        else
            netmap_sync_from_host(na, NULL, NULL);
        break;
    }

    /* find the last ring to scan */
    /* 
    得到需要檢查的最后一個(gè)ring，如果是NETMAP_HW_RING，那么就是最大ring數(shù)值 
    關(guān)于np_qfirst和np_qlast，等看到netmap_set_ringid時(shí)，大家就明白了
    */
    lim = priv- >np_qlast;
    if (lim == NETMAP_HW_RING)
        lim = (cmd == NIOCTXSYNC) ?
            na- >num_tx_rings : na- >num_rx_rings;

    /* 從第一個(gè)開始遍歷每個(gè)ring */
    for (i = priv- >np_qfirst; i < lim; i++) {
        if (cmd == NIOCTXSYNC) {
            struct netmap_kring *kring = &na- >tx_rings[i];
            if (netmap_verbose & NM_VERB_TXSYNC)
                D("pre txsync ring %d cur %d hwcur %d",
                    i, kring- >ring- >cur,
                    kring- >nr_hwcur);
            /* 執(zhí)行發(fā)送工作，留到后面分析 */
            na- >nm_txsync(ifp, i, 1 /* do lock */);
            if (netmap_verbose & NM_VERB_TXSYNC)
                D("post txsync ring %d cur %d hwcur %d",
                    i, kring- >ring- >cur,
                    kring- >nr_hwcur);
        } else {
            /* 執(zhí)行接收工作，留到后面分析*/
            na- >nm_rxsync(ifp, i, 1 /* do lock */);
            /* 
            在linux平臺(tái)上，實(shí)際上是調(diào)用了do_gettimeofday，不知道為什么接收需要的這個(gè)時(shí)間
            看看以后是不是可以知道原因。
            */
            microtime(&na- >rx_rings[i].ring- >ts);
        }
    }

到此，netmap_ioctl分析學(xué)習(xí)完畢。

netmap_set_ringid

static int
netmap_set_ringid(struct netmap_priv_d *priv, u_int ringid)
{
    struct ifnet *ifp = priv- >np_ifp;
    struct netmap_adapter *na = NA(ifp);

    /*
    從下面三個(gè)宏，可以得知ringid是一個(gè)“復(fù)用”的結(jié)構(gòu)。低24位用于表示id值，高位作為標(biāo)志。
    #define NETMAP_HW_RING  0x4000      /* low bits indicate one hw ring */
    #define NETMAP_SW_RING  0x2000      /* process the sw ring */
    #define NETMAP_NO_TX_POLL   0x1000  /* no automatic txsync on poll */
    #define NETMAP_RING_MASK 0xfff      /* the ring number */
    */
    u_int i = ringid & NETMAP_RING_MASK;
    /*
    根據(jù)注釋，在初始化階段，np_qfirst和np_qlast相等，不需要鎖保護(hù)。
    關(guān)于這點(diǎn)我沒想明白。如果兩個(gè)線程同時(shí)進(jìn)入怎么辦？
    */
    /* initially (np_qfirst == np_qlast) we don't want to lock */
    int need_lock = (priv- >np_qfirst != priv- >np_qlast);
    int lim = na- >num_rx_rings;

    /* 上限取發(fā)送和接收隊(duì)列數(shù)量的最大值 */
    if (na- >num_tx_rings > lim)
        lim = na- >num_tx_rings;
    /* 當(dāng)處理HW ring時(shí)，要對(duì)id進(jìn)行有效性判斷 */
    if ( (ringid & NETMAP_HW_RING) && i >= lim) {
        D("invalid ring id %d", i);
        return (EINVAL);
    }
    if (need_lock)
        na- >nm_lock(ifp, NETMAP_CORE_LOCK, 0);
    priv- >np_ringid = ringid;
    /*
    根據(jù)三種標(biāo)志，設(shè)置正確的np_qfirst和qlast。從這里也可以看出，只有在初始化時(shí),np_qfirst才可能等于np_qlast。 
    */
    if (ringid & NETMAP_SW_RING) {
        priv- >np_qfirst = NETMAP_SW_RING;
        priv- >np_qlast = 0;
    } else if (ringid & NETMAP_HW_RING) {
        priv- >np_qfirst = i;
        priv- >np_qlast = i + 1;
    } else {
        priv- >np_qfirst = 0;
        priv- >np_qlast = NETMAP_HW_RING ;
    }
    /* 是否在執(zhí)行接收數(shù)據(jù)包的poll時(shí)，發(fā)送數(shù)據(jù)包 */
    priv- >np_txpoll = (ringid & NETMAP_NO_TX_POLL) ? 0 : 1;
    if (need_lock)
        na- >nm_lock(ifp, NETMAP_CORE_UNLOCK, 0);
    if (ringid & NETMAP_SW_RING)
        D("ringid %s set to SW RING", ifp- >if_xname);
    else if (ringid & NETMAP_HW_RING)
        D("ringid %s set to HW RING %d", ifp- >if_xname,
            priv- >np_qfirst);
    else
        D("ringid %s set to all %d HW RINGS", ifp- >if_xname, lim);
    return 0;
}

netmap_ioctl分析完了，根據(jù)netmap的示例，下面該分析netmap的mmap的實(shí)現(xiàn)了。

定位netmap的mmap

前文提到過netmap會(huì)創(chuàng)建一個(gè)設(shè)備

static struct miscdevice netmap_cdevsw = {  /* same name as FreeBSD */  
    MISC_DYNAMIC_MINOR,  
    "netmap",  
    &netmap_fops,  
};

netmap_fops定義了netmap設(shè)備支持的操作

static struct file_operations netmap_fops = {
    .mmap = linux_netmap_mmap,
    LIN_IOCTL_NAME = linux_netmap_ioctl,
    .poll = linux_netmap_poll,
    .release = netmap_release,
};

OK，現(xiàn)在我們找到了mmap的入口，linuxnetmapmmap。

linux_netmap_mmap分析

現(xiàn)在直接進(jìn)入linux_netmap_mmap的代碼

static int
linux_netmap_mmap(struct file *f, struct vm_area_struct *vma)
{
    int lut_skip, i, j;
    int user_skip = 0;
    struct lut_entry *l_entry;
    const struct netmap_obj_pool *p[] = {
        nm_mem- >nm_if_pool,
        nm_mem- >nm_ring_pool,
        nm_mem- >nm_buf_pool };
    /*
    * vma- >vm_start: start of mapping user address space
    * vma- >vm_end: end of the mapping user address space
    */

    /* 
    這里又是一個(gè)編程技巧，使用(void）f既不會(huì)產(chǎn)生任何真正的代碼，又可以消除變量f沒有使用的warning。
    為什么f不使用，還會(huì)出現(xiàn)在參數(shù)列表中呢？沒辦法啊，只是Linux框架決定的。linux_netmap_mmap只是一個(gè)注冊(cè)回調(diào)，自然要遵從linux的框架了。
    */
    (void)f;    /* UNUSED */
    // XXX security checks

    for (i = 0; i < 3; i++) {  /* loop through obj_pools */
        /*
         * In each pool memory is allocated in clusters
         * of size _clustsize , each containing clustentries
         * entries. For each object k we already store the
         * vtophys malling in lut[k] so we use that, scanning
         * the lut[] array in steps of clustentries,
         * and we map each cluster (not individual pages,
         * it would be overkill).
         */
        /* 
        上面的注釋說的很明白。
        每個(gè)pool里的object都是由_clustsize組成的，每一個(gè)都包含clustertries個(gè)基礎(chǔ)內(nèi)存塊。 一個(gè)pool公有_numclusters個(gè)基礎(chǔ)內(nèi)存塊。
        所以，在進(jìn)行內(nèi)存映射的時(shí)候，user_skip表示已經(jīng)映射的內(nèi)存大小，vma- >start+user_skip也就是當(dāng)前未映射內(nèi)存的起始地址，lut_skip表示當(dāng)前待映射的物理內(nèi)存池的塊索引
        */
        for (lut_skip = 0, j = 0; j < p[i]- >_numclusters; j++) {
            l_entry = &p[i]- >lut[lut_skip];
            if (remap_pfn_range(vma, vma- >vm_start + user_skip,
                    l_entry- >paddr > > PAGE_SHIFT, p[i]- >_clustsize,
                    vma- >vm_page_prot))
                return -EAGAIN; // XXX check return value
            lut_skip += p[i]- >clustentries;
            user_skip += p[i]- >_clustsize;
        }
    }

    /* 
    循環(huán)執(zhí)行完畢后，netmap在內(nèi)核中的3個(gè)對(duì)象池已經(jīng)完全映射到用戶空間
    真正執(zhí)行映射的函數(shù)是remap_pfn_range，這是內(nèi)核函數(shù)，用于將內(nèi)核空間映射到用戶空間
    這個(gè)函數(shù)超出了本文的主題范圍了，我們只需要知道它是做什么的就行了。 
    */

    return 0;
}

用戶態(tài)得到對(duì)應(yīng)網(wǎng)卡的netmap結(jié)構(gòu)

在將netmap內(nèi)核態(tài)的內(nèi)存映射到用戶空間以后，netmap的示例通過offset來得到對(duì)應(yīng)網(wǎng)卡的netmap結(jié)構(gòu)。

fd = open("/dev/netmap", 0);
strcpy(req.nr_name, "ix0"); // register the interface
ioctl(fd, NIOCREG, &req); // offset of the structure
mem = mmap(NULL, req.nr_memsize, PROT_READ|PROT_WRITE, 0, fd, 0);
nifp = NETMAP_IF(mem, req.nr_offset);

在此例中，使用ioctl，得到req.nroffset是ix0網(wǎng)卡的netmap結(jié)構(gòu)的偏移——準(zhǔn)確的說是netmap管理網(wǎng)卡結(jié)構(gòu)內(nèi)存池的偏移。mmap后，mem是netmap內(nèi)存的映射，而網(wǎng)卡結(jié)構(gòu)內(nèi)存是內(nèi)存中的第一項(xiàng)，那么mem同樣可以視為netmap管理網(wǎng)卡結(jié)構(gòu)的內(nèi)存池的起始地址。因此，利用前面的req.nroffset，就得到了ix0的netmap結(jié)構(gòu)，即struct netmap_if。

走讀netmap的示例中工作代碼

按照netmap示例，馬上就要進(jìn)入netmap真正工作的代碼了。

for (;;) {  
    struct pollfd x[1];
    /*
    根據(jù)netmap的代碼，NETMAP_RXRING的定義如下
    #define NETMAP_RXRING(nifp, index)          
        ((struct netmap_ring *)((char *)(nifp) +    
        (nifp)- >ring_ofs[index + (nifp)- >ni_tx_rings + 1] ) )
    得到該網(wǎng)卡的接收ring buffer。

    吐個(gè)槽，為什么英文接收Receive要縮寫為RX呢。。。我在別的地方也見過。
    */
    struct netmap_ring *ring = NETMAP_RX_RING(nifp, 0);
    x[0].fd = fd;
    x[0].events = POLLIN;
    /* 超時(shí)1秒等接收事件發(fā)生 */
    poll(x, 1, 1000);
    /* 收到ring- >avail個(gè)包 */
    for ( ; ring- >avail > 0 ; ring- >avail--) {
        /* 得到當(dāng)前包索引 */
        i = ring- >cur;
        /* 得到對(duì)應(yīng)的數(shù)據(jù)包 */
        buf = NETMAP_BUF(ring, i);
        /* 用戶態(tài)處理該數(shù)據(jù)包 */
        use_data(buf, ring- >slot[i].len);
        /* 移到下一個(gè)待處理數(shù)據(jù)包 */
        ring- >cur = NETMAP_NEXT(ring, i);
    }
}

聲明：本文內(nèi)容及配圖由入駐作者撰寫或者入駐合作網(wǎng)站授權(quán)轉(zhuǎn)載。文章觀點(diǎn)僅代表作者本人，不代表電子發(fā)燒友網(wǎng)立場(chǎng)。文章及其配圖僅供工程師學(xué)習(xí)之用，如有內(nèi)容侵權(quán)或者其他違規(guī)問題，請(qǐng)聯(lián)系本站處理。舉報(bào)投訴

網(wǎng)絡(luò)設(shè)備

網(wǎng)絡(luò)設(shè)備

+關(guān)注

關(guān)注
0

文章
308

瀏覽量
29612
源碼

源碼

+關(guān)注

關(guān)注
8

文章
633

瀏覽量
29140
數(shù)據(jù)包

數(shù)據(jù)包

+關(guān)注

關(guān)注
0

文章
253

瀏覽量
24363

評(píng)論

相關(guān)推薦

精品国产人成在线_亚洲高清无码在线观看_国产在线视频国产永久2021_国产AV综合第一页一个的一区免费影院黑人_最近中文字幕MV高清在线视频

搜索歷史

netmap的源碼分析

netmap簡(jiǎn)單介紹

netmap的源碼分析

e1000_configure的修改

e1000cleanrx_irq的修改

netmap_init

netmapmemoryinit

netmapnewobj_allocator

netmap的應(yīng)用示例

open操作

NIOCREG ioctl操作

netmap的ioctl

NIOCGINFO

NIOCREGIF

netmap_ioctl

NIOCUNREGIF

NIOCTXSYNC和NIOCRXSYNC

netmap_set_ringid

定位netmap的mmap

linux_netmap_mmap分析

用戶態(tài)得到對(duì)應(yīng)網(wǎng)卡的netmap結(jié)構(gòu)

走讀netmap的示例中工作代碼

評(píng)論

分享主成分分析源碼

Linux內(nèi)核源碼之我見——內(nèi)核源碼的分析方法

鴻蒙源碼分析系列(總目錄) | 給HarmonyOS源碼逐行加上中文注釋

uCOS2源碼分析

互斥量源碼分析測(cè)試

從內(nèi)核協(xié)議棧轉(zhuǎn)向DPDK/netmap或者XDP的本質(zhì)原因是什么？

uCOS2源碼分析

nucleus plus源碼分析下載

基于stm32_TFT液晶屏顯示源碼分析

基于stm32TFT液晶屏顯示源碼分析

UCOS-III OS_CPU_PendSVHandler源碼分析

uboot源碼分析，思路還算清晰

Java反射的工作原理和源碼分析

十二個(gè)Pixhawk源碼筆記分析資源下載

epoll源碼分析