0%

Linux Socket 子系统

socket 在所有的网络操作系统中都是必不可少的,而且在所有的网络应用程序中也是必不可少的。它是网络通信中应用程序对应的进程和网络协议之间的接口之间的接口。 本文将介绍其使用和内核实现原理,参考 Linux 内核版本为 v5.4。

Background

socket 在网络系统中的作用是:

  1. socket 位于协议之上,屏蔽了不同网络协议之间的差异;
  2. socket是网络编程的入口,它提供了大量的系统调用,构成了网络程序的主体;
  3. 在Linux系统中,socket属于文件系统的一部分,网络通信可以被看作是对文件的读取,使得我们对网络的控制和对文件的控制一样方便。

结构

无论是用 socket 操作 TCP,还是 UDP,我们首先都要调用 socket 函数。

1
int socket(int domain, int type, int protocol);

socket 函数用于创建一个 socket 的文件描述符,唯一标识一个 socket。我们把它叫作文件描述符,因为在内核中,我们会创建类似文件系统的数据结构,并且后续的操作都有用到它。

在创建 socket 的时候,有三个参数:

  • family

    表示地址族。不是所有的 Socket 都要通过 IP 进行通信,还有其他的通信方式。例如,下面的定义中,domain sockets 就是通过本地文件进行通信的,不需要 IP 地址。只不过,通过 IP 地址只是最常用的模式,所以我们这里着重分析这种模式。

    include/linux/socket.h
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    /* Supported address families. */
    #define AF_UNSPEC 0
    #define AF_UNIX 1 /* Unix domain sockets */
    #define AF_LOCAL 1 /* POSIX name for AF_UNIX */
    #define AF_INET 2 /* Internet IP Protocol */
    #define AF_AX25 3 /* Amateur Radio AX.25 */
    #define AF_IPX 4 /* Novell IPX */
    #define AF_APPLETALK 5 /* AppleTalk DDP */
    #define AF_NETROM 6 /* Amateur Radio NET/ROM */
    #define AF_BRIDGE 7 /* Multiprotocol bridge */
    #define AF_ATMPVC 8 /* ATM PVCs */
    #define AF_X25 9 /* Reserved for X.25 project */
    #define AF_INET6 10 /* IP version 6 */
    // ...
    #define AF_XDP 44 /* XDP sockets */

    #define AF_MAX 45 /* For now.. */

    对应的也有 Protocol Famliy

    include/linux/socket.h
    1
    2
    3
    4
    5
    6
    7
    8
    9
    /* Protocol families, same as address families. */
    #define PF_UNSPEC AF_UNSPEC
    #define PF_UNIX AF_UNIX
    #define PF_LOCAL AF_LOCAL
    #define PF_INET AF_INET
    #define PF_AX25 AF_AX25
    #define PF_IPX AF_IPX
    #define PF_APPLETALK AF_APPLETALK
    // ...
  • type

    常用的 Socket 类型有三种,分别是 SOCK_STREAM、SOCK_DGRAM 和 SOCK_RAW。

    include/linux/net.h
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    enum sock_type {
    SOCK_STREAM = 1, // stream (connection) socket
    SOCK_DGRAM = 2, // datagram (conn.less) socket
    SOCK_RAW = 3, // raw socket
    SOCK_RDM = 4, // reliably-delivered message
    SOCK_SEQPACKET = 5, // sequential packet socket
    SOCK_DCCP = 6, // Datagram Congestion Control Protocol socket
    SOCK_PACKET = 10, // linux specific way of getting packets at the dev level.
    // For writing rarp and other similar things on the user level.
    };
    #define SOCK_MAX (SOCK_PACKET + 1)
    • SOCK_STREAM 是面向数据流的,协议 IPPROTO_TCP 属于这种类型。
    • SOCK_DGRAM 是面向数据报的,协议 IPPROTO_UDP 属于这种类型。如果在内核里面看的话,IPPROTO_ICMP 也属于这种类型。
    • SOCK_RAW 是原始的 IP 包,IPPROTO_IP 属于这种类型。

    这一节,我们重点看 SOCK_STREAM 类型和 IPPROTO_TCP 协议。

  • protocol

    第三个参数是 protocol,是协议。协议数目是比较多的,也就是说,多个协议会属于同一种类型。

    include/uapi/linux/in.h
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    #if __UAPI_DEF_IN_IPPROTO
    /* Standard well-defined IP protocols. */
    enum {
    IPPROTO_IP = 0, /* Dummy protocol for TCP */
    #define IPPROTO_IP IPPROTO_IP
    IPPROTO_ICMP = 1, /* Internet Control Message Protocol */
    #define IPPROTO_ICMP IPPROTO_ICMP
    IPPROTO_IGMP = 2, /* Internet Group Management Protocol */
    #define IPPROTO_IGMP IPPROTO_IGMP
    IPPROTO_IPIP = 4, /* IPIP tunnels (older KA9Q tunnels use 94) */
    #define IPPROTO_IPIP IPPROTO_IPIP
    IPPROTO_TCP = 6, /* Transmission Control Protocol */
    #define IPPROTO_TCP IPPROTO_TCP
    IPPROTO_EGP = 8, /* Exterior Gateway Protocol */
    #define IPPROTO_EGP IPPROTO_EGP
    IPPROTO_PUP = 12, /* PUP protocol */
    #define IPPROTO_PUP IPPROTO_PUP
    IPPROTO_UDP = 17, /* User Datagram Protocol */
    #define IPPROTO_UDP IPPROTO_UDP
    IPPROTO_IDP = 22, /* XNS IDP protocol */
    #define IPPROTO_IDP IPPROTO_IDP
    IPPROTO_TP = 29, /* SO Transport Protocol Class 4 */
    #define IPPROTO_TP IPPROTO_TP
    IPPROTO_DCCP = 33, /* Datagram Congestion Control Protocol */
    #define IPPROTO_DCCP IPPROTO_DCCP
    IPPROTO_IPV6 = 41, /* IPv6-in-IPv4 tunnelling */
    #define IPPROTO_IPV6 IPPROTO_IPV6
    IPPROTO_RSVP = 46, /* RSVP Protocol */
    #define IPPROTO_RSVP IPPROTO_RSVP
    IPPROTO_GRE = 47, /* Cisco GRE tunnels (rfc 1701,1702) */
    #define IPPROTO_GRE IPPROTO_GRE
    IPPROTO_ESP = 50, /* Encapsulation Security Payload protocol */
    #define IPPROTO_ESP IPPROTO_ESP
    IPPROTO_AH = 51, /* Authentication Header protocol */
    #define IPPROTO_AH IPPROTO_AH
    IPPROTO_MTP = 92, /* Multicast Transport Protocol */
    #define IPPROTO_MTP IPPROTO_MTP
    IPPROTO_BEETPH = 94, /* IP option pseudo header for BEET */
    #define IPPROTO_BEETPH IPPROTO_BEETPH
    IPPROTO_ENCAP = 98, /* Encapsulation Header */
    #define IPPROTO_ENCAP IPPROTO_ENCAP
    IPPROTO_PIM = 103, /* Protocol Independent Multicast */
    #define IPPROTO_PIM IPPROTO_PIM
    IPPROTO_COMP = 108, /* Compression Header Protocol */
    #define IPPROTO_COMP IPPROTO_COMP
    IPPROTO_SCTP = 132, /* Stream Control Transport Protocol */
    #define IPPROTO_SCTP IPPROTO_SCTP
    IPPROTO_UDPLITE = 136, /* UDP-Lite (RFC 3828) */
    #define IPPROTO_UDPLITE IPPROTO_UDPLITE
    IPPROTO_MPLS = 137, /* MPLS in IP (RFC 4023) */
    #define IPPROTO_MPLS IPPROTO_MPLS
    IPPROTO_RAW = 255, /* Raw IP packets */
    #define IPPROTO_RAW IPPROTO_RAW
    IPPROTO_MAX
    };
    #endif

    并不是上面的 type 和 protocol可以随意组合的,如 SOCK_STREAM 不可以跟 IPPROTO_UDP 组合。当 protocol 为 0 时,会自动选择 type 类型对应的默认协议。

为了管理 family、type、protocol 这三个分类层次,内核会创建对应的数据结构。

UDP

TCP

三次握手建立连接

Socket vs Sock

内核中socket有两个数据结构,一个是socket,另一个是sock:

  • socket 是 general BSD socket,是应用程序和4层协议之间的接口,屏蔽掉了相关的4层协议部分
  • sock 是内核中保存 socket 所需要使用的相关的4层协议的信息
  • socket和sock这两个结构都有保存对方的指针,因此可以很容易的存取对方

同样的,在 socket 和 sock 层分别有 proto_opsproto 两个数据结构:

  • socket 的 ops域保存了对于不同的socket类型的操作函数
  • sock中有一个sk_common保存了一个 skc_prot 域保存的是相应的协议簇的操作函数的集合
  • proto 相当于对 proto_ops的一层封装,最终会在proto中调用 proto_ops

内核调用路径

内核在调用相关操作都是

  • 直接调用 socket 的ops域
  • 然后在ops域中调用相应的sock结构体中的sock_common域的skc_prot的操作集中的相对应的函数

举个例子,假设现在我们使用tcp协议然后调用 bind 方法,内核会先调用sys_bind 方法(属于socket的操作集):

1
2
3
4
5
6
7
8
9
10
int __sys_bind(int fd, struct sockaddr __user *umyaddr, int addrlen)
{
struct socket *sock;

/* ... */
err = sock->ops->bind(sock,
(struct sockaddr *)
&address, addrlen);
/* ... */
}

可以看到它调用的是 ops 域的bind方法.而这时我们的 ops 域是 inet_stream_ops来看它的bind方法(属于sock的操作集):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// net/ipv4/af_inet.c
int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
{
struct sock *sk = sock->sk;
int err;

/* If the socket has its own bind function then use it. (RAW) */
if (sk->sk_prot->bind) {
return sk->sk_prot->bind(sk, uaddr, addr_len);
}

/* ... */

return __inet_bind(sk, uaddr, addr_len, false, true);
}

它最终调用的是sock结构的 sk_prot 域(也就是sock_common的skc_prot域) 的bind方法,而此时我们的skc_prot的值是 tcp_prot,因此最终会调用 tcp_prot 的 bind方法。对于 bind 而言,因为没有 bind,所以还是调用的 __inet_bind

下面就是示意图:

socket

include/linux/net.h
1
2
3
4
5
6
7
8
9
10
// struct socket - general BSD socket
struct socket {
socket_state state; // socket state (%SS_CONNECTED, etc)
short type; // socket type (%SOCK_STREAM, etc)
unsigned long flags; // socket flags (%SOCK_NOSPACE, etc)
struct socket_wq *wq; // wait queue for several uses
struct file *file; // File back pointer for gc
struct sock *sk; // internal networking protocol agnostic socket representation
const struct proto_ops *ops; // protocol specific socket operations
};

proto_ops

include/linux/net.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
struct proto_ops {
int family;
struct module *owner;
int (*release) (struct socket *sock);
int (*bind) (struct socket *sock, struct sockaddr *myaddr, int sockaddr_len);
int (*connect) (struct socket *sock, struct sockaddr *vaddr, int sockaddr_len, int flags);
int (*socketpair)(struct socket *sock1, struct socket *sock2);
int (*accept) (struct socket *sock, struct socket *newsock, int flags, bool kern);
int (*getname) (struct socket *sock, struct sockaddr *addr, int peer);
__poll_t (*poll) (struct file *file, struct socket *sock, struct poll_table_struct *wait);
int (*ioctl) (struct socket *sock, unsigned int cmd, unsigned long arg);
int (*listen) (struct socket *sock, int len);
int (*shutdown) (struct socket *sock, int flags);
int (*setsockopt)(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen);
int (*getsockopt)(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen);
int (*sendmsg) (struct socket *sock, struct msghdr *m, size_t total_len);
int (*recvmsg) (struct socket *sock, struct msghdr *m, size_t total_len, int flags);
int (*mmap) (struct file *file, struct socket *sock, struct vm_area_struct * vma);
ssize_t (*sendpage) (struct socket *sock, struct page *page, int offset, size_t size, int flags);
ssize_t (*splice_read)(struct socket *sock, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags);
int (*set_peek_off)(struct sock *sk, int val);
int (*peek_len)(struct socket *sock);
};

socket state

include/uapi/linux/net.h
1
2
3
4
5
6
7
typedef enum {
SS_FREE = 0, /* not allocated */
SS_UNCONNECTED, /* unconnected to any socket */
SS_CONNECTING, /* in process of connecting */
SS_CONNECTED, /* connected to socket */
SS_DISCONNECTING /* in process of disconnecting */
} socket_state;

sock

sock_common

include/net/sock.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
/**
* struct sock_common - minimal network layer representation of sockets
*
* This is the minimal network layer representation of sockets, the header
* for struct sock and struct inet_timewait_sock.
*/
struct sock_common {
/* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned
* address on 64bit arches : cf INET_MATCH()
*/
union {
__addrpair skc_addrpair;
struct {
__be32 skc_daddr; // Foreign IPv4 addr
__be32 skc_rcv_saddr; // Bound local IPv4 addr
};
};
union {
unsigned int skc_hash; // hash value used with various protocol lookup tables
__u16 skc_u16hashes[2]; // two u16 hash values used by UDP lookup tables
};
/* skc_dport && skc_num must be grouped as well */
union {
__portpair skc_portpair;
struct {
__be16 skc_dport; // placeholder for inet_dport/tw_dport
__u16 skc_num; // placeholder for inet_num/tw_num
};
};

unsigned short skc_family; // network address family
volatile unsigned char skc_state; // Connection state
unsigned char skc_reuse:4; // %SO_REUSEADDR setting
unsigned char skc_reuseport:1; // %SO_REUSEPORT setting
unsigned char skc_ipv6only:1;
unsigned char skc_net_refcnt:1;
int skc_bound_dev_if; // bound device index if != 0
union {
struct hlist_node skc_bind_node; // bind hash linkage for various protocol lookup tables
struct hlist_node skc_portaddr_node; // second hash linkage for UDP/UDP-Lite protocol
};
struct proto *skc_prot; // protocol handlers inside a network family
possible_net_t skc_net; // reference to the network namespace of this socket

#if IS_ENABLED(CONFIG_IPV6)
struct in6_addr skc_v6_daddr;
struct in6_addr skc_v6_rcv_saddr;
#endif

atomic64_t skc_cookie;

/* following fields are padding to force
* offset(struct sock, sk_refcnt) == 128 on 64bit arches
* assuming IPV6 is enabled. We use this padding differently
* for different kind of 'sockets'
*/
union {
unsigned long skc_flags; /* place holder for sk_flags
* %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
* %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
*/
struct sock *skc_listener; /* request_sock */
struct inet_timewait_death_row *skc_tw_dr; /* inet_timewait_sock */
};
/*
* fields between dontcopy_begin/dontcopy_end
* are not copied in sock_copy()
*/
/* private: */
int skc_dontcopy_begin[0];
/* public: */
union {
struct hlist_node skc_node; // main hash linkage for various protocol lookup tables
struct hlist_nulls_node skc_nulls_node; // main hash linkage for TCP/UDP/UDP-Lite protocol
};
unsigned short skc_tx_queue_mapping; // tx queue number for this connection
#ifdef CONFIG_XPS
unsigned short skc_rx_queue_mapping; // rx queue number for this connection
#endif
union {
int skc_incoming_cpu; // record/match cpu processing incoming packets
u32 skc_rcv_wnd;
u32 skc_tw_rcv_nxt; /* struct tcp_timewait_sock */
};

refcount_t skc_refcnt; // reference count
/* private: */
int skc_dontcopy_end[0];
union {
u32 skc_rxhash;
u32 skc_window_clamp;
u32 skc_tw_snd_nxt; /* struct tcp_timewait_sock */
};
/* public: */
};

sock

include/linux/sock.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
struct sock {
/*
* Now struct inet_timewait_sock also uses sock_common, so please just
* don't add nothing before this first member (__sk_common) --acme
*/
struct sock_common __sk_common; // shared layout with inet_timewait_sock

socket_lock_t sk_lock; // synchronizer
atomic_t sk_drops; // raw/udp drops counter
int sk_rcvlowat; // %SO_RCVLOWAT setting
struct sk_buff_head sk_error_queue; // rarely used
struct sk_buff_head sk_receive_queue; // incoming packets
/*
* The backlog queue is special, it is always used with
* the per-socket spinlock held and requires low latency
* access. Therefore we special case it's implementation.
* Note : rmem_alloc is in this structure to fill a hole
* on 64bit arches, not because its logically part of
* backlog.
*/
struct {
atomic_t rmem_alloc;
int len;
struct sk_buff *head;
struct sk_buff *tail;
} sk_backlog; // always used with the per-socket spinlock held
#define sk_rmem_alloc sk_backlog.rmem_alloc

int sk_forward_alloc; // space allocated forward
#ifdef CONFIG_NET_RX_BUSY_POLL
unsigned int sk_ll_usec; // usecs to busypoll when there is no data
/* ===== mostly read cache line ===== */
unsigned int sk_napi_id; // id of the last napi context to receive data for sk
#endif
int sk_rcvbuf; // size of receive buffer in bytes

struct sk_filter __rcu *sk_filter; // socket filtering instructions
union {
struct socket_wq __rcu *sk_wq; // sock wait queue and async head
struct socket_wq *sk_wq_raw;
};

struct dst_entry *sk_rx_dst; // receive input route used by early demux
struct dst_entry __rcu *sk_dst_cache; // destination cache
atomic_t sk_omem_alloc; // "o" is "option" or "other"
int sk_sndbuf; // size of send buffer in bytes

/* ===== cache line for TX ===== */
int sk_wmem_queued; // persistent queue size
refcount_t sk_wmem_alloc; // transmit queue bytes committed
unsigned long sk_tsq_flags; // TCP Small Queues flags
union {
struct sk_buff *sk_send_head; // front of stuff to transmit
struct rb_root tcp_rtx_queue;
};
struct sk_buff_head sk_write_queue; // Packet sending queue
__s32 sk_peek_off; // current peek_offset value
int sk_write_pending; // a write to stream socket waits to start
__u32 sk_dst_pending_confirm; // need to confirm neighbour
u32 sk_pacing_status; // see enum sk_pacing, Pacing status (requested, handled by sch_fq)
long sk_sndtimeo; // %SO_SNDTIMEO setting
struct timer_list sk_timer; // sock cleanup timer
__u32 sk_priority; // %SO_PRIORITY setting
__u32 sk_mark; // generic packet mark
u32 sk_pacing_rate; // bytes per second, Pacing rate (if supported by transport/packet scheduler)
u32 sk_max_pacing_rate; // Maximum pacing rate (%SO_MAX_PACING_RATE)
struct page_frag sk_frag; // cached page frag
netdev_features_t sk_route_caps; // route capabilities (e.g. %NETIF_F_TSO)
netdev_features_t sk_route_nocaps; // forbidden route capabilities (e.g NETIF_F_GSO_MASK)
netdev_features_t sk_route_forced_caps;
int sk_gso_type; // GSO type (e.g. %SKB_GSO_TCPV4)
unsigned int sk_gso_max_size; // Maximum GSO segment size to build
gfp_t sk_allocation; // allocation mode
__u32 sk_txhash; // computed flow hash for use on transmit

/*
* Because of non atomicity rules, all
* changes are protected by socket lock.
*/
unsigned int __sk_flags_offset[0]; // empty field used to determine location of bitfield

unsigned int sk_padding : 1, // unused element for alignment
sk_kern_sock : 1, // True if sock is using kernel lock classes
sk_no_check_tx : 1, // %SO_NO_CHECK setting, set checksum in TX packets
sk_no_check_rx : 1, // allow zero checksum in RX packets
sk_userlocks : 4, // %SO_SNDBUF and %SO_RCVBUF settings
sk_protocol : 8, // which protocol this socket belongs in this network family
sk_type : 16; // socket type (%SOCK_STREAM, etc)

u16 sk_gso_max_segs; // Maximum number of GSO segments
u8 sk_pacing_shift; // scaling factor for TCP Small Queues
unsigned long sk_lingertime; // %SO_LINGER l_linger setting
struct proto *sk_prot_creator; // sk_prot of original sock creator (see ipv6_setsockopt, IPV6_ADDRFORM for instance)
rwlock_t sk_callback_lock; // used with the callbacks in the end of this struct
int sk_err, // last error
sk_err_soft; // errors that don't cause failure but are the cause of a persistent failure not just 'timed out'
u32 sk_ack_backlog; // current listen backlog
u32 sk_max_ack_backlog; // listen backlog set in listen()
kuid_t sk_uid; // user id of owner
struct pid *sk_peer_pid; // &struct pid for this socket's peer
const struct cred *sk_peer_cred; // %SO_PEERCRED setting
long sk_rcvtimeo; // %SO_RCVTIMEO setting
ktime_t sk_stamp; // time stamp of last packet received
u16 sk_tsflags; // SO_TIMESTAMPING socket options
u8 sk_shutdown; // mask of %SEND_SHUTDOWN and/or %RCV_SHUTDOWN
u32 sk_tskey; // counter to disambiguate concurrent tstamp requests
atomic_t sk_zckey; // counter to order MSG_ZEROCOPY notifications

u8 sk_clockid; // clockid used by time-based scheduling (SO_TXTIME)
u8 sk_txtime_deadline_mode : 1, // set deadline mode for SO_TXTIME
sk_txtime_report_errors : 1,
sk_txtime_unused : 6; // unused txtime flags

struct socket *sk_socket; // Identd and reporting IO signals
void *sk_user_data; // RPC layer private data
struct sock_cgroup_data sk_cgrp_data; // cgroup data for this cgroup
struct mem_cgroup *sk_memcg; // this socket's memory cgroup association
void (*sk_state_change)(struct sock *sk); // callback to indicate change in the state of the sock
void (*sk_data_ready)(struct sock *sk); // callback to indicate there is data to be processed
void (*sk_write_space)(struct sock *sk); // callback to indicate there is bf sending space available
void (*sk_error_report)(struct sock *sk); // callback to indicate errors (e.g. %MSG_ERRQUEUE)
int (*sk_backlog_rcv)(struct sock *sk, struct sk_buff *skb); // callback to process the backlog
void (*sk_destruct)(struct sock *sk); // called at sock freeing time, i.e. when all refcnt == 0
struct sock_reuseport __rcu *sk_reuseport_cb; // reuseport group container
struct rcu_head sk_rcu; // used during RCU grace period
};

proto

很重要的是其连接着对应的 proto 操作集,在操作集中维护着对应的 hashinfohashinfo 存储着 sock、端口等重要信息

include/net/sock.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
/* Networking protocol blocks we attach to sockets.
* socket layer -> transport layer interface
*/
struct proto {
void (*close)(struct sock *sk, long timeout);
int (*pre_connect)(struct sock *sk, struct sockaddr *uaddr, int addr_len);
int (*connect)(struct sock *sk, struct sockaddr *uaddr, int addr_len);
int (*disconnect)(struct sock *sk, int flags);

struct sock * (*accept)(struct sock *sk, int flags, int *err, bool kern);

int (*ioctl)(struct sock *sk, int cmd, unsigned long arg);
int (*init)(struct sock *sk);
void (*destroy)(struct sock *sk);
void (*shutdown)(struct sock *sk, int how);
int (*setsockopt)(struct sock *sk, int level, int optname, char __user *optval, unsigned int optlen);
int (*getsockopt)(struct sock *sk, int level, int optname, char __user *optval, int __user *option);
void (*keepalive)(struct sock *sk, int valbool);
int (*sendmsg)(struct sock *sk, struct msghdr *msg, size_t len);
int (*recvmsg)(struct sock *sk, struct msghdr *msg, size_t len, int noblock, int flags, int *addr_len);
int (*sendpage)(struct sock *sk, struct page *page, int offset, size_t size, int flags);
int (*bind)(struct sock *sk, struct sockaddr *uaddr, int addr_len);

int (*backlog_rcv) (struct sock *sk, struct sk_buff *skb);

void (*release_cb)(struct sock *sk);

/* Keeping track of sk's, looking them up, and port selection methods. */
int (*hash)(struct sock *sk);
void (*unhash)(struct sock *sk);
void (*rehash)(struct sock *sk);
int (*get_port)(struct sock *sk, unsigned short snum);

bool (*stream_memory_free)(const struct sock *sk);
bool (*stream_memory_read)(const struct sock *sk);
/* Memory pressure */
void (*enter_memory_pressure)(struct sock *sk);
void (*leave_memory_pressure)(struct sock *sk);
atomic_long_t *memory_allocated; /* Current allocated memory. */
struct percpu_counter *sockets_allocated; /* Current number of sockets. */
/*
* Pressure flag: try to collapse.
* Technical note: it is used by multiple contexts non atomically.
* All the __sk_mem_schedule() is of this nature: accounting
* is strict, actions are advisory and have some latency.
*/
unsigned long *memory_pressure;
long *sysctl_mem;

int *sysctl_wmem;
int *sysctl_rmem;
u32 sysctl_wmem_offset;
u32 sysctl_rmem_offset;

int max_header;
bool no_autobind;

struct kmem_cache *slab;
unsigned int obj_size;
slab_flags_t slab_flags;
unsigned int useroffset; /* Usercopy region offset */
unsigned int usersize; /* Usercopy region size */

struct percpu_counter *orphan_count;

struct request_sock_ops *rsk_prot;
struct timewait_sock_ops *twsk_prot;

union {
struct inet_hashinfo *hashinfo;
struct udp_table *udp_table;
struct raw_hashinfo *raw_hash;
struct smc_hashinfo *smc_hash;
} h;

struct module *owner;

char name[32];

struct list_head node;
int (*diag_destroy)(struct sock *sk, int err);
} __randomize_layout;

inet_sock

inet_sock 是 INET 域的 socket 表示,是对 struct sock 的一个扩展,提供 INET 域的一些属性,如TTL,组播列表,IP地址,端口等;

include/net/inet_sock.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
/** struct inet_sock - representation of INET sockets
*
* @sk - ancestor class
* @pinet6 - pointer to IPv6 control block
* @inet_daddr - Foreign IPv4 addr
* @inet_rcv_saddr - Bound local IPv4 addr
* @inet_dport - Destination port
* @inet_num - Local port
* @inet_saddr - Sending source
* @uc_ttl - Unicast TTL
* @inet_sport - Source port
*/
struct inet_sock {
struct sock sk;
#define inet_daddr sk.__sk_common.skc_daddr
#define inet_rcv_saddr sk.__sk_common.skc_rcv_saddr
#define inet_dport sk.__sk_common.skc_dport
#define inet_num sk.__sk_common.skc_num

__be32 inet_saddr;
__s16 uc_ttl;
__u16 cmsg_flags;
__be16 inet_sport;
__u16 inet_id;

/* ... */
};

raw_sock

它是 RAW 协议的一个 socket 表示,是对 struct inet_sock 的扩展,它要处理与ICMP相关的内容;

include/net/raw.h
1
2
3
4
5
6
struct raw_sock {
/* inet_sock has to be the first member */
struct inet_sock inet;
struct icmp_filter filter;
u32 ipmr_table;
};

udp_sock

它是UDP协议的socket表示,是对 struct inet_sock 的扩展;

include/net/udp.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
struct udp_sock {
/* inet_sock has to be the first member */
struct inet_sock inet;
#define udp_port_hash inet.sk.__sk_common.skc_u16hashes[0]
#define udp_portaddr_hash inet.sk.__sk_common.skc_u16hashes[1]
#define udp_portaddr_node inet.sk.__sk_common.skc_portaddr_node
int pending; /* Any pending frames ? */
unsigned int corkflag; /* Cork is required */
__u8 encap_type; /* Is this an Encapsulation socket? */
unsigned char no_check6_tx:1,/* Send zero UDP6 checksums on TX? */
no_check6_rx:1;/* Allow zero UDP6 checksums on RX? */
/*
* Following member retains the information to create a UDP header
* when the socket is uncorked.
*/
__u16 len; /* total length of pending frames */
__u16 gso_size;

/*
* For encapsulation sockets.
*/
int (*encap_rcv)(struct sock *sk, struct sk_buff *skb);
void (*encap_destroy)(struct sock *sk);

/* GRO functions for UDP socket */
struct sk_buff * (*gro_receive)(struct sock *sk, struct list_head *head, struct sk_buff *skb);
int (*gro_complete)(struct sock *sk, struct sk_buff *skb,int nhoff);

/* udp_recvmsg try to use this before splicing sk_receive_queue */
struct sk_buff_head reader_queue ____cacheline_aligned_in_smp;

/* This field is dirtied by udp_recvmsg() */
int forward_deficit;
};

inet_connection_sock

inet_connection_sock 是所有 面向连接 的socket表示,是对 struct inet_sock 的扩展;它的第一个域就是inet_sock。主要维护了与其绑定的端口结构,以及监听等状态过程中的信息。

include/net/inet_connection_sock.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// inet_connection_sock - INET connection oriented sock	   
struct inet_connection_sock {
/* inet_sock has to be the first member! */
struct inet_sock icsk_inet;
struct request_sock_queue icsk_accept_queue; // FIFO of established children
struct inet_bind_bucket *icsk_bind_hash; // Bind node
unsigned long icsk_timeout; // Timeout
struct timer_list icsk_retransmit_timer; // Resend (no ack)
struct timer_list icsk_delack_timer;
__u32 icsk_rto; // Retransmit timeout
__u32 icsk_pmtu_cookie; // Last pmtu seen by socket
const struct tcp_congestion_ops *icsk_ca_ops; // Pluggable congestion control hook
const struct inet_connection_sock_af_ops *icsk_af_ops; // Operations which are AF_INET{4,6} specific
const struct tcp_ulp_ops *icsk_ulp_ops; // Pluggable ULP control hook
void *icsk_ulp_data; // ULP private data
void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq); // Clean acked data hook
struct hlist_node icsk_listen_portaddr_node; // hash to the portaddr listener hashtable
unsigned int (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
__u8 icsk_ca_state:6, // Congestion control state
icsk_ca_setsockopt:1,
icsk_ca_dst_locked:1;
__u8 icsk_retransmits; // Number of unrecovered [RTO] timeouts
__u8 icsk_pending; // Scheduled timer event
__u8 icsk_backoff; // Backoff
__u8 icsk_syn_retries; // Number of allowed SYN (or equivalent) retries
__u8 icsk_probes_out; // unanswered 0 window probes
__u16 icsk_ext_hdr_len; // Network protocol overhead (IP/IPv6 options)
struct {
__u8 pending; /* ACK is pending */
__u8 quick; /* Scheduled number of quick acks */
__u8 pingpong; /* The session is interactive */
__u8 blocked; /* Delayed ACK was blocked by socket lock */
__u32 ato; /* Predicted tick of soft clock */
unsigned long timeout; /* Currently scheduled timeout */
__u32 lrcvtime; /* timestamp of last received data packet */
__u16 last_seg_size; /* Size of last incoming segment */
__u16 rcv_mss; /* MSS used for delayed ACK decisions */
} icsk_ack; // Delayed ACK control data
struct {
int enabled;

/* Range of MTUs to search */
int search_high;
int search_low;

/* Information on the current probe. */
int probe_size;

u32 probe_timestamp;
} icsk_mtup; // MTU probing control data
u32 icsk_user_timeout;

u64 icsk_ca_priv[88 / sizeof(u64)];
#define ICSK_CA_PRIV_SIZE (11 * sizeof(u64))
};

tcp_sock

tcp_sock 是TCP协议的socket表示,是对 struct inet_connection_sock 的扩展,主要增加滑动窗口,拥塞控制一些TCP专用属性;它的第一个域就是inet_connection_sock

include/net/tcp.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
struct tcp_sock {
/* inet_connection_sock has to be the first member of tcp_sock */
struct inet_connection_sock inet_conn;
u16 tcp_header_len; /* Bytes of tcp header to send */
u16 gso_segs; /* Max number of segs per GSO packet */

/*
* Header prediction flags
* 0x5?10 << 16 + snd_wnd in net byte order
*/
__be32 pred_flags;

/*
* RFC793 variables by their proper names. This means you can
* read the code and the spec side by side (and laugh ...)
* See RFC793 and RFC1122. The RFC writes these in capitals.
*/
u64 bytes_received; /* RFC4898 tcpEStatsAppHCThruOctetsReceived
* sum(delta(rcv_nxt)), or how many bytes
* were acked.
*/
u32 segs_in; /* RFC4898 tcpEStatsPerfSegsIn
* total number of segments in.
*/
u32 data_segs_in; /* RFC4898 tcpEStatsPerfDataSegsIn
* total number of data segments in.
*/
u32 rcv_nxt; /* What we want to receive next */
u32 copied_seq; /* Head of yet unread data */
u32 rcv_wup; /* rcv_nxt on last window update sent */
u32 snd_nxt; /* Next sequence we send */
u32 segs_out; /* RFC4898 tcpEStatsPerfSegsOut
* The total number of segments sent.
*/
u32 data_segs_out; /* RFC4898 tcpEStatsPerfDataSegsOut
* total number of data segments sent.
*/
u64 bytes_sent; /* RFC4898 tcpEStatsPerfHCDataOctetsOut
* total number of data bytes sent.
*/
u64 bytes_acked; /* RFC4898 tcpEStatsAppHCThruOctetsAcked
* sum(delta(snd_una)), or how many bytes
* were acked.
*/
u32 dsack_dups; /* RFC4898 tcpEStatsStackDSACKDups
* total number of DSACK blocks received
*/
u32 snd_una; /* First byte we want an ack for */
u32 snd_sml; /* Last byte of the most recently transmitted small packet */
u32 rcv_tstamp; /* timestamp of last received ACK (for keepalives) */
u32 lsndtime; /* timestamp of last sent data packet (for restart window) */
u32 last_oow_ack_time; /* timestamp of last out-of-window ACK */

u32 tsoffset; /* timestamp offset */

struct list_head tsq_node; /* anchor in tsq_tasklet.head list */
struct list_head tsorted_sent_queue; /* time-sorted sent but un-SACKed skbs */

u32 snd_wl1; /* Sequence for window update */
u32 snd_wnd; /* The window we expect to receive */
u32 max_window; /* Maximal window ever seen from peer */
u32 mss_cache; /* Cached effective mss, not including SACKS */

u32 window_clamp; /* Maximal window to advertise */
u32 rcv_ssthresh; /* Current window clamp */

/* Information of the most recently (s)acked skb */
struct tcp_rack {
u64 mstamp; /* (Re)sent time of the skb */
u32 rtt_us; /* Associated RTT */
u32 end_seq; /* Ending TCP sequence of the skb */
u32 last_delivered; /* tp->delivered at last reo_wnd adj */
u8 reo_wnd_steps; /* Allowed reordering window */
#define TCP_RACK_RECOVERY_THRESH 16
u8 reo_wnd_persist:5, /* No. of recovery since last adj */
dsack_seen:1, /* Whether DSACK seen after last adj */
advanced:1; /* mstamp advanced since last lost marking */
} rack;
u16 advmss; /* Advertised MSS */
u8 compressed_ack;
u32 chrono_start; /* Start time in jiffies of a TCP chrono */
u32 chrono_stat[3]; /* Time in jiffies for chrono_stat stats */
u8 chrono_type:2, /* current chronograph type */
rate_app_limited:1, /* rate_{delivered,interval_us} limited? */
fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */
fastopen_no_cookie:1, /* Allow send/recv SYN+data without a cookie */
is_sack_reneg:1, /* in recovery from loss with SACK reneg? */
unused:2;
u8 nonagle : 4,/* Disable Nagle algorithm? */
thin_lto : 1,/* Use linear timeouts for thin streams */
recvmsg_inq : 1,/* Indicate # of bytes in queue upon recvmsg */
repair : 1,
frto : 1;/* F-RTO (RFC5682) activated in CA_Loss */
u8 repair_queue;
u8 syn_data:1, /* SYN includes data */
syn_fastopen:1, /* SYN includes Fast Open option */
syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */
syn_fastopen_ch:1, /* Active TFO re-enabling probe */
syn_data_acked:1,/* data in SYN is acked by SYN-ACK */
save_syn:1, /* Save headers of SYN packet */
is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */
syn_smc:1; /* SYN includes SMC */
u32 tlp_high_seq; /* snd_nxt at the time of TLP retransmit. */

/* RTT measurement */
u64 tcp_mstamp; /* most recent packet received/sent */
u32 srtt_us; /* smoothed round trip time << 3 in usecs */
u32 mdev_us; /* medium deviation */
u32 mdev_max_us; /* maximal mdev for the last rtt period */
u32 rttvar_us; /* smoothed mdev_max */
u32 rtt_seq; /* sequence number to update rttvar */
struct minmax rtt_min;

u32 packets_out; /* Packets which are "in flight" */
u32 retrans_out; /* Retransmitted packets out */
u32 max_packets_out; /* max packets_out in last window */
u32 max_packets_seq; /* right edge of max_packets_out flight */

u16 urg_data; /* Saved octet of OOB data and control flags */
u8 ecn_flags; /* ECN status bits. */
u8 keepalive_probes; /* num of allowed keep alive probes */
u32 reordering; /* Packet reordering metric. */
u32 reord_seen; /* number of data packet reordering events */
u32 snd_up; /* Urgent pointer */

/*
* Options received (usually on last packet, some only on SYN packets).
*/
struct tcp_options_received rx_opt;

/*
* Slow start and congestion control (see also Nagle, and Karn & Partridge)
*/
u32 snd_ssthresh; /* Slow start size threshold */
u32 snd_cwnd; /* Sending congestion window */
u32 snd_cwnd_cnt; /* Linear increase counter */
u32 snd_cwnd_clamp; /* Do not allow snd_cwnd to grow above this */
u32 snd_cwnd_used;
u32 snd_cwnd_stamp;
u32 prior_cwnd; /* cwnd right before starting loss recovery */
u32 prr_delivered; /* Number of newly delivered packets to
* receiver in Recovery. */
u32 prr_out; /* Total number of pkts sent during Recovery. */
u32 delivered; /* Total data packets delivered incl. rexmits */
u32 delivered_ce; /* Like the above but only ECE marked packets */
u32 lost; /* Total data packets lost incl. rexmits */
u32 app_limited; /* limited until "delivered" reaches this val */
u64 first_tx_mstamp; /* start of window send phase */
u64 delivered_mstamp; /* time we reached "delivered" */
u32 rate_delivered; /* saved rate sample: packets delivered */
u32 rate_interval_us; /* saved rate sample: time elapsed */

u32 rcv_wnd; /* Current receiver window */
u32 write_seq; /* Tail(+1) of data held in tcp send buffer */
u32 notsent_lowat; /* TCP_NOTSENT_LOWAT */
u32 pushed_seq; /* Last pushed seq, required to talk to windows */
u32 lost_out; /* Lost packets */
u32 sacked_out; /* SACK'd packets */

struct hrtimer pacing_timer;
struct hrtimer compressed_ack_timer;

/* from STCP, retrans queue hinting */
struct sk_buff* lost_skb_hint;
struct sk_buff *retransmit_skb_hint;

/* OOO segments go in this rbtree. Socket lock must be held. */
struct rb_root out_of_order_queue;
struct sk_buff *ooo_last_skb; /* cache rb_last(out_of_order_queue) */

/* SACKs data, these 2 need to be together (see tcp_options_write) */
struct tcp_sack_block duplicate_sack[1]; /* D-SACK block */
struct tcp_sack_block selective_acks[4]; /* The SACKS themselves*/

struct tcp_sack_block recv_sack_cache[4];

struct sk_buff *highest_sack; /* skb just after the highest
* skb with SACKed bit set
* (validity guaranteed only if
* sacked_out > 0)
*/

int lost_cnt_hint;

u32 prior_ssthresh; /* ssthresh saved at recovery start */
u32 high_seq; /* snd_nxt at onset of congestion */

u32 retrans_stamp; /* Timestamp of the last retransmit,
* also used in SYN-SENT to remember stamp of
* the first SYN. */
u32 undo_marker; /* snd_una upon a new recovery episode. */
int undo_retrans; /* number of undoable retransmissions. */
u64 bytes_retrans; /* RFC4898 tcpEStatsPerfOctetsRetrans
* Total data bytes retransmitted
*/
u32 total_retrans; /* Total retransmits for entire connection */

u32 urg_seq; /* Seq of received urgent pointer */
unsigned int keepalive_time; /* time before keep alive takes place */
unsigned int keepalive_intvl; /* time interval between keep alive probes */

int linger2;


/* Sock_ops bpf program related variables */
#ifdef CONFIG_BPF
u8 bpf_sock_ops_cb_flags; /* Control calling BPF programs
* values defined in uapi/linux/tcp.h
*/
#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
#else
#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
#endif

/* Receiver side RTT estimation */
u32 rcv_rtt_last_tsecr;
struct {
u32 rtt_us;
u32 seq;
u64 time;
} rcv_rtt_est;

/* Receiver queue space */
struct {
u32 space;
u32 seq;
u64 time;
} rcvq_space;

/* TCP-specific MTU probe information. */
struct {
u32 probe_seq_start;
u32 probe_seq_end;
} mtu_probe;
u32 mtu_info; /* We received an ICMP_FRAG_NEEDED / ICMPV6_PKT_TOOBIG
* while socket was owned by user.
*/

#ifdef CONFIG_TCP_MD5SIG
/* TCP AF-Specific parts; only used by MD5 Signature support so far */
const struct tcp_sock_af_ops *af_specific;

/* TCP MD5 Signature Option information */
struct tcp_md5sig_info __rcu *md5sig_info;
#endif

/* TCP fastopen related information */
struct tcp_fastopen_request *fastopen_req;
/* fastopen_rsk points to request_sock that resulted in this big
* socket. Used to retransmit SYNACKs etc.
*/
struct request_sock *fastopen_rsk;
u32 *saved_syn;
};

Socket

socket 系统调用

我们从 socket 系统调用开始:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
{
return __sys_socket(family, type, protocol);
}

int __sys_socket(int family, int type, int protocol)
{
int retval;
struct socket *sock;
int flags;

if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;

retval = sock_create(family, type, protocol, &sock);
if (retval < 0)
return retval;

return sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
}

这里面的代码比较容易看懂,socket 系统调用会调用 sock_create 创建一个 struct socket 结构,然后通过 sock_map_fd 和文件描述符对应起来。

sock_create

接下来,我们打开 sock_create 函数看一下,可以看到它调用了 __sock_create,添加了 network namespace 和 是否是 kern 的参数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#define current get_current()  // return current task_struct

int sock_create(int family, int type, int protocol, struct socket **res)
{
return __sock_create(current->nsproxy->net_ns, family, type, protocol, res, 0);
}

int sock_create_kern(struct net *net, int family, int type, int protocol, struct socket **res)
{
return __sock_create(net, family, type, protocol, res, 1);
}

int __sock_create(struct net *net, int family, int type, int protocol,
struct socket **res, int kern)

{
int err;
struct socket *sock;
const struct net_proto_family *pf;

err = security_socket_create(family, type, protocol, kern);

sock = sock_alloc();
sock->type = type;

rcu_read_lock();
pf = rcu_dereference(net_families[family]);
/*
* We will call the ->create function, that possibly is in a loadable
* module, so we have to bump that loadable module refcnt first.
*/
if (!try_module_get(pf->owner))
goto out_release;

/* Now protected by module ref count */
rcu_read_unlock();

err = pf->create(net, sock, protocol, kern);

/*
* Now to bump the refcnt of the [loadable] module that owns this
* socket at sock_release time we decrement its refcnt.
*/
if (!try_module_get(sock->ops->owner))
goto out_module_busy;

/*
* Now that we're done with the ->create function, the [loadable]
* module can have its refcnt decremented
*/
module_put(pf->owner);
err = security_socket_post_create(sock, family, type, protocol, kern);
if (err)
goto out_sock_release;
*res = sock;

return 0;

}

这里主要做了两件事:

  • 调用 sock_alloc 分配了一个 struct socket 结构,并创建了该 socket 对应的 inode
  • 根据 family 参数拿到了对应的 net_proto_family,并调用 pf->create

net_proto_family 初始化

这里解释下 net_proto_family:linux 网络子系统有一个 net_families 数组,我们能够以 family 参数为下标,找到对应的 struct net_proto_family,每个 net_proto_family 都有一个 create 函数:

1
2
3
4
5
6
7
8
9
10
// include/linux/net.h
struct net_proto_family {
int family;
int (*create)(struct net *net, struct socket *sock,
int protocol, int kern);
struct module *owner;
};

// net/socket.c
static const struct net_proto_family __rcu *net_families[NPROTO] __read_mostly;

每一种地址族都有自己的 net_proto_family,IP 地址族的 net_proto_family 定义如下,里面最重要的 create 函数指向 inet_create

1
2
3
4
5
6
// net/ipv4/af_inet.c
static const struct net_proto_family inet_family_ops = {
.family = PF_INET,
.create = inet_create, //这个用于 socket 系统调用创建
.owner = THIS_MODULE,
}

inet_family 是在 inet_init 中注册到 net_families 中的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

static int __init inet_init(void)
{
/* ... */
(void)sock_register(&inet_family_ops);

/* Register the socket-side information for inet_create. */
for (r = &inetsw[0]; r < &inetsw[SOCK_MAX]; ++r)
INIT_LIST_HEAD(r);

for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN]; ++q)
inet_register_protosw(q);

/* ... */
}

fs_initcall(inet_init);
sock_register

查看 sock_register 的实现,实际上就是向 net_families 这个数组插入对应的协议实现:

net/socket.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
int sock_register(const struct net_proto_family *ops)
{
int err;

spin_lock(&net_family_lock);
if (rcu_dereference_protected(net_families[ops->family],
lockdep_is_held(&net_family_lock)))
err = -EEXIST;
else {
rcu_assign_pointer(net_families[ops->family], ops);
err = 0;
}
spin_unlock(&net_family_lock);

pr_info("NET: Registered protocol family %d\n", ops->family);
return err;
}
注册 inetsw[]

inet_init 里面接下来初始化的是 inetswinetsw_array

这里的 inetsw 也是一个数组,type 作为下标,里面的内容是 struct inet_protosw,是协议,也即 inetsw 数组对于每个类型有一项,这一项里面是属于这个类型的协议。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/* The inetsw table contains everything that inet_create needs to
* build a new socket.
*/
static struct list_head inetsw[SOCK_MAX];
static DEFINE_SPINLOCK(inetsw_lock);

static int __init inet_init(void)
{
/* ... */

/* Register the socket-side information for inet_create. */
for (r = &inetsw[0]; r < &inetsw[SOCK_MAX]; ++r)
INIT_LIST_HEAD(r);
for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN]; ++q)
inet_register_protosw(q);

/* ... */
}

inetsw 数组是在系统初始化的时候初始化的,就像下面代码里面实现的一样。

首先,一个循环会将 inetsw 数组的每一项,都初始化为一个链表。咱们前面说了,一个 type 类型会包含多个 protocol,因而我们需要一个链表。接下来一个循环,是将 inetsw_array 注册到 inetsw 数组里面去。inetsw_array 的定义如下,这个数组里面的内容很重要,后面会用到它们。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
static struct inet_protosw inetsw_array[] =
{
{
.type = SOCK_STREAM,
.protocol = IPPROTO_TCP,
.prot = &tcp_prot,
.ops = &inet_stream_ops,
.flags = INET_PROTOSW_PERMANENT | INET_PROTOSW_ICSK,
},

{
.type = SOCK_DGRAM,
.protocol = IPPROTO_UDP,
.prot = &udp_prot,
.ops = &inet_dgram_ops,
.flags = INET_PROTOSW_PERMANENT,
},

{
.type = SOCK_DGRAM,
.protocol = IPPROTO_ICMP,
.prot = &ping_prot,
.ops = &inet_sockraw_ops,
.flags = INET_PROTOSW_REUSE,
},

{
.type = SOCK_RAW,
.protocol = IPPROTO_IP, /* wild card */
.prot = &raw_prot,
.ops = &inet_sockraw_ops,
.flags = INET_PROTOSW_REUSE,
}
};

#define INETSW_ARRAY_LEN ARRAY_SIZE(inetsw_array)

sock_alloc

初始化 socket 对象,主要与文件系统打交道:

在文件系统中分配inode
  • 通过 new_inode_pseudo 在 socket 文件系统中创建新的 inode
    • 其中 sock_mnt 为 socks 的根节点,是在内核初始化安装socket文件系统时赋值的
    • mnt_sb 是该文件系统安装点的超级块对象的指针
net/socket.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
struct socket *sock_alloc(void)
{
struct inode *inode;
struct socket *sock;

inode = new_inode_pseudo(sock_mnt->mnt_sb);
if (!inode)
return NULL;

sock = SOCKET_I(inode);

inode->i_ino = get_next_ino();
inode->i_mode = S_IFSOCK | S_IRWXUGO;
inode->i_uid = current_fsuid();
inode->i_gid = current_fsgid();
inode->i_op = &sockfs_inode_ops;

return sock;
}
根据 inode 取得 socket 对象
  • 由于创建inode是文件系统的通用逻辑,因此其返回值是inode对象的指针;
  • 但这里在创建socket的inode后,需要根据inode得到socket对象;
  • 内联函数 SOCKET_I由此而来:
1
2
3
4
5
6
7
8
9
struct socket_alloc {
struct socket socket;
struct inode vfs_inode;
};

static inline struct socket *SOCKET_I(struct inode *inode)
{
return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
}

回到sock_alloc函数,SOCKET_I 根据 inode 取得 socket 变量后,记录当前进程的一些信息

  • 如fsuid, fsgid,并增加sockets_in_use的值(该变量表示创建socket的个数);
  • 设置 inode 的 imode 为 S_IFSOCK
  • 设置 inode 的 i_op 为 sockfs_inode_ops
1
2
3
4
static const struct inode_operations sockfs_inode_ops = {
.listxattr = sockfs_listxattr,
.setattr = sockfs_setattr,
};

inet_create

我们回到函数 __sock_create。接下来,在这里面,这个 inet_create 会被调用。

net/ipv4/af_inet.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
static int inet_create(struct net *net, struct socket *sock, int protocol, int kern)
{
struct sock *sk;
struct inet_protosw *answer;
struct inet_sock *inet;
struct proto *answer_prot;
unsigned char answer_flags;
int try_loading_module = 0;
int err;

sock->state = SS_UNCONNECTED;

/* Look for the requested type/protocol pair. */
lookup_protocol:
list_for_each_entry_rcu(answer, &inetsw[sock->type], list) {
err = 0;
/* Check the non-wild match. */
if (protocol == answer->protocol) {
if (protocol != IPPROTO_IP)
break;
} else {
/* Check for the two wild cases. */
if (IPPROTO_IP == protocol) {
protocol = answer->protocol;
break;
}
if (IPPROTO_IP == answer->protocol)
break;
}
err = -EPROTONOSUPPORT;
}

/* ... */
sock->ops = answer->ops;
answer_prot = answer->prot;
answer_flags = answer->flags;

/* ... */
sk = sk_alloc(net, PF_INET, GFP_KERNEL, answer_prot, kern);

/* ... */
inet = inet_sk(sk);
inet->nodefrag = 0;
if (SOCK_RAW == sock->type) {
inet->inet_num = protocol;
if (IPPROTO_RAW == protocol)
inet->hdrincl = 1;
}
inet->inet_id = 0;
sock_init_data(sock, sk);

sk->sk_destruct = inet_sock_destruct;
sk->sk_protocol = protocol;
sk->sk_backlog_rcv = sk->sk_prot->backlog_rcv;

inet->uc_ttl = -1;
inet->mc_loop = 1;
inet->mc_ttl = 1;
inet->mc_all = 1;
inet->mc_index = 0;
inet->mc_list = NULL;
inet->rcv_tos = 0;

if (inet->inet_num) {
inet->inet_sport = htons(inet->inet_num);
/* Add to protocol hash chains. */
err = sk->sk_prot->hash(sk);
}

if (sk->sk_prot->init) {
err = sk->sk_prot->init(sk);
}

/* ... */
}
设置 socket 状态为 SS_UNCONNECTED
1
sock->state = SS_UNCONNECTED;
找到 type/protocol 对应的 inet_protosw

在 inet_create 中,我们先会看到一个循环 list_for_each_entry_rcu。在这里,socket 的第二个参数 type 开始起作用。因为循环查看的是 inetsw[sock->type]

我们回到 inet_create 的 list_for_each_entry_rcu 循环中。到这里就好理解了:

  • 这是在 inetsw 数组中,根据 type 找到属于这个类型的列表
  • 依次比较列表中的 struct inet_protosw 的 protocol 是不是用户指定的 protocol;
    • 如果是,就得到了符合用户指定的 family->type->protocolstruct inet_protosw *answer 对象

接下来,struct socket *sock 的 ops 成员变量,被赋值为 answer 的 ops。对于 TCP 来讲,就是 inet_stream_ops。后面任何用户对于这个 socket 的操作,都是通过 inet_stream_ops 进行的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
const struct proto_ops inet_stream_ops = {
.family = PF_INET,
.owner = THIS_MODULE,
.release = inet_release,
.bind = inet_bind,
.connect = inet_stream_connect,
.socketpair = sock_no_socketpair,
.accept = inet_accept,
.getname = inet_getname,
.poll = tcp_poll,
.ioctl = inet_ioctl,
.listen = inet_listen,
.shutdown = inet_shutdown,
.setsockopt = sock_common_setsockopt,
.getsockopt = sock_common_getsockopt,
.sendmsg = inet_sendmsg,
.recvmsg = inet_recvmsg,

/* ... */
}
初始化 socket
1
2
3
sock->ops = answer->ops;
answer_prot = answer->prot;
answer_flags = answer->flags;

结合例子,socket 变量的ops指向 inet_stream_ops 结构体变量;

分配 sock 结构体变量 sk_alloc

接下来,我们创建一个 struct sock *sk 对象。这里比较让人困惑。socket 和 sock 看起来几乎一样,容易让人混淆,这里需要说明一下,socket 是用于负责对上给用户提供接口,并且和文件系统关联。而 sock 负责向下对接内核网络协议栈。前面代码中已经看到,我们通过 sock_alloc 来初始化 socket 对象,通过 sk_alloc 来初始化 sock 对象。

1
sk = sk_alloc(net, PF_INET, GFP_KERNEL, answer_prot, kern);

在 sk_alloc 函数中,struct inet_protosw *answer 结构的 tcp_prot 赋值给了 struct sock *sk 的 sk_prot 成员。tcp_prot 的定义如下,里面定义了很多的函数,都是 sock 之下内核协议栈的动作。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
struct proto tcp_prot = {
.name = "TCP",
.owner = THIS_MODULE,
.close = tcp_close,
.connect = tcp_v4_connect,
.disconnect = tcp_disconnect,
.accept = inet_csk_accept,
.ioctl = tcp_ioctl,
.init = tcp_v4_init_sock,
.destroy = tcp_v4_destroy_sock,
.shutdown = tcp_shutdown,
.setsockopt = tcp_setsockopt,
.getsockopt = tcp_getsockopt,
.keepalive = tcp_set_keepalive,
.recvmsg = tcp_recvmsg,
.sendmsg = tcp_sendmsg,
.sendpage = tcp_sendpage,
.backlog_rcv = tcp_v4_do_rcv,
.release_cb = tcp_release_cb,
.hash = inet_hash,
.get_port = inet_csk_get_port,
}

这里创建了 struct sock,主要用于初始化各种网络协议栈相关的对象。

net/core/sock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
/*	
* sk_alloc - All socket objects are allocated here
*/
struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
struct proto *prot, int kern)
{
struct sock *sk;

sk = sk_prot_alloc(prot, priority | __GFP_ZERO, family);
if (sk) {
sk->sk_family = family;
/*
* See comment in struct sock definition to understand
* why we need sk_prot_creator -acme
*/
sk->sk_prot = sk->sk_prot_creator = prot;
sk->sk_kern_sock = kern;
sock_lock_init(sk);
sk->sk_net_refcnt = kern ? 0 : 1;
if (likely(sk->sk_net_refcnt)) {
get_net(net);
sock_inuse_add(net, 1);
}

sock_net_set(sk, net);
refcount_set(&sk->sk_wmem_alloc, 1);

mem_cgroup_sk_alloc(sk);
cgroup_sk_alloc(&sk->sk_cgrp_data);
sock_update_classid(&sk->sk_cgrp_data);
sock_update_netprioidx(&sk->sk_cgrp_data);
}

return sk;
}
建立 socket 与 sock 的关系 sock_init_data

inet_create 函数中,接下来创建一个 struct inet_sock 结构,这个结构一开始就是 struct sock,然后扩展了一些其他的信息,剩下的代码就填充这些信息。这一幕我们会经常看到,将一个结构放在另一个结构的开始位置,然后扩展一些成员,通过对于指针的强制类型转换,来访问这些成员。

1
inet = inet_sk(sk);

这里为什么能直接将sock结构体变量强制转化为inet_sock结构体变量呢?只有一种可能,那就是在分配sock结构体变量时,真正分配的是inet_sock或是其他结构体。

我们回到分配sock结构体的那块代码(net/core/sock.c):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
int family)
{
struct sock *sk;
struct kmem_cache *slab;

slab = prot->slab;
if (slab != NULL)
// TCP 高速缓存
sk = kmem_cache_alloc(slab, priority & ~__GFP_ZERO);
else
// 内存分配
sk = kmalloc(prot->obj_size, priority);

return sk;

}

上面的代码在分配sock结构体时,有两种途径:

  • 一是从tcp专用高速缓存中分配,这种情况下在初始化高速缓存时,指定了结构体大小为 prot->obj_size
  • 二是从内存直接分配,这种情况下也有指定大小为prot->obj_sizee

根据这点,我们看下tcp_prot变量中的obj_size:

1
2
// net/ipv4/tcp_ipv4.c
.obj_size = sizeof(struct tcp_sock),

也就是说,分配的真实结构体是tcp_sock;由于tcp_sock、inet_connection_sock、inet_sock、sock之间均为0处偏移量,因此可以直接将tcp_sock直接强制转化为inet_sock;这几个结构体间的关系如下:

创建完sock变量之后,便是初始化sock结构体,并建立sock与socket之间的引用关系;调用函数为 sock_init_data()

该函数主要工作是:

  1. 初始化sock结构的缓冲区、队列等;
  2. 初始化sock结构的状态为 TCP_CLOSE;
  3. 建立socket与sock结构的相互引用关系;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// net/core/sock.c
void sock_init_data(struct socket *sock, struct sock *sk)
{
sk_init_common(sk);
sk->sk_send_head = NULL;

timer_setup(&sk->sk_timer, NULL, 0);

sk->sk_allocation = GFP_KERNEL;
sk->sk_rcvbuf = sysctl_rmem_default;
sk->sk_sndbuf = sysctl_wmem_default;
sk->sk_state = TCP_CLOSE;
sk_set_socket(sk, sock);

if (sock) {
sk->sk_type = sock->type;
sk->sk_wq = sock->wq;
sock->sk = sk;
sk->sk_uid = SOCK_INODE(sock)->i_uid;
} else {
sk->sk_wq = NULL;
sk->sk_uid = make_kuid(sock_net(sk)->user_ns, 0);
}
}
使用 tcp 协议初始化 sock

inet_create() 函数最后,通过相应的协议来初始化sock结构:

1
2
3
if (sk->sk_prot->init) {
err = sk->sk_prot->init(sk);
}

对 TCP 协议而言,其 init 函数是 tcp_v4_init_sock,它主要是对tcp_sock和inet_connection_sock进行一些初始化。

tcp_v4_init_sock
net/ipv4/tcp_ipv4,c
1
2
3
4
5
static int tcp_v4_init_sock(struct sock *sk)
{
tcp_init_sock(sk);
return 0;
}
tcp_init_sock
net/ipv4/tcp.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
/* Address-family independent initialization for a tcp_sock.
*
* NOTE: A lot of things set to zero explicitly by call to
* sk_alloc() so need not be done here.
*/
void tcp_init_sock(struct sock *sk)
{
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);

tp->out_of_order_queue = RB_ROOT;
sk->tcp_rtx_queue = RB_ROOT;
tcp_init_xmit_timers(sk);
INIT_LIST_HEAD(&tp->tsq_node);
INIT_LIST_HEAD(&tp->tsorted_sent_queue);

icsk->icsk_rto = TCP_TIMEOUT_INIT;
tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
minmax_reset(&tp->rtt_min, tcp_jiffies32, ~0U);

/* So many TCP implementations out there (incorrectly) count the
* initial SYN frame in their delayed-ACK and congestion control
* algorithms that we must have the following bandaid to talk
* efficiently to them. -DaveM
*/
tp->snd_cwnd = TCP_INIT_CWND;

/* There's a bubble in the pipe until at least the first ACK. */
tp->app_limited = ~0U;

/* See draft-stevens-tcpca-spec-01 for discussion of the
* initialization of these values.
*/
tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
tp->snd_cwnd_clamp = ~0;
tp->mss_cache = TCP_MSS_DEFAULT;

tp->reordering = sock_net(sk)->ipv4.sysctl_tcp_reordering;
tcp_assign_congestion_control(sk);

tp->tsoffset = 0;
tp->rack.reo_wnd_steps = 1;

sk->sk_state = TCP_CLOSE;

sk->sk_write_space = sk_stream_write_space;
sock_set_flag(sk, SOCK_USE_WRITE_QUEUE);

icsk->icsk_sync_mss = tcp_sync_mss;

sk->sk_sndbuf = sock_net(sk)->ipv4.sysctl_tcp_wmem[1];
sk->sk_rcvbuf = sock_net(sk)->ipv4.sysctl_tcp_rmem[1];

sk_sockets_allocated_inc(sk);
sk->sk_route_forced_caps = NETIF_F_GSO;
}

sock_map_fd

创建好与socket相关的结构后,需要与文件系统关联,详见 sock_map_fd() 函数:

  • 申请文件描述符,并分配file结构和目录项结构;
  • 关联socket相关的文件操作函数表和目录项操作函数表;
  • 将file->private_date指向socket;

关于 VFS,可以参考我的 这篇文章

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
static int sock_map_fd(struct socket *sock, int flags)
{
struct file *newfile;
int fd = get_unused_fd_flags(flags);
if (unlikely(fd < 0)) {
sock_release(sock);
return fd;
}

newfile = sock_alloc_file(sock, flags, NULL);
if (!IS_ERR(newfile)) {
fd_install(fd, newfile);
return fd;
}

put_unused_fd(fd);
return PTR_ERR(newfile);
}

get_unused_fd_flags

从当前进程的 files 结构分配一个 unused fd

1
2
3
4
int get_unused_fd_flags(unsigned flags)
{
return __alloc_fd(current->files, 0, rlimit(RLIMIT_NOFILE), flags);
}

sock_alloc_file

创建一个 file 对象,内核通过把socket指针赋值给file的private_data。这样一来,就可以通过 fd,在 fdtable 中得到 file 对象,然后得到 socket 对象:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
struct file *sock_alloc_file(struct socket *sock, int flags, const char *dname)
{
struct file *file;

if (!dname)
dname = sock->sk ? sock->sk->sk_prot_creator->name : "";

file = alloc_file_pseudo(SOCK_INODE(sock), sock_mnt, dname,
O_RDWR | (flags & O_NONBLOCK),
&socket_file_ops);
if (IS_ERR(file)) {
sock_release(sock);
return file;
}

sock->file = file;
file->private_data = sock;
return file;
}

这里的 socket_file_ops 定义如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// net/socket.c
static const struct file_operations socket_file_ops = {
.owner = THIS_MODULE,
.llseek = no_llseek,
.read_iter = sock_read_iter,
.write_iter = sock_write_iter,
.poll = sock_poll,
.unlocked_ioctl = sock_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = compat_sock_ioctl,
#endif
.mmap = sock_mmap,
.release = sock_close,
.fasync = sock_fasync,
.sendpage = sock_sendpage,
.splice_write = generic_splice_sendpage,
.splice_read = sock_splice_read,
};

经过这个 赋值之后,我们直接对 socket 的 fd 进行 read/write 等操作,就会进入到对应协议处理。以 read 为例,这里会调用 sock_recvmsg ,最终就调用到对应协议的 recvmsg

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
static ssize_t sock_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
struct file *file = iocb->ki_filp;
struct socket *sock = file->private_data;
struct msghdr msg = {.msg_iter = *to,
.msg_iocb = iocb};
ssize_t res;

if (file->f_flags & O_NONBLOCK)
msg.msg_flags = MSG_DONTWAIT;

if (iocb->ki_pos != 0)
return -ESPIPE;

if (!iov_iter_count(to)) /* Match SYS5 behaviour */
return 0;

res = sock_recvmsg(sock, &msg, msg.msg_flags);
*to = msg.msg_iter;
return res;
}

fd_install

为当前进程构建 fd 与 file 的映射关系:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// fs/file.c
void fd_install(unsigned int fd, struct file *file)
{
__fd_install(current->files, fd, file);
}

void __fd_install(struct files_struct *files, unsigned int fd,
struct file *file)
{
struct fdtable *fdt;

rcu_read_lock_sched();

if (unlikely(files->resize_in_progress)) {
rcu_read_unlock_sched();
spin_lock(&files->file_lock);
fdt = files_fdtable(files);
BUG_ON(fdt->fd[fd] != NULL);
rcu_assign_pointer(fdt->fd[fd], file);
spin_unlock(&files->file_lock);
return;
}
/* coupled with smp_wmb() in expand_fdtable() */
smp_rmb();
fdt = rcu_dereference_sched(files->fdt);
BUG_ON(fdt->fd[fd] != NULL);
rcu_assign_pointer(fdt->fd[fd], file);
rcu_read_unlock_sched();
}

socket 调用总结

最终来总结一下:

  • 内核中,socket是作为一个伪文件系统来实现的,它在初始化时注册到内核
  • 每个进程的files_struct域保存了所有的句柄,包括socket的.
    • 一般的文件操作,内核直接调用vfs层的方法,然后会自动调用socket实现的相关方法
    • 内核通过 inode 结构的 imode 域就可以知道当前的句柄所关联的是不是socket类型
    • 这时遇到 socket 独有的操作,就通过containof方法,来计算出socket的地址,从而就可以进行相关的操作

Bind

地址结构

struct sockaddr

struct sockaddr 其实相当于一个基类的地址结构,其他的结构都能够直接转到 sockaddr。举个例子比如当 sa_familyPF_INET 时,sa_data 就包含了端口号和 ip 地址:

1
2
3
4
5
// include/linux/socket.h
struct sockaddr {
sa_family_t sa_family; /* address family, AF_xxx */
char sa_data[14]; /* 14 bytes of protocol address */
};

struct sockaddr_in

sockaddr_in 表示了所有的 ipv4 的地址结构,即代表 AF_INET 域的地址,可以看到它相当于 sockaddr 的一个子类。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// include/uapi/linux/in.h
#define __SOCK_SIZE__ 16 /* sizeof(struct sockaddr) */
struct sockaddr_in {
__kernel_sa_family_t sin_family; /* Address family */
__be16 sin_port; /* Port number */
struct in_addr sin_addr; /* Internet address */

/* Pad to size of `struct sockaddr'. */
unsigned char __pad[__SOCK_SIZE__ - sizeof(short int) -
sizeof(unsigned short int) - sizeof(struct in_addr)];
};

struct in_addr
{
in_addr_t s_addr;
};

struct sockaddr_storage

这里还有一个内核比较新的地址结构 sockaddr_storage,他可以容纳所有类型的套接口结构,比如ipv4,ipv6等。相比于 sockaddr,它是强制对齐的。

1
2
3
4
5
6
7
8
9
10
11
12
13
// include/uapi/linux/socket.h
struct __kernel_sockaddr_storage {
union {
struct {
__kernel_sa_family_t ss_family; /* address family */
/* Following field(s) are implementation specific */
char __data[_K_SS_MAXSIZE - sizeof(unsigned short)];
/* space to achieve desired size, */
/* _SS_MAXSIZE value minus size of ss_family */
};
void *__align; /* implementation specific desired alignment */
};
};

存储数据结构

inet_hashinfo

第一个是 inet_hashinfo,它主要用来管理 tcp的 bind hash bucket

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// include/net/inet_hashtables.h
#define INET_LHTABLE_SIZE 32

struct inet_hashinfo {
/* This is for sockets with full identity only. Sockets here will
* always be without wildcards and will have the following invariant:
*
* TCP_ESTABLISHED <= sk->sk_state < TCP_CLOSE
*
*/
struct inet_ehash_bucket *ehash;
spinlock_t *ehash_locks;
unsigned int ehash_mask;
unsigned int ehash_locks_mask;

struct kmem_cache *bind_bucket_cachep;
struct inet_bind_hashbucket *bhash;
unsigned int bhash_size;

/* ... */
struct inet_listen_hashbucket listening_hash[INET_LHTABLE_SIZE]
____cacheline_aligned_in_smp;
};

在这个结构体中,有几个元素:

  • ehash:e 指的是 establish,管理除了 LISTEN 状态的所有 socket 的 hash 表
    • 类似于 C++中的 unordered_map<Key, sock*>
    • key 是 {source_ip, destination_ip, source_port, destination_port}
    • sock* 可以指向 tcp_request_sock、tcp_sock、tcp_timewait_sock 之一
  • bhash: b 指的是 bind,负责端口分配
    • 类似于 C++中的 map<uint16_t, list<tcp_sock*>> 其中 list 对应 inet_bind_bucket
  • listening_hash:负责侦听(listening) socket,bucket 数量是在编译内核时确定,通常是 32 个。

tcp_hashinfoinet_hashinfo 结构体的实例:

1
2
3
// net/ipv4/tcp_ipv4.c
struct inet_hashinfo tcp_hashinfo;
EXPORT_SYMBOL(tcp_hashinfo);

tcp_hashinfotcp_init 中被初始化:

1
2
3
4
5
6
7
8
9
10
11
12
// net/ipv4/tcp.c
void __init tcp_init(void)
{
/* ... */

inet_hashinfo_init(&tcp_hashinfo);
inet_hashinfo2_init(&tcp_hashinfo, "tcp_listen_portaddr_hash",
thash_entries, 21, /* one slot per 2 MB*/
0, 64 * 1024);

/* ... */
}

然后 tcp_hashinfo 会被赋值给 tcp_proto 和 sock 的 sk_prot 域. 其具体的实现在 /net/ipv4/tcp.c 中。

inet_ehash_bucket

struct inet_ehash_bucket 管理所有的tcp状态在 TCP_ESTABLISHEDTCP_CLOSE 之间的 sock:

1
2
3
4
// include/net/inet_hashtables.h
struct inet_ehash_bucket {
struct hlist_nulls_head chain;
};

这里的 bucket 里面是 hlist_nulls_head 是有可能会碰到两个 hash 值一样的 socket,以链表形式连起来(这也就是 hash table)

tcp_init 中,我们会初始化 ehash,这里申请了一个系统级较大的 hash 表,hash 表有 thash_entries 个 bucket,它的大小在内核启动的时候确定(有参数可以配置),不会更改(即不会 rehash):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// net/ipv4/tcp.c
void __init tcp_init(void)
{
tcp_hashinfo.ehash =
alloc_large_system_hash("TCP established",
sizeof(struct inet_ehash_bucket),
thash_entries,
17, /* one slot per 128 KB of memory */
0,
NULL,
&tcp_hashinfo.ehash_mask,
0,
thash_entries ? 0 : 512 * 1024);
for (i = 0; i <= tcp_hashinfo.ehash_mask; i++)
INIT_HLIST_NULLS_HEAD(&tcp_hashinfo.ehash[i].chain, i);

if (inet_ehash_locks_alloc(&tcp_hashinfo))
panic("TCP: failed to alloc ehash_locks");

}

参考内核日志:

1
2
3
4
5
6
7
8
9
10
11
[    1.053697] NET: Registered protocol family 2
[ 1.055864] TCP established hash table entries: 1048576 (order: 12, 16777216 bytes)
[ 1.059805] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
[ 1.088226] hashinfo create: new net:ffffffff82013d40 reuse TCP_HASHINFO ffffffff823df040
[ 1.098897] TCP: Hash tables configured (established 1048576 bind 65536)
[ 1.100420] death_row create: reuse TCP_DEATH_ROW net:ffffffff82013d40
[ 1.101934] TCP: reno registered
[ 1.102982] UDP hash table entries: 8192 (order: 6, 262144 bytes)
[ 1.104444] UDP: udp_hash3_init
[ 1.105868] UDP hash3 hash table entries: 1048576 (order: 11, 8388608 bytes)
[ 1.108546] UDP-Lite hash table entries: 8192 (order: 6, 262144 bytes)

参考 alloc_large_system_hash 函数声明:

1
2
3
4
5
6
7
8
9
10
11
12
void *__init alloc_large_system_hash(const char *tablename,
unsigned long bucketsize,
unsigned long numentries,
int scale,
int flags,
unsigned int *_hash_shift,
unsigned int *_hash_mask,
unsigned long low_limit,
unsigned long high_limit)
{
/* ... */
}

inet_bind_hashbucket

inet_bind_hashbucket 是哈希桶结构, lock 成员是用于操作时对桶进行加锁, chain 成员是相同哈希值的节点的链表。结构如下,存储了所有的端口的信息:

1
2
3
4
5
6
// include/net/inet_hashtables.h

struct inet_bind_hashbucket {
spinlock_t lock;
struct hlist_head chain;
};

tcp_init 中,我们会初始化 bhash,这里申请了一个系统级较大的 hash 表,它的大小在内核启动的时候确定(有参数可以配置),不会更改(即不会 rehash):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// net/ipv4/tcp.c
void __init tcp_init(void)
{
/* ... */
tcp_hashinfo.bind_bucket_cachep =
kmem_cache_create("tcp_bind_bucket",
sizeof(struct inet_bind_bucket), 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);

tcp_hashinfo.bhash =
alloc_large_system_hash("TCP bind",
sizeof(struct inet_bind_hashbucket),
tcp_hashinfo.ehash_mask + 1,
17, /* one slot per 128 KB of memory */
0,
&tcp_hashinfo.bhash_size,
NULL,
0,
64 * 1024);
tcp_hashinfo.bhash_size = 1U << tcp_hashinfo.bhash_size;
for (i = 0; i < tcp_hashinfo.bhash_size; i++) {
spin_lock_init(&tcp_hashinfo.bhash[i].lock);
INIT_HLIST_HEAD(&tcp_hashinfo.bhash[i].chain);
}
/* ... */
}

inet_bind_bucket

前面提到的哈希桶结构中的 chain 链表中的每个节点,其宿主结构体是 inet_bind_bucket ,该结构体通过成员 node 链入链表;

1
2
3
4
5
6
7
8
9
10
11
// include/net/inet_hashtables.h

struct inet_bind_bucket {
possible_net_t ib_net;
int l3mdev;
unsigned short port;
struct hlist_node node; // 作为bhash中chain链表的节点
struct hlist_head owners; // 绑定在该端口上的sock链表

/* ... */
};

用户编程实例

1
2
3
4
5
6
struct sockaddr_in server_address;  
server_address.sin_family = AF_INET;
server_address.sin_addr.s_addr = inet_addr("0.0.0.0");
server_address.sin_port = htons(9734);
server_len = sizeof(server_address);
bind(server_sockfd, (struct sockaddr *)&server_address, server_len);

bind 实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
int __sys_bind(int fd, struct sockaddr __user *umyaddr, int addrlen)
{
struct socket *sock;
struct sockaddr_storage address;
int err, fput_needed;

sock = sockfd_lookup_light(fd, &err, &fput_needed);
if (sock) {
err = move_addr_to_kernel(umyaddr, addrlen, &address);
if (!err) {
err = security_socket_bind(sock,
(struct sockaddr *)&address,
addrlen);
if (!err)
err = sock->ops->bind(sock,
(struct sockaddr *)
&address, addrlen);
}
fput_light(sock->file, fput_needed);
}
return err;
}

查找 socket 结构

根据 fd 取得关联的 socket 结构:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
static struct socket *sockfd_lookup_light(int fd, int *err, int *fput_needed)
{
struct fd f = fdget(fd);
struct socket *sock;

*err = -EBADF;
if (f.file) {
sock = sock_from_file(f.file, err);
if (likely(sock)) {
*fput_needed = f.flags;
return sock;
}
fdput(f);
}
return NULL;
}

struct socket *sock_from_file(struct file *file, int *err)
{
if (file->f_op == &socket_file_ops)
return file->private_data; /* set in sock_map_fd */

*err = -ENOTSOCK;
return NULL;
}

将用户空间地址结构复制到内核空间

1
2
3
4
5
6
7
8
9
10
int move_addr_to_kernel(void __user *uaddr, int ulen, struct sockaddr_storage *kaddr)
{
if (ulen < 0 || ulen > sizeof(struct sockaddr_storage))
return -EINVAL;
if (ulen == 0)
return 0;
if (copy_from_user(kaddr, uaddr, ulen))
return -EFAULT;
return audit_sockaddr(ulen, kaddr);
}

调用 bind 函数 inet_bind

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// net/ipv4/af_inet.c
int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
{
struct sock *sk = sock->sk;
int err;

/* If the socket has its own bind function then use it. (RAW) */
if (sk->sk_prot->bind) {
return sk->sk_prot->bind(sk, uaddr, addr_len);
}
if (addr_len < sizeof(struct sockaddr_in))
return -EINVAL;

/* BPF prog is run before any checks are done so that if the prog
* changes context in a wrong way it will be caught.
*/
err = BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr);
if (err)
return err;

return __inet_bind(sk, uaddr, addr_len, false, true);
}

可以看到最终时调用了 __inet_bind 函数

地址类型检查
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// net/ipv4/af_inet.c
int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
bool force_bind_address_no_port, bool with_lock)
{
/* ... */

if (addr->sin_family != AF_INET) {
/* Compatibility games : accept AF_UNSPEC (mapped to AF_INET)
* only if s_addr is INADDR_ANY.
*/
err = -EAFNOSUPPORT;
if (addr->sin_family != AF_UNSPEC ||
addr->sin_addr.s_addr != htonl(INADDR_ANY))
goto out;
}


tb_id = l3mdev_fib_table_by_index(net, sk->sk_bound_dev_if) ? : tb_id;
chk_addr_ret = inet_addr_type_table(net, addr->sin_addr.s_addr, tb_id);

/* Not specified by any standard per-se, however it breaks too
* many applications when removed. It is unfortunate since
* allowing applications to make a non-local bind solves
* several problems with systems using dynamic addressing.
* (ie. your servers still start up even if your ISDN link
* is temporarily down)
*/
err = -EADDRNOTAVAIL;
if (!inet_can_nonlocal_bind(net, inet) &&
addr->sin_addr.s_addr != htonl(INADDR_ANY) &&
chk_addr_ret != RTN_LOCAL &&
chk_addr_ret != RTN_MULTICAST &&
chk_addr_ret != RTN_BROADCAST)
goto out;
/* ... */
}

这里的 inet_addr_type_table 用于检查该地址所属的类型:

  • 广播地址
  • 本地地址
  • 多播地址

然后再做对应判断

端口范围的检查
  • 如果端口号小于 inet_prot_sock,也即是 1024 必须得有 root权限,如果没有则退出
  • capable 就是用来判断权限的,如果没有 CAP_NET_BIND_SERVICE 则退出
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// net/ipv4/af_inet.c

static inline int inet_prot_sock(struct net *net)
{
return PROT_SOCK;
}

int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
bool force_bind_address_no_port, bool with_lock)
{
/* ... */
snum = ntohs(addr->sin_port);
err = -EACCES;
if (snum && snum < inet_prot_sock(net) &&
!ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
goto out;

/* ... */
}
设置发送源地址和接收源地址
  • 这里先检查 sock 的状态,如果不是 TCP_CLOSE 或端口为 0,则出错返回(这里也映射到创建socket时要将sock结构体变量的状态设置为TCP_CLOSE上了)
  • 如果地址类型是多播或广播,则源地址设置为0,而接收地址为设置的ip地址
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// net/ipv4/af_inet.c
int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
bool force_bind_address_no_port, bool with_lock)
{
/* ... */

/* Check these errors (active socket, double bind). */
err = -EINVAL;
if (sk->sk_state != TCP_CLOSE || inet->inet_num)
goto out_release_sock;

inet->inet_rcv_saddr = inet->inet_saddr = addr->sin_addr.s_addr;
if (chk_addr_ret == RTN_MULTICAST || chk_addr_ret == RTN_BROADCAST)
inet->inet_saddr = 0; /* Use device */

/* ... */
}
检查端口是否被占用 inet_csk_get_port
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// net/ipv4/af_inet.c
int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
bool force_bind_address_no_port, bool with_lock)
{
/* ... */

/* Make sure we are allowed to bind here. */
if (snum || !(inet->bind_address_no_port ||
force_bind_address_no_port)) {
if (sk->sk_prot->get_port(sk, snum)) {
inet->inet_saddr = inet->inet_rcv_saddr = 0;
err = -EADDRINUSE;
goto out_release_sock;
}
err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
if (err) {
inet->inet_saddr = inet->inet_rcv_saddr = 0;
goto out_release_sock;
}
}
}

用于检查端口的函数为 sk->sk_prot->get_port(sk, snum),在 tcp 中被实例化为 inet_csk_get_port,这里我先来介绍下inet_csk_get_port的流程:

  • 当绑定的 port 为 0 时,也即是需要 kernel 来分配一个新的 port
    1. 首先得到系统的port范围
    2. 随机分配一个port
    3. 从 bhash 中得到当前随机分配的端口的链表(也就是 inet_bind_bucket 链表)
    4. 遍历这个链表(链表为空的话,也说明这个port没有被使用),如果这个端口已经被使用,则将端口号加一,继续循环,直到找到当前没有被使用的port,也就是没有在 bhash中 存在的port
    5. 新建一个inet_bind_bucket,并插入到bhash中.
  • 当指定 port
    1. 从 bhash 中根据hash值 (port计算的) 取得当前指定端口对应的 inet_bind_bucket 结构
    2. 如果 bhash 中存在,则说明这个端口已经在使用,因此需要判断这个端口是否允许被reuse
    3. 如果不存在,则步骤和上面的第5步一样
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// net/ipv4/inet_connection_sock.c
int inet_csk_get_port(struct sock *sk, unsigned short snum)
{
bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
int ret = 1, port = snum;
struct inet_bind_hashbucket *head;
struct net *net = sock_net(sk);
struct inet_bind_bucket *tb = NULL;
kuid_t uid = sock_i_uid(sk);
int l3mdev;

l3mdev = inet_sk_bound_l3mdev(sk);

// 端口为0,也就是需要内核来分配端口
if (!port) {
head = inet_csk_find_open_port(sk, &tb, &port);
if (!head)
return ret;
if (!tb)
goto tb_not_found;
goto success;
}
head = &hinfo->bhash[inet_bhashfn(net, port,
hinfo->bhash_size)];
spin_lock_bh(&head->lock);
inet_bind_bucket_for_each(tb, &head->chain)
if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
tb->port == port)
goto tb_found;
tb_not_found:
tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep,
net, head, port, l3mdev);
if (!tb)
goto fail_unlock;
tb_found:
if (!hlist_empty(&tb->owners)) {
if (sk->sk_reuse == SK_FORCE_REUSE)
goto success;

if (inet_csk_bind_conflict(sk, tb, true, true))
goto fail_unlock;
}
success:
/* ... */

if (sk->sk_reuseport) {}

/* ... */
if (!inet_csk(sk)->icsk_bind_hash)
inet_bind_hash(sk, tb, port);

return ret;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// net/ipv4/inet_hashtables.c
struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
struct net *net,
struct inet_bind_hashbucket *head,
const unsigned short snum,
int l3mdev)
{
struct inet_bind_bucket *tb = kmem_cache_alloc(cachep, GFP_ATOMIC);

if (tb) {
write_pnet(&tb->ib_net, net);
tb->l3mdev = l3mdev;
tb->port = snum;
tb->fastreuse = 0;
tb->fastreuseport = 0;
INIT_HLIST_HEAD(&tb->owners);
hlist_add_head(&tb->node, &head->chain);
}
return tb;
}
初始化目标地址和目的地址

由于绑定时不知道目的地址的信息,所以置为 0

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// net/ipv4/af_inet.c
int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
bool force_bind_address_no_port, bool with_lock)
{
/* ... */

if (inet->inet_rcv_saddr)
sk->sk_userlocks |= SOCK_BINDADDR_LOCK;
if (snum)
sk->sk_userlocks |= SOCK_BINDPORT_LOCK;
inet->inet_sport = htons(inet->inet_num);
inet->inet_daddr = 0;
inet->inet_dport = 0;
sk_dst_reset(sk);
err = 0;

/* ... */
}

Listen

数据结构

request_sock_queue

前面提到,inet_connection_sock 是所有 面向连接 的socket表示,是对 struct inet_sock 的扩展;它的第一个域就是inet_sock。主要维护了与其绑定的端口结构,以及监听等状态过程中的信息。它包含了一个 icsk_accept_queue 的域,这个域是一个request_sock_queue 类型。

每个需要 listen的sock都需要维护一个 request_sock_queue 结构,主要保存了与自己连接的半连接信息。

request_sock_queue 也就表示一个 request_sock 队列。这里我们知道,tcp中分为

  • 半连接队列:处于SYN_RECVD状态,刚接到syn,等待三次握手完成
  • 已完成连接队列:处于established状态,已经完成三次握手,等待accept来读取

在连接建立过程中:

  • 当当syn信号到来时,即处于半连接状态时,我们都会新建一个 request_sock结构,并将它加入到 listen_sockrequest_sock hash表中
  • 当当3次握手完毕后,即处于以完成连接状态时,将它放入到 request_sock_queuerskq_accept_headrskq_accept_tail 队列中
  • 当accept的时候,就直接从这个队列中读取,并且将 request_sock结构释放,然后在BSD层新建一个socket结构,并将它和接收端新建的子sock结构关联起来
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// include/net/request_sock.h
struct request_sock_queue {
spinlock_t rskq_lock;
u8 rskq_defer_accept;

u32 synflood_warned;
atomic_t qlen;
atomic_t young;

struct request_sock *rskq_accept_head;
struct request_sock *rskq_accept_tail;
struct fastopen_queue fastopenq; /* Check max_qlen != 0 to determine
* if TFO is enabled.
*/
};

listen_sock

listen_sock 维护了一个sock的半连接队列的信息,其中的syn_table中是连接该sock的所有半连接队列的hash数组。

新建立的request_sock就存放在listen_sock结构的syn_table中。这是一个哈希数组,总共有nr_table_entries项。实际上在分配内存时,分配的大小是TCP_SYNQ_HSIZE(512)项。成员max_qlen_log以2的对数的形式表示request_sock队列的最大值。哈希表有512项,但队列的最大值的取值是1024。即max_qlen_log的值为10。qlen是队列的当前长度。hash_rnd是一个随机数,计算哈希值用。

我们需要知道每一个SYN请求都会新建一个request_sock结构,并将它加入到listen_sock的syn_table哈希表中,然后接收端会发送一个SYN/ACK段给SYN请求端,当SYN请求端将3次握手的最后一个ACK发送给接收端后,并且接收端判断ACK正确,则将request_sock从syn_table哈希表中删除,将request_sock加入到request_sock_queue的rskq_accept_head和rskq_accept_tail队列中,最后的accept系统调用不过是判断accept队列是否存在完成3次请求的request_sock,从这个队列中将request_sock结构释放,然后在BSD层新建一个socket结构,并将它和接收端新建的子sock结构关联起来。

request_sock

request_sock保存了 tcp 双方传输所必需的一些域,比如窗口大小、对端速率、对端数据包序列号等等这些值.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// include/net/request_sock.h
struct request_sock {
struct sock_common __req_common;
#define rsk_refcnt __req_common.skc_refcnt
#define rsk_hash __req_common.skc_hash
#define rsk_listener __req_common.skc_listener
#define rsk_window_clamp __req_common.skc_window_clamp
#define rsk_rcv_wnd __req_common.skc_rcv_wnd

struct request_sock *dl_next;
u16 mss;
u8 num_retrans; /* number of retransmits */
u8 cookie_ts:1; /* syncookie: encode tcpopts in timestamp */
u8 num_timeout:7; /* number of timeouts */
u32 ts_recent;
struct timer_list rsk_timer;
const struct request_sock_ops *rsk_ops;
struct sock *sk;
u32 *saved_syn;
u32 secid;
u32 peer_secid;
};

inet_request_sock

struct inet_request_sock是request_sock的扩展,其定义如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// include/linux/inet_sock.h

struct inet_request_sock {
struct request_sock req;
#define ir_loc_addr req.__req_common.skc_rcv_saddr
#define ir_rmt_addr req.__req_common.skc_daddr
#define ir_num req.__req_common.skc_num
#define ir_rmt_port req.__req_common.skc_dport
#define ir_v6_rmt_addr req.__req_common.skc_v6_daddr
#define ir_v6_loc_addr req.__req_common.skc_v6_rcv_saddr
#define ir_iif req.__req_common.skc_bound_dev_if
#define ir_cookie req.__req_common.skc_cookie
#define ireq_net req.__req_common.skc_net
#define ireq_state req.__req_common.skc_state
#define ireq_family req.__req_common.skc_family

u16 snd_wscale : 4,
rcv_wscale : 4,
tstamp_ok : 1,
sack_ok : 1,
wscale_ok : 1,
ecn_ok : 1,
acked : 1,
no_srccheck: 1,
smc_ok : 1;
u32 ir_mark;
union {
struct ip_options_rcu __rcu *ireq_opt;
};
};

tcp_request_sock

增加了对于接收和发送的ISN(初始序列号)等的维护。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// include/linux/tcp.h
struct tcp_request_sock {
struct inet_request_sock req;
const struct tcp_request_sock_ops *af_specific;
u64 snt_synack; /* first SYNACK sent time */
bool tfo_listener;
u32 txhash;
u32 rcv_isn;
u32 snt_isn;
u32 ts_off;
u32 last_oow_ack_time; /* last SYNACK */
u32 rcv_nxt; /* the ack # by SYNACK. For
* FastOpen it's the seq#
* after data-in-SYN.
*/
};

Listen 实现

Listen 库函数调用的主要工作可以分为以下几步:

  1. 根据 socket 文件描述符找到内核中对应的socket结构体变量;
  2. 设置 socket 的状态并初始化等待连接队列;
  3. 将 socket 放入 inet_hashinfo的 listen 哈希表中;

这里还有一个概念那就是 backlog 是用户通过函数调用传入的。在 linux 中, backlog 的大小指的是可支持连接队列的大小,而可支持的半开连接的大小一般是和 backlog差不多,而半开连接队列的最大长度是根据backlog计算的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
int __sys_listen(int fd, int backlog)
{
struct socket *sock;
int err, fput_needed;
int somaxconn;

sock = sockfd_lookup_light(fd, &err, &fput_needed);
if (sock) {
somaxconn = sock_net(sock->sk)->core.sysctl_somaxconn;
if ((unsigned int)backlog > somaxconn)
backlog = somaxconn;

err = security_socket_listen(sock, backlog);
if (!err)
err = sock->ops->listen(sock, backlog);

fput_light(sock->file, fput_needed);
}
return err;
}

inet_listen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// net/ipv4/af_inet.c

/*
* Move a socket into listening state.
*/
int inet_listen(struct socket *sock, int backlog)
{
struct sock *sk = sock->sk;
unsigned char old_state;
int err, tcp_fastopen;

lock_sock(sk);

err = -EINVAL;
if (sock->state != SS_UNCONNECTED || sock->type != SOCK_STREAM)
goto out;

old_state = sk->sk_state;
if (!((1 << old_state) & (TCPF_CLOSE | TCPF_LISTEN)))
goto out;

sk->sk_max_ack_backlog = backlog;
/* Really, if the socket is already in listen state
* we can only allow the backlog to be adjusted.
*/
if (old_state != TCP_LISTEN) {
/* Enable TFO w/o requiring TCP_FASTOPEN socket option.
* Note that only TCP sockets (SOCK_STREAM) will reach here.
* Also fastopen backlog may already been set via the option
* because the socket was in TCP_LISTEN state previously but
* was shutdown() rather than close().
*/
tcp_fastopen = sock_net(sk)->ipv4.sysctl_tcp_fastopen;
if ((tcp_fastopen & TFO_SERVER_WO_SOCKOPT1) &&
(tcp_fastopen & TFO_SERVER_ENABLE) &&
!inet_csk(sk)->icsk_accept_queue.fastopenq.max_qlen) {
fastopen_queue_tune(sk, backlog);
tcp_fastopen_init_key_once(sock_net(sk));
}

err = inet_csk_listen_start(sk, backlog);
if (err)
goto out;
tcp_call_bpf(sk, BPF_SOCK_OPS_TCP_LISTEN_CB, 0, NULL);
}
err = 0;

out:
release_sock(sk);
return err;
}

上面的代码中,有点值得注意的是,当 sock 状态已经是 TCP_LISTEN 时,也可以继续调用 listen() 库函数,其作用是设置 sock 的最大并发连接请求数;

inet_csk_listen_start

它的主要工作是新分配一个 listen socket,将它加入到 inet_connection_sockicsk_accept_queue域的 listen_opt 中,然后对当前使用端口进行判断,最终返回:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// net/ipv4/inet_connection_sock.c

int inet_csk_listen_start(struct sock *sk, int backlog)
{
struct inet_connection_sock *icsk = inet_csk(sk);
struct inet_sock *inet = inet_sk(sk);
int err = -EADDRINUSE;

// 初始化连接等待队列
reqsk_queue_alloc(&icsk->icsk_accept_queue);

sk->sk_ack_backlog = 0; // 已经连接的个数
inet_csk_delack_init(sk); // 将ack标记全部置0

/* There is race window here: we announce ourselves listening,
* but this transition is still not validated by get_port().
* It is OK, because this socket enters to hash table only
* after validation is complete.
*/
inet_sk_state_store(sk, TCP_LISTEN); // 设置sock的状态为TCP_LISTEN
if (!sk->sk_prot->get_port(sk, inet->inet_num)) {
inet->inet_sport = htons(inet->inet_num);

sk_dst_reset(sk);
err = sk->sk_prot->hash(sk); // 将当前socket哈希到inet_hashinfo中

if (likely(!err))
return 0;
}

inet_sk_set_state(sk, TCP_CLOSE);
return err;
}

request_sock_queue 的初始化 reqsk_queue_alloc

先看下 reqsk_queue_alloc() 的源代码:

1
2
3
4
5
6
7
8
9
10
11
12
// net/core/request_sock.c
void reqsk_queue_alloc(struct request_sock_queue *queue)
{
spin_lock_init(&queue->rskq_lock);

spin_lock_init(&queue->fastopenq.lock);
queue->fastopenq.rskq_rst_head = NULL;
queue->fastopenq.rskq_rst_tail = NULL;
queue->fastopenq.qlen = 0;

queue->rskq_accept_head = NULL;
}

整个过程中,先计算 request_sock 的大小并申请空间,然后初始化 request_sock_queue 的相应成员的值;

端口注册

前面提到管理 socket 的哈希表结构 inet_hashinfo ,其中的成员 listening_hash 用于存放处于 TCP_LISTEN 状态的 sock ;

当 socket 通过 listen() 调用完成等待连接队列的初始化后,需要将当前 sock 放到该结构体中:

1
2
3
4
5
6
7
8
9
10
inet_sk_state_store(sk, TCP_LISTEN); // 设置sock的状态为TCP_LISTEN
if (!sk->sk_prot->get_port(sk, inet->inet_num)) {
inet->inet_sport = htons(inet->inet_num);

sk_dst_reset(sk);
err = sk->sk_prot->hash(sk); // 将当前socket哈希到inet_hashinfo中

if (likely(!err))
return 0;
}

这里 sk_prot->hash(sk) 调用了 net/ipv4/inet_hashtables.c:inet_hash() 方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
int inet_hash(struct sock *sk)
{
int err = 0;

if (sk->sk_state != TCP_CLOSE) {
local_bh_disable();
err = __inet_hash(sk, NULL);
local_bh_enable();
}

return err;
}

int __inet_hash(struct sock *sk, struct sock *osk)
{
struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
struct inet_listen_hashbucket *ilb;
int err = 0;

// 状态检查
if (sk->sk_state != TCP_LISTEN) {
inet_ehash_nolisten(sk, osk);
return 0;
}
WARN_ON(!sk_unhashed(sk));

// 计算hash值,取得链表
ilb = &hashinfo->listening_hash[inet_sk_listen_hashfn(sk)];

spin_lock(&ilb->lock);
if (sk->sk_reuseport) {
err = inet_reuseport_add_sock(sk, ilb);
if (err)
goto unlock;
}
if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
sk->sk_family == AF_INET6)
hlist_add_tail_rcu(&sk->sk_node, &ilb->head);
else
hlist_add_head_rcu(&sk->sk_node, &ilb->head);
inet_hash2(hashinfo, sk);
ilb->count++;
sock_set_flag(sk, SOCK_RCU_FREE);
// 将sock添加到listen链表中
sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
unlock:
spin_unlock(&ilb->lock);

return err;
}

Accept

accept的作用就是从 accept 队列中取出三次握手完成的sock,并将它关联到vfs上(其实操作和调用sys_socket时新建一个socket类似),然后返回。

这里还有个要注意的:

  • 如果这个传递给 accept 的 socket 是非阻塞的话,就算accept队列为空,也会直接返回
  • 如果这个传递给 accept 的 socket 是阻塞的话,会休眠掉,等待accept队列有数据后唤醒它

接下来我们就来看它的实现,accept对应的系统调用是sys_accept,而他则会调用 sys_accept4,因此我们直接来看 sys_accept4:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr,
int __user *upeer_addrlen, int flags)
{
struct socket *sock, *newsock;
struct file *newfile;
int err, len, newfd, fput_needed;
struct sockaddr_storage address;

if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
return -EINVAL;

if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;

sock = sockfd_lookup_light(fd, &err, &fput_needed);
if (!sock)
goto out;

err = -ENFILE;
newsock = sock_alloc();
if (!newsock)
goto out_put;

newsock->type = sock->type;
newsock->ops = sock->ops;

/*
* We don't need try_module_get here, as the listening socket (sock)
* has the protocol module (sock->ops->owner) held.
*/
__module_get(newsock->ops->owner);

newfd = get_unused_fd_flags(flags);
if (unlikely(newfd < 0)) {
err = newfd;
sock_release(newsock);
goto out_put;
}
newfile = sock_alloc_file(newsock, flags, sock->sk->sk_prot_creator->name);
if (IS_ERR(newfile)) {
err = PTR_ERR(newfile);
put_unused_fd(newfd);
goto out_put;
}

err = security_socket_accept(sock, newsock);
if (err)
goto out_fd;

err = sock->ops->accept(sock, newsock, sock->file->f_flags, false);
if (err < 0)
goto out_fd;

if (upeer_sockaddr) {
len = newsock->ops->getname(newsock,
(struct sockaddr *)&address, 2);
if (len < 0) {
err = -ECONNABORTED;
goto out_fd;
}
err = move_addr_to_user(&address,
len, upeer_sockaddr, upeer_addrlen);
if (err < 0)
goto out_fd;
}

/* File flags are not inherited via accept() unlike another OSes. */

fd_install(newfd, newfile);
err = newfd;

out_put:
fput_light(sock->file, fput_needed);
out:
return err;
out_fd:
fput(newfile);
put_unused_fd(newfd);
goto out_put;
}

可以看到,这里创建了一个新的 socket 结构,然后调用了 sock->ops->accept,也就是 inet_accept

实现

inet_accept

可以看到流程很简单,最终的实现都集中在 inet_accept中了.而inet_accept主要工作如下:

  • 调用 inet_csk_accept 来进行对 accept 队列的操作,它会返回取得的sock.
  • 将从 inet_csk_accept返回的 sock 链接到传递进来的new的socket中
    • 这里就知道我们上面为什么只需要new一个 socket 而不是 sock了,因为sock我们是直接从 accept 队列中取得的
  • 设置新的 socket 的状态为 SS_CONNECTED
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
int inet_accept(struct socket *sock, struct socket *newsock, int flags,
bool kern)
{
struct sock *sk1 = sock->sk;
int err = -EINVAL;
struct sock *sk2 = sk1->sk_prot->accept(sk1, flags, &err, kern);

if (!sk2)
goto do_err;

lock_sock(sk2);

sock_rps_record_flow(sk2);
WARN_ON(!((1 << sk2->sk_state) &
(TCPF_ESTABLISHED | TCPF_SYN_RECV |
TCPF_CLOSE_WAIT | TCPF_CLOSE)));

sock_graft(sk2, newsock);

newsock->state = SS_CONNECTED;
err = 0;
release_sock(sk2);
do_err:
return err;
}

inet_csk_accept

inet_csk_accept 就是从accept队列中取出sock 然后返回

在看他的源码之前先来看几个相关函数的实现:

1
2
3
4
5
include/net/request_sock.h
static inline bool reqsk_queue_empty(const struct request_sock_queue *queue)
{
return READ_ONCE(queue->rskq_accept_head) == NULL;
}

reqsk_queue_remove 主要是从accept队列中得到一个sock

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
static inline struct request_sock *reqsk_queue_remove(struct request_sock_queue *queue,
struct sock *parent)
{
struct request_sock *req;

spin_lock_bh(&queue->rskq_lock);
req = queue->rskq_accept_head;
if (req) {
sk_acceptq_removed(parent);
WRITE_ONCE(queue->rskq_accept_head, req->dl_next);
if (queue->rskq_accept_head == NULL)
queue->rskq_accept_tail = NULL;
}
spin_unlock_bh(&queue->rskq_lock);
return req;
}

inet_csk_wait_for_connect 用来在accept队列为空的情况下,休眠掉一段时间 (这里每个socket都有一个等待队列的)

这里是每个调用的进程都会声明一个wait队列,然后将它连接到主的socket的等待队列链表中,然后休眠,。等到唤醒.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
static int inet_csk_wait_for_connect(struct sock *sk, long timeo)
{
struct inet_connection_sock *icsk = inet_csk(sk);
// 定义一个waitqueue.
DEFINE_WAIT(wait);
int err;

for (;;) {
prepare_to_wait_exclusive(sk_sleep(sk), &wait,
TASK_INTERRUPTIBLE);
release_sock(sk);
if (reqsk_queue_empty(&icsk->icsk_accept_queue))
timeo = schedule_timeout(timeo);
sched_annotate_sleep();
lock_sock(sk);
err = 0;
if (!reqsk_queue_empty(&icsk->icsk_accept_queue))
break;
err = -EINVAL;
if (sk->sk_state != TCP_LISTEN)
break;
err = sock_intr_errno(timeo);
if (signal_pending(current))
break;
err = -EAGAIN;
if (!timeo)
break;
}
finish_wait(sk_sleep(sk), &wait);
return err;

inet_csk_accept 中有个阻塞和非阻塞的问题.非阻塞的话会直接返回的,就算accept队列为空.这个时侯设置errno为-EAGAIN.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
{
struct inet_connection_sock *icsk = inet_csk(sk);
struct request_sock_queue *queue = &icsk->icsk_accept_queue;
struct request_sock *req;
struct sock *newsk;
int error;

lock_sock(sk);

/* We need to make sure that this socket is listening,
* and that it has something pending.
*/
error = -EINVAL;
if (sk->sk_state != TCP_LISTEN)
goto out_err;

/* Find already established connection */
if (reqsk_queue_empty(queue)) {
long timeo = sock_rcvtimeo(sk, flags & O_NONBLOCK);

/* If this is a non blocking socket don't sleep */
error = -EAGAIN;
if (!timeo)
goto out_err;

error = inet_csk_wait_for_connect(sk, timeo);
if (error)
goto out_err;
}
req = reqsk_queue_remove(queue, sk);
newsk = req->sk;

if (sk->sk_protocol == IPPROTO_TCP &&
tcp_rsk(req)->tfo_listener) {
spin_lock_bh(&queue->fastopenq.lock);
if (tcp_rsk(req)->tfo_listener) {
/* We are still waiting for the final ACK from 3WHS
* so can't free req now. Instead, we set req->sk to
* NULL to signify that the child socket is taken
* so reqsk_fastopen_remove() will free the req
* when 3WHS finishes (or is aborted).
*/
req->sk = NULL;
req = NULL;
}
spin_unlock_bh(&queue->fastopenq.lock);
}
out:
release_sock(sk);
if (req)
reqsk_put(req);
return newsk;
out_err:
newsk = NULL;
req = NULL;
*err = error;
goto out;
}

sock_graft

将 socket 和 sock 关联。

1
2
3
4
5
6
7
8
9
10
11
static inline void sock_graft(struct sock *sk, struct socket *parent)
{
WARN_ON(parent->sk);
write_lock_bh(&sk->sk_callback_lock);
rcu_assign_pointer(sk->sk_wq, &parent->wq);
parent->sk = sk;
sk_set_socket(sk, parent);
sk->sk_uid = SOCK_INODE(parent)->i_uid;
security_sock_graft(sk, parent);
write_unlock_bh(&sk->sk_callback_lock);
}

Connect

什么情况下,icsk_accept_queue 才不为空呢?当然是三次握手结束才可以。接下来我们来分析三次握手的过程。

三次握手一般是由客户端调用 connect 发起,它的具体流程是:

  1. 由 fd 得到 socket, 并且将地址复制到内核空间
  2. 调用 inet_stream_connect 进行主要的处理

这里要注意connect也有个阻塞和非阻塞的区别,阻塞的话调用 inet_wait_for_connect 休眠,等待握手完成,否则直接返回。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
int __sys_connect(int fd, struct sockaddr __user *uservaddr, int addrlen)
{
struct socket *sock;
struct sockaddr_storage address;
int err, fput_needed;

sock = sockfd_lookup_light(fd, &err, &fput_needed);
if (!sock)
goto out;
err = move_addr_to_kernel(uservaddr, addrlen, &address);
if (err < 0)
goto out_put;

err =
security_socket_connect(sock, (struct sockaddr *)&address, addrlen);
if (err)
goto out_put;

err = sock->ops->connect(sock, (struct sockaddr *)&address, addrlen,
sock->file->f_flags);
out_put:
fput_light(sock->file, fput_needed);
out:
return err;
}

client 端

inet_stream_connect

1
2
3
4
5
6
7
8
9
10
int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
int addr_len, int flags)
{
int err;

lock_sock(sock->sk);
err = __inet_stream_connect(sock, uaddr, addr_len, flags, 0);
release_sock(sock->sk);
return err;
}

看到主要调用了 inet_stream_connect,它的主要工作是:

  1. 判断socket的状态,只有当为 SS_UNCONNECTED 也就是非连接状态时才调用 tcp_v4_connect 来进行连接处理
  2. 判断tcp的状态 sk_state 只能为 TCPF_SYN_SENT 或者TCPF_SYN_RECV 才进入相关处理
  3. 如果状态合适并且socket为阻塞模式则调用 inet_wait_for_connect 进入休眠等待三次握手完成来重新唤醒它。
  4. 否则直接返回,并设置错误号为EINPROGRESS.

需要注意一下,如果为非阻塞模式,用户可以通过select来查看socket是否可读写,从而判断是否已连接。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
int __inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
int addr_len, int flags, int is_sendmsg)
{
struct sock *sk = sock->sk;
int err;
long timeo;

if (uaddr) {
if (addr_len < sizeof(uaddr->sa_family))
return -EINVAL;

if (uaddr->sa_family == AF_UNSPEC) {
err = sk->sk_prot->disconnect(sk, flags);
sock->state = err ? SS_DISCONNECTING : SS_UNCONNECTED;
goto out;
}
}

switch (sock->state) {
default:
err = -EINVAL;
goto out;
case SS_CONNECTED:
err = -EISCONN;
goto out;
case SS_CONNECTING:
if (inet_sk(sk)->defer_connect)
err = is_sendmsg ? -EINPROGRESS : -EISCONN;
else
err = -EALREADY;
/* Fall out of switch with err, set for this state */
break;
case SS_UNCONNECTED:
err = -EISCONN;
if (sk->sk_state != TCP_CLOSE)
goto out;

if (BPF_CGROUP_PRE_CONNECT_ENABLED(sk)) {
err = sk->sk_prot->pre_connect(sk, uaddr, addr_len);
if (err)
goto out;
}

err = sk->sk_prot->connect(sk, uaddr, addr_len);
if (err < 0)
goto out;

sock->state = SS_CONNECTING;

if (!err && inet_sk(sk)->defer_connect)
goto out;

/* Just entered SS_CONNECTING state; the only
* difference is that return value in non-blocking
* case is EINPROGRESS, rather than EALREADY.
*/
err = -EINPROGRESS;
break;
}

timeo = sock_sndtimeo(sk, flags & O_NONBLOCK);

if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
int writebias = (sk->sk_protocol == IPPROTO_TCP) &&
tcp_sk(sk)->fastopen_req &&
tcp_sk(sk)->fastopen_req->data ? 1 : 0;

/* Error code is set above */
if (!timeo || !inet_wait_for_connect(sk, timeo, writebias))
goto out;

err = sock_intr_errno(timeo);
if (signal_pending(current))
goto out;
}

/* Connection was closed by RST, timeout, ICMP error
* or another process disconnected us.
*/
if (sk->sk_state == TCP_CLOSE)
goto sock_error;

/* sk->sk_err may be not zero now, if RECVERR was ordered by user
* and error was received after socket entered established state.
* Hence, it is handled normally after connect() return successfully.
*/

sock->state = SS_CONNECTED;
err = 0;
out:
return err;

sock_error:
err = sock_error(sk) ? : -ECONNABORTED;
sock->state = SS_UNCONNECTED;
if (sk->sk_prot->disconnect(sk, flags))
sock->state = SS_DISCONNECTING;
goto out;
}

tcp_v4_connect

tcp_v4_connect 的流程:

  1. 判断地址的一些合法性.
  2. 调用 ip_route_connect 来查找出去的路由(包括查找临时端口等等)
    1. 为什么呢?因为三次握手马上就要发送一个 SYN 包了,这就要凑齐源地址、源端口、目标地址、目标端口。目标地址和目标端口是服务端的,已经知道源端口是客户端随机分配的,源地址应该用哪一个呢?这时候要选择一条路由,看从哪个网卡出去,就应该填写哪个网卡的 IP 地址。
  3. 设置sock的状态为 TCP_SYN_SENT,并调用 inet_hash_connect 来查找一个临时端口(也就是我们出去的端口),并加入到对应的hash链表(具体操作和get_port很相似)
  4. 调用 tcp_connect 来完成最终的操作:这个函数主要用来初始化将要发送的syn包(包括窗口大小isn等等),然后将这个sk_buffer加入到socket的写队列。最终调用 tcp_transmit_skb传输到3层。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
/* This will initiate an outgoing connection. */
int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
{
struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
struct inet_sock *inet = inet_sk(sk);
struct tcp_sock *tp = tcp_sk(sk);
__be16 orig_sport, orig_dport;
__be32 daddr, nexthop;
struct flowi4 *fl4;
struct rtable *rt;
int err;
struct ip_options_rcu *inet_opt;
struct inet_timewait_death_row *tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;

if (addr_len < sizeof(struct sockaddr_in))
return -EINVAL;

if (usin->sin_family != AF_INET)
return -EAFNOSUPPORT;

nexthop = daddr = usin->sin_addr.s_addr;
inet_opt = rcu_dereference_protected(inet->inet_opt,
lockdep_sock_is_held(sk));
if (inet_opt && inet_opt->opt.srr) {
if (!daddr)
return -EINVAL;
nexthop = inet_opt->opt.faddr;
}

orig_sport = inet->inet_sport;
orig_dport = usin->sin_port;
fl4 = &inet->cork.fl.u.ip4;
rt = ip_route_connect(fl4, nexthop, inet->inet_saddr,
RT_CONN_FLAGS(sk), sk->sk_bound_dev_if,
IPPROTO_TCP,
orig_sport, orig_dport, sk);
if (IS_ERR(rt)) {
err = PTR_ERR(rt);
if (err == -ENETUNREACH)
IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
return err;
}

if (rt->rt_flags & (RTCF_MULTICAST | RTCF_BROADCAST)) {
ip_rt_put(rt);
return -ENETUNREACH;
}

if (!inet_opt || !inet_opt->opt.srr)
daddr = fl4->daddr;

if (!inet->inet_saddr)
inet->inet_saddr = fl4->saddr;
sk_rcv_saddr_set(sk, inet->inet_saddr);

if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
/* Reset inherited state */
tp->rx_opt.ts_recent = 0;
tp->rx_opt.ts_recent_stamp = 0;
if (likely(!tp->repair))
WRITE_ONCE(tp->write_seq, 0);
}

inet->inet_dport = usin->sin_port;
sk_daddr_set(sk, daddr);

inet_csk(sk)->icsk_ext_hdr_len = 0;
if (inet_opt)
inet_csk(sk)->icsk_ext_hdr_len = inet_opt->opt.optlen;

tp->rx_opt.mss_clamp = TCP_MSS_DEFAULT;

/* Socket identity is still unknown (sport may be zero).
* However we set state to SYN-SENT and not releasing socket
* lock select source port, enter ourselves into the hash tables and
* complete initialization after this.
*/
tcp_set_state(sk, TCP_SYN_SENT);
err = inet_hash_connect(tcp_death_row, sk);
if (err)
goto failure;

sk_set_txhash(sk);

rt = ip_route_newports(fl4, rt, orig_sport, orig_dport,
inet->inet_sport, inet->inet_dport, sk);
if (IS_ERR(rt)) {
err = PTR_ERR(rt);
rt = NULL;
goto failure;
}
/* OK, now commit destination to socket. */
sk->sk_gso_type = SKB_GSO_TCPV4;
sk_setup_caps(sk, &rt->dst);
rt = NULL;

if (likely(!tp->repair)) {
if (!tp->write_seq)
WRITE_ONCE(tp->write_seq,
secure_tcp_seq(inet->inet_saddr,
inet->inet_daddr,
inet->inet_sport,
usin->sin_port));
tp->tsoffset = secure_tcp_ts_off(sock_net(sk),
inet->inet_saddr,
inet->inet_daddr);
}

inet->inet_id = prandom_u32();

if (tcp_fastopen_defer_connect(sk, &err))
return err;
if (err)
goto failure;

err = tcp_connect(sk);

if (err)
goto failure;

return 0;

failure:
/*
* This unhashes the socket and releases the local port,
* if necessary.
*/
tcp_set_state(sk, TCP_CLOSE);
ip_rt_put(rt);
sk->sk_route_caps = 0;
inet->inet_dport = 0;
return err;
}
EXPORT_SYMBOL(tcp_v4_connect);

tcp_connect

该函数完成发送SYN的作用,其流程如下:

  1. 初始化该次连接的 sk 结构 tcp_sock
  2. sk_stream_alloc_skb 分配一个skb数据包
  3. tcp_init_nondata_skb 初始化 skb,也就是 SYN 包
  4. skb 加入发送队列后,调用 tcp_transmit_skb 发送该数据包(SYN)
  5. 更新 snd_nxt,inet_csk_reset_xmit_timer 启动超时定时器。如果 SYN 发送不成功,则再次发送
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/* Build a SYN and send it off. */
int tcp_connect(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *buff;
int err;

tcp_call_bpf(sk, BPF_SOCK_OPS_TCP_CONNECT_CB, 0, NULL);

if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk))
return -EHOSTUNREACH; /* Routing failure or similar. */

tcp_connect_init(sk);

if (unlikely(tp->repair)) {
tcp_finish_connect(sk, NULL);
return 0;
}

buff = sk_stream_alloc_skb(sk, 0, sk->sk_allocation, true);
if (unlikely(!buff))
return -ENOBUFS;

tcp_init_nondata_skb(buff, tp->write_seq++, TCPHDR_SYN);
tcp_mstamp_refresh(tp);
tp->retrans_stamp = tcp_time_stamp(tp);
tcp_connect_queue_skb(sk, buff);
tcp_ecn_send_syn(sk, buff);
tcp_rbtree_insert(&sk->tcp_rtx_queue, buff);

/* Send off SYN; include data in Fast Open. */
err = tp->fastopen_req ? tcp_send_syn_data(sk, buff) :
tcp_transmit_skb(sk, buff, 1, sk->sk_allocation);
if (err == -ECONNREFUSED)
return err;

/* We change tp->snd_nxt after the tcp_transmit_skb() call
* in order to make this packet get counted in tcpOutSegs.
*/
WRITE_ONCE(tp->snd_nxt, tp->write_seq);
tp->pushed_seq = tp->write_seq;
buff = tcp_send_head(sk);
if (unlikely(buff)) {
WRITE_ONCE(tp->snd_nxt, TCP_SKB_CB(buff)->seq);
tp->pushed_seq = TCP_SKB_CB(buff)->seq;
}
TCP_INC_STATS(sock_net(sk), TCP_MIB_ACTIVEOPENS);

/* Timer for repeating the SYN until an answer. */
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
return 0;
}

我们回到 __inet_stream_connect 函数,在调用 sk->sk_prot->connect 之后,inet_wait_for_connect 会一直等待客户端收到服务端的 ACK。而我们知道,服务端在 accept 之后,也是在等待中。

Server 端

为了解析三次握手,我们简单的看网络包接收到 TCP 层做的部分事情。

tcp_v4_rcv

1
2
3
4
5
6
7
8
9
static struct net_protocol tcp_protocol = {
.early_demux = tcp_v4_early_demux,
.early_demux_handler = tcp_v4_early_demux,
.handler = tcp_v4_rcv,
.err_handler = tcp_v4_err,
.no_policy = 1,
.netns_ok = 1,
.icmp_strict_tag_validation = 1,
}

我们通过 struct net_protocol 结构中的 handler 进行接收,调用的函数是 tcp_v4_rcv。接下来的调用链为 tcp_v4_rcv->tcp_v4_do_rcv->tcp_rcv_state_process

tcp_rcv_state_process

tcp_rcv_state_process,顾名思义,是用来处理接收一个网络包后引起状态变化的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// net/ipv4/tcp_input.c
int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
struct inet_connection_sock *icsk = inet_csk(sk);
const struct tcphdr *th = tcp_hdr(skb);
struct request_sock *req;
int queued = 0;
bool acceptable;

switch (sk->sk_state) {
/* ... */
case TCP_LISTEN:
/* ... */
if (th->syn) {
acceptable = icsk->icsk_af_ops->conn_request(sk, skb) >= 0;
if (!acceptable)
return 1;
consume_skb(skb);
return 0;
}
/* ... */
}

目前服务端是处于 TCP_LISTEN 状态的,而且发过来的包是 SYN,因而就有了上面的代码,调用 icsk->icsk_af_ops->conn_request 函数。struct inet_connection_sock 对应的操作是 inet_connection_sock_af_ops,按照下面的定义,其实调用的是 tcp_v4_conn_request

1
2
3
4
5
6
7
8
9
10
11
12
13
14
const struct inet_connection_sock_af_ops ipv4_specific = {
.queue_xmit = ip_queue_xmit,
.send_check = tcp_v4_send_check,
.rebuild_header = inet_sk_rebuild_header,
.sk_rx_dst_set = inet_sk_rx_dst_set,
.conn_request = tcp_v4_conn_request,
.syn_recv_sock = tcp_v4_syn_recv_sock,
.net_header_len = sizeof(struct iphdr),
.setsockopt = ip_setsockopt,
.getsockopt = ip_getsockopt,
.addr2sockaddr = inet_csk_addr2sockaddr,
.sockaddr_len = sizeof(struct sockaddr_in),
.mtu_reduced = tcp_v4_mtu_reduced,
};

tcp_v4_conn_request

tcp_v4_conn_request 会调用 tcp_conn_request,这个函数也比较长,里面调用了 send_synack,但实际调用的是 tcp_v4_send_synack。具体发送的过程我们不去管它,看注释我们能知道,这是收到了 SYN 后,回复一个 SYN-ACK,回复完毕后,服务端处于 TCP_SYN_RECV

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

int tcp_conn_request(struct request_sock_ops *rsk_ops,
const struct tcp_request_sock_ops *af_ops,
struct sock *sk, struct sk_buff *skb)
{
/* ... */
af_ops->send_synack(sk, dst, &fl, req, &foc,
!want_cookie ? TCP_SYNACK_NORMAL :
TCP_SYNACK_COOKIE);
/* ... */
}

/*
* Send a SYN-ACK after having received a SYN.
*/
static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst,
struct flowi *fl,
struct request_sock *req,
struct tcp_fastopen_cookie *foc,
enum tcp_synack_type synack_type)
{......}

这个时候,轮到客户端接收网络包了。都是 TCP 协议栈,所以过程和服务端没有太多区别,还是会走到 tcp_rcv_state_process 函数的,只不过由于客户端目前处于 TCP_SYN_SENT 状态,就进入了下面的代码分支。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
struct inet_connection_sock *icsk = inet_csk(sk);
const struct tcphdr *th = tcp_hdr(skb);
struct request_sock *req;
int queued = 0;
bool acceptable;

switch (sk->sk_state) {
/* ... */
case TCP_SYN_SENT:
tp->rx_opt.saw_tstamp = 0;
tcp_mstamp_refresh(tp);
queued = tcp_rcv_synsent_state_process(sk, skb, th);
if (queued >= 0)
return queued;
/* Do step6 onward by hand. */
tcp_urg(sk, skb, th);
__kfree_skb(skb);
tcp_data_snd_check(sk);
return 0;
}
/* ... */
}

tcp_rcv_synsent_state_process 会调用 tcp_send_ack,发送一个 ACK-ACK,发送后客户端处于 TCP_ESTABLISHED 状态。

又轮到服务端接收网络包了,我们还是归 tcp_rcv_state_process 函数处理。由于服务端目前处于状态 TCP_SYN_RECV 状态,因而又走了另外的分支。当收到这个网络包的时候,服务端也处于 TCP_ESTABLISHED 状态,三次握手结束。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
struct inet_connection_sock *icsk = inet_csk(sk);
const struct tcphdr *th = tcp_hdr(skb);
struct request_sock *req;
int queued = 0;
bool acceptable;
/* ... */
switch (sk->sk_state) {
case TCP_SYN_RECV:
if (req) {
inet_csk(sk)->icsk_retransmits = 0;
reqsk_fastopen_remove(sk, req, false);
} else {
/* Make sure socket is routed, for correct metrics. */
icsk->icsk_af_ops->rebuild_header(sk);
tcp_call_bpf(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
tcp_init_congestion_control(sk);

tcp_mtup_init(sk);
tp->copied_seq = tp->rcv_nxt;
tcp_init_buffer_space(sk);
}
smp_mb();
tcp_set_state(sk, TCP_ESTABLISHED);
sk->sk_state_change(sk);
if (sk->sk_socket)
sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT);
tp->snd_una = TCP_SKB_CB(skb)->ack_seq;
tp->snd_wnd = ntohs(th->window) << tp->rx_opt.snd_wscale;
tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
break;
/* ... */
}

Send

数据结构

msghdr

1
2
3
4
5
6
7
8
9
struct msghdr {
void *msg_name; /* ptr to socket address structure */
int msg_namelen; /* size of socket address structure */
struct iov_iter msg_iter; /* data */
void *msg_control; /* ancillary data */
__kernel_size_t msg_controllen; /* ancillary data buffer length */
unsigned int msg_flags; /* flags on received message */
struct kiocb *msg_iocb; /* ptr to iocb for async requests */
};

kiocb

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
struct kiocb {
struct file *ki_filp;

/* The 'ki_filp' pointer is shared in a union for aio */
randomized_struct_fields_start

loff_t ki_pos;
void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
void *private;
int ki_flags;
u16 ki_hint;
u16 ki_ioprio; /* See linux/ioprio.h */
unsigned int ki_cookie; /* for ->iopoll */

randomized_struct_fields_end
};

iovec

1
2
3
4
5
struct iovec
{
void *iov_base; /* Pointer to data. */
size_t iov_len; /* Length of data. */
};

实现

系统调用

1
2
3
4
5
6
SYSCALL_DEFINE6(sendto, int, fd, void __user *, buff, size_t, len,
unsigned int, flags, struct sockaddr __user *, addr,
int, addr_len)
{
return __sys_sendto(fd, buff, len, flags, addr, addr_len);
}

__sys_sendto

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
int __sys_sendto(int fd, void __user *buff, size_t len, unsigned int flags,
struct sockaddr __user *addr, int addr_len)
{
struct socket *sock;
struct sockaddr_storage address;
int err;
struct msghdr msg;
struct iovec iov;
int fput_needed;

err = import_single_range(WRITE, buff, len, &iov, &msg.msg_iter);
if (unlikely(err))
return err;
sock = sockfd_lookup_light(fd, &err, &fput_needed);
if (!sock)
goto out;

msg.msg_name = NULL;
msg.msg_control = NULL;
msg.msg_controllen = 0;
msg.msg_namelen = 0;
if (addr) {
err = move_addr_to_kernel(addr, addr_len, &address);
if (err < 0)
goto out_put;
msg.msg_name = (struct sockaddr *)&address;
msg.msg_namelen = addr_len;
}
if (sock->file->f_flags & O_NONBLOCK)
flags |= MSG_DONTWAIT;
msg.msg_flags = flags;
err = sock_sendmsg(sock, &msg);

out_put:
fput_light(sock->file, fput_needed);
out:
return err;
}

sock_sendmsg

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
int sock_sendmsg(struct socket *sock, struct msghdr *msg)
{
int err = security_socket_sendmsg(sock, msg,
msg_data_left(msg));

return err ?: sock_sendmsg_nosec(sock, msg);
}

static inline int sock_sendmsg_nosec(struct socket *sock, struct msghdr *msg)
{
int ret = INDIRECT_CALL_INET(sock->ops->sendmsg, inet6_sendmsg,
inet_sendmsg, sock, msg,
msg_data_left(msg));
BUG_ON(ret == -EIOCBQUEUED);
return ret;
}

最终调用了 sock->ops->sendmsg,也就是 inet_sendmsg

inet_sendmsg

可以看到,最终调用的是 sk_prot 的 sendmsg 方法,对于 TCP 就是 tcp_sendmsg

1
2
3
4
5
6
7
8
9
10
11
// net/ipv4/af_inet.c
int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
{
struct sock *sk = sock->sk;

if (unlikely(inet_send_prepare(sk)))
return -EAGAIN;

return INDIRECT_CALL_2(sk->sk_prot->sendmsg, tcp_sendmsg, udp_sendmsg,
sk, msg, size);
}

tcp_sendmsg

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
{
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
int flags, err, copied = 0;
int mss_now = 0, size_goal, copied_syn = 0;
long timeo;
/* ... */

/* Ok commence sending. */
copied = 0;
restart:
mss_now = tcp_send_mss(sk, &size_goal, flags);

while (msg_data_left(msg)) {
int copy = 0;
int max = size_goal;

skb = tcp_write_queue_tail(sk);
if (tcp_send_head(sk)) {
if (skb->ip_summed == CHECKSUM_NONE)
max = mss_now;
copy = max - skb->len;
}

if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) {
bool first_skb;

new_segment:
/* Allocate new segment. If the interface is SG,
* allocate skb fitting to single page.
*/
if (!sk_stream_memory_free(sk))
goto wait_for_sndbuf;
/* ... */
first_skb = skb_queue_empty(&sk->sk_write_queue);
skb = sk_stream_alloc_skb(sk,
select_size(sk, sg, first_skb),
sk->sk_allocation,
first_skb);
/* ... */
skb_entail(sk, skb);
copy = size_goal;
max = size_goal;
/* ... */
}

/* Try to append data to the end of skb. */
if (copy > msg_data_left(msg))
copy = msg_data_left(msg);

/* Where to copy to? */
if (skb_availroom(skb) > 0) {
/* We have some space in skb head. Superb! */
copy = min_t(int, copy, skb_availroom(skb));
err = skb_add_data_nocache(sk, skb, &msg->msg_iter, copy);
/* ... */
} else {
bool merge = true;
int i = skb_shinfo(skb)->nr_frags;
struct page_frag *pfrag = sk_page_frag(sk);
/* ... */
copy = min_t(int, copy, pfrag->size - pfrag->offset);
/* ... */
err = skb_copy_to_page_nocache(sk, &msg->msg_iter, skb,
pfrag->page,
pfrag->offset,
copy);
/* ... */
pfrag->offset += copy;
}

/* ... */
tp->write_seq += copy;
TCP_SKB_CB(skb)->end_seq += copy;
tcp_skb_pcount_set(skb, 0);

copied += copy;
if (!msg_data_left(msg)) {
if (unlikely(flags & MSG_EOR))
TCP_SKB_CB(skb)->eor = 1;
goto out;
}

if (skb->len < max || (flags & MSG_OOB) || unlikely(tp->repair))
continue;

if (forced_push(tp)) {
tcp_mark_push(tp, skb);
__tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH);
} else if (skb == tcp_send_head(sk))
tcp_push_one(sk, mss_now);
continue;
/* ... */
}
/* ... */
}

tcp_sendmsg 的实现还是很复杂的,这里面做了这样几件事情。

  • msg 是用户要写入的数据,这个数据要拷贝到内核协议栈里面去发送;在内核协议栈里面,网络包的数据都是由 struct sk_buff 维护的,因而第一件事情就是找到一个空闲的内存空间,将用户要写入的数据,拷贝到 struct sk_buff 的管辖范围内。而第二件事情就是发送 struct sk_buff。
  • 在 tcp_sendmsg 中,我们首先通过强制类型转换,将 sock 结构转换为 struct tcp_sock,这个是维护 TCP 连接状态的重要数据结构。
  • 接下来是 tcp_sendmsg 的第一件事情,把数据拷贝到 struct sk_buff。

我们先声明一个变量 copied,初始化为 0,这表示拷贝了多少数据。紧接着是一个循环,while (msg_data_left(msg)),也即如果用户的数据没有发送完毕,就一直循环。循环里声明了一个 copy 变量,表示这次拷贝的数值,在循环的最后有 copied += copy,将每次拷贝的数量都加起来。

我们这里只需要看一次循环做了哪些事情。

  1. tcp_write_queue_tail 从 TCP 写入队列 sk_write_queue 中拿出最后一个 struct sk_buff,在这个写入队列中排满了要发送的 struct sk_buff,为什么要拿最后一个呢?这里面只有最后一个,可能会因为上次用户给的数据太少,而没有填满。

  2. tcp_send_mss 会计算 MSS,也即 Max Segment Size。这是什么呢?这个意思是说,我们在网络上传输的网络包的大小是有限制的,而这个限制在最底层开始就有。

  3. 如果 copy 小于 0,说明最后一个 struct sk_buff 已经没地方存放了,需要调用 sk_stream_alloc_skb,重新分配 struct sk_buff,然后调用 skb_entail,将新分配的 sk_buff 放到队列尾部。

  4. 为了减少内存拷贝的代价,有的网络设备支持分散聚合(Scatter/Gather)I/O,顾名思义,就是 IP 层没必要通过内存拷贝进行聚合,让散的数据零散的放在原处,在设备层进行聚合。在注释 / Where to copy to? / 后面有个 if-else 分支。

    1. if 分支就是 skb_add_data_nocache 将数据拷贝到连续的数据区域。
    2. else 分支就是 skb_copy_to_page_nocache 将数据拷贝到 struct skb_shared_info 结构指向的不需要连续的页面区域。

  5. 就是要发生网络包了。

    1. 第一种情况是积累的数据报数目太多了,因而我们需要通过调用 __tcp_push_pending_frames 发送网络包。
    2. 第二种情况是,这是第一个网络包,需要马上发送,调用 tcp_push_one
    3. 无论 __tcp_push_pending_frames 还是 tcp_push_one,都会调用 tcp_write_xmit 发送网络包。

tcp_write_xmit

这里面主要的逻辑是一个循环,用来处理发送队列,只要队列不空,就会发送。

主要涉及 TSO 和滑动窗口拥塞控制相关的部分,此处不介绍了,最终调用了 tcp_transmit_skb

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, int push_one, gfp_t gfp)
{
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
unsigned int tso_segs, sent_pkts;
int cwnd_quota;
/* ... */
max_segs = tcp_tso_segs(sk, mss_now);
while ((skb = tcp_send_head(sk))) {
unsigned int limit;
/* ... */
tso_segs = tcp_init_tso_segs(skb, mss_now);
/* ... */
cwnd_quota = tcp_cwnd_test(tp, skb);
/* ... */
if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now))) {
is_rwnd_limited = true;
break;
}
/* ... */
limit = mss_now;
if (tso_segs > 1 && !tcp_urg_mode(tp))
limit = tcp_mss_split_point(sk, skb, mss_now, min_t(unsigned int, cwnd_quota, max_segs), nonagle);

if (skb->len > limit &&
unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
break;
/* ... */
if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
break;

repair:
/* Advance the send_head. This one is sent out.
* This call will increment packets_out.
*/
tcp_event_new_data_sent(sk, skb);

tcp_minshall_update(tp, mss_now, skb);
sent_pkts += tcp_skb_pcount(skb);

if (push_one)
break;
}
/* ... */
}

tcp_transmit_skb

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
gfp_t gfp_mask)
{
const struct inet_connection_sock *icsk = inet_csk(sk);
struct inet_sock *inet;
struct tcp_sock *tp;
struct tcp_skb_cb *tcb;
struct tcphdr *th;
int err;

tp = tcp_sk(sk);

skb->skb_mstamp = tp->tcp_mstamp;
inet = inet_sk(sk);
tcb = TCP_SKB_CB(skb);
memset(&opts, 0, sizeof(opts));

tcp_header_size = tcp_options_size + sizeof(struct tcphdr);
skb_push(skb, tcp_header_size);

/* Build TCP header and checksum it. */
th = (struct tcphdr *)skb->data;
th->source = inet->inet_sport;
th->dest = inet->inet_dport;
th->seq = htonl(tcb->seq);
th->ack_seq = htonl(tp->rcv_nxt);
*(((__be16 *)th) + 6) = htons(((tcp_header_size >> 2) << 12) |
tcb->tcp_flags);

th->check = 0;
th->urg_ptr = 0;
/* ... */
tcp_options_write((__be32 *)(th + 1), tp, &opts);
th->window = htons(min(tp->rcv_wnd, 65535U));
/* ... */
err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl);
/* ... */
}

tcp_transmit_skb 这个函数比较长,主要做了两件事情:

  • 填充 TCP 头
  • 会调用 icsk_af_ops 的 queue_xmit 方法
    • icsk_af_ops 指向 ipv4_specific,也即调用的是 ip_queue_xmit 函数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

const struct inet_connection_sock_af_ops ipv4_specific = {
.queue_xmit = ip_queue_xmit,
.send_check = tcp_v4_send_check,
.rebuild_header = inet_sk_rebuild_header,
.sk_rx_dst_set = inet_sk_rx_dst_set,
.conn_request = tcp_v4_conn_request,
.syn_recv_sock = tcp_v4_syn_recv_sock,
.net_header_len = sizeof(struct iphdr),
.setsockopt = ip_setsockopt,
.getsockopt = ip_getsockopt,
.addr2sockaddr = inet_csk_addr2sockaddr,
.sockaddr_len = sizeof(struct sockaddr_in),
.mtu_reduced = tcp_v4_mtu_reduced,
};

Recv

数据到达

tcp_v4_rcv

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

int tcp_v4_rcv(struct sk_buff *skb)
{
struct net *net = dev_net(skb->dev);
const struct iphdr *iph;
const struct tcphdr *th;
bool refcounted;
struct sock *sk;
int ret;
......
th = (const struct tcphdr *)skb->data;
iph = ip_hdr(skb);
......
TCP_SKB_CB(skb)->seq = ntohl(th->seq);
TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin + skb->len - th->doff * 4);
TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
TCP_SKB_CB(skb)->tcp_tw_isn = 0;
TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
TCP_SKB_CB(skb)->sacked = 0;

lookup:
sk = __inet_lookup_skb(&tcp_hashinfo, skb, __tcp_hdrlen(th), th->source, th->dest, &refcounted);

process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;

if (sk->sk_state == TCP_NEW_SYN_RECV) {
......
}
......
th = (const struct tcphdr *)skb->data;
iph = ip_hdr(skb);

skb->dev = NULL;

if (sk->sk_state == TCP_LISTEN) {
ret = tcp_v4_do_rcv(sk, skb);
goto put_and_return;
}
......
if (!sock_owned_by_user(sk)) {
if (!tcp_prequeue(sk, skb))
ret = tcp_v4_do_rcv(sk, skb);
} else if (tcp_add_backlog(sk, skb)) {
goto discard_and_relse;
}
......
}

在 tcp_v4_rcv 中,得到 TCP 的头之后,我们可以开始处理 TCP 层的事情。因为 TCP 层是分状态的,状态被维护在数据结构 struct sock 里面,因而我们要根据 IP 地址以及 TCP 头里面的内容,在 tcp_hashinfo 中找到这个包对应的 struct sock,从而得到这个包对应的连接的状态。

接下来,我们就根据不同的状态做不同的处理,例如,上面代码中的 TCP_LISTEN、TCP_NEW_SYN_RECV 状态属于连接建立过程中。这个我们在讲三次握手的时候讲过了。再如,TCP_TIME_WAIT 状态是连接结束的时候的状态,这个我们暂时可以不用看。

接下来,我们来分析最主流的网络包的接收过程,这里面涉及三个队列:

  • backlog 队列
  • prequeue 队列
  • sk_receive_queue 队列

为什么接收网络包的过程,需要在这三个队列里面倒腾过来、倒腾过去呢?这是因为,同样一个网络包要在三个主体之间交接。

  • 第一个主体是软中断的处理过程。如果你没忘记的话,我们在执行 tcp_v4_rcv 函数的时候,依然处于软中断的处理逻辑里,所以必然会占用这个软中断。
  • 第二个主体就是用户态进程。如果用户态触发系统调用 read 读取网络包,也要从队列里面找。
  • 第三个主体就是内核协议栈。哪怕用户进程没有调用 read,读取网络包,当网络包来的时候,也得有一个地方收着呀。

这时候,我们就能够了解上面代码中 sock_owned_by_user 的意思了,其实就是说,当前这个 sock 是不是正有一个用户态进程等着读数据呢,如果没有,内核协议栈也调用 tcp_add_backlog,暂存在 backlog 队列中,并且抓紧离开软中断的处理过程。

如果有一个用户态进程等待读取数据呢?我们先调用 tcp_prequeue,也即赶紧放入 prequeue 队列,并且离开软中断的处理过程。在这个函数里面,我们会看到对于 sysctl_tcp_low_latency 的判断,也即是不是要低时延地处理网络包。

如果把 sysctl_tcp_low_latency 设置为 0,那就要放在 prequeue 队列中暂存,这样不用等待网络包处理完毕,就可以离开软中断的处理过程,但是会造成比较长的时延。如果把 sysctl_tcp_low_latency 设置为 1,我们还是调用 tcp_v4_do_rcv。

tcp_v4_do_rcv

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
{
struct sock *rsk;

if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
struct dst_entry *dst = sk->sk_rx_dst;
......
tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len);
return 0;
}
......
if (tcp_rcv_state_process(sk, skb)) {
......
}
return 0;
......
}

在 tcp_v4_do_rcv 中,分两种情况,一种情况是连接已经建立,处于 TCP_ESTABLISHED 状态,调用 tcp_rcv_established。另一种情况,就是其他的状态,调用 tcp_rcv_state_process。

tcp_v4_established

在连接状态下,我们会调用 tcp_rcv_established。在这个函数里面,我们会调用 tcp_data_queue,将其放入 sk_receive_queue 队列进行处理。

tcp_data_queue

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73

static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
bool fragstolen = false;
......
if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) {
if (tcp_receive_window(tp) == 0)
goto out_of_window;

/* Ok. In sequence. In window. */
if (tp->ucopy.task == current &&
tp->copied_seq == tp->rcv_nxt && tp->ucopy.len &&
sock_owned_by_user(sk) && !tp->urg_data) {
int chunk = min_t(unsigned int, skb->len,
tp->ucopy.len);

__set_current_state(TASK_RUNNING);

if (!skb_copy_datagram_msg(skb, 0, tp->ucopy.msg, chunk)) {
tp->ucopy.len -= chunk;
tp->copied_seq += chunk;
eaten = (chunk == skb->len);
tcp_rcv_space_adjust(sk);
}
}

if (eaten <= 0) {
queue_and_out:
......
eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
}
tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq);
......
if (!RB_EMPTY_ROOT(&tp->out_of_order_queue)) {
tcp_ofo_queue(sk);
......
}
......
return;
}

if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
/* A retransmit, 2nd most common case. Force an immediate ack. */
tcp_dsack_set(sk, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);

out_of_window:
tcp_enter_quickack_mode(sk);
inet_csk_schedule_ack(sk);
drop:
tcp_drop(sk, skb);
return;
}

/* Out of window. F.e. zero window probe. */
if (!before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt + tcp_receive_window(tp)))
goto out_of_window;

tcp_enter_quickack_mode(sk);

if (before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
/* Partial packet, seq < rcv_next < end_seq */
tcp_dsack_set(sk, TCP_SKB_CB(skb)->seq, tp->rcv_nxt);
/* If window is closed, drop tail of packet. But after
* remembering D-SACK for its head made in previous line.
*/
if (!tcp_receive_window(tp))
goto out_of_window;
goto queue_and_out;
}

tcp_data_queue_ofo(sk, skb);
}

在 tcp_data_queue 中,对于收到的网络包,我们要分情况进行处理。

第一种情况,seq == tp->rcv_nxt,说明来的网络包正是我服务端期望的下一个网络包。这个时候我们判断 sock_owned_by_user,也即用户进程也是正在等待读取,这种情况下,就直接 skb_copy_datagram_msg,将网络包拷贝给用户进程就可以了。

如果用户进程没有正在等待读取,或者因为内存原因没有能够拷贝成功,tcp_queue_rcv 里面还是将网络包放入 sk_receive_queue 队列。

接下来,tcp_rcv_nxt_update 将 tp->rcv_nxt 设置为 end_seq,也即当前的网络包接收成功后,更新下一个期待的网络包。

这个时候,我们还会判断一下另一个队列,out_of_order_queue,也看看乱序队列的情况,看看乱序队列里面的包,会不会因为这个新的网络包的到来,也能放入到 sk_receive_queue 队列中。

实现

系统调用

1
2
3
4
5
SYSCALL_DEFINE4(recv, int, fd, void __user *, ubuf, size_t, size,
unsigned int, flags)
{
return __sys_recvfrom(fd, ubuf, size, flags, NULL, NULL);
}

__sys_recvfrom

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
int __sys_recvfrom(int fd, void __user *ubuf, size_t size, unsigned int flags,
struct sockaddr __user *addr, int __user *addr_len)
{
struct socket *sock;
struct iovec iov;
struct msghdr msg;
struct sockaddr_storage address;
int err, err2;
int fput_needed;

err = import_single_range(READ, ubuf, size, &iov, &msg.msg_iter);
if (unlikely(err))
return err;
sock = sockfd_lookup_light(fd, &err, &fput_needed);
if (!sock)
goto out;

msg.msg_control = NULL;
msg.msg_controllen = 0;
/* Save some cycles and don't copy the address if not needed */
msg.msg_name = addr ? (struct sockaddr *)&address : NULL;
/* We assume all kernel code knows the size of sockaddr_storage */
msg.msg_namelen = 0;
msg.msg_iocb = NULL;
msg.msg_flags = 0;
if (sock->file->f_flags & O_NONBLOCK)
flags |= MSG_DONTWAIT;
err = sock_recvmsg(sock, &msg, flags);

if (err >= 0 && addr != NULL) {
err2 = move_addr_to_user(&address,
msg.msg_namelen, addr, addr_len);
if (err2 < 0)
err = err2;
}

fput_light(sock->file, fput_needed);
out:
return err;
}

sock_recvmsg

1
2
3
4
5
6
7
8
9
10
11
12
13
14
int sock_recvmsg(struct socket *sock, struct msghdr *msg, int flags)
{
int err = security_socket_recvmsg(sock, msg, msg_data_left(msg), flags);

return err ?: sock_recvmsg_nosec(sock, msg, flags);
}

static inline int sock_recvmsg_nosec(struct socket *sock, struct msghdr *msg,
int flags)
{
return INDIRECT_CALL_INET(sock->ops->recvmsg, inet6_recvmsg,
inet_recvmsg, sock, msg, msg_data_left(msg),
flags);
}

最终调用了 sock->ops->recvmsg,也就是 inet_recvmsg

inet_recvmsg

可以看到,最终调用的是 sk_prot 的 recvmsg 方法,对于 TCP 就是 tcp_recvmsg

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
int flags)
{
struct sock *sk = sock->sk;
int addr_len = 0;
int err;

if (likely(!(flags & MSG_ERRQUEUE)))
sock_rps_record_flow(sk);

err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
if (err >= 0)
msg->msg_namelen = addr_len;
return err;
}

tcp_recvmsg

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
int flags, int *addr_len)
{
struct tcp_sock *tp = tcp_sk(sk);
int copied = 0;
u32 peek_seq;
u32 *seq;
unsigned long used;
int err;
int target; /* Read at least this many bytes */
long timeo;
struct task_struct *user_recv = NULL;
struct sk_buff *skb, *last;
.....
do {
u32 offset;
......
/* Next get a buffer. */
last = skb_peek_tail(&sk->sk_receive_queue);
skb_queue_walk(&sk->sk_receive_queue, skb) {
last = skb;
offset = *seq - TCP_SKB_CB(skb)->seq;
if (offset < skb->len)
goto found_ok_skb;
......
}
......
if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) {
/* Install new reader */
if (!user_recv && !(flags & (MSG_TRUNC | MSG_PEEK))) {
user_recv = current;
tp->ucopy.task = user_recv;
tp->ucopy.msg = msg;
}

tp->ucopy.len = len;
/* Look: we have the following (pseudo)queues:
*
* 1. packets in flight
* 2. backlog
* 3. prequeue
* 4. receive_queue
*
* Each queue can be processed only if the next ones
* are empty.
*/
if (!skb_queue_empty(&tp->ucopy.prequeue))
goto do_prequeue;
}

if (copied >= target) {
/* Do not sleep, just process backlog. */
release_sock(sk);
lock_sock(sk);
} else {
sk_wait_data(sk, &timeo, last);
}

if (user_recv) {
int chunk;
chunk = len - tp->ucopy.len;
if (chunk != 0) {
len -= chunk;
copied += chunk;
}

if (tp->rcv_nxt == tp->copied_seq &&
!skb_queue_empty(&tp->ucopy.prequeue)) {
do_prequeue:
tcp_prequeue_process(sk);

chunk = len - tp->ucopy.len;
if (chunk != 0) {
len -= chunk;
copied += chunk;
}
}
}
continue;
found_ok_skb:
/* Ok so how much can we use? */
used = skb->len - offset;
if (len < used)
used = len;

if (!(flags & MSG_TRUNC)) {
err = skb_copy_datagram_msg(skb, offset, msg, used);
......
}

*seq += used;
copied += used;
len -= used;

tcp_rcv_space_adjust(sk);
......
} while (len > 0);
......
}

tcp_recvmsg 这个函数比较长,里面逻辑也很复杂,好在里面有一段注释概括了这里面的逻辑。注释里面提到了三个队列,receive_queue 队列、prequeue 队列和 backlog 队列。这里面,我们需要把前一个队列处理完毕,才处理后一个队列。

tcp_recvmsg 的整个逻辑也是这样执行的:这里面有一个 while 循环,不断地读取网络包。

这里,我们会先处理 sk_receive_queue 队列。如果找到了网络包,就跳到 found_ok_skb 这里。这里会调用 skb_copy_datagram_msg,将网络包拷贝到用户进程中,然后直接进入下一层循环。

直到 sk_receive_queue 队列处理完毕,我们才到了 sysctl_tcp_low_latency 判断。如果不需要低时延,则会有 prequeue 队列。于是,我们能就跳到 do_prequeue 这里,调用 tcp_prequeue_process 进行处理。

如果 sysctl_tcp_low_latency 设置为 1,也即没有 prequeue 队列,或者 prequeue 队列为空,则需要处理 backlog 队列,在 release_sock 函数中处理。

release_sock 会调用 __release_sock,这里面会依次处理队列中的网络包。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
void release_sock(struct sock *sk)
{
......
if (sk->sk_backlog.tail)
__release_sock(sk);
......
}

static void __release_sock(struct sock *sk)
__releases(&sk->sk_lock.slock)
__acquires(&sk->sk_lock.slock)
{
struct sk_buff *skb, *next;

while ((skb = sk->sk_backlog.head) != NULL) {
sk->sk_backlog.head = sk->sk_backlog.tail = NULL;
do {
next = skb->next;
prefetch(next);
skb->next = NULL;
sk_backlog_rcv(sk, skb);
cond_resched();
skb = next;
} while (skb != NULL);
}
......
}

最后,哪里都没有网络包,我们只好调用 sk_wait_data,继续等待在哪里,等待网络包的到来。

至此,网络包的接收过程到此结束。

参考资料