summaryrefslogtreecommitdiff
path: root/include/uapi/linux
AgeCommit message (Collapse)Author
2013-04-29packet_diag: disclose uid valueNicolas Dichtel
This value is disclosed via /proc/net/packet but not via netlink messages. The goal is to have the same level of information. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-29vxlan: allow choosing destination port per vxlanstephen hemminger
Allow configuring the default destination port on a per-device basis. Adds new netlink paramater IFLA_VXLAN_PORT to allow setting destination port when creating new vxlan. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-29vxlan: source compatiablity with IFLA_VXLAN_GROUP (v2)stephen hemminger
Source compatiability for build iproute2 was broken by: commit c7995c43facc6e5dea4de63fa9d283a337aabeb1 Author: Atzm Watanabe <atzm@stratosphere.co.jp> vxlan: Allow setting destination to unicast address. Since this commit has not made it upstream (still net-next), and better to avoid gratitious changes to exported API's; go back to original definition, and add a comment. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-25packet: if hw/sw ts enabled in rx/tx ring, report which ts we gotDaniel Borkmann
Currently, there is no way to find out which timestamp is reported in tpacket{,2,3}_hdr's tp_sec, tp_{n,u}sec members. It can be one of SOF_TIMESTAMPING_SYS_HARDWARE, SOF_TIMESTAMPING_RAW_HARDWARE, SOF_TIMESTAMPING_SOFTWARE, or a fallback variant late call from the PF_PACKET code in software. Therefore, report in the tp_status member of the ring buffer which timestamp has been reported for RX and TX path. This should not break anything for the following reasons: i) in RX ring path, the user needs to test for tp_status & TP_STATUS_USER, and later for other flags as well such as TP_STATUS_VLAN_VALID et al, so adding other flags will do no harm; ii) in TX ring path, time stamps with PACKET_TIMESTAMP socketoption are not available resp. had no effect except that the application setting this is buggy. Next to TP_STATUS_AVAILABLE, the user also should check for other flags such as TP_STATUS_WRONG_FORMAT to reclaim frames to the application. Thus, in case TX ts are turned off (default case), nothing happens to the application logic, and in case we want to use this new feature, we now can also check which of the ts source is reported in the status field as provided in the docs. Reported-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-25packet: minor: convert status bits into shifting formatDaniel Borkmann
This makes it more readable and clearer what bits are still free to use. The compiler reduces this to a constant for us anyway. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-25Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== The following patchset contains fixes for recently applied Netfilter/IPVS updates to the net-next tree, most relevantly they are: * Fix sparse warnings introduced in the RCU conversion, from Julian Anastasov. * Fix wrong endianness in the size field of IPVS sync messages, from Simon Horman. * Fix missing if checking in nf_xfrm_me_harder, from Dan Carpenter. * Fix off by one access in the IPVS SCTP tracking code, again from Dan Carpenter. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-24Merge branch 'master' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem
2013-04-23caif: Remove my bouncing email address.sjur.brandeland@stericsson.com
Remove my soon bouncing email address. Also remove the "Contact:" line in file header. The MAINTAINERS file is a better place to find the contact person anyway. Signed-off-by: Sjur Brændeland <sjur.brandeland@stericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-23ipvs: fix sparse warnings for some parametersJulian Anastasov
Some service fields are in network order: - netmask: used once in network order and also as prefix len for IPv6 - port Other parameters are in host order: - struct ip_vs_flags: flags and mask moved between user and kernel only - sync state: moved between user and kernel only - syncid: sent over network as single octet Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/emulex/benet/be_main.c drivers/net/ethernet/intel/igb/igb_main.c drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c include/net/scm.h net/batman-adv/routing.c net/ipv4/tcp_input.c The e{uid,gid} --> {uid,gid} credentials fix conflicted with the cleanup in net-next to now pass cred structs around. The be2net driver had a bug fix in 'net' that overlapped with the VLAN interface changes by Patrick McHardy in net-next. An IGB conflict existed because in 'net' the build_skb() support was reverted, and in 'net-next' there was a comment style fix within that code. Several batman-adv conflicts were resolved by making sure that all calls to batadv_is_my_mac() are changed to have a new bat_priv first argument. Eric Dumazet's TS ECR fix in TCP in 'net' conflicted with the F-RTO rewrite in 'net-next', mostly overlapping changes. Thanks to Stephen Rothwell and Antonio Quartulli for help with several of these merge resolutions. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-22Merge branch 'for-john' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
2013-04-22cfg80211: introduce critical protocol indication from user-spaceArend van Spriel
Some protocols need a more reliable connection to complete successful in reasonable time. This patch adds a user-space API to indicate the wireless driver that a critical protocol is about to commence and when it is done, using nl80211 primitives NL80211_CMD_CRIT_PROTOCOL_START and NL80211_CRIT_PROTOCOL_STOP. There can be only on critical protocol session started per registered cfg80211 device. The driver can support this by implementing the cfg80211 callbacks .crit_proto_start() and .crit_proto_stop(). Examples of protocols that can benefit from this are DHCP, EAPOL, APIPA. Exactly how the link can/should be made more reliable is up to the driver. Things to consider are avoid scanning, no multi-channel operations, and alter coexistence schemes. Reviewed-by: Pieter-Paul Giesberts <pieterpg@broadcom.com> Reviewed-by: Franky (Zhenhui) Lin <frankyl@broadcom.com> Signed-off-by: Arend van Spriel <arend@broadcom.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2013-04-19netlink: add RX/TX-ring support to netlink diagPatrick McHardy
Based on AF_PACKET. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19netlink: mmaped netlink: ring setupPatrick McHardy
Add support for mmap'ed RX and TX ring setup and teardown based on the af_packet.c code. The following patches will use this to add the real mmap'ed receive and transmit functionality. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19net: vlan: add 802.1ad supportPatrick McHardy
Add support for 802.1ad VLAN devices. This mainly consists of checking for ETH_P_8021AD in addition to ETH_P_8021Q in a couple of places and check offloading capabilities based on the used protocol. Configuration is done using "ip link": # ip link add link eth0 eth0.1000 \ type vlan proto 802.1ad id 1000 # ip link add link eth0.1000 eth0.1000.1000 \ type vlan proto 802.1q id 1000 52:54:00:12:34:56 > 92:b1:54:28:e4:8c, ethertype 802.1Q (0x8100), length 106: vlan 1000, p 0, ethertype 802.1Q, vlan 1000, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 84) 20.1.0.2 > 20.1.0.1: ICMP echo request, id 3003, seq 8, length 64 92:b1:54:28:e4:8c > 52:54:00:12:34:56, ethertype 802.1Q-QinQ (0x88a8), length 106: vlan 1000, p 0, ethertype 802.1Q, vlan 1000, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 47944, offset 0, flags [none], proto ICMP (1), length 84) 20.1.0.1 > 20.1.0.2: ICMP echo reply, id 3003, seq 8, length 64 Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-18tcp: introduce TCPSpuriousRtxHostQueues SNMP counterEric Dumazet
Host queues (Qdisc + NIC) can hold packets so long that TCP can eventually retransmit a packet before the first transmit even left the host. Its not clear right now if we could avoid this in the first place : - We could arm RTO timer not at the time we enqueue packets, but at the time we TX complete them (tcp_wfree()) - Cancel the sending of the new copy of the packet if prior one is still in queue. This patch adds instrumentation so that we can at least see how often this problem happens. TCPSpuriousRtxHostQueues SNMP counter is incremented every time we detect the fast clone is not yet freed in tcp_transmit_skb() Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-17Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch Jesse Gross says: ==================== A number of improvements for net-next/3.10. Highlights include: * Properly exposing linux/openvswitch.h to userspace after the uapi changes. * Simplification of locking. It immediately makes things simpler to reason about and avoids holding RTNL mutex for longer than necessary. In the near future it will also enable tunnel registration and more fine-grained locking. * Miscellaneous cleanups and simplifications. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-17fuse: fix type definitions in uapi headerMiklos Szeredi
Commit 7e98d53086d18c877cb44e9065219335184024de (Synchronize fuse header with one used in library) added #ifdef __linux__ around defines if it is not set. The kernel build is self-contained and can be built on non-Linux toolchains. After the mentioned commit builds on non-Linux toolchains will try to include stdint.h and fail due to -nostdinc, and then fail with a bunch of undefined type errors. Fix by checking for __KERNEL__ instead of __linux__ and using the standard int types instead of the linux specific ones. Reported-by: Arve Hjønnevåg <arve@android.com> Reported-by: Colin Cross <ccross@android.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-04-16vxlan: Allow setting destination to unicast address.Atzm Watanabe
This patch allows setting VXLAN destination to unicast address. It allows that VXLAN can be used as peer-to-peer tunnel without multicast. v4: generalize struct vxlan_dev, "gaddr" is replaced with vxlan_rdst. "GROUP" attribute is replaced with "REMOTE". they are based by David Stevens's comments. v3: move a new attribute REMOTE into the last of an enum list based by Stephen Hemminger's comments. v2: use a new attribute REMOTE instead of GROUP based by Cong Wang's comments. Signed-off-by: Atzm Watanabe <atzm@stratosphere.co.jp> Acked-by: David L Stevens <dlstevens@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-12rfkill: Add NFC to the list of supported radiosSamuel Ortiz
And return the proper string for it. Acked-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2013-04-11Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== 1) Allow to avoid copying DSCP during encapsulation by setting a SA flag. From Nicolas Dichtel. 2) Constify the netlink dispatch table, no need to modify it at runtime. From Mathias Krause. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-11NFC: llcp: Add support in getsockopt for RW, LTO, and MIU remote parametersThierry Escande
Useful for LLCP validation tests. Signed-off-by: Thierry Escande <thierry.escande@linux.intel.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2013-04-09net: sctp: introduce uapi header for sctpDaniel Borkmann
This patch introduces an UAPI header for the SCTP protocol, so that we can facilitate the maintenance and development of user land applications or libraries, in particular in terms of header synchronization. To not break compatibility, some fragments from lksctp-tools' netinet/sctp.h have been carefully included, while taking care that neither kernel nor user land breaks, so both compile fine with this change (for lksctp-tools I tested with the old netinet/sctp.h header and with a newly adapted one that includes the uapi sctp header). lksctp-tools smoke test run through successfully as well in both cases. Suggested-by: Neil Horman <nhorman@tuxdriver.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-08net: ipv6: add tokenized interface identifier supportDaniel Borkmann
This patch adds support for IPv6 tokenized IIDs, that allow for administrators to assign well-known host-part addresses to nodes whilst still obtaining global network prefix from Router Advertisements. It is currently in draft status. The primary target for such support is server platforms where addresses are usually manually configured, rather than using DHCPv6 or SLAAC. By using tokenised identifiers, hosts can still determine their network prefix by use of SLAAC, but more readily be automatically renumbered should their network prefix change. [...] The disadvantage with static addresses is that they are likely to require manual editing should the network prefix in use change. If instead there were a method to only manually configure the static identifier part of the IPv6 address, then the address could be automatically updated when a new prefix was introduced, as described in [RFC4192] for example. In such cases a DNS server might be configured with such a tokenised interface identifier of ::53, and SLAAC would use the token in constructing the interface address, using the advertised prefix. [...] http://tools.ietf.org/html/draft-chown-6man-tokenised-ipv6-identifiers-02 The implementation is partially based on top of Mark K. Thompson's proof of concept. However, it uses the Netlink interface for configuration resp. data retrival, so that it can be easily extended in future. Successfully tested by myself. Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-02netfilter: fix struct ip6t_frag field descriptionMichal Kubeček
Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-04-02netfilter: xt_NFQUEUE: introduce CPU fanoutholger@eitzenberger.org
Current NFQUEUE target uses a hash, computed over source and destination address (and other parameters), for steering the packet to the actual NFQUEUE. This, however forgets about the fact that the packet eventually is handled by a particular CPU on user request. If E. g. 1) IRQ affinity is used to handle packets on a particular CPU already (both single-queue or multi-queue case) and/or 2) RPS is used to steer packets to a specific softirq the target easily chooses an NFQUEUE which is not handled by a process pinned to the same CPU. The idea is therefore to use the CPU index for determining the NFQUEUE handling the packet. E. g. when having a system with 4 CPUs, 4 MQ queues and 4 NFQUEUEs it looks like this: +-----+ +-----+ +-----+ +-----+ |NFQ#0| |NFQ#1| |NFQ#2| |NFQ#3| +-----+ +-----+ +-----+ +-----+ ^ ^ ^ ^ | |NFQUEUE | | + + + + +-----+ +-----+ +-----+ +-----+ |rx-0 | |rx-1 | |rx-2 | |rx-3 | +-----+ +-----+ +-----+ +-----+ The NFQUEUEs not necessarily have to start with number 0, setups with less NFQUEUEs than packet-handling CPUs are not a problem as well. This patch extends the NFQUEUE target to accept a new NFQ_FLAG_CPU_FANOUT flag. If this is specified the target uses the CPU index for determining the NFQUEUE being used. I have to introduce rev3 for this. The 'flags' are folded into _v2 'bypass'. By changing the way which queue is assigned, I'm able to improve the performance if the processes reading on the NFQUEUs are pinned correctly. Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-03-29openvswitch: Expose <linux/openvswitch.h> to userspaceThomas Graf
It contains the public netlink interface bits required by userspace to make use of the interface. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-03-28net: add ETH_P_802_3_MINSimon Horman
Add a new constant ETH_P_802_3_MIN, the minimum ethernet type for an 802.3 frame. Frames with a lower value in the ethernet type field are Ethernet II. Also update all the users of this value that David Miller and I could find to use the new constant. Also correct a bug in util.c. The comparison with ETH_P_802_3_MIN should be >= not >. As suggested by Jesse Gross. Compile tested only. Cc: David Miller <davem@davemloft.net> Cc: Jesse Gross <jesse@nicira.com> Cc: Karsten Keil <isdn@linux-pingi.de> Cc: John W. Linville <linville@tuxdriver.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Bart De Schuymer <bart.de.schuymer@pandora.be> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Gustavo Padovan <gustavo@padovan.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: linux-bluetooth@vger.kernel.org Cc: netfilter-devel@vger.kernel.org Cc: bridge@lists.linux-foundation.org Cc: linux-wireless@vger.kernel.org Cc: linux1394-devel@lists.sourceforge.net Cc: linux-media@vger.kernel.org Cc: netdev@vger.kernel.org Cc: dev@openvswitch.org Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com> Acked-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-26netlink: remove duplicated NLMSG_ALIGNHong zhi guo
NLMSG_HDRLEN is already aligned value. It's for directly reference without extra alignment. The redundant alignment here may confuse the API users. Signed-off-by: Hong Zhiguo <honkiko@gmail.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Pull to get the thermal netlink multicast group name fix, otherwise the assertion added in net-next to netlink to detect that kind of bug makes systems unbootable for some folks. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-21netlink: Diag core and basic socket info dumping (v2)Andrey Vagin
The netlink_diag can be built as a module, just like it's done in unix sockets. The core dumping message carries the basic info about netlink sockets: family, type and protocol, portis, dst_group, dst_portid, state. Groups can be received as an optional parameter NETLINK_DIAG_GROUPS. Netlink sockets cab be filtered by protocols. The socket inode number and cookie is reserved for future per-socket info retrieving. The per-protocol filtering is also reserved for future by requiring the sdiag_protocol to be zero. The file /proc/net/netlink doesn't provide enough information for dumping netlink sockets. It doesn't provide dst_group, dst_portid, groups above 32. v2: fix NETLINK_DIAG_MAX. Now it's equal to the last constant. Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Gao feng <gaofeng@cn.fujitsu.com> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-21net: fix *_DIAG_MAX constantsAndrey Vagin
Follow the common pattern and define *_DIAG_MAX like: [...] __XXX_DIAG_MAX, }; Because everyone is used to do: struct nlattr *attrs[XXX_DIAG_MAX+1]; nla_parse([...], XXX_DIAG_MAX, [...] Reported-by: Thomas Graf <tgraf@suug.ch> Cc: "David S. Miller" <davem@davemloft.net> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Eric Dumazet <edumazet@google.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-20Merge branch 'master' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem
2013-03-20connector: Added coredumping event to the process connectorJesper Derehag
Process connector can now also detect coredumping events. Main aim of patch is get notified at start of coredumping, instead of having to wait for it to finish and then being notified through EXIT event. Could be used for instance by process-managers that want to get notified as soon as possible about process failures, and not necessarily beeing notified after coredump, which could be in the order of minutes depending on size of coredump, piping and so on. Signed-off-by: Jesper Derehag <jderehag@hotmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-20filter: add ANC_PAY_OFFSET instruction for loading payload start offsetDaniel Borkmann
It is very useful to do dynamic truncation of packets. In particular, we're interested to push the necessary header bytes to the user space and cut off user payload that should probably not be transferred for some reasons (e.g. privacy, speed, or others). With the ancillary extension PAY_OFFSET, we can load it into the accumulator, and return it. E.g. in bpfc syntax ... ld #poff ; { 0x20, 0, 0, 0xfffff034 }, ret a ; { 0x16, 0, 0, 0x00000000 }, ... as a filter will accomplish this without having to do a big hackery in a BPF filter itself. Follow-up JIT implementations are welcome. Thanks to Eric Dumazet for suggesting and discussing this during the Netfilter Workshop in Copenhagen. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Pull in the 'net' tree to get Daniel Borkmann's flow dissector infrastructure change. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-19packet: packet fanout rollover during socket overloadWillem de Bruijn
Changes: v3->v2: rebase (no other changes) passes selftest v2->v1: read f->num_members only once fix bug: test rollover mode + flag Minimize packet drop in a fanout group. If one socket is full, roll over packets to another from the group. Maintain flow affinity during normal load using an rxhash fanout policy, while dispersing unexpected traffic storms that hit a single cpu, such as spoofed-source DoS flows. Rollover breaks affinity for flows arriving at saturated sockets during those conditions. The patch adds a fanout policy ROLLOVER that rotates between sockets, filling each socket before moving to the next. It also adds a fanout flag ROLLOVER. If passed along with any other fanout policy, the primary policy is applied until the chosen socket is full. Then, rollover selects another socket, to delay packet drop until the entire system is saturated. Probing sockets is not free. Selecting the last used socket, as rollover does, is a greedy approach that maximizes chance of success, at the cost of extreme load imbalance. In practice, with sufficiently long queues to absorb bursts, sockets are drained in parallel and load balance looks uniform in `top`. To avoid contention, scales counters with number of sockets and accesses them lockfree. Values are bounds checked to ensure correctness. Tested using an application with 9 threads pinned to CPUs, one socket per thread and sufficient busywork per packet operation to limits each thread to handling 32 Kpps. When sent 500 Kpps single UDP stream packets, a FANOUT_CPU setup processes 32 Kpps in total without this patch, 270 Kpps with the patch. Tested with read() and with a packet ring (V1). Also, passes psock_fanout.c unit test added to selftests. Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-17tcp: Remove TCPCTChristoph Paasch
TCPCT uses option-number 253, reserved for experimental use and should not be used in production environments. Further, TCPCT does not fully implement RFC 6013. As a nice side-effect, removing TCPCT increases TCP's performance for very short flows: Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests for files of 1KB size. before this patch: average (among 7 runs) of 20845.5 Requests/Second after: average (among 7 runs) of 21403.6 Requests/Second Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-17vxlan: generalize forwarding tablesDavid Stevens
This patch generalizes VXLAN forwarding table entries allowing an administrator to: 1) specify multiple destinations for a given MAC 2) specify alternate vni's in the VXLAN header 3) specify alternate destination UDP ports 4) use multicast MAC addresses as fdb lookup keys 5) specify multicast destinations 6) specify the outgoing interface for forwarded packets The combination allows configuration of more complex topologies using VXLAN encapsulation. Changes since v1: rebase to 3.9.0-rc2 Signed-Off-By: David L Stevens <dlstevens@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-13Merge branch 'akpm' (fixes from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: - A bunch of fixes - Finish off the idr API conversions before someone starts to use the old interfaces again. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: idr: idr_alloc() shouldn't trigger lowmem warning when preloaded UAPI: fix endianness conditionals in M32R's asm/stat.h UAPI: fix endianness conditionals in linux/raid/md_p.h UAPI: fix endianness conditionals in linux/acct.h UAPI: fix endianness conditionals in linux/aio_abi.h decompressors: fix typo "POWERPC" mm/fremap.c: fix oops on error path idr: deprecate idr_pre_get() and idr_get_new[_above]() tidspbridge: convert to idr_alloc() zcache: convert to idr_alloc() mlx4: remove leftover idr_pre_get() call workqueue: convert to idr_alloc() nfsd: convert to idr_alloc() nfsd: remove unused get_new_stid() kernel/signal.c: use __ARCH_HAS_SA_RESTORER instead of SA_RESTORER signal: always clear sa_restorer on execve mm: remove_memory(): fix end_pfn setting include/linux/res_counter.h needs errno.h
2013-03-13UAPI: fix endianness conditionals in linux/raid/md_p.hDavid Howells
In the UAPI header files, __BIG_ENDIAN and __LITTLE_ENDIAN must be compared against __BYTE_ORDER in preprocessor conditionals where these are exposed to userspace (that is they're not inside __KERNEL__ conditionals). However, in the main kernel the norm is to check for "defined(__XXX_ENDIAN)" rather than comparing against __BYTE_ORDER and this has incorrectly leaked into the userspace headers. The definition of struct mdp_superblock_s in linux/raid/md_p.h is wrong in this way. Note that userspace will likely interpret the ordering of the fields incorrectly as the big-endian variant on a little-endian machines - depending on header inclusion order. [!!!] NOTE [!!!] This patch may adversely change the userspace API. It might be better to fix the ordering of events_hi, events_lo, cp_events_hi and cp_events_lo in struct mdp_superblock_s / typedef mdp_super_t. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-13UAPI: fix endianness conditionals in linux/acct.hDavid Howells
In the UAPI header files, __BIG_ENDIAN and __LITTLE_ENDIAN must be compared against __BYTE_ORDER in preprocessor conditionals where these are exposed to userspace (that is they're not inside __KERNEL__ conditionals). However, in the main kernel the norm is to check for "defined(__XXX_ENDIAN)" rather than comparing against __BYTE_ORDER and this has incorrectly leaked into the userspace headers. The definition of ACCT_BYTEORDER in linux/acct.h is wrong in this way. Note that userspace will likely interpret this incorrectly as the big-endian variant on little-endian machines - depending on header inclusion order. [!!!] NOTE [!!!] This patch may adversely change the userspace API. It might be better to fix the value of ACCT_BYTEORDER. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-13UAPI: fix endianness conditionals in linux/aio_abi.hDavid Howells
In the UAPI header files, __BIG_ENDIAN and __LITTLE_ENDIAN must be compared against __BYTE_ORDER in preprocessor conditionals where these are exposed to userspace (that is they're not inside __KERNEL__ conditionals). However, in the main kernel the norm is to check for "defined(__XXX_ENDIAN)" rather than comparing against __BYTE_ORDER and this has incorrectly leaked into the userspace headers. The definition of PADDED() in linux/aio_abi.h is wrong in this way. Note that userspace will likely interpret this and thus the order of fields in struct iocb incorrectly as the little-endian variant on big-endian machines - depending on header inclusion order. [!!!] NOTE [!!!] This patch may adversely change the userspace API. It might be better to fix the ordering of aio_key and aio_reserved1 in struct iocb. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-12tty/serial: Add support for Altera serial portLey Foon Tan
Add support for Altera 8250/16550 compatible serial port. Signed-off-by: Ley Foon Tan <lftan@altera.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-12tcp: TLP loss detection.Nandita Dukkipati
This is the second of the TLP patch series; it augments the basic TLP algorithm with a loss detection scheme. This patch implements a mechanism for loss detection when a Tail loss probe retransmission plugs a hole thereby masking packet loss from the sender. The loss detection algorithm relies on counting TLP dupacks as outlined in Sec. 3 of: http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01 The basic idea is: Sender keeps track of TLP "episode" upon retransmission of a TLP packet. An episode ends when the sender receives an ACK above the SND.NXT (tracked by tlp_high_seq) at the time of the episode. We want to make sure that before the episode ends the sender receives a "TLP dupack", indicating that the TLP retransmission was unnecessary, so there was no loss/hole that needed plugging. If the sender gets no TLP dupack before the end of the episode, then it reduces ssthresh and the congestion window, because the TLP packet arriving at the receiver probably plugged a hole. Signed-off-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-12tcp: Tail loss probe (TLP)Nandita Dukkipati
This patch series implement the Tail loss probe (TLP) algorithm described in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The first patch implements the basic algorithm. TLP's goal is to reduce tail latency of short transactions. It achieves this by converting retransmission timeouts (RTOs) occuring due to tail losses (losses at end of transactions) into fast recovery. TLP transmits one packet in two round-trips when a connection is in Open state and isn't receiving any ACKs. The transmitted packet, aka loss probe, can be either new or a retransmission. When there is tail loss, the ACK from a loss probe triggers FACK/early-retransmit based fast recovery, thus avoiding a costly RTO. In the absence of loss, there is no change in the connection state. PTO stands for probe timeout. It is a timer event indicating that an ACK is overdue and triggers a loss probe packet. The PTO value is set to max(2*SRTT, 10ms) and is adjusted to account for delayed ACK timer when there is only one oustanding packet. TLP Algorithm On transmission of new data in Open state: -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms). -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms) -> PTO = min(PTO, RTO) Conditions for scheduling PTO: -> Connection is in Open state. -> Connection is either cwnd limited or no new data to send. -> Number of probes per tail loss episode is limited to one. -> Connection is SACK enabled. When PTO fires: new_segment_exists: -> transmit new segment. -> packets_out++. cwnd remains same. no_new_packet: -> retransmit the last segment. Its ACK triggers FACK or early retransmit based recovery. ACK path: -> rearm RTO at start of ACK processing. -> reschedule PTO if need be. In addition, the patch includes a small variation to the Early Retransmit (ER) algorithm, such that ER and TLP together can in principle recover any N-degree of tail loss through fast recovery. TLP is controlled by the same sysctl as ER, tcp_early_retrans sysctl. tcp_early_retrans==0; disables TLP and ER. ==1; enables RFC5827 ER. ==2; delayed ER. ==3; TLP and delayed ER. [DEFAULT] ==4; TLP only. The TLP patch series have been extensively tested on Google Web servers. It is most effective for short Web trasactions, where it reduced RTOs by 15% and improved HTTP response time (average by 6%, 99th percentile by 10%). The transmitted probes account for <0.5% of the overall transmissions. Signed-off-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-10NFC: llcp: Service Name Lookup netlink interfaceThierry Escande
This adds a netlink interface for service name lookup support. Multiple URIs can be passed nested into the NFC_ATTR_LLC_SDP attribute using the NFC_CMD_LLC_SDREQ netlink command. When the SNL reply is received, a NFC_EVENT_LLC_SDRES event is sent to the user space. URI and SAP tuples are passed back, nested into NFC_ATTR_LLC_SDP attribute. Signed-off-by: Thierry Escande <thierry.escande@linux.intel.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2013-03-10NFC: llcp: Implement socket optionsSamuel Ortiz
Some LLCP services (e.g. the validation ones) require some control over the LLCP link parameters like the receive window (RW) or the MIU extension (MIUX). This can only be done through socket options. Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2013-03-08VSOCK: Split vm_sockets.h into kernel/uapiAndy King
Split the vSockets header into kernel and UAPI parts. The former gets the bits that used to be in __KERNEL__ guards, while the latter gets everything that is user-visible. Tested by compiling vsock (+transport) and a simple user-mode vSockets application. Reported-by: David Howells <dhowells@redhat.com> Acked-by: Dmitry Torokhov <dtor@vmware.com> Signed-off-by: Andy King <acking@vmware.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-06htb: add HTB_DIRECT_QLEN attributeEric Dumazet
HTB uses an internal pfifo queue, which limit is not reported to userland tools (tc), and value inherited from device tx_queue_len at setup time. Introduce TCA_HTB_DIRECT_QLEN attribute to allow finer control. Remove two obsolete pr_err() calls as well. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>