The kernel's rpcbind client creates and deletes an rpc_clnt and its
underlying transport socket for every upcall to the local rpcbind
daemon.
When starting a typical NFS server on IPv4 and IPv6, the NFS service
itself does three upcalls (one per version) times two upcalls (one
per transport) times two upcalls (one per address family), making 12,
plus another one for the initial call to unregister previous NFS
services. Starting the NLM service adds an additional 13 upcalls,
for similar reasons.
(Currently the NFS service doesn't start IPv6 listeners, but it will
soon enough).
Instead, let's create an rpc_clnt for rpcbind upcalls during the
first local rpcbind query, and cache it. This saves the overhead of
creating and destroying an rpc_clnt and a socket for every upcall.
The new logic also prevents the kernel from attempting an RPCB_SET or
RPCB_UNSET if it knows from the start that the local portmapper does
not support rpcbind protocol version 4. This will cut down on the
number of rpcbind upcalls in legacy environments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up: At one point, rpcb_local_clnt() handled IPv6 loopback
addresses too, but it doesn't any more; only IPv4 loopback is used
now. Get rid of the @addr and @addrlen arguments to
rpcb_local_clnt().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The kernel sometimes makes RPC calls to services that aren't running.
Because the kernel's RPC client always assumes the hard retry semantic
when reconnecting a connection-oriented RPC transport, the underlying
reconnect logic takes a long while to time out, even though the remote
may have responded immediately with ECONNREFUSED.
In certain cases, like upcalls to our local rpcbind daemon, or for NFS
mount requests, we'd like the kernel to fail immediately if the remote
service isn't reachable. This allows another transport to be tried
immediately, or the pending request can be abandoned quickly.
Introduce a per-request flag which controls how call_transmit_status()
behaves when request transmission fails because the server cannot be
reached.
We don't want soft connection semantics to apply to other errors. The
default case of the switch statement in call_transmit_status() no
longer falls through; the fall through code is copied to the default
case, and a "break;" is added.
The transport's connection re-establishment timeout is also ignored for
such requests. We want the request to fail immediately, so the
reconnect delay is skipped. Additionally, we don't want a connect
failure here to further increase the reconnect timeout value, since
this request will not be retried.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The success case, where task->tk_status == 0, is by far the most
frequent case in call_transmit_status().
The default: arm of the switch statement in call_transmit_status()
handles the 0 case. default: was moved close to the top of the switch
statement in call_transmit_status() under the theory that the compiler
places object code for the earliest arms of a switch statement first,
making the CPU do less work.
The default: arm of a switch statement, however, is executed only
after all the other cases have been checked. Even if the compiler
rearranges the object code, the default: arm is the "last resort",
meaning all of the other cases have been explicitly exhausted. That
makes the current arrangement about as inefficient as it gets for the
common case.
To fix this, add an explicit check for zero before the switch
statement. That forces the compiler to do the zero check first, no
matter what optimizations it might try to do to the switch statement.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When the "rsize=" or "wsize=" mount options are not specified,
text-based mounts have slightly different behavior than legacy binary
mounts. Text-based mounts use the smaller of the server's maximum
and the client's maximum, but binary mounts use the smaller of the
server's _preferred_ size and the client's maximum.
This difference is actually pretty subtle. Most servers advertise
the same value as their maximum and their preferred transfer size, so
the end result is the same in most cases.
The reason for this difference is that for text-based mounts, if
r/wsize are not specified, they are set to the largest value supported
by the client. For legacy mounts, the values are set to zero if these
options are not specified.
nfs_server_set_fsinfo() can negotiate the transfer size defaults
correctly in any case. There's no need to specify any particular
value as default in the text-based option parsing logic.
Note that nfs4 doesn't use nfs_server_set_fsinfo(), but the mount.nfs4
command does set rsize and wsize to 0 if the user didn't specify these
options. So, make the same change for text-based NFSv4 mounts.
Thanks to James Pearson <james-p@moving-picture.com> for reporting and
diagnosing the problem.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Recent changes to snprintf() introduced the %pI6c formatter, which can
display an IPv6 address with standard shorthanding. Use this new
formatter when displaying IPv6 server addresses in /proc/mounts.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Recent changes to snprintf() introduced the %pI6c formatter, which can
display an IPv6 address with standard shorthanding. Using a
shorthanded address can save us a few bytes of memory for each stored
presentation address, or a few bytes on the wire when sending these in
a universal address.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
reorder nfs4_sequence_args to remove 8 bytes of padding on 64 bit
builds.
The size of this structure drops to 24 bytes from 32 and reduces the
text size of nfs.ko.
On my x86_64 size reports
text data bss
2.6.32-rc5 200996 8512 432 209940 33414 nfs.ko
+patch 200884 8512 432 209828 333a4 nfs.ko
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Solaris uses netids as values for the proto= option, so that when
someone specifies "tcp6" they get traffic over TCP + IPv6. Until
recently, this has never really been an issue for Linux since it didn't
support NFS over IPv6. The netid and the protocol name were generally
always the same (modulo any strange configuration in /etc/netconfig).
The solaris manpage documents their proto= option as:
proto= _netid_ | rdma
This patch is intended to bring Linux closer to how the Solaris proto=
option works, by declaring a static netid mapping in the kernel and
converting the proto= and mountproto= options to follow it and display
the proper values in /proc/mounts.
Much of this functionality will need to be provided by a userspace
mount.nfs patch. Chuck Lever has a patch to change mount.nfs in
the same way. In principle, we could do *all* of this in userspace but
that would mean that the options in /proc/mounts may not match the
options used by userspace.
The alternative to the static mapping here is to add a mechanism to
upcall to userspace for netid's. I'm not opposed to that option, but
it'll probably mean more overhead (and quite a bit more code). Rather
than shoot for that at first, I figured it was probably better to
start simply.
Comments welcome.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The nfs4_state_manager should not be looking at the error values when
deciding whether or not to loop round in order to handle a higher priority
state recovery task. It should rather be looking at the clp->cl_state.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If our lease expires, and the server reboots while we're recovering, we
need to be able to wait until the grace period is over.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs4_recovery_handle_error() will correctly handle errors such as
NFS4ERR_CB_PATH_DOWN, however because they are still passed back to the
main loop in nfs4_state_manager(), they can cause the latter to exit
prematurely.
Fix this by letting nfs4_recovery_handle_error() change the error value in
cases where there is no action required by the caller.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It has been dead for the last three years (== since the initial driver
merge) and probability that it will ever get fixed is quite low.
Since there is no reason to keep this dead code around any longer just
remove it (it can still be retrieved from the git history if necessary).
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
In practice, we need to ensure that we call nfs4_state_end_reclaim_reboot
in 2 cases:
- If we lose the lease while we were reclaiming state
OR
- After we're done with reboot recovery
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Do not clear MWDMA timings for device on the other port when
programming slave device.
This change should be safe as this is how we have been doing
things in IDE slc90e66 host driver for years.
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
* do not clear PIO timings for master when programming slave
* do not clear PIO timings for device on the other port when
programming slave device
Both changes should be safe as this is how we have been doing
things in IDE slc90e66 host driver for years.
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Fix erroneous check for ap->udma_mask in do_pata_set_dmamode()
resulting in controller not being programmed properly for MWDMA.
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
When a firmware command fails, only the failure codes are printed.
It is difficult to co-relate this to the actual command that has failed.
These changes will now print the command code that has failed.
Signed-off-by: Sarveshwar Bandi <sarveshwarb@serverengines.com>
Signed-off-by: Ajit Khaparde <ajitk@serverengines.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function walks the whole hashtable so there is no point in
passing it a network namespace. Instead I purge all timewait
sockets from dead network namespaces that I find. If the namespace
is one of the once I am trying to purge I am guaranteed no new timewait
sockets can be formed so this will get them all. If the namespace
is one I am not acting for it might form a few more but I will
call inet_twsk_purge again and shortly to get rid of them. In
any even if the network namespace is dead timewait sockets are
useless.
Move the calls of inet_twsk_purge into batch_exit routines so
that if I am killing a bunch of namespaces at once I will just
call inet_twsk_purge once and save a lot of redundant unnecessary
work.
My simple 4k network namespace exit test the cleanup time dropped from
roughly 8.2s to 1.6s. While the time spent running inet_twsk_purge fell
to about 2ms. 1ms for ipv4 and 1ms for ipv6.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While we are looking up entries to free there is no reason to take
the lock in inet_twsk_purge. We have to drop locks and restart
occassionally anyway so adding a few more in case we get on the
wrong list because of a timewait move is no big deal. At the
same time not taking the lock for long periods of time is much
more polite to the rest of the users of the hash table.
In my test configuration of killing 4k network namespaces
this change causes 4k back to back runs of inet_twsk_purge on an
empty hash table to go from roughly 20.7s to 3.3s, and the total
time to destroy 4k network namespaces goes from roughly 44s to
3.3s.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor the code so fib_rules_register always takes a template instead
of the actual fib_rules_ops structure that will be used. This is
required for network namespace support so 2 out of the 3 callers already
do this, it allows the error handling to be made common, and it allows
fib_rules_unregister to free the template for hte caller.
Modify fib_rules_unregister to use call_rcu instead of syncrhonize_rcu
to allw multiple namespaces to be cleaned up in the same rcu grace
period.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows namespace exit methods to batch work that comes requires an
rcu barrier using call_rcu without having to treat the
unregister_pernet_operations cases specially.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
xfrm.nlsk is provided by the xfrm_user module and is access via rcu from
other parts of the xfrm code. Add xfrm.nlsk_stash a copy of xfrm.nlsk that
will never be set to NULL. This allows the synchronize_net and
netlink_kernel_release to be deferred until a whole batch of xfrm.nlsk sockets
have been set to NULL.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move network device exit batching from a special case in
net_namespace.c to using common mechanisms in dev.c
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Add exit_list to struct net to support building lists of network
namespaces to cleanup.
- Add exit_batch to pernet_operations to allow running operations only
once during a network namespace exit. Instead of once per network
namespace.
- Factor opt ops_exit_list and ops_exit_free so the logic with cleanup
up a network namespace does not need to be duplicated.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 8ec1e0ebe26087bfc5c0394ada5feb5758014fc8
Author: Patrick McHardy <kaber@trash.net>
Date: Thu Dec 3 12:16:35 2009 +0100
ipv4: add sysctl to accept packets with local source addresses
Change fib_validate_source() to accept packets with a local source address when
the "accept_local" sysctl is set for the incoming inet device. Combined with the
previous patches, this allows to communicate between multiple local interfaces
over the wire.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit d124356ce314fff22a047ea334379d5105b2d834
Author: Patrick McHardy <kaber@trash.net>
Date: Thu Dec 3 12:16:35 2009 +0100
net: fib_rules: allow to delete local rule
Allow to delete the local rule and recreate it with a higher priority. This
can be used to force packets with a local destination out on the wire instead
of routing them to loopback. Additionally this patch allows to recreate rules
with a priority of 0.
Combined with the previous patch to allow oif classification, a socket can
be bound to the desired interface and packets routed to the wire like this:
# move local rule to lower priority
ip rule add pref 1000 lookup local
ip rule del pref 0
# route packets of sockets bound to eth0 to the wire independant
# of the destination address
ip rule add pref 100 oif eth0 lookup 100
ip route add default dev eth0 table 100
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 68144d350f4f6c348659c825cde6a82b34c27a91
Author: Patrick McHardy <kaber@trash.net>
Date: Thu Dec 3 12:05:25 2009 +0100
net: fib_rules: add oif classification
Support routing table lookup based on the flow's oif. This is useful to
classify packets originating from sockets bound to interfaces differently.
The route cache already includes the oif and needs no changes.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 229e77eec406ad68662f18e49fda8b5d366768c5
Author: Patrick McHardy <kaber@trash.net>
Date: Thu Dec 3 12:05:23 2009 +0100
net: fib_rules: rename ifindex/ifname/FRA_IFNAME to iifindex/iifname/FRA_IIFNAME
The next patch will add oif classification, rename interface related members
and attributes to reflect that they're used for iif classification.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit b8952893d5d86f69c4e499d191b98c6658f64b0f
Author: Patrick McHardy <kaber@trash.net>
Date: Thu Dec 3 12:05:22 2009 +0100
net: fib_rules: rearrange struct fib_rule
The ifname member is only used to resolve interface names and is not needed
during rule lookups. The target and ctarget members however are used during
rule lookups and are currently located in a second cacheline.
Move ifname further to the end to make sure both target and ctarget are
located in the same cacheline as other members used during rule lookups.
The layout on 64 bit changes from:
struct fib_rule {
...
u32 table; /* 56 4 */
u8 action; /* 60 1 */
/* XXX 3 bytes hole, try to pack */
/* --- cacheline 1 boundary (64 bytes) --- */
u32 target; /* 64 4 */
/* XXX 4 bytes hole, try to pack */
struct fib_rule * ctarget; /* 72 8 */
struct rcu_head rcu; /* 80 16 */
struct net * fr_net; /* 96 8 */
};
to:
struct fib_rule {
...
u32 table; /* 40 4 */
u8 action; /* 44 1 */
/* XXX 3 bytes hole, try to pack */
u32 target; /* 48 4 */
/* XXX 4 bytes hole, try to pack */
struct fib_rule * ctarget; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
char ifname[16]; /* 64 16 */
struct rcu_head rcu; /* 80 16 */
struct net * fr_net; /* 96 8 */
};
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
If running in non-secure mode accessing
some registers of l2x0 will fault. So
check if l2x0 is already enabled, if so
do not access those secure registers.
Signed-off-by: srinidhi kasagar <srinidhi.kasagar@stericsson.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
removed old macro definition for io access, using
the generic macros defined in asm/io.h
Signed-off-by: Leo Hao Chen <leochen@broadcom.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
ahci can drive the Promise PDC42819, but obviously it can only use SATA
disks connected to this controller. The controller can actually support
SAS disks as well, but we only know how to use it in it's AHCI mode.
Add a message to let users know that because ahci is driving their chip
they can only use the SATA disks connected to this controller.
Signed-off-by: Mark Nelson <mdnelson8@gmail.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Some places were using PCI_CLASS_REVISION instead of PCI_REVISION_ID, so
they weren't converted by commit 44c10138fd
(PCI: Change all drivers to use pci_device->revision).
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
The comment in the driver actually describes HPT37x's timing register layout,
which is different from HPT36x. Fix it and reformat the comment, while at it.
Bump the driver version, accounting for several patches that forgot to do it.
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
These drivers inherited from the older 'hpt366' IDE driver the buggy timing
register masks in their set_piomode() metods. As a result, too low command
cycle active time is programmed for slow PIO modes. Quite fortunately, it's
later "fixed up" by the set_dmamode() methods which also "helpfully" reprogram
the command timings, usually to PIO mode 4; unfortunately, setting an UltraDMA
mode #N also reprograms already set PIO data timings, usually to MWDMA mode #
max(N, 2) timings...
However, the drivers added some breakage of their own too: the bit that they
set/clear to control the FIFO is sometimes wrong -- it's actually the MSB of
the command cycle setup time; also, setting it in DMA mode is wrong as this
bit is only for PIO actually and clearing it for PIO modes is not needed as
no mode in any timing table has it set...
Fix all this, inverting the masks while at it, like in the 'hpt366' and
'pata_hpt366' drivers; bump the drivers' versions, accounting for recent
patches that forgot to do it...
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: stable@kernel.org
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
We were never able to get docs for this out of Toshiba for years. Dave
Barnes produced a NetBSD driver however and from that we can fill in the
needed tables.
As we correct the PCI identifiers a bit also update the old ide generic driver
at the same time so it stays compiling.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
I have observed cases where the implicit stop_machine_destroy() done by
stop_machine() hangs while destroying the workqueues, specifically in
kthread_stop(). This seems to be because timer ticks are not restarted
until after stop_machine() returns.
Fortunately stop_machine provides a facility to pre-create/post-destroy
the workqueues so use this to ensure that workqueues are only destroyed
after everything is really up and running again.
I only actually observed this failure with 2.6.30. It seems that newer
kernels are somehow more robust against doing kthread_stop() without timer
interrupts (I tried some backports of some likely looking candidates but
did not track down the commit which added this robustness). However this
change seems like a reasonable belt&braces thing to do.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stable Kernel <stable@kernel.org>
The existing error handling has a few issues:
- If freeze_processes() fails it exits with shutting_down = SHUTDOWN_SUSPEND.
- If dpm_suspend_noirq() fails it exits without resuming xenbus.
- If stop_machine() fails it exits without resuming xenbus or calling
dpm_resume_end().
- xs_suspend()/xs_resume() and dpm_suspend_noirq()/dpm_resume_noirq() were not
nested in the obvious way.
Fix by ensuring each failure case goto's the correct label. Treat a failure of
stop_machine() as a cancelled suspend in order to follow the correct resume
path.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stable Kernel <stable@kernel.org>