1
0
Commit Graph

19915 Commits

Author SHA1 Message Date
Paul E. McKenney
8d7dc9283f rcu: Control grace-period delays directly from value
In a misguided attempt to avoid an #ifdef, the use of the
gp_init_delay module parameter was conditioned on the corresponding
RCU_TORTURE_TEST_SLOW_INIT Kconfig variable, using IS_ENABLED() at
the point of use in the code.  This meant that the compiler always saw
the delay, which meant that RCU_TORTURE_TEST_SLOW_INIT_DELAY had to be
unconditionally defined.  This in turn caused "make oldconfig" to ask
pointless questions about the value of RCU_TORTURE_TEST_SLOW_INIT_DELAY
in cases where it was not even used.

This commit avoids these pointless questions by defining gp_init_delay
under #ifdef.  In one branch, gp_init_delay is initialized to
RCU_TORTURE_TEST_SLOW_INIT_DELAY and is also a module parameter (thus
allowing boot-time modification), and in the other branch gp_init_delay
is a const variable initialized by default to zero.

This approach also simplifies the code at the delay point by eliminating
the IS_DEFINED().  Because gp_init_delay is constant zero in the no-delay
case intended for production use, the "gp_init_delay > 0" check causes
the delay to become dead code, as desired in this case.  In addition,
this commit replaces magic constant "10" with the preprocessor variable
PER_RCU_NODE_PERIOD, which controls the number of grace periods that
are allowed to elapse at full speed before a delay is inserted.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-04-14 19:33:59 -07:00
Paul E. McKenney
00df35f991 cpu: Defer smpboot kthread unparking until CPU known to scheduler
Currently, smpboot_unpark_threads() is invoked before the incoming CPU
has been added to the scheduler's runqueue structures.  This might
potentially cause the unparked kthread to run on the wrong CPU, since the
correct CPU isn't fully set up yet.

That causes a sporadic, hard to debug boot crash triggering on some
systems, reported by Borislav Petkov, and bisected down to:

  2a442c9c64 ("x86: Use common outgoing-CPU-notification code")

This patch places smpboot_unpark_threads() in a CPU hotplug
notifier with priority set so that these kthreads are unparked just after
the CPU has been added to the runqueues.

Reported-and-tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-13 08:25:16 +02:00
Ingo Molnar
4bfe186dbe Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

  - Documentation updates.

  - Changes permitting use of call_rcu() and friends very early in
    boot, for example, before rcu_init() is invoked.

  - Miscellaneous fixes.

  - Add in-kernel API to enable and disable expediting of normal RCU
    grace periods.

  - Improve RCU's handling of (hotplug-) outgoing CPUs.

    Note: ARM support is lagging a bit here, and these improved
    diagnostics might generate (harmless) splats.

  - NO_HZ_FULL_SYSIDLE fixes.

  - Tiny RCU updates to make it more tiny.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 10:04:06 +01:00
Mel Gorman
074c238177 mm: numa: slow PTE scan rate if migration failures occur
Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

  Across the board the 4.0-rc1 numbers are much slower, and the degradation
  is far worse when using the large memory footprint configs. Perf points
  straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:

   -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
      - default_send_IPI_mask_sequence_phys
         - 99.99% physflat_send_IPI_mask
            - 99.37% native_send_call_func_ipi
                 smp_call_function_many
               - native_flush_tlb_others
                  - 99.85% flush_tlb_page
                       ptep_clear_flush
                       try_to_unmap_one
                       rmap_walk
                       try_to_unmap
                       migrate_pages
                       migrate_misplaced_page
                     - handle_mm_fault
                        - 99.73% __do_page_fault
                             trace_do_page_fault
                             do_async_page_fault
                           + async_page_fault
              0.63% native_send_call_func_single_ipi
                 generic_exec_single
                 smp_call_function_single

This is showing excessive migration activity even though excessive
migrations are meant to get throttled.  Normally, the scan rate is tuned
on a per-task basis depending on the locality of faults.  However, if
migrations fail for any reason then the PTE scanner may scan faster if
the faults continue to be remote.  This means there is higher system CPU
overhead and fault trapping at exactly the time we know that migrations
cannot happen.  This patch tracks when migration failures occur and
slows the PTE scanner.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Dave Chinner <david@fromorbit.com>
Tested-by: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25 16:20:31 -07:00
Paul E. McKenney
42528795ac Merge branches 'doc.2015.02.26a', 'earlycb.2015.03.03a', 'fixes.2015.03.03a', 'gpexp.2015.02.26a', 'hotplug.2015.03.20a', 'sysidle.2015.02.26b' and 'tiny.2015.02.26a' into HEAD
doc.2015.02.26a:  Documentation changes
earlycb.2015.03.03a:  Permit early-boot RCU callbacks
fixes.2015.03.03a:  Miscellaneous fixes
gpexp.2015.02.26a:  In-kernel expediting of normal grace periods
hotplug.2015.03.20a:  CPU hotplug fixes
sysidle.2015.02.26b:  NO_HZ_FULL_SYSIDLE fixes
tiny.2015.02.26a:  TINY_RCU fixes
2015-03-20 08:31:01 -07:00
Paul E. McKenney
654e953340 rcu: Associate quiescent-state reports with grace period
As noted in earlier commit logs, CPU hotplug operations running
concurrently with grace-period initialization can result in a given
leaf rcu_node structure having all CPUs offline and no blocked readers,
but with this rcu_node structure nevertheless blocking the current
grace period.  Therefore, the quiescent-state forcing code now checks
for this situation and repairs it.

Unfortunately, this checking can result in false positives, for example,
when the last task has just removed itself from this leaf rcu_node
structure, but has not yet started clearing the ->qsmask bits further
up the structure.  This means that the grace-period kthread (which
forces quiescent states) and some other task might be attempting to
concurrently clear these ->qsmask bits.  This is usually not a problem:
One of these tasks will be the first to acquire the upper-level rcu_node
structure's lock and with therefore clear the bit, and the other task,
seeing the bit already cleared, will stop trying to clear bits.

Sadly, this means that the following unusual sequence of events -can-
result in a problem:

1.	The grace-period kthread wins, and clears the ->qsmask bits.

2.	This is the last thing blocking the current grace period, so
	that the grace-period kthread clears ->qsmask bits all the way
	to the root and finds that the root ->qsmask field is now zero.

3.	Another grace period is required, so that the grace period kthread
	initializes it, including setting all the needed qsmask bits.

4.	The leaf rcu_node structure (the one that started this whole
	mess) is blocking this new grace period, either because it
	has at least one online CPU or because there is at least one
	task that had blocked within an RCU read-side critical section
	while running on one of this leaf rcu_node structure's CPUs.
	(And yes, that CPU might well have gone offline before the
	grace period in step (3) above started, which can mean that
	there is a task on the leaf rcu_node structure's ->blkd_tasks
	list, but ->qsmask equal to zero.)

5.	The other kthread didn't get around to trying to clear the upper
	level ->qsmask bits until all the above had happened.  This means
	that it now sees bits set in the upper-level ->qsmask field, so it
	proceeds to clear them.  Too bad that it is doing so on behalf of
	a quiescent state that does not apply to the current grace period!

This sequence of events can result in the new grace period being too
short.  It can also result in the new grace period ending before the
leaf rcu_node structure's ->qsmask bits have been cleared, which will
result in splats during initialization of the next grace period.  In
addition, it can result in tasks blocking the new grace period still
being queued at the start of the next grace period, which will result
in other splats.  Sasha's testing turned up another of these splats,
as did rcutorture testing.  (And yes, rcutorture is being adjusted to
make these splats show up more quickly.  Which probably is having the
undesirable side effect of making other problems show up less quickly.
Can't have everything!)

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> # 4.0.x
Tested-by: Sasha Levin <sasha.levin@oracle.com>
2015-03-20 08:28:25 -07:00
Paul E. McKenney
a77da14ce9 rcu: Yet another fix for preemption and CPU hotplug
As noted earlier, the following sequence of events can occur when
running PREEMPT_RCU and HOTPLUG_CPU on a system with a multi-level
rcu_node combining tree:

1.	A group of tasks block on CPUs corresponding to a given leaf
	rcu_node structure while within RCU read-side critical sections.
2.	All CPUs corrsponding to that rcu_node structure go offline.
3.	The next grace period starts, but because there are still tasks
	blocked, the upper-level bits corresponding to this leaf rcu_node
	structure remain set.
4.	All the tasks exit their RCU read-side critical sections and
	remove themselves from the leaf rcu_node structure's list,
	leaving it empty.
5.	But because there now is code to check for this condition at
	force-quiescent-state time, the upper bits are cleared and the
	grace period completes.

However, there is another complication that can occur following step 4 above:

4a.	The grace period starts, and the leaf rcu_node structure's
	gp_tasks pointer is set to NULL because there are no tasks
	blocked on this structure.
4b.	One of the CPUs corresponding to the leaf rcu_node structure
	comes back online.
4b.	An endless stream of tasks are preempted within RCU read-side
	critical sections on this CPU, such that the ->blkd_tasks
	list is always non-empty.

The grace period will never end.

This commit therefore makes the force-quiescent-state processing check only
for absence of tasks blocking the current grace period rather than absence
of tasks altogether.  This will cause a quiescent state to be reported if
the current leaf rcu_node structure is not blocking the current grace period
and its parent thinks that it is, regardless of how RCU managed to get
itself into this state.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> # 4.0.x
Tested-by: Sasha Levin <sasha.levin@oracle.com>
2015-03-20 08:27:33 -07:00
Linus Torvalds
da11508eb0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatching fix from Jiri Kosina:

 - fix for potential race with module loading, from Petr Mladek.

   The race is very unlikely to be seen in real world and has been found
   by code inspection, but should be fixed for 4.0 anyway.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
  livepatch: Fix subtle race with coming and going modules
2015-03-18 10:46:39 -07:00
Linus Torvalds
13326e5a62 Merge branches 'perf-urgent-for-linus' and 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf and timer fixes from Ingo Molnar:
 "Two small perf fixes:
   - kernel side context leak fix
   - tooling crash fix

  And two clocksource driver fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix context leak in put_event()
  perf annotate: Fix fallback to unparsed disassembler line

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clockevents: sun5i: Fix setup_irq init sequence
  clocksource: efm32: Fix a NULL pointer dereference
2015-03-17 13:22:29 -07:00
Petr Mladek
8cb2c2dc47 livepatch: Fix subtle race with coming and going modules
There is a notifier that handles live patches for coming and going modules.
It takes klp_mutex lock to avoid races with coming and going patches but
it does not keep the lock all the time. Therefore the following races are
possible:

  1. The notifier is called sometime in STATE_MODULE_COMING. The module
     is visible by find_module() in this state all the time. It means that
     new patch can be registered and enabled even before the notifier is
     called. It might create wrong order of stacked patches, see below
     for an example.

   2. New patch could still see the module in the GOING state even after
      the notifier has been called. It will try to initialize the related
      object structures but the module could disappear at any time. There
      will stay mess in the structures. It might even cause an invalid
      memory access.

This patch solves the problem by adding a boolean variable into struct module.
The value is true after the coming and before the going handler is called.
New patches need to be applied when the value is true and they need to ignore
the module when the value is false.

Note that we need to know state of all modules on the system. The races are
related to new patches. Therefore we do not know what modules will get
patched.

Also note that we could not simply ignore going modules. The code from the
module could be called even in the GOING state until mod->exit() finishes.
If we start supporting patches with semantic changes between function
calls, we need to apply new patches to any still usable code.
See below for an example.

Finally note that the patch solves only the situation when a new patch is
registered. There are no such problems when the patch is being removed.
It does not matter who disable the patch first, whether the normal
disable_patch() or the module notifier. There is nothing to do
once the patch is disabled.

Alternative solutions:
======================

+ reject new patches when a patched module is coming or going; this is ugly

+ wait with adding new patch until the module leaves the COMING and GOING
  states; this might be dangerous and complicated; we would need to release
  kgr_lock in the middle of the patch registration to avoid a deadlock
  with the coming and going handlers; also we might need a waitqueue for
  each module which seems to be even bigger overhead than the boolean

+ stop modules from entering COMING and GOING states; wait until modules
  leave these states when they are already there; looks complicated; we would
  need to ignore the module that asked to stop the others to avoid a deadlock;
  also it is unclear what to do when two modules asked to stop others and
  both are in COMING state (situation when two new patches are applied)

+ always register/enable new patches and fix up the potential mess (registered
  patches order) in klp_module_init(); this is nasty and prone to regressions
  in the future development

+ add another MODULE_STATE where the kallsyms are visible but the module is not
  used yet; this looks too complex; the module states are checked on "many"
  locations

Example of patch stacking breakage:
===================================

The notifier could _not_ _simply_ ignore already initialized module objects.
For example, let's have three patches (P1, P2, P3) for functions a() and b()
where a() is from vmcore and b() is from a module M. Something like:

	a()	b()
P1	a1()	b1()
P2	a2()	b2()
P3	a3()	b3(3)

If you load the module M after all patches are registered and enabled.
The ftrace ops for function a() and b() has listed the functions in this
order:

	ops_a->func_stack -> list(a3,a2,a1)
	ops_b->func_stack -> list(b3,b2,b1)

, so the pointer to b3() is the first and will be used.

Then you might have the following scenario. Let's start with state when patches
P1 and P2 are registered and enabled but the module M is not loaded. Then ftrace
ops for b() does not exist. Then we get into the following race:

CPU0					CPU1

load_module(M)

  complete_formation()

  mod->state = MODULE_STATE_COMING;
  mutex_unlock(&module_mutex);

					klp_register_patch(P3);
					klp_enable_patch(P3);

					# STATE 1

  klp_module_notify(M)
    klp_module_notify_coming(P1);
    klp_module_notify_coming(P2);
    klp_module_notify_coming(P3);

					# STATE 2

The ftrace ops for a() and b() then looks:

  STATE1:

	ops_a->func_stack -> list(a3,a2,a1);
	ops_b->func_stack -> list(b3);

  STATE2:
	ops_a->func_stack -> list(a3,a2,a1);
	ops_b->func_stack -> list(b2,b1,b3);

therefore, b2() is used for the module but a3() is used for vmcore
because they were the last added.

Example of the race with going modules:
=======================================

CPU0					CPU1

delete_module()  #SYSCALL

   try_stop_module()
     mod->state = MODULE_STATE_GOING;

   mutex_unlock(&module_mutex);

					klp_register_patch()
					klp_enable_patch()

					#save place to switch universe

					b()     # from module that is going
					  a()   # from core (patched)

   mod->exit();

Note that the function b() can be called until we call mod->exit().

If we do not apply patch against b() because it is in MODULE_STATE_GOING,
it will call patched a() with modified semantic and things might get wrong.

[jpoimboe@redhat.com: use one boolean instead of two]
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-03-17 10:31:54 +01:00
Leon Yu
d415a7f1c1 perf: Fix context leak in put_event()
Commit:

  a83fe28e2e ("perf: Fix put_event() ctx lock")

changed the locking logic in put_event() by replacing mutex_lock_nested()
with perf_event_ctx_lock_nested(), but didn't fix the subsequent
mutex_unlock() with a correct counterpart, perf_event_ctx_unlock().

Contexts are thus leaked as a result of incremented refcount
in perf_event_ctx_lock_nested().

Signed-off-by: Leon Yu <chianglungyu@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Fixes: a83fe28e2e ("perf: Fix put_event() ctx lock")
Link: http://lkml.kernel.org/r/1424954613-5034-1-git-send-email-chianglungyu@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-13 10:02:18 +01:00
Andrey Ryabinin
a5af5aa8b6 kasan, module, vmalloc: rework shadow allocation for modules
Current approach in handling shadow memory for modules is broken.

Shadow memory could be freed only after memory shadow corresponds it is no
longer used.  vfree() called from interrupt context could use memory its
freeing to store 'struct llist_node' in it:

    void vfree(const void *addr)
    {
    ...
        if (unlikely(in_interrupt())) {
            struct vfree_deferred *p = this_cpu_ptr(&vfree_deferred);
            if (llist_add((struct llist_node *)addr, &p->list))
                    schedule_work(&p->wq);

Later this list node used in free_work() which actually frees memory.
Currently module_memfree() called in interrupt context will free shadow
before freeing module's memory which could provoke kernel crash.

So shadow memory should be freed after module's memory.  However, such
deallocation order could race with kasan_module_alloc() in module_alloc().

Free shadow right before releasing vm area.  At this point vfree()'d
memory is not used anymore and yet not available for other allocations.
New VM_KASAN flag used to indicate that vm area has dynamically allocated
shadow memory so kasan frees shadow only if it was previously allocated.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 18:46:08 -07:00
Paul E. McKenney
5c60d25fa1 rcu: Add diagnostics to grace-period cleanup
At grace-period initialization time, RCU checks that all quiescent
states were really reported for the previous grace period.  Now that
grace-period cleanup has been split out of grace-period initialization,
this commit also performs those checks at grace-period cleanup time.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-12 15:19:38 -07:00
Paul E. McKenney
88428cc5c2 rcu: Handle outgoing CPUs on exit from idle loop
This commit informs RCU of an outgoing CPU just before that CPU invokes
arch_cpu_idle_dead() during its last pass through the idle loop (via a
new CPU_DYING_IDLE notifier value).  This change means that RCU need not
deal with outgoing CPUs passing through the scheduler after informing
RCU that they are no longer online.  Note that removing the CPU from
the rcu_node ->qsmaskinit bit masks is done at CPU_DYING_IDLE time,
and orphaning callbacks is still done at CPU_DEAD time, the reason being
that at CPU_DEAD time we have another CPU that can adopt them.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-12 15:19:38 -07:00
Paul E. McKenney
528a25b00e cpu: Make CPU-offline idle-loop transition point more precise
This commit uses a per-CPU variable to make the CPU-offline code path
through the idle loop more precise, so that the outgoing CPU is
guaranteed to make it into the idle loop before it is powered off.
This commit is in preparation for putting the RCU offline-handling
code on this code path, which will eliminate the magic one-jiffy
wait that RCU uses as the maximum time for an outgoing CPU to get
all the way through the scheduler.

The magic one-jiffy wait for incoming CPUs remains a separate issue.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-12 15:19:37 -07:00
Paul E. McKenney
c199068913 rcu: Eliminate ->onoff_mutex from rcu_node structure
Because that RCU grace-period initialization need no longer exclude
CPU-hotplug operations, this commit eliminates the ->onoff_mutex and
its uses.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-12 15:19:37 -07:00
Paul E. McKenney
0aa04b055e rcu: Process offlining and onlining only at grace-period start
Races between CPU hotplug and grace periods can be difficult to resolve,
so the ->onoff_mutex is used to exclude the two events.  Unfortunately,
this means that it is impossible for an outgoing CPU to perform the
last bits of its offlining from its last pass through the idle loop,
because sleeplocks cannot be acquired in that context.

This commit avoids these problems by buffering online and offline events
in a new ->qsmaskinitnext field in the leaf rcu_node structures.  When a
grace period starts, the events accumulated in this mask are applied to
the ->qsmaskinit field, and, if needed, up the rcu_node tree.  The special
case of all CPUs corresponding to a given leaf rcu_node structure being
offline while there are still elements in that structure's ->blkd_tasks
list is handled using a new ->wait_blkd_tasks field.  In this case,
propagating the offline bits up the tree is deferred until the beginning
of the grace period after all of the tasks have exited their RCU read-side
critical sections and removed themselves from the list, at which point
the ->wait_blkd_tasks flag is cleared.  If one of that leaf rcu_node
structure's CPUs comes back online before the list empties, then the
->wait_blkd_tasks flag is simply cleared.

This of course means that RCU's notion of which CPUs are offline can be
out of date.  This is OK because RCU need only wait on CPUs that were
online at the time that the grace period started.  In addition, RCU's
force-quiescent-state actions will handle the case where a CPU goes
offline after the grace period starts.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-12 15:19:37 -07:00
Paul E. McKenney
cc99a310ca rcu: Move rcu_report_unblock_qs_rnp() to common code
The rcu_report_unblock_qs_rnp() function is invoked when the
last task blocking the current grace period exits its outermost
RCU read-side critical section.  Previously, this was called only
from rcu_read_unlock_special(), and was therefore defined only when
CONFIG_RCU_PREEMPT=y.  However, this function will be invoked even when
CONFIG_RCU_PREEMPT=n once CPU-hotplug operations are processed only at
the beginnings of RCU grace periods.  The reason for this change is that
the last task on a given leaf rcu_node structure's ->blkd_tasks list
might well exit its RCU read-side critical section between the time that
recent CPU-hotplug operations were applied and when the new grace period
was initialized.  This situation could result in RCU waiting forever on
that leaf rcu_node structure, because if all that structure's CPUs were
already offline, there would be no quiescent-state events to drive that
structure's part of the grace period.

This commit therefore moves rcu_report_unblock_qs_rnp() to common code
that is built unconditionally so that the quiescent-state-forcing code
can clean up after this situation, avoiding the grace-period stall.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-12 15:19:36 -07:00
Paul E. McKenney
8eb74b2b29 rcu: Rework preemptible expedited bitmask handling
Currently, the rcu_node tree ->expmask bitmasks are initially set to
reflect the online CPUs.  This is pointless, because only the CPUs
preempted within RCU read-side critical sections by the preceding
synchronize_sched_expedited() need to be tracked.  This commit therefore
instead sets up these bitmasks based on the state of the ->blkd_tasks
lists.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-12 15:18:42 -07:00
Paul E. McKenney
999c286347 rcu: Remove event tracing from rcu_cpu_notify(), used by offline CPUs
Offline CPUs cannot safely invoke trace events, but such CPUs do execute
within rcu_cpu_notify().  Therefore, this commit removes the trace events
from rcu_cpu_notify().  These trace events are for utilization, against
which rcu_cpu_notify() execution time should be negligible.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-11 13:22:39 -07:00
Paul E. McKenney
37745d2810 rcu: Provide diagnostic option to slow down grace-period initialization
Grace-period initialization normally proceeds quite quickly, so
that it is very difficult to reproduce races against grace-period
initialization.  This commit therefore allows grace-period
initialization to be artificially slowed down, increasing
race-reproduction probability.  A pair of new Kconfig parameters are
provided, CONFIG_RCU_TORTURE_TEST_SLOW_INIT to enable the slowdowns, and
CONFIG_RCU_TORTURE_TEST_SLOW_INIT_DELAY to specify the number of jiffies
of slowdown to apply.  A boot-time parameter named rcutree.gp_init_delay
allows boot-time delay to be specified.  By default, no delay will be
applied even if CONFIG_RCU_TORTURE_TEST_SLOW_INIT is set.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-11 13:22:38 -07:00
Paul E. McKenney
237a0f2193 rcu: Detect stalls caused by failure to propagate up rcu_node tree
If all CPUs have passed through quiescent states, then stalls might be
due to starvation of the grace-period kthread or to failure to propagate
the quiescent states up the rcu_node combining tree.  The current stall
warning messages do not differentiate, so this commit adds a printout
of the root rcu_node structure's ->qsmask field.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-11 13:22:38 -07:00
Paul E. McKenney
18c629eaeb rcu: Eliminate empty HOTPLUG_CPU ifdef
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-11 13:22:37 -07:00
Paul E. McKenney
c8aead6a9b rcu: Simplify sync_rcu_preempt_exp_init()
This commit eliminates a boolean and associated "if" statement by
rearranging the code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-11 13:22:37 -07:00
Paul E. McKenney
78043c467a rcu: Put all orphan-callback-related code under same comment
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-11 13:22:37 -07:00
Paul E. McKenney
b33078b609 rcu: Consolidate offline-CPU callback initialization
Currently, both rcu_cleanup_dead_cpu() and rcu_send_cbs_to_orphanage()
initialize the outgoing CPU's callback list.  However, only
rcu_cleanup_dead_cpu() invokes rcu_send_cbs_to_orphanage(), and
it does so unconditionally, which means that only one of these
initializations is required.  This commit therefore consolidates the
callback-list initialization with the rest of the callback handling in
rcu_send_cbs_to_orphanage().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-11 13:22:36 -07:00
Paul E. McKenney
8038dad7e8 smpboot: Add common code for notification from dying CPU
RCU ignores offlined CPUs, so they cannot safely run RCU read-side code.
(They -can- use SRCU, but not RCU.)  This means that any use of RCU
during or after the call to arch_cpu_idle_dead().  Unfortunately,
commit 2ed53c0d6c added a complete() call, which will contain RCU
read-side critical sections if there is a task waiting to be awakened.

Which, as it turns out, there almost never is.  In my qemu/KVM testing,
the to-be-awakened task is not yet asleep more than 99.5% of the time.
In current mainline, failure is even harder to reproduce, requiring a
virtualized environment that delays the outgoing CPU by at least three
jiffies between the time it exits its stop_machine() task at CPU_DYING
time and the time it calls arch_cpu_idle_dead() from the idle loop.
However, this problem really can occur, especially in virtualized
environments, and therefore really does need to be fixed

This suggests moving back to the polling loop, but using a much shorter
wait, with gentle exponential backoff instead of the old 100-millisecond
wait.  Most of the time, the loop will exit without waiting at all,
and almost all of the remaining uses will wait only five microseconds.
If the outgoing CPU is preempted, a loop will wait one jiffy, then
increase the wait by a factor of 11/10ths, rounding up.  As before, there
is a five-second timeout.

This commit therefore provides common-code infrastructure to do the
dying-to-surviving CPU handoff in a safe manner.  This code also
provides an indication at CPU-online of whether the CPU to be onlined
previously timed out on offline.  The new cpu_check_up_prepare() function
returns -EBUSY if this CPU previously took more than five seconds to
go offline, or -EAGAIN if it has not yet managed to go offline.  The
rationale for -EAGAIN is that it might still be preempted, so an additional
wait might well find it correctly offlined.  Architecture-specific code
can decide how to handle these conditions.  Systems in which CPUs take
themselves completely offline might respond to an -EBUSY return as if
it was a zero (success) return.  Systems in which the surviving CPU must
take some action might take it at this time, or might simply mark the
other CPU as unusable.

Note that architectures that take the easy way out and simply pass the
-EBUSY and -EAGAIN upwards will change the sysfs API.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: <linux-api@vger.kernel.org>
Cc: <linux-arch@vger.kernel.org>
[ paulmck: Fixed state machine for architectures that don't check earlier
  CPU-hotplug results as suggested by James Hogan. ]
2015-03-11 13:20:25 -07:00
Linus Torvalds
e7901af143 Merge tag 'trace-fixes-v4.0-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull seq-buf/ftrace fixes from Steven Rostedt:
 "This includes fixes for seq_buf_bprintf() truncation issue.  It also
  contains fixes to ftrace when /proc/sys/kernel/ftrace_enabled and
  function tracing are started.  Doing the following causes some issues:

    # echo 0 > /proc/sys/kernel/ftrace_enabled
    # echo function_graph > /sys/kernel/debug/tracing/current_tracer
    # echo 1 > /proc/sys/kernel/ftrace_enabled
    # echo nop > /sys/kernel/debug/tracing/current_tracer
    # echo function_graph > /sys/kernel/debug/tracing/current_tracer

  As well as with function tracing too.  Pratyush Anand first reported
  this issue to me and supplied a patch.  When I tested this on my x86
  test box, it caused thousands of backtraces and warnings to appear in
  dmesg, which also caused a denial of service (a warning for every
  function that was listed).  I applied Pratyush's patch but it did not
  fix the issue for me.  I looked into it and found a slight problem
  with trampoline accounting.  I fixed it and sent Pratyush a patch, but
  he said that it did not fix the issue for him.

  I later learned tha Pratyush was using an ARM64 server, and when I
  tested on my ARM board, I was able to reproduce the same issue as
  Pratyush.  After applying his patch, it fixed the problem.  The above
  test uncovered two different bugs, one in x86 and one in ARM and
  ARM64.  As this looked like it would affect PowerPC, I tested it on my
  PPC64 box.  It too broke, but neither the patch that fixed ARM or x86
  fixed this box (the changes were all in generic code!).  The above
  test, uncovered two more bugs that affected PowerPC.  Again, the
  changes were only done to generic code.  It's the way the arch code
  expected things to be done that was different between the archs.  Some
  where more sensitive than others.

  The rest of this series fixes the PPC bugs as well"

* tag 'trace-fixes-v4.0-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Fix ftrace enable ordering of sysctl ftrace_enabled
  ftrace: Fix en(dis)able graph caller when en(dis)abling record via sysctl
  ftrace: Clear REGS_EN and TRAMP_EN flags on disabling record via sysctl
  seq_buf: Fix seq_buf_bprintf() truncation
  seq_buf: Fix seq_buf_vprintf() truncation
2015-03-09 18:44:06 -07:00
Linus Torvalds
c0e99a71bd Merge branch 'for-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "The cgroup iteration update two years ago and the recent cpuset
  restructuring introduced regressions in subset of cpuset
  configurations.  Three patches to fix them.

  All are marked for -stable"

* 'for-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cpuset: Fix cpuset sched_relax_domain_level
  cpuset: fix a warning when clearing configured masks in old hierarchy
  cpuset: initialize effective masks when clone_children is enabled
2015-03-09 17:30:09 -07:00
Linus Torvalds
b695f31f4e Merge branch 'for-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
 "One fix patch for a subtle livelock condition which can happen on
  PREEMPT_NONE kernels involving two racing cancel_work calls.  Whoever
  comes in the second has to wait for the previous one to finish.  This
  was implemented by making the later one block for the same condition
  that the former would be (work item completion) and then loop and
  retest; unfortunately, depending on the wake up order, the later one
  could lock out the former one to finish by busy looping on the cpu.

  This is fixed by implementing explicit wait mechanism.  Work item
  might not belong anywhere at this point and there's remote possibility
  of thundering herd problem.  I originally tried to use bit_waitqueue
  but it didn't work for static work items on modules.  It's currently
  using single wait queue with filtering wake up function and exclusive
  wakeup.  If this ever becomes a problem, which is not very likely, we
  can try to figure out a way to piggy back on bit_waitqueue"

* 'for-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE
2015-03-09 17:00:54 -07:00
Steven Rostedt (Red Hat)
524a386825 ftrace: Fix ftrace enable ordering of sysctl ftrace_enabled
Some archs (specifically PowerPC), are sensitive with the ordering of
the enabling of the calls to function tracing and setting of the
function to use to be traced.

That is, update_ftrace_function() sets what function the ftrace_caller
trampoline should call. Some archs require this to be set before
calling ftrace_run_update_code().

Another bug was discovered, that ftrace_startup_sysctl() called
ftrace_run_update_code() directly. If the function the ftrace_caller
trampoline changes, then it will not be updated. Instead a call
to ftrace_startup_enable() should be called because it tests to see
if the callback changed since the code was disabled, and will
tell the arch to update appropriately. Most archs do not need this
notification, but PowerPC does.

The problem could be seen by the following commands:

 # echo 0 > /proc/sys/kernel/ftrace_enabled
 # echo function > /sys/kernel/debug/tracing/current_tracer
 # echo 1 > /proc/sys/kernel/ftrace_enabled
 # cat /sys/kernel/debug/tracing/trace

The trace will show that function tracing was not active.

Cc: stable@vger.kernel.org # 2.6.27+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-03-09 10:55:34 -04:00
Pratyush Anand
1619dc3f8f ftrace: Fix en(dis)able graph caller when en(dis)abling record via sysctl
When ftrace is enabled globally through the proc interface, we must check if
ftrace_graph_active is set. If it is set, then we should also pass the
FTRACE_START_FUNC_RET command to ftrace_run_update_code(). Similarly, when
ftrace is disabled globally through the proc interface, we must check if
ftrace_graph_active is set. If it is set, then we should also pass the
FTRACE_STOP_FUNC_RET command to ftrace_run_update_code().

Consider the following situation.

 # echo 0 > /proc/sys/kernel/ftrace_enabled

After this ftrace_enabled = 0.

 # echo function_graph > /sys/kernel/debug/tracing/current_tracer

Since ftrace_enabled = 0, ftrace_enable_ftrace_graph_caller() is never
called.

 # echo 1 > /proc/sys/kernel/ftrace_enabled

Now ftrace_enabled will be set to true, but still
ftrace_enable_ftrace_graph_caller() will not be called, which is not
desired.

Further if we execute the following after this:
  # echo nop > /sys/kernel/debug/tracing/current_tracer

Now since ftrace_enabled is set it will call
ftrace_disable_ftrace_graph_caller(), which causes a kernel warning on
the ARM platform.

On the ARM platform, when ftrace_enable_ftrace_graph_caller() is called,
it checks whether the old instruction is a nop or not. If it's not a nop,
then it returns an error. If it is a nop then it replaces instruction at
that address with a branch to ftrace_graph_caller.
ftrace_disable_ftrace_graph_caller() behaves just the opposite. Therefore,
if generic ftrace code ever calls either ftrace_enable_ftrace_graph_caller()
or ftrace_disable_ftrace_graph_caller() consecutively two times in a row,
then it will return an error, which will cause the generic ftrace code to
raise a warning.

Note, x86 does not have an issue with this because the architecture
specific code for ftrace_enable_ftrace_graph_caller() and
ftrace_disable_ftrace_graph_caller() does not check the previous state,
and calling either of these functions twice in a row has no ill effect.

Link: http://lkml.kernel.org/r/e4fbe64cdac0dd0e86a3bf914b0f83c0b419f146.1425666454.git.panand@redhat.com

Cc: stable@vger.kernel.org # 2.6.31+
Signed-off-by: Pratyush Anand <panand@redhat.com>
[
  removed extra if (ftrace_start_up) and defined ftrace_graph_active as 0
  if CONFIG_FUNCTION_GRAPH_TRACER is not set.
]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-03-09 10:50:51 -04:00
Steven Rostedt (Red Hat)
b24d443b8f ftrace: Clear REGS_EN and TRAMP_EN flags on disabling record via sysctl
When /proc/sys/kernel/ftrace_enabled is set to zero, all function
tracing is disabled. But the records that represent the functions
still hold information about the ftrace_ops that are hooked to them.

ftrace_ops may request "REGS" (have a full set of pt_regs passed to
the callback), or "TRAMP" (the ops has its own trampoline to use).
When the record is updated to represent the state of the ops hooked
to it, it sets "REGS_EN" and/or "TRAMP_EN" to state that the callback
points to the correct trampoline (REGS has its own trampoline).

When ftrace_enabled is set to zero, all ftrace locations are a nop,
so they do not point to any trampoline. But the _EN flags are still
set. This can cause the accounting to go wrong when ftrace_enabled
is cleared and an ops that has a trampoline is registered or unregistered.

For example, the following will cause ftrace to crash:

 # echo function_graph > /sys/kernel/debug/tracing/current_tracer
 # echo 0 > /proc/sys/kernel/ftrace_enabled
 # echo nop > /sys/kernel/debug/tracing/current_tracer
 # echo 1 > /proc/sys/kernel/ftrace_enabled
 # echo function_graph > /sys/kernel/debug/tracing/current_tracer

As function_graph uses a trampoline, when ftrace_enabled is set to zero
the updates to the record are not done. When enabling function_graph
again, the record will still have the TRAMP_EN flag set, and it will
look for an op that has a trampoline other than the function_graph
ops, and fail to find one.

Cc: stable@vger.kernel.org # 3.17+
Reported-by: Pratyush Anand <panand@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-03-09 10:46:00 -04:00
Linus Torvalds
bbbce516bb Merge tag 'tty-4.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
 "Here are some tty and serial driver fixes for 4.0-rc3.

  Along with the atime fix that you know about, here are some other
  serial driver bugfixes as well.  Most notable is a wait_until_sent
  bugfix that was traced back to being around since before 2.6.12 that
  Johan has fixed up.

  All have been in linux-next successfully"

* tag 'tty-4.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  TTY: fix tty_wait_until_sent maximum timeout
  TTY: fix tty_wait_until_sent on 64-bit machines
  USB: serial: fix infinite wait_until_sent timeout
  TTY: bfin_jtag_comm: remove incorrect wait_until_sent operation
  net: irda: fix wait_until_sent poll timeout
  serial: uapi: Declare all userspace-visible io types
  serial: core: Fix iotype userspace breakage
  serial: sprd: Fix missing spin_unlock in sprd_handle_irq()
  console: Fix console name size mismatch
  tty: fix up atime/mtime mess, take four
  serial: 8250_dw: Fix get_mctrl behaviour
  serial:8250:8250_pci: delete unneeded quirk entries
  serial:8250:8250_pci: fix redundant entry report for WCH_CH352_2S
  Change email address for 8250_pci
  serial: 8250: Revert "tty: serial: 8250_core: read only RX if there is something in the FIFO"
  Revert "tty/serial: of_serial: add DT alias ID handling"
2015-03-08 12:25:40 -07:00
Linus Torvalds
9aae0df6a3 Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
 "arm64 and generic kernel/module.c (acked by Rusty) fixes for
  CONFIG_DEBUG_SET_MODULE_RONX"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  kernel/module.c: Update debug alignment after symtable generation
  arm64: Don't use is_module_addr in setting page attributes
2015-03-07 11:31:17 -08:00
Peter Hurley
30a22c215a console: Fix console name size mismatch
commit 6ae9200f2c ("enlarge console.name") increased the storage
for the console name to 16 bytes, but not the corresponding
struct console_cmdline::name storage. Console names longer than
8 bytes cause read beyond end-of-string and failure to match
console; I'm not sure if there are other unexpected consequences.

Cc: <stable@vger.kernel.org> # 2.6.22+
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-07 03:39:55 +01:00
Linus Torvalds
0d9b9c1674 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatching fix from Jiri Kosina:
 "Fix an RCU unlock misplacement in live patching infrastructure, from
  Peter Zijlstra"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
  livepatch: fix RCU usage in klp_find_external_symbol()
2015-03-06 13:47:56 -08:00
Laura Abbott
168e47f2a6 kernel/module.c: Update debug alignment after symtable generation
When CONFIG_DEBUG_SET_MODULE_RONX is enabled, the sizes of
module sections are aligned up so appropriate permissions can
be applied. Adjusting for the symbol table may cause them to
become unaligned. Make sure to re-align the sizes afterward.

Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2015-03-06 12:04:22 +00:00
Rafael J. Wysocki
79d223646b Merge branch 'irq-pm'
* irq-pm:
  genirq / PM: describe IRQF_COND_SUSPEND
  tty: serial: atmel: rework interrupt and wakeup handling
  watchdog: at91sam9: request the irq with IRQF_NO_SUSPEND
  clk: at91: implement suspend/resume for the PMC irqchip
  rtc: at91rm9200: rework wakeup and interrupt handling
  rtc: at91sam9: rework wakeup and interrupt handling
  PM / wakeup: export pm_system_wakeup symbol
  genirq / PM: Add flag for shared NO_SUSPEND interrupt lines
  genirq / PM: better describe IRQF_NO_SUSPEND semantics
2015-03-06 01:29:05 +01:00
Rafael J. Wysocki
eef16e4362 Merge branch 'suspend-to-idle'
* suspend-to-idle:
  cpuidle / sleep: Use broadcast timer for states that stop local timer
  cpuidle: Clean up fallback handling in cpuidle_idle_call()
  cpuidle / sleep: Do sanity checks in cpuidle_enter_freeze() too
  idle / sleep: Avoid excessive disabling and enabling interrupts
2015-03-05 23:14:51 +01:00
Rafael J. Wysocki
ef2b22ac54 cpuidle / sleep: Use broadcast timer for states that stop local timer
Commit 3810631332 (PM / sleep: Re-implement suspend-to-idle handling)
overlooked the fact that entering some sufficiently deep idle states
by CPUs may cause their local timers to stop and in those cases it
is necessary to switch over to a broadcast timer prior to entering
the idle state.  If the cpuidle driver in use does not provide
the new ->enter_freeze callback for any of the idle states, that
problem affects suspend-to-idle too, but it is not taken into account
after the changes made by commit 3810631332.

Fix that by changing the definition of cpuidle_enter_freeze() and
re-arranging of the code in cpuidle_idle_call(), so the former does
not call cpuidle_enter() any more and the fallback case is handled
by cpuidle_idle_call() directly.

Fixes: 3810631332 (PM / sleep: Re-implement suspend-to-idle handling)
Reported-and-tested-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2015-03-05 23:13:19 +01:00
Tejun Heo
8603e1b300 workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE
cancel[_delayed]_work_sync() are implemented using
__cancel_work_timer() which grabs the PENDING bit using
try_to_grab_pending() and then flushes the work item with PENDING set
to prevent the on-going execution of the work item from requeueing
itself.

try_to_grab_pending() can always grab PENDING bit without blocking
except when someone else is doing the above flushing during
cancelation.  In that case, try_to_grab_pending() returns -ENOENT.  In
this case, __cancel_work_timer() currently invokes flush_work().  The
assumption is that the completion of the work item is what the other
canceling task would be waiting for too and thus waiting for the same
condition and retrying should allow forward progress without excessive
busy looping

Unfortunately, this doesn't work if preemption is disabled or the
latter task has real time priority.  Let's say task A just got woken
up from flush_work() by the completion of the target work item.  If,
before task A starts executing, task B gets scheduled and invokes
__cancel_work_timer() on the same work item, its try_to_grab_pending()
will return -ENOENT as the work item is still being canceled by task A
and flush_work() will also immediately return false as the work item
is no longer executing.  This puts task B in a busy loop possibly
preventing task A from executing and clearing the canceling state on
the work item leading to a hang.

task A			task B			worker

						executing work
__cancel_work_timer()
  try_to_grab_pending()
  set work CANCELING
  flush_work()
    block for work completion
						completion, wakes up A
			__cancel_work_timer()
			while (forever) {
			  try_to_grab_pending()
			    -ENOENT as work is being canceled
			  flush_work()
			    false as work is no longer executing
			}

This patch removes the possible hang by updating __cancel_work_timer()
to explicitly wait for clearing of CANCELING rather than invoking
flush_work() after try_to_grab_pending() fails with -ENOENT.

Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com

v3: bit_waitqueue() can't be used for work items defined in vmalloc
    area.  Switched to custom wake function which matches the target
    work item and exclusive wait and wakeup.

v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if
    the target bit waitqueue has wait_bit_queue's on it.  Use
    DEFINE_WAIT_BIT() and __wake_up_bit() instead.  Reported by Tomeu
    Vizoso.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Rabin Vincent <rabin.vincent@axis.com>
Cc: Tomeu Vizoso <tomeu.vizoso@gmail.com>
Cc: stable@vger.kernel.org
Tested-by: Jesper Nilsson <jesper.nilsson@axis.com>
Tested-by: Rabin Vincent <rabin.vincent@axis.com>
2015-03-05 08:04:13 -05:00
Rafael J. Wysocki
17f4803420 genirq / PM: Add flag for shared NO_SUSPEND interrupt lines
It currently is required that all users of NO_SUSPEND interrupt
lines pass the IRQF_NO_SUSPEND flag when requesting the IRQ or the
WARN_ON_ONCE() in irq_pm_install_action() will trigger.  That is
done to warn about situations in which unprepared interrupt handlers
may be run unnecessarily for suspended devices and may attempt to
access those devices by mistake.  However, it may cause drivers
that have no technical reasons for using IRQF_NO_SUSPEND to set
that flag just because they happen to share the interrupt line
with something like a timer.

Moreover, the generic handling of wakeup interrupts introduced by
commit 9ce7a25849 (genirq: Simplify wakeup mechanism) only works
for IRQs without any NO_SUSPEND users, so the drivers of wakeup
devices needing to use shared NO_SUSPEND interrupt lines for
signaling system wakeup generally have to detect wakeup in their
interrupt handlers.  Thus if they happen to share an interrupt line
with a NO_SUSPEND user, they also need to request that their
interrupt handlers be run after suspend_device_irqs().

In both cases the reason for using IRQF_NO_SUSPEND is not because
the driver in question has a genuine need to run its interrupt
handler after suspend_device_irqs(), but because it happens to
share the line with some other NO_SUSPEND user.  Otherwise, the
driver would do without IRQF_NO_SUSPEND just fine.

To make it possible to specify that condition explicitly, introduce
a new IRQ action handler flag for shared IRQs, IRQF_COND_SUSPEND,
that, when set, will indicate to the IRQ core that the interrupt
user is generally fine with suspending the IRQ, but it also can
tolerate handler invocations after suspend_device_irqs() and, in
particular, it is capable of detecting system wakeup and triggering
it as appropriate from its interrupt handler.

That will allow us to work around a problem with a shared timer
interrupt line on at91 platforms.

Link: http://marc.info/?l=linux-kernel&m=142252777602084&w=2
Link: http://marc.info/?t=142252775300011&r=1&w=2
Link: https://lkml.org/lkml/2014/12/15/552
Reported-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
2015-03-04 21:42:19 +01:00
Yao Dongdong
9910affa89 rcu: Remove redundant check of cpu_online()
Because invoke_cpu_core() checks whether the current CPU is online,
there is no need for __call_rcu_core() to redundantly check it.
There should not be any performance degradation because the called
function is visible to the compiler.  This commit therefore removes
the redundant check.

Signed-off-by: Yao Dongdong <yaodongdong@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-03 11:17:34 -08:00
Paul E. McKenney
e7580f3388 rcu: Get rcu_sched_force_quiescent_state() where it belongs
The very similar functions rcu_force_quiescent_state(),
rcu_bh_force_quiescent_state(), and rcu_sched_force_quiescent_state()
are supposed to be together, but have drifted apart.  This commit
restores rcu_sched_force_quiescent_state() to its rightful place.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-03 11:17:19 -08:00
Paul E. McKenney
a3bd2c09ad rcu: Add boot-up check for non-default CONFIG_RCU_FANOUT_LEAF values
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-03 11:16:31 -08:00
Paul E. McKenney
ab6f5bd674 rcu: Use IS_ENABLED() to simplify rcu_bootup_announce_oddness()
This commit gets rid of some inline #ifdefs by replacing them with
IS_ENABLED.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-03 11:16:20 -08:00
Paul E. McKenney
d24209bb68 rcu: Improve diagnostics for blocked critical sections in irq
If an RCU read-side critical section occurs within an interrupt handler
or a softirq handler, it cannot have been preempted.  Therefore, there is
a check in rcu_read_unlock_special() checking for this error.  However,
when this check triggers, it lacks diagnostic information.  This commit
therefore moves rcu_read_unlock()'s lockdep annotation to follow the
call to __rcu_read_unlock() and changes rcu_read_unlock_special()'s
WARN_ON_ONCE() to an lockdep_rcu_suspicious() in order to locate where
the offending RCU read-side critical section began.  In addition, the
value of the ->rcu_read_unlock_special field is printed.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-03 11:16:00 -08:00
Paul E. McKenney
6629240575 rcu: Use IS_ENABLED() to CONFIG_RCU_FANOUT_EXACT #ifdef
This commit uses IS_ENABLED() to remove the #ifdef from the
rcu_init_levelspread() functions.  No effect on executable code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-03 11:14:08 -08:00
Paul E. McKenney
4762767810 rcu: Move early boot callback tests earlier
Because callbacks can now be posted quite early in boot, move the
early boot callback tests to precede RCU initialization.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-03 11:06:22 -08:00