1
0
Commit Graph

19927 Commits

Author SHA1 Message Date
Linus Torvalds
49d2953c72 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "Major changes:

   - Reworked CPU capacity code, for better SMP load balancing on
     systems with assymetric CPUs. (Vincent Guittot, Morten Rasmussen)

   - Reworked RT task SMP balancing to be push based instead of pull
     based, to reduce latencies on large CPU count systems. (Steven
     Rostedt)

   - SCHED_DEADLINE support updates and fixes. (Juri Lelli)

   - SCHED_DEADLINE task migration support during CPU hotplug. (Wanpeng Li)

   - x86 mwait-idle optimizations and fixes. (Mike Galbraith, Len Brown)

   - sched/numa improvements. (Rik van Riel)

   - various cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
  sched/core: Drop debugging leftover trace_printk call
  sched/deadline: Support DL task migration during CPU hotplug
  sched/core: Check for available DL bandwidth in cpuset_cpu_inactive()
  sched/deadline: Always enqueue on previous rq when dl_task_timer() fires
  sched/core: Remove unused argument from init_[rt|dl]_rq()
  sched/deadline: Fix rt runtime corruption when dl fails its global constraints
  sched/deadline: Avoid a superfluous check
  sched: Improve load balancing in the presence of idle CPUs
  sched: Optimize freq invariant accounting
  sched: Move CFS tasks to CPUs with higher capacity
  sched: Add SD_PREFER_SIBLING for SMT level
  sched: Remove unused struct sched_group_capacity::capacity_orig
  sched: Replace capacity_factor by usage
  sched: Calculate CPU's usage statistic and put it into struct sg_lb_stats::group_usage
  sched: Add struct rq::cpu_capacity_orig
  sched: Make scale_rt invariant with frequency
  sched: Make sched entity usage tracking scale-invariant
  sched: Remove frequency scaling from cpu_capacity
  sched: Track group sched_entity usage contributions
  sched: Add sched_avg::utilization_avg_contrib
  ...
2015-04-13 10:47:34 -07:00
Linus Torvalds
cc76ee75a9 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core locking changes from Ingo Molnar:
 "Main changes:

   - jump label asm preparatory work for PowerPC (Anton Blanchard)

   - rwsem optimizations and cleanups (Davidlohr Bueso)

   - mutex optimizations and cleanups (Jason Low)

   - futex fix (Oleg Nesterov)

   - remove broken atomicity checks from {READ,WRITE}_ONCE() (Peter
     Zijlstra)"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  powerpc, jump_label: Include linux/jump_label.h to get HAVE_JUMP_LABEL define
  jump_label: Allow jump labels to be used in assembly
  jump_label: Allow asm/jump_label.h to be included in assembly
  locking/mutex: Further simplify mutex_spin_on_owner()
  locking: Remove atomicy checks from {READ,WRITE}_ONCE
  locking/rtmutex: Rename argument in the rt_mutex_adjust_prio_chain() documentation as well
  locking/rwsem: Fix lock optimistic spinning when owner is not running
  locking: Remove ACCESS_ONCE() usage
  locking/rwsem: Check for active lock before bailing on spinning
  locking/rwsem: Avoid deceiving lock spinners
  locking/rwsem: Set lock ownership ASAP
  locking/rwsem: Document barrier need when waking tasks
  locking/futex: Check PF_KTHREAD rather than !p->mm to filter out kthreads
  locking/mutex: Refactor mutex_spin_on_owner()
  locking/mutex: In mutex_spin_on_owner(), return true when owner changes
2015-04-13 10:27:28 -07:00
Linus Torvalds
9003601310 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
 "First batch of KVM changes for 4.1

  The most interesting bit here is irqfd/ioeventfd support for ARM and
  ARM64.

  Summary:

  ARM/ARM64:
     fixes for live migration, irqfd and ioeventfd support (enabling
     vhost, too), page aging

  s390:
     interrupt handling rework, allowing to inject all local interrupts
     via new ioctl and to get/set the full local irq state for migration
     and introspection.  New ioctls to access memory by virtual address,
     and to get/set the guest storage keys.  SIMD support.

  MIPS:
     FPU and MIPS SIMD Architecture (MSA) support.  Includes some
     patches from Ralf Baechle's MIPS tree.

  x86:
     bugfixes (notably for pvclock, the others are small) and cleanups.
     Another small latency improvement for the TSC deadline timer"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
  KVM: use slowpath for cross page cached accesses
  kvm: mmu: lazy collapse small sptes into large sptes
  KVM: x86: Clear CR2 on VCPU reset
  KVM: x86: DR0-DR3 are not clear on reset
  KVM: x86: BSP in MSR_IA32_APICBASE is writable
  KVM: x86: simplify kvm_apic_map
  KVM: x86: avoid logical_map when it is invalid
  KVM: x86: fix mixed APIC mode broadcast
  KVM: x86: use MDA for interrupt matching
  kvm/ppc/mpic: drop unused IRQ_testbit
  KVM: nVMX: remove unnecessary double caching of MAXPHYADDR
  KVM: nVMX: checks for address bits beyond MAXPHYADDR on VM-entry
  KVM: x86: cache maxphyaddr CPUID leaf in struct kvm_vcpu
  KVM: vmx: pass error code with internal error #2
  x86: vdso: fix pvclock races with task migration
  KVM: remove kvm_read_hva and kvm_read_hva_atomic
  KVM: x86: optimize delivery of TSC deadline timer interrupt
  KVM: x86: extract blocking logic from __vcpu_run
  kvm: x86: fix x86 eflags fixed bit
  KVM: s390: migrate vcpu interrupt state
  ...
2015-04-13 09:47:01 -07:00
Linus Torvalds
e5e02de066 Merge tag 'pm+acpi-4.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI fixes from Rafael Wysocki:
 "These are stable-candidate fixes of some recently reported issues in
  the cpufreq core, cpuidle core, the ACPI cpuidle driver and the
  hibernate core.

  Specifics:

   - Revert a 3.17 hibernate commit that was supposed to fix an issue
     related to e820 reserved regions, but broke resume from hibernation
     on Lenovo x230 (Rafael J Wysocki).

   - Prevent the ACPI cpuidle driver from overwriting the name and
     description of the C0 state set by the core when the list of
     C-states changes (Thomas Schlichter).

   - Remove the no longer needed state_count field from struct
     cpuidle_device which prevents the list of C-states shown by the
     sysfs interface from becoming incorrect when the current number of
     them is different from the number of C-states on boot (Bartlomiej
     Zolnierkiewicz).

   - The cpufreq core updates the policy object of the only online CPU
     during system resume to make it reflect the current hardware state,
     but it always assumes that CPU to be CPU0 which need not be the
     case, so fix the code to avoid that assumption (Viresh Kumar)"

* tag 'pm+acpi-4.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  Revert "PM / hibernate: avoid unsafe pages in e820 reserved regions"
  cpuidle: ACPI: do not overwrite name and description of C0
  cpuidle: remove state_count field from struct cpuidle_device
  cpufreq: Schedule work for the first-online CPU on resume
2015-04-09 17:44:27 -07:00
Rafael J. Wysocki
b2d5fb97d3 Merge branches 'pm-sleep', 'pm-cpufreq' and 'pm-cpuidle'
* pm-sleep:
  Revert "PM / hibernate: avoid unsafe pages in e820 reserved regions"

* pm-cpufreq:
  cpufreq: Schedule work for the first-online CPU on resume

* pm-cpuidle:
  cpuidle: ACPI: do not overwrite name and description of C0
  cpuidle: remove state_count field from struct cpuidle_device
2015-04-09 23:25:23 +02:00
Jason Low
01ac33c1f9 locking/mutex: Further simplify mutex_spin_on_owner()
Similar to what Linus suggested for rwsem_spin_on_owner(), in
mutex_spin_on_owner() instead of having while (true) and
breaking out of the spin loop on lock->owner != owner, we can
have the loop directly check for while (lock->owner == owner) to
improve the readability of the code.

It also shrinks the code a bit:

   text    data     bss     dec     hex filename
   3721       0       0    3721     e89 mutex.o.before
   3705       0       0    3705     e79 mutex.o.after

Signed-off-by: Jason Low <jason.low2@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/1428521960-5268-2-git-send-email-jason.low2@hp.com
[ Added code generation info. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-09 08:10:23 +02:00
Linus Torvalds
b97fdef8e6 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "Three fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: numa: disable change protection for vma(VM_HUGETLB)
  include/linux/dmapool.h: declare struct device
  mm: move zone lock to a different cache line than order-0 free page lists
2015-04-08 14:42:49 -07:00
Linus Torvalds
3afe9f8496 Copy the kernel module data from user space in chunks
Unlike most (all?) other copies from user space, kernel module loading
is almost unlimited in size.  So we do a potentially huge
"copy_from_user()" when we copy the module data from user space to the
kernel buffer, which can be a latency concern when preemption is
disabled (or voluntary).

Also, because 'copy_from_user()' clears the tail of the kernel buffer on
failures, even a *failed* copy can end up wasting a lot of time.

Normally neither of these are concerns in real life, but they do trigger
when doing stress-testing with trinity.  Running in a VM seems to add
its own overheadm causing trinity module load testing to even trigger
the watchdog.

The simple fix is to just chunk up the module loading, so that it never
tries to copy insanely big areas in one go.  That bounds the latency,
and also the amount of (unnecessarily, in this case) cleared memory for
the failure case.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-08 14:35:48 -07:00
Naoya Horiguchi
6b79c57b92 mm: numa: disable change protection for vma(VM_HUGETLB)
Currently when a process accesses a hugetlb range protected with
PROTNONE, unexpected COWs are triggered, which finally puts the hugetlb
subsystem into a broken/uncontrollable state, where for example
h->resv_huge_pages is subtracted too much and wraps around to a very
large number, and the free hugepage pool is no longer maintainable.

This patch simply stops changing protection for vma(VM_HUGETLB) to fix
the problem.  And this also allows us to avoid useless overhead of minor
faults.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-07 16:45:33 -07:00
Rafael J. Wysocki
f82daee49c Revert "PM / hibernate: avoid unsafe pages in e820 reserved regions"
Commit 84c91b7ae0 (PM / hibernate: avoid unsafe pages in e820 reserved
regions) is reported to make resume from hibernation on Lenovo x230
unreliable, so revert it.

We will revisit the issue the commit in question was supposed to fix
in the future.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=96111
Reported-by: rhn <kebuac.rhn@porcupinefactory.org>
Cc: 3.17+ <stable@vger.kernel.org> # 3.17+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-04-07 01:13:23 +02:00
Borislav Petkov
62a935b256 sched/core: Drop debugging leftover trace_printk call
Commit:

  3c18d447b3 ("sched/core: Check for available DL bandwidth in cpuset_cpu_inactive()")

forgot a trace_printk() debugging piece in and Steve's banner screamed
in dmesg. Remove it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1428050570-21041-1-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-03 10:48:25 +02:00
Wanpeng Li
fa9c9d10e9 sched/deadline: Support DL task migration during CPU hotplug
I observed that DL tasks can't be migrated to other CPUs during CPU
hotplug, in addition, task may/may not be running again if CPU is
added back.

The root cause which I found is that DL tasks will be throtted and
removed from the DL rq after comsuming all their budget, which
leads to the situation that stop task can't pick them up from the
DL rq and migrate them to other CPUs during hotplug.

The method to reproduce:

  schedtool -E -t 50000:100000 -e ./test

Actually './test' is just a simple for loop. Then observe which CPU the
test task is on and offline it:

  echo 0 > /sys/devices/system/cpu/cpuN/online

This patch adds the DL task migration during CPU hotplug by finding a
most suitable later deadline rq after DL timer fires if current rq is
offline.

If it fails to find a suitable later deadline rq then it falls back to
any eligible online CPU in so that the deadline task will come back
to us, and the push/pull mechanism should then move it around properly.

Suggested-and-Acked-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427411315-4298-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02 17:42:57 +02:00
Juri Lelli
3c18d447b3 sched/core: Check for available DL bandwidth in cpuset_cpu_inactive()
Hotplug operations are destructive w.r.t. cpusets. In case such an
operation is performed on a CPU belonging to an exlusive cpuset, the
DL bandwidth information associated with the corresponding root
domain is gone even if the operation fails (in sched_cpu_inactive()).

For this reason we need to move the check we currently have in
sched_cpu_inactive() to cpuset_cpu_inactive() to prevent useless
cpusets reconfiguration in the CPU_DOWN_FAILED path.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Link: http://lkml.kernel.org/r/1427792017-7356-2-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02 17:42:56 +02:00
Juri Lelli
4cd57f9713 sched/deadline: Always enqueue on previous rq when dl_task_timer() fires
dl_task_timer() may fire on a different rq from where a task was removed
after throttling. Since the call path is:

  dl_task_timer() ->
    enqueue_task_dl() ->
      enqueue_dl_entity() ->
        replenish_dl_entity()

and replenish_dl_entity() uses dl_se's rq, we can't use current's rq
in dl_task_timer(), but we need to lock the task's previous one.

Tested-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Kirill Tkhai <ktkhai@parallels.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Fixes: 3960c8c0c7 ("sched: Make dl_task_time() use task_rq_lock()")
Link: http://lkml.kernel.org/r/1427792017-7356-1-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02 17:42:56 +02:00
Abel Vesa
07c54f7a7f sched/core: Remove unused argument from init_[rt|dl]_rq()
Obviously, 'rq' is not used in these two functions, therefore,
there is no reason for it to be passed as an argument.

Signed-off-by: Abel Vesa <abelvesa@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1425383427-26244-1-git-send-email-abelvesa@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02 17:42:55 +02:00
Linus Torvalds
b6c3a5946c Merge tag 'lazytime_fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull lazytime fixes from Ted Ts'o:
 "This fixes a problem in the lazy time patches, which can cause
  frequently updated inods to never have their timestamps updated.

  These changes guarantee that no timestamp on disk will be stale by
  more than 24 hours"

* tag 'lazytime_fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  fs: add dirtytime_expire_seconds sysctl
  fs: make sure the timestamps for lazytime inodes eventually get written
2015-04-01 10:05:42 -07:00
Linus Torvalds
4a89452e70 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
 "Two clocksource driver fixes, and an idle loop RCU warning fix"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource/drivers/sun5i: Fix cpufreq interaction with sched_clock()
  clocksource/drivers: Fix various !CONFIG_HAS_IOMEM build errors
  timers/tick/broadcast-hrtimer: Fix suspicious RCU usage in idle loop
2015-03-28 11:21:23 -07:00
Linus Torvalds
19dba4f3e9 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
 "A single sched/rt corner case fix for RLIMIT_RTIME correctness"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix RLIMIT_RTTIME when PI-boosting to RT
2015-03-28 11:17:32 -07:00
Linus Torvalds
ee9b63dd0f Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Ingo Molnar:
 "A perf kernel side fix for a fuzzer triggered lockup"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix irq_work 'tail' recursion
2015-03-28 11:12:08 -07:00
Linus Torvalds
0fa7271a8a Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
 "A module unload lockdep race fix"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lockdep: Fix the module unload key range freeing logic
2015-03-28 11:05:03 -07:00
Wanpeng Li
a1963b81de sched/deadline: Fix rt runtime corruption when dl fails its global constraints
One version of sched_rt_global_constaints() (the !rt-cgroup one)
changes state, therefore if we fail the later sched_dl_global_constraints()
call the state is left in an inconsistent state.

Fix this by changing the order of the calls.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
[ Improved the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Link: http://lkml.kernel.org/r/1426590931-4639-2-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:15 +01:00
Wanpeng Li
bd4bde14b9 sched/deadline: Avoid a superfluous check
Since commit 40767b0dc7 ("sched/deadline: Fix deadline parameter
modification handling") we clear the thottled state when switching
from a dl task, therefore we should never find it set in switching to
a dl task.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
[ Improved the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Link: http://lkml.kernel.org/r/1426590931-4639-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:12 +01:00
Preeti U Murthy
d4573c3e1c sched: Improve load balancing in the presence of idle CPUs
When a CPU is kicked to do nohz idle balancing, it wakes up to do load
balancing on itself, followed by load balancing on behalf of idle CPUs.
But it may end up with load after the load balancing attempt on itself.
This aborts nohz idle balancing. As a result several idle CPUs are left
without tasks till such a time that an ILB CPU finds it unfavorable to
pull tasks upon itself. This delays spreading of load across idle CPUs
and worse, clutters only a few CPUs with tasks.

The effect of the above problem was observed on an SMT8 POWER server
with 2 levels of numa domains. Busy loops equal to number of cores were
spawned. Since load balancing on fork/exec is discouraged across numa
domains, all busy loops would start on one of the numa domains. However
it was expected that eventually one busy loop would run per core across
all domains due to nohz idle load balancing. But it was observed that it
took as long as 10 seconds to spread the load across numa domains.

Further investigation showed that this was a consequence of the
following:

 1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
    load balancing [Given the experiment, upto 6 CPUs per core could be
    potentially idle in this domain.]

 2. However the ILB CPU would call load_balance() on itself before
    initiating nohz idle load balancing.

 3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
    tasks from its sibling cores to even out load.

 4. Now that the ILB CPU was no longer idle, it would abort nohz idle
    load balancing

As a result the opportunities to spread load across numa domains were
lost until such a time that the cores within the first numa domain had
equal number of tasks among themselves.  This is a pretty bad scenario,
since the cores within the first numa domain would have as many as 4
tasks each, while cores in the neighbouring numa domains would all
remain idle.

Fix this, by checking if a CPU was woken up to do nohz idle load
balancing, before it does load balancing upon itself. This way we allow
idle CPUs across the system to do load balancing which results in
quicker spread of load, instead of performing load balancing within the
local sched domain hierarchy of the ILB CPU alone under circumstances
such as above.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jason Low <jason.low2@hp.com>
Cc: benh@kernel.crashing.org
Cc: daniel.lezcano@linaro.org
Cc: efault@gmx.de
Cc: iamjoonsoo.kim@lge.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: riel@redhat.com
Cc: srikar@linux.vnet.ibm.com
Cc: svaidy@linux.vnet.ibm.com
Cc: tim.c.chen@linux.intel.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/20150326130014.21532.17158.stgit@preeti.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:09 +01:00
Peter Zijlstra
dfbca41f34 sched: Optimize freq invariant accounting
Currently the freq invariant accounting (in
__update_entity_runnable_avg() and sched_rt_avg_update()) get the
scale factor from a weak function call, this means that even for archs
that default on their implementation the compiler cannot see into this
function and optimize the extra scaling math away.

This is sad, esp. since its a 64-bit multiplication which can be quite
costly on some platforms.

So replace the weak function with #ifdef and __always_inline goo. This
is not quite as nice from an arch support PoV but should at least
result in compile time errors if done wrong.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Morten.Rasmussen@arm.com
Cc: Paul Turner <pjt@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/20150323131905.GF23123@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:08 +01:00
Vincent Guittot
1aaf90a4b8 sched: Move CFS tasks to CPUs with higher capacity
When a CPU is used to handle a lot of IRQs or some RT tasks, the remaining
capacity for CFS tasks can be significantly reduced. Once we detect such
situation by comparing cpu_capacity_orig and cpu_capacity, we trig an idle
load balance to check if it's worth moving its tasks on an idle CPU.

It's worth trying to move the task before the CPU is fully utilized to
minimize the preemption by irq or RT tasks.

Once the idle load_balance has selected the busiest CPU, it will look for an
active load balance for only two cases:

  - There is only 1 task on the busiest CPU.

  - We haven't been able to move a task of the busiest rq.

A CPU with a reduced capacity is included in the 1st case, and it's worth to
actively migrate its task if the idle CPU has got more available capacity for
CFS tasks. This test has been added in need_active_balance.

As a sidenote, this will not generate more spurious ilb because we already
trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
has a task, we will trig the ilb once for migrating the task.

The nohz_kick_needed function has been cleaned up a bit while adding the new
test

env.src_cpu and env.src_rq must be set unconditionnally because they are used
in need_active_balance which is called even if busiest->nr_running equals 1

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-12-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:06 +01:00
Vincent Guittot
caff37ef96 sched: Add SD_PREFER_SIBLING for SMT level
Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
the scheduler will place at least one task per core.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-11-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:05 +01:00
Vincent Guittot
dc7ff76ead sched: Remove unused struct sched_group_capacity::capacity_orig
The 'struct sched_group_capacity::capacity_orig' field is no longer used
in the scheduler so we can remove it.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425378903-5349-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:05 +01:00
Vincent Guittot
ea67821b9a sched: Replace capacity_factor by usage
The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is
SCHED_CAPACITY_SCALE.

'struct sg_lb_stats:group_capacity_factor' divides the capacity of the group
by SCHED_LOAD_SCALE to estimate how many task can run in the group. Then, it
compares this value with the sum of nr_running to decide if the group is
overloaded or not.

But the 'group_capacity_factor' concept is hardly working for SMT systems, it
sometimes works for big cores but fails to do the right thing for little cores.

Below are two examples to illustrate the problem that this patch solves:

1- If the original capacity of a CPU is less than SCHED_CAPACITY_SCALE
   (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2
   (div_round_closest(3x640/1024) = 2) which means that it will be seen as
   overloaded even if we have only one task per CPU.

2 - If the original capacity of a CPU is greater than SCHED_CAPACITY_SCALE
   (1512 as an example), a group of 4 CPUs will have a capacity_factor of 4
   (at max and thanks to the fix [0] for SMT system that prevent the apparition
   of ghost CPUs) but if one CPU is fully used by rt tasks (and its capacity is
   reduced to nearly nothing), the capacity factor of the group will still be 4
   (div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]).

So, this patch tries to solve this issue by removing capacity_factor and
replacing it with the 2 following metrics:

  - The available CPU's capacity for CFS tasks which is already used by
    load_balance().

  - The usage of the CPU by the CFS tasks. For the latter, utilization_avg_contrib
    has been re-introduced to compute the usage of a CPU by CFS tasks.

'group_capacity_factor' and 'group_has_free_capacity' has been removed and replaced
by 'group_no_capacity'. We compare the number of task with the number of CPUs and
we evaluate the level of utilization of the CPUs to define if a group is
overloaded or if a group has capacity to handle more tasks.

For SD_PREFER_SIBLING, a group is tagged overloaded if it has more than 1 task
so it will be selected in priority (among the overloaded groups). Since [1],
SD_PREFER_SIBLING is no more concerned by the computation of 'load_above_capacity'
because local is not overloaded.

[1] 9a5d9ba6a3 ("sched/fair: Allow calculate_imbalance() to move idle cpus")

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1425052454-25797-9-git-send-email-vincent.guittot@linaro.org
[ Tidied up the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:04 +01:00
Vincent Guittot
8bb5b00c2f sched: Calculate CPU's usage statistic and put it into struct sg_lb_stats::group_usage
Monitor the usage level of each group of each sched_domain level. The usage is
the portion of cpu_capacity_orig that is currently used on a CPU or group of
CPUs. We use the utilization_load_avg to evaluate the usage level of each
group.

The utilization_load_avg only takes into account the running time of the CFS
tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully
utilized. Nevertheless, we must cap utilization_load_avg which can be
temporally greater than SCHED_LOAD_SCALE after the migration of a task on this
CPU and until the metrics are stabilized.

The utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the
running load on the CPU whereas the available capacity for the CFS task is in
the range [0..cpu_capacity_orig]. In order to test if a CPU is fully utilized
by CFS tasks, we have to scale the utilization in the cpu_capacity_orig range
of the CPU to get the usage of the latter. The usage can then be compared with
the available capacity (ie cpu_capacity) to deduct the usage level of a CPU.

The frequency scaling invariance of the usage is not taken into account in this
patch, it will be solved in another patch which will deal with frequency
scaling invariance on the utilization_load_avg.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425455327-13508-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:03 +01:00
Vincent Guittot
ca6d75e690 sched: Add struct rq::cpu_capacity_orig
This new field 'cpu_capacity_orig' reflects the original capacity of a CPU
before being altered by rt tasks and/or IRQ

The cpu_capacity_orig will be used:

  - to detect when the capacity of a CPU has been noticeably reduced so we can
    trig load balance to look for a CPU with better capacity. As an example, we
    can detect when a CPU handles a significant amount of irq
    (with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by
    scheduler whereas CPUs, which are really idle, are available.

  - evaluate the available capacity for CFS tasks

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-7-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:02 +01:00
Vincent Guittot
b5b4860d1d sched: Make scale_rt invariant with frequency
The average running time of RT tasks is used to estimate the remaining compute
capacity for CFS tasks. This remaining capacity is the original capacity scaled
down by a factor (aka scale_rt_capacity). This estimation of available capacity
must also be invariant with frequency scaling.

A frequency scaling factor is applied on the running time of the RT tasks for
computing scale_rt_capacity.

In sched_rt_avg_update(), we now scale the RT execution time like below:

  rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT

Then, scale_rt_capacity can be summarized by:

  scale_rt_capacity = SCHED_CAPACITY_SCALE * available / total

with available = total - rq->rt_avg

This has been been optimized in current code by:

  scale_rt_capacity = available / (total >> SCHED_CAPACITY_SHIFT)

But we can also developed the equation like below:

  scale_rt_capacity = SCHED_CAPACITY_SCALE - ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / total)

and we can optimize the equation by removing SCHED_CAPACITY_SHIFT shift in
the computation of rq->rt_avg and scale_rt_capacity().

so rq->rt_avg += rt_delta * arch_scale_freq_capacity()
and
scale_rt_capacity = SCHED_CAPACITY_SCALE - (rq->rt_avg / total)

arch_scale_frequency_capacity() will be called in the hot path of the scheduler
which implies to have a short and efficient function.

As an example, arch_scale_frequency_capacity() should return a cached value that
is updated periodically outside of the hot path.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:01 +01:00
Morten Rasmussen
0c1dc6b27d sched: Make sched entity usage tracking scale-invariant
Apply frequency scale-invariance correction factor to usage tracking.

Each segment of the running_avg_sum geometric series is now scaled by the
current frequency so the utilization_avg_contrib of each entity will be
invariant with frequency scaling.

As a result, utilization_load_avg which is the sum of utilization_avg_contrib,
becomes invariant too. So the usage level that is returned by get_cpu_usage(),
stays relative to the max frequency as the cpu_capacity which is is compared against.

Then, we want the keep the load tracking values in a 32-bit type, which implies
that the max value of {runnable|running}_avg_sum must be lower than
2^32/88761=48388 (88761 is the max weigth of a task). As LOAD_AVG_MAX = 47742,
arch_scale_freq_capacity() must return a value less than
(48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024).
So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Morten.Rasmussen@arm.com
Cc: Paul Turner <pjt@google.com>
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425455186-13451-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:00 +01:00
Vincent Guittot
a8faa8f55d sched: Remove frequency scaling from cpu_capacity
Now that arch_scale_cpu_capacity has been introduced to scale the original
capacity, the arch_scale_freq_capacity is no longer used (it was
previously used by ARM arch).

Remove arch_scale_freq_capacity from the computation of cpu_capacity.
The frequency invariance will be handled in the load tracking and not in
the CPU capacity. arch_scale_freq_capacity will be revisited for scaling
load with the current frequency of the CPUs in a later patch.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:35:59 +01:00
Morten Rasmussen
21f4486630 sched: Track group sched_entity usage contributions
Add usage contribution tracking for group entities. Unlike
se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group
entities is the sum of se->avg.utilization_avg_contrib for all entities on the
group runqueue.

It is _not_ influenced in any way by the task group h_load. Hence it is
representing the actual cpu usage of the group, not its intended load
contribution which may differ significantly from the utilization on
lightly utilized systems.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Morten.Rasmussen@arm.com
Cc: Paul Turner <pjt@google.com>
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:35:58 +01:00
Vincent Guittot
36ee28e45d sched: Add sched_avg::utilization_avg_contrib
Add new statistics which reflect the average time a task is running on the CPU
and the sum of these running time of the tasks on a runqueue. The latter is
named utilization_load_avg.

This patch is based on the usage metric that was proposed in the 1st
versions of the per-entity load tracking patchset by Paul Turner
<pjt@google.com> but that has be removed afterwards. This version differs from
the original one in the sense that it's not linked to task_group.

The rq's utilization_load_avg will be used to check if a rq is overloaded or
not instead of trying to compute how many tasks a group of CPUs can handle.

Rename runnable_avg_period into avg_period as it is now used with both
runnable_avg_sum and running_avg_sum.

Add some descriptions of the variables to explain their differences.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Morten.Rasmussen@arm.com
Cc: Paul Turner <pjt@google.com>
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:35:57 +01:00
Mel Gorman
074c238177 mm: numa: slow PTE scan rate if migration failures occur
Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

  Across the board the 4.0-rc1 numbers are much slower, and the degradation
  is far worse when using the large memory footprint configs. Perf points
  straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:

   -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
      - default_send_IPI_mask_sequence_phys
         - 99.99% physflat_send_IPI_mask
            - 99.37% native_send_call_func_ipi
                 smp_call_function_many
               - native_flush_tlb_others
                  - 99.85% flush_tlb_page
                       ptep_clear_flush
                       try_to_unmap_one
                       rmap_walk
                       try_to_unmap
                       migrate_pages
                       migrate_misplaced_page
                     - handle_mm_fault
                        - 99.73% __do_page_fault
                             trace_do_page_fault
                             do_async_page_fault
                           + async_page_fault
              0.63% native_send_call_func_single_ipi
                 generic_exec_single
                 smp_call_function_single

This is showing excessive migration activity even though excessive
migrations are meant to get throttled.  Normally, the scan rate is tuned
on a per-task basis depending on the locality of faults.  However, if
migrations fail for any reason then the PTE scanner may scan faster if
the faults continue to be remote.  This means there is higher system CPU
overhead and fault trapping at exactly the time we know that migrations
cannot happen.  This patch tracks when migration failures occur and
slows the PTE scanner.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Dave Chinner <david@fromorbit.com>
Tested-by: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25 16:20:31 -07:00
Tom(JeHyeon) Yeon
e6beaa363d locking/rtmutex: Rename argument in the rt_mutex_adjust_prio_chain() documentation as well
The following commit changed "deadlock_detect" to "chwalk":

   8930ed80f9 ("rtmutex: Cleanup deadlock detector debug logic")

do that rename in the function's documentation as well.

Signed-off-by: Tom(JeHyeon) Yeon <tom.yeon@windriver.com>
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1426655010-31651-1-git-send-email-tom.yeon@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-25 13:43:55 +01:00
Marcelo Tosatti
0a4e6be9ca x86: kvm: Revert "remove sched notifier for cross-cpu migrations"
The following point:

    2. per-CPU pvclock time info is updated if the
       underlying CPU changes.

Is not true anymore since "KVM: x86: update pvclock area conditionally,
on cpu migration".

Add task migration notification back.

Problem noticed by Andy Lutomirski.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
CC: stable@kernel.org # 3.11+
2015-03-23 20:22:48 -03:00
Steven Rostedt
b6366f048e sched/rt: Use IPI to trigger RT task push migration instead of pulling
When debugging the latencies on a 40 core box, where we hit 300 to
500 microsecond latencies, I found there was a huge contention on the
runqueue locks.

Investigating it further, running ftrace, I found that it was due to
the pulling of RT tasks.

The test that was run was the following:

 cyclictest --numa -p95 -m -d0 -i100

This created a thread on each CPU, that would set its wakeup in iterations
of 100 microseconds. The -d0 means that all the threads had the same
interval (100us). Each thread sleeps for 100us and wakes up and measures
its latencies.

cyclictest is maintained at:
 git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git

What happened was another RT task would be scheduled on one of the CPUs
that was running our test, when the other CPU tests went to sleep and
scheduled idle. This caused the "pull" operation to execute on all
these CPUs. Each one of these saw the RT task that was overloaded on
the CPU of the test that was still running, and each one tried
to grab that task in a thundering herd way.

To grab the task, each thread would do a double rq lock grab, grabbing
its own lock as well as the rq of the overloaded CPU. As the sched
domains on this box was rather flat for its size, I saw up to 12 CPUs
block on this lock at once. This caused a ripple affect with the
rq locks especially since the taking was done via a double rq lock, which
means that several of the CPUs had their own rq locks held while trying
to take this rq lock. As these locks were blocked, any wakeups or load
balanceing on these CPUs would also block on these locks, and the wait
time escalated.

I've tried various methods to lessen the load, but things like an
atomic counter to only let one CPU grab the task wont work, because
the task may have a limited affinity, and we may pick the wrong
CPU to take that lock and do the pull, to only find out that the
CPU we picked isn't in the task's affinity.

Instead of doing the PULL, I now have the CPUs that want the pull to
send over an IPI to the overloaded CPU, and let that CPU pick what
CPU to push the task to. No more need to grab the rq lock, and the
push/pull algorithm still works fine.

With this patch, the latency dropped to just 150us over a 20 hour run.
Without the patch, the huge latencies would trigger in seconds.

I've created a new sched feature called RT_PUSH_IPI, which is enabled
by default.

When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
is enabled, the IPI is sent to the overloaded CPU to do a push.

To enabled or disable this at run time:

 # mount -t debugfs nodev /sys/kernel/debug
 # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
or
 # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features

Update: This original patch would send an IPI to all CPUs in the RT overload
list. But that could theoretically cause the reverse issue. That is, there
could be lots of overloaded RT queues and one CPU lowers its priority. It would
then send an IPI to all the overloaded RT queues and they could then all try
to grab the rq lock of the CPU lowering its priority, and then we have the
same problem.

The latest design sends out only one IPI to the first overloaded CPU. It tries to
push any tasks that it can, and then looks for the next overloaded CPU that can
push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable
tasks that have priorities greater than the source CPU are covered. In case the
source CPU lowers its priority again, a flag is set to tell the IPI traversal to
restart with the first RT overloaded CPU after the source CPU.

Parts-suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joern Engel <joern@purestorage.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150318144946.2f3cc982@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-23 10:55:22 +01:00
Ingo Molnar
e1b63dec2d Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying new patches
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-23 10:50:29 +01:00
Preeti U Murthy
a127d2bcf1 timers/tick/broadcast-hrtimer: Fix suspicious RCU usage in idle loop
The hrtimer mode of broadcast queues hrtimers in the idle entry
path so as to wakeup cpus in deep idle states. The associated
call graph is :

	cpuidle_idle_call()
	|____ clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, ....))
	     |_____tick_broadcast_set_event()
		   |____clockevents_program_event()
			|____bc_set_next()

The hrtimer_{start/cancel} functions call into tracing which uses RCU.
But it is not legal to call into RCU in cpuidle because it is one of the
quiescent states. Hence protect this region with RCU_NONIDLE which informs
RCU that the cpu is momentarily non-idle.

As an aside it is helpful to point out that the clock event device that is
programmed here is not a per-cpu clock device; it is a
pseudo clock device, used by the broadcast framework alone.
The per-cpu clock device programming never goes through bc_set_next().

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: linuxppc-dev@ozlabs.org
Cc: mpe@ellerman.id.au
Cc: tglx@linutronix.de
Link: http://lkml.kernel.org/r/20150318104705.17763.56668.stgit@preeti.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-23 10:50:05 +01:00
Peter Zijlstra
35a9393c95 lockdep: Fix the module unload key range freeing logic
Module unload calls lockdep_free_key_range(), which removes entries
from the data structures. Most of the lockdep code OTOH assumes the
data structures are append only; in specific see the comments in
add_lock_to_list() and look_up_lock_class().

Clearly this has only worked by accident; make it work proper. The
actual scenario to make it go boom would involve the memory freed by
the module unlock being re-allocated and re-used for a lock inside of
a rcu-sched grace period. This is a very unlikely scenario, still
better plug the hole.

Use RCU list iteration in all places and ammend the comments.

Change lockdep_free_key_range() to issue a sync_sched() between
removal from the lists and returning -- which results in the memory
being freed. Further ensure the callers are placed correctly and
comment the requirements.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Tsyvarev <tsyvarev@ispras.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-23 10:49:07 +01:00
Brian Silverman
746db9443e sched: Fix RLIMIT_RTTIME when PI-boosting to RT
When non-realtime tasks get priority-inheritance boosted to a realtime
scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
counter used for checking this (the same one used for SCHED_RR
timeslices) was not getting reset. This meant that tasks running with a
non-realtime scheduling class which are repeatedly boosted to a realtime
one, but never block while they are running realtime, eventually hit the
timeout without ever running for a time over the limit. This patch
resets the realtime timeslice counter when un-PI-boosting from an RT to
a non-RT scheduling class.

I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
mutex which induces priority boosting and spins while boosted that gets
killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
does happen eventually with PREEMPT_VOLUNTARY kernels.

Signed-off-by: Brian Silverman <brian@peloton-tech.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: austin@peloton-tech.com
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1424305436-6716-1-git-send-email-brian@peloton-tech.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-23 10:47:55 +01:00
Peter Zijlstra
d525211f9d perf: Fix irq_work 'tail' recursion
Vince reported a watchdog lockup like:

	[<ffffffff8115e114>] perf_tp_event+0xc4/0x210
	[<ffffffff810b4f8a>] perf_trace_lock+0x12a/0x160
	[<ffffffff810b7f10>] lock_release+0x130/0x260
	[<ffffffff816c7474>] _raw_spin_unlock_irqrestore+0x24/0x40
	[<ffffffff8107bb4d>] do_send_sig_info+0x5d/0x80
	[<ffffffff811f69df>] send_sigio_to_task+0x12f/0x1a0
	[<ffffffff811f71ce>] send_sigio+0xae/0x100
	[<ffffffff811f72b7>] kill_fasync+0x97/0xf0
	[<ffffffff8115d0b4>] perf_event_wakeup+0xd4/0xf0
	[<ffffffff8115d103>] perf_pending_event+0x33/0x60
	[<ffffffff8114e3fc>] irq_work_run_list+0x4c/0x80
	[<ffffffff8114e448>] irq_work_run+0x18/0x40
	[<ffffffff810196af>] smp_trace_irq_work_interrupt+0x3f/0xc0
	[<ffffffff816c99bd>] trace_irq_work_interrupt+0x6d/0x80

Which is caused by an irq_work generating new irq_work and therefore
not allowing forward progress.

This happens because processing the perf irq_work triggers another
perf event (tracepoint stuff) which in turn generates an irq_work ad
infinitum.

Avoid this by raising the recursion counter in the irq_work -- which
effectively disables all software events (including tracepoints) from
actually triggering again.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20150219170311.GH21418@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-23 10:46:32 +01:00
Linus Torvalds
da11508eb0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatching fix from Jiri Kosina:

 - fix for potential race with module loading, from Petr Mladek.

   The race is very unlikely to be seen in real world and has been found
   by code inspection, but should be fixed for 4.0 anyway.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
  livepatch: Fix subtle race with coming and going modules
2015-03-18 10:46:39 -07:00
Linus Torvalds
13326e5a62 Merge branches 'perf-urgent-for-linus' and 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf and timer fixes from Ingo Molnar:
 "Two small perf fixes:
   - kernel side context leak fix
   - tooling crash fix

  And two clocksource driver fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix context leak in put_event()
  perf annotate: Fix fallback to unparsed disassembler line

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clockevents: sun5i: Fix setup_irq init sequence
  clocksource: efm32: Fix a NULL pointer dereference
2015-03-17 13:22:29 -07:00
Theodore Ts'o
1efff914af fs: add dirtytime_expire_seconds sysctl
Add a tuning knob so we can adjust the dirtytime expiration timeout,
which is very useful for testing lazytime.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2015-03-17 12:23:32 -04:00
Petr Mladek
8cb2c2dc47 livepatch: Fix subtle race with coming and going modules
There is a notifier that handles live patches for coming and going modules.
It takes klp_mutex lock to avoid races with coming and going patches but
it does not keep the lock all the time. Therefore the following races are
possible:

  1. The notifier is called sometime in STATE_MODULE_COMING. The module
     is visible by find_module() in this state all the time. It means that
     new patch can be registered and enabled even before the notifier is
     called. It might create wrong order of stacked patches, see below
     for an example.

   2. New patch could still see the module in the GOING state even after
      the notifier has been called. It will try to initialize the related
      object structures but the module could disappear at any time. There
      will stay mess in the structures. It might even cause an invalid
      memory access.

This patch solves the problem by adding a boolean variable into struct module.
The value is true after the coming and before the going handler is called.
New patches need to be applied when the value is true and they need to ignore
the module when the value is false.

Note that we need to know state of all modules on the system. The races are
related to new patches. Therefore we do not know what modules will get
patched.

Also note that we could not simply ignore going modules. The code from the
module could be called even in the GOING state until mod->exit() finishes.
If we start supporting patches with semantic changes between function
calls, we need to apply new patches to any still usable code.
See below for an example.

Finally note that the patch solves only the situation when a new patch is
registered. There are no such problems when the patch is being removed.
It does not matter who disable the patch first, whether the normal
disable_patch() or the module notifier. There is nothing to do
once the patch is disabled.

Alternative solutions:
======================

+ reject new patches when a patched module is coming or going; this is ugly

+ wait with adding new patch until the module leaves the COMING and GOING
  states; this might be dangerous and complicated; we would need to release
  kgr_lock in the middle of the patch registration to avoid a deadlock
  with the coming and going handlers; also we might need a waitqueue for
  each module which seems to be even bigger overhead than the boolean

+ stop modules from entering COMING and GOING states; wait until modules
  leave these states when they are already there; looks complicated; we would
  need to ignore the module that asked to stop the others to avoid a deadlock;
  also it is unclear what to do when two modules asked to stop others and
  both are in COMING state (situation when two new patches are applied)

+ always register/enable new patches and fix up the potential mess (registered
  patches order) in klp_module_init(); this is nasty and prone to regressions
  in the future development

+ add another MODULE_STATE where the kallsyms are visible but the module is not
  used yet; this looks too complex; the module states are checked on "many"
  locations

Example of patch stacking breakage:
===================================

The notifier could _not_ _simply_ ignore already initialized module objects.
For example, let's have three patches (P1, P2, P3) for functions a() and b()
where a() is from vmcore and b() is from a module M. Something like:

	a()	b()
P1	a1()	b1()
P2	a2()	b2()
P3	a3()	b3(3)

If you load the module M after all patches are registered and enabled.
The ftrace ops for function a() and b() has listed the functions in this
order:

	ops_a->func_stack -> list(a3,a2,a1)
	ops_b->func_stack -> list(b3,b2,b1)

, so the pointer to b3() is the first and will be used.

Then you might have the following scenario. Let's start with state when patches
P1 and P2 are registered and enabled but the module M is not loaded. Then ftrace
ops for b() does not exist. Then we get into the following race:

CPU0					CPU1

load_module(M)

  complete_formation()

  mod->state = MODULE_STATE_COMING;
  mutex_unlock(&module_mutex);

					klp_register_patch(P3);
					klp_enable_patch(P3);

					# STATE 1

  klp_module_notify(M)
    klp_module_notify_coming(P1);
    klp_module_notify_coming(P2);
    klp_module_notify_coming(P3);

					# STATE 2

The ftrace ops for a() and b() then looks:

  STATE1:

	ops_a->func_stack -> list(a3,a2,a1);
	ops_b->func_stack -> list(b3);

  STATE2:
	ops_a->func_stack -> list(a3,a2,a1);
	ops_b->func_stack -> list(b2,b1,b3);

therefore, b2() is used for the module but a3() is used for vmcore
because they were the last added.

Example of the race with going modules:
=======================================

CPU0					CPU1

delete_module()  #SYSCALL

   try_stop_module()
     mod->state = MODULE_STATE_GOING;

   mutex_unlock(&module_mutex);

					klp_register_patch()
					klp_enable_patch()

					#save place to switch universe

					b()     # from module that is going
					  a()   # from core (patched)

   mod->exit();

Note that the function b() can be called until we call mod->exit().

If we do not apply patch against b() because it is in MODULE_STATE_GOING,
it will call patched a() with modified semantic and things might get wrong.

[jpoimboe@redhat.com: use one boolean instead of two]
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-03-17 10:31:54 +01:00
Leon Yu
d415a7f1c1 perf: Fix context leak in put_event()
Commit:

  a83fe28e2e ("perf: Fix put_event() ctx lock")

changed the locking logic in put_event() by replacing mutex_lock_nested()
with perf_event_ctx_lock_nested(), but didn't fix the subsequent
mutex_unlock() with a correct counterpart, perf_event_ctx_unlock().

Contexts are thus leaked as a result of incremented refcount
in perf_event_ctx_lock_nested().

Signed-off-by: Leon Yu <chianglungyu@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Fixes: a83fe28e2e ("perf: Fix put_event() ctx lock")
Link: http://lkml.kernel.org/r/1424954613-5034-1-git-send-email-chianglungyu@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-13 10:02:18 +01:00
Andrey Ryabinin
a5af5aa8b6 kasan, module, vmalloc: rework shadow allocation for modules
Current approach in handling shadow memory for modules is broken.

Shadow memory could be freed only after memory shadow corresponds it is no
longer used.  vfree() called from interrupt context could use memory its
freeing to store 'struct llist_node' in it:

    void vfree(const void *addr)
    {
    ...
        if (unlikely(in_interrupt())) {
            struct vfree_deferred *p = this_cpu_ptr(&vfree_deferred);
            if (llist_add((struct llist_node *)addr, &p->list))
                    schedule_work(&p->wq);

Later this list node used in free_work() which actually frees memory.
Currently module_memfree() called in interrupt context will free shadow
before freeing module's memory which could provoke kernel crash.

So shadow memory should be freed after module's memory.  However, such
deallocation order could race with kasan_module_alloc() in module_alloc().

Free shadow right before releasing vm area.  At this point vfree()'d
memory is not used anymore and yet not available for other allocations.
New VM_KASAN flag used to indicate that vm area has dynamically allocated
shadow memory so kasan frees shadow only if it was previously allocated.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 18:46:08 -07:00