We currently hold the IOMMU spinlock around tce_build and tce_flush.
This causes our spinlock hold times to be much higher than required
and can impact multiqueue adapters.
This patch moves tce_build and tce_flush outside of the lock in
iommu_alloc, and tce_flush outside of the lock in iommu_free.
Some performance numbers were obtained with a Chelsio T3 adapter on
two POWER7 boxes, running a 100 session TCP round robin test.
Performance improved 32% with this patch applied.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
tce_buildmulti_pSeriesLP uses a per cpu page to communicate with the
hypervisor. We currently rely on the IOMMU table spinlock but
subsequent patches will be removing that so disable interrupts
around all accesses of tce_page.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Implement a POWER7 optimised memcpy using VMX and enhanced prefetch
instructions.
This is a copy of the POWER7 optimised copy_to_user/copy_from_user
loop. Detailed implementation and performance details can be found in
commit a66086b819 (powerpc: POWER7 optimised
copy_to_user/copy_from_user using VMX).
I noticed memcpy issues when profiling a RAID6 workload:
.memcpy
.async_memcpy
.async_copy_data
.__raid_run_ops
.handle_stripe
.raid5d
.md_thread
I created a simplified testcase by building a RAID6 array with 4 1GB
ramdisks (booting with brd.rd_size=1048576):
# mdadm -CR -e 1.2 /dev/md0 --level=6 -n4 /dev/ram[0-3]
I then timed how long it took to write to the entire array:
# dd if=/dev/zero of=/dev/md0 bs=1M
Before: 892 MB/s
After: 999 MB/s
A 12% improvement.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Version 2.06 of the POWER ISA introduced enhanced touch instructions,
allowing us to specify a number of attributes including the length of
a stream.
This patch adds a software stream for both loads and stores in the
POWER7 copy_tofrom_user loop. Since the setup is quite complicated
and we have to use an eieio to ensure correct ordering of the "GO"
command we only do this for copies above 4kB.
To quantify any performance improvements we need a working set
bigger than the caches so we operate on a 1GB file:
# dd if=/dev/zero of=/tmp/foo bs=1M count=1024
And we compare how fast we can read the file:
# dd if=/tmp/foo of=/dev/null bs=1M
before: 7.7 GB/s
after: 9.6 GB/s
A 25% improvement.
The worst case for this patch will be a completely L1 cache contained
copy of just over 4kB. We can test this with the copy_to_user
testcase we used to tune copy_tofrom_user originally:
http://ozlabs.org/~anton/junkcode/copy_to_user.c
# time ./copy_to_user2 -l 4224 -i 10000000
before: 6.807 s
after: 6.946 s
A 2% slowdown, which seems reasonable considering our data is unlikely
to be completely L1 contained.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
While creating the PCI root bus through function pci_create_root_bus()
of PCI core, it should have assigned the secondary bus number for the
newly created PCI root bus. Thus we needn't do the explicit assignment
for the secondary bus number again in pcibios_scan_phb().
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The form affinity for NUMA is set to 1 if the firmware supports
OPAL. Otherwise, we have to retrieve that from OF node "/chosen".
For the latter case, OF node "/chosen" reference count was never
decreased.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Implement a POWER7 optimised copy_page using VMX and enhanced
prefetch instructions. We use enhanced prefetch hints to prefetch
both the load and store side. We copy a cacheline at a time and
fall back to regular loads and stores if we are unable to use VMX
(eg we are in an interrupt).
The following microbenchmark was used to assess the impact of
the patch:
http://ozlabs.org/~anton/junkcode/page_fault_file.c
We test MAP_PRIVATE page faults across a 1GB file, 100 times:
# time ./page_fault_file -p -l 1G -i 100
Before: 22.25s
After: 18.89s
17% faster
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Subsequent patches will add more VMX library functions and it makes
sense to keep all the c-code helper functions in the one file.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
mtmsrd is an expensive instruction, we save a few cycles by
doing it once instead of twice.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Version 2.06 of the POWER ISA introduced enhanced touch instructions,
allowing us to specify a number of attributes including the length of
a stream.
This patch adds a software stream for both loads and stores in the
POWER7 copy_tofrom_user loop. Since the setup is quite complicated
and we have to use an eieio to ensure correct ordering of the "GO"
command we only do this for copies above 4kB.
To quantify any performance improvements we need a working set
bigger than the caches so we operate on a 1GB file:
# dd if=/dev/zero of=/tmp/foo bs=1M count=1024
And we compare how fast we can read the file:
# dd if=/tmp/foo of=/dev/null bs=1M
before: 7.7 GB/s
after: 9.6 GB/s
A 25% improvement.
The worst case for this patch will be a completely L1 cache contained
copy of just over 4kB. We can test this with the copy_to_user
testcase we used to tune copy_tofrom_user originally:
http://ozlabs.org/~anton/junkcode/copy_to_user.c
# time ./copy_to_user2 -l 4224 -i 10000000
before: 6.807 s
after: 6.946 s
A 2% slowdown, which seems reasonable considering our data is unlikely
to be completely L1 contained.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
1) call_function.lock used in smp_call_function_many() is just to protect
call_function.queue and &data->refs, cpu_online_mask is outside of the
lock. And it's not necessary to protect cpu_online_mask,
because data->cpumask is pre-calculate and even if a cpu is brougt up
when calling arch_send_call_function_ipi_mask(), it's harmless because
validation test in generic_smp_call_function_interrupt() will take care
of it.
2) For cpu down issue, stop_machine() will guarantee that no concurrent
smp_call_fuction() is processing.
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
I noticed __clear_user high up in a profile of one of my RAID stress
tests. The testcase was doing a dd from /dev/zero which ends up
calling __clear_user.
__clear_user is basically a loop with a single 4 byte store which
is horribly slow. We can do much better by aligning the desination
and doing 32 bytes of 8 byte stores in a loop.
The following testcase was used to verify the patch:
http://ozlabs.org/~anton/junkcode/stress_clear_user.c
To show the improvement in performance I ran a dd from /dev/zero
to /dev/null on a POWER7 box:
Before:
# dd if=/dev/zero of=/dev/null bs=1M count=10000
10485760000 bytes (10 GB) copied, 3.72379 s, 2.8 GB/s
After:
# time dd if=/dev/zero of=/dev/null bs=1M count=10000
10485760000 bytes (10 GB) copied, 0.728318 s, 14.4 GB/s
Over 5x faster.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
irq_entry, irq_exit, timer_interrupt_entry and timer_interrupt_exit
all do the same thing so use DECLARE_EVENT_CLASS to avoid duplicating
everything 4 times.
This saves quite a lot of space in both instruction text and data:
text data bss dec hex filename
9265 19622 16 28903 70e7 arch/powerpc/kernel/irq.o
6817 19019 16 25852 64fc arch/powerpc/kernel/irq.o
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When looking through some instruction traces I noticed our tracepoint
checks were inline. It turns out we don't have CONFIG_JUMP_LABEL
enabled.
By enabling CONFIG_JUMP_LABEL we replace a load/compare/branch with
a nop at every tracepoint call. For example in do_IRQ:
CONFIG_JUMP_LABEL disabled:
stdx 3,11,9
lwz 0,8(29)
cmpwi 7,0,0
bne- 7,.L124
bl .irq_enter
CONFIG_JUMP_LABEL enabled:
stdx 3,11,9
nop
bl .irq_enter
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The following patch is to remove the pseries_notify_add_cpu() call
and replace it by a hot plug notifier.
This would prevent cpuidle resources being released and allocated each
time cpu comes online on pseries.
The earlier design was causing a lockdep problem
in start_secondary as reported on this thread
-https://lkml.org/lkml/2012/5/17/2
This applies on 3.4-rc7
Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
An upcoming release of firmware will add DDW extensions, in particular
an API to "reset" the DMA window to the original configuration (32-bit,
2GB in size). With that API available, we can safely remove the default
window, increasing the resources available to firmware for creation of
larger windows for the slot in question -- if we encounter an error, we
can use the new API to reset the state of the slot.
Further, this same release of firmware will make it a hard requirement
for OSes to release the existing window before any other windows will be
shown as available, to avoid conflicts in addressing between the two
windows.
In anticipation of these changes, always remove the default window
before we do any DDW manipulations.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch_instruction() interface is made to modify kernel text. It is
safer to use that then the probe_kernel_write() when modifying kernel
code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
For ftrace to use the patch_instruction code, it needs to check for
faults on write. Ftrace updates code all over the kernel, and we need to
know if code is updated or not due to protections that are placed on
some portions of the kernel. If ftrace does not detect a fault, it will
error later on, and it will be much more difficult to find the problem.
By changing patch_instruction() to detect faults, then ftrace will be
able to make use of it too.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
PowerPC does not have the synchronization issues that x86 has with
modifying code on one CPU while another CPU is executing it.
The other CPU will either see the old or new code without any
issues, unlike x86 which may issue a GPF.
Instead of calling the heavy stop_machine, just update the code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently we build all board files regardless of the final zImage
target. This is sub-optimal (in terms on compilation) and leads to
problems in one platform needlessly causing failures for other
platforms.
Use the Kconfig variables to selectively construct this board files to
build.
Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull a couple more powerpc fixes from Benjamin Herrenschmidt:
"Here are two more fixes that I "missed" when scrubbing patchwork last
week which are worth still having in 3.5."
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc/kvm: sldi should be sld
powerpc/xmon: Use cpumask iterator to avoid warning
Aas the flexcan driver has DT binding for some time now. This patch makes
the flexcan driver available in the can drivers menu if a SOC with flexcan
is enabled.
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Acked-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
The recently added SDHI2 interface to kzm9g needs power supplies for the
upcoming transition away from OCR masks, provided in platform data.
Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de>
Acked-by: Magnus Damm <damm@opensource.se>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
From Shawn Guo <shawn.guo@linaro.org>, this makes it possible to use
sparse irqs with mach-imx.
* 'imx/sparse-irq' of git://git.linaro.org/people/shawnguo/linux-2.6:
ARM: imx: enable SPARSE_IRQ for imx platform
ARM: fiq: change FIQ_START to a variable
tty: serial: imx: remove the use of MXC_INTERNAL_IRQS
ARM: imx: remove unneeded mach/irq.h inclusion
i2c: imx: remove unneeded mach/irqs.h inclusion
ARM: imx: add a legacy irqdomain for mx31ads
ARM: imx: add a legacy irqdomain for 3ds_debugboard
ARM: imx: pass gpio than irq number into mxc_expio_init
ARM: imx: leave irq_base of wm8350_platform_data uninitialized
dma: ipu: remove the use of ipu_platform_data
ARM: imx: move irq_domain_add_legacy call into avic driver
ARM: imx: move irq_domain_add_legacy call into tzic driver
gpio/mxc: move irq_domain_add_legacy call into gpio driver
ARM: imx: eliminate macro IRQ_GPIOx()
ARM: imx: eliminate macro IOMUX_TO_IRQ()
ARM: imx: eliminate macro IMX_GPIO_TO_IRQ()
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
A subsequent patch will add a generic PWM API driver for the Tegra PWFM
controller, supporting all four PWM devices with a single PWM chip. The
device will be named tegra-pwm and only one clock needs to be provided.
Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>
Signed-off-by: Stephen Warren <swarren@nvidia.com>
PWM clock source registers in Tegra 2 have different clock source selection bit
fields than other registers. PWM clock source bits in CLK_SOURCE_PWM_0 register
are located at bit field bit[30:28] while others are at bit field bit[31:30] in
their respective clock source register.
This patch updates the clock programming to correctly reflect that, by adding a
flag to indicate the alternate bit field format and checking for it when
selecting a clock source (parent clock).
Signed-off-by: Bill Huang <bilhuang@nvidia.com>
Signed-off-by: Simon Que <sque@chromium.org>
Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>
Signed-off-by: Stephen Warren <swarren@nvidia.com>
Move the driver to drivers/pwm/ and convert it to use the framework.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Ben Dooks <ben-linux@fluff.org>
Cc: Kukjin Kim <kgene.kim@samsung.com>
[eric@eukrea.com: fix pwmchip_add return code test]
Signed-off-by: Eric Bénard <eric@eukrea.com>
Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>
Move the driver to drivers/pwm/ and convert it to use the framework.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
[eric@eukrea.com: set chip.dev to prevent probe failure]
[eric@eukrea.com: fix pwmchip_add return code test]
Signed-off-by: Eric Bénard <eric@eukrea.com>
Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>
This commit moves the PXA PWM driver to the drivers/pwm subdirectory and
converts it to use the new PWM framework.
Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>
This commit moves the Blackfin PWM driver to the drivers/pwm sub-
directory and converts it to register with the new PWM framework.
Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>
Add auxdata to instantiate the PWFM controller from a device tree,
include the corresponding nodes in the dtsi files for Tegra 20 and
Tegra 30 and add binding documentation.
Acked-by: Stephen Warren <swarren@wwwdotorg.org>
Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>
This reverts commit 616c310e83.
(Move PREEMPT_RCU preemption to switch_to() invocation).
Testing by Sasha Levin <levinsasha928@gmail.com> showed that this
can result in deadlock due to invoking the scheduler when one of
the runqueue locks is held. Because this commit was simply a
performance optimization, revert it.
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
The number of lines of AIC5 has increased from 32 to 128. Due to this
increase, a source select register has been introduced for the interrupt
line selection. Moreover, register mapping has been changed. For that reasons,
we need some dedicated callbacks for AIC5.
Power management is also concerned by these changes. On suspend, we can't get
the whole interrupt mask register as before, we have to read this register 128
times. To reduce this overhead, a snapshot of the whole IMR is maintained.
Signed-off-by: Ludovic Desroches <ludovic.desroches@atmel.com>
Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
mach/irqs only defines FIQ_START which doesn't appear to be used anywhere
so remove it.
Signed-off-by: Ludovic Desroches <ludovic.desroches@atmel.com>
Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Enable sparse irq support for multisoc image. It involves to add the
NR_IRQS_LEGACY offset to static SoC irq number definitions since NR_IRQS_LEGACY
irq descs are allocated before AIC requests irq descs allocation.
Move NR_AIC_IRQS macro to a more appropiate place with the purpose to
remove mach/irqs.h later.
Signed-off-by: Ludovic Desroches <ludovic.desroches@atmel.com>
Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
SOC_AT91SAM9 selects MULTI_IRQ_HANDLER in order to let machines specify their
own IRQ handler at run time.
Signed-off-by: Ludovic Desroches <ludovic.desroches@atmel.com>
Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Since irq priorites are managed in DT, static ones are no more required for
sam9x5 which only has DT support.
Signed-off-by: Ludovic Desroches <ludovic.desroches@atmel.com>
Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
The Advanced Interrupt Controller allows us to use the fast EOI handler type.
It lets us remove the Atmel specific workaround into arch/arm/kernel/irq.c
used to indicate to the AIC the end of the interrupt treatment.
Signed-off-by: Ludovic Desroches <ludovic.desroches@atmel.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Commit 82c583e3ae ("AT91RM9200 hardware
headers") introduced arch/arm/mach-at91/include/mach/at91_spi.h and
arch/arm/mach-at91/include/mach/at91_ssc.h. (These files were called
at91rm9200_spi.h and at91rm9200_ssc.h then.) That commit was added in
the v2.6.18 development cycle.
Nothing includes them now and nothing uses the named constants they
provide. (There's no way these named constants could be used unless
these files were included somehow.) It actually seems these two headers
have never been used since they were added to the tree. They can safely
be removed.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
The SYS_NIRQ1 pin is the interupt line for the PMIC part of the TWL6030
and interrupts from the PMIC are needed as wakeup sources.
Ensure this pin is mux'd as input and has wakeup enabled so PMIC
interupts (e.g. RTC) can be used as wakeup sources.
Tested on OMAP4430/Panda.
Signed-off-by: Kevin Hilman <khilman@ti.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
In order for suspend/resume dependencies to work correctly, I2C has to
be initialized (more specifically, registered with the driver core)
before MMC. Without this, the MMC driver fails to adjust the VMMC
regulator (using i2c writes) during the suspend path.
Problem found testing suspend/resume on 3730/OveroSTORM platform.
Signed-off-by: Kevin Hilman <khilman@ti.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Register OMAP DRM/KMS platform device. DMM is split into a
separate device using hwmod.
Signed-off-by: Andy Gross <andy.gross@ti.com>
Signed-off-by: Rob Clark <rob.clark@linaro.org>
Signed-off-by: Tony Lindgren <tony@atomide.com>