Right now, if it's an open of a negative dentry, a race is possible
with several openers who all try to instantiate/rehash the same
dentry and would hit a BUG_ON in d_add.
But in fact if we got a negative dentry in atomic_open, that means
we just revalidated it so no point in talking to MDS at all,
just return ENOENT and make the race go away completely.
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Cc: stable <stable@vger.kernel.org> # 4.7+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull more vfs updates from Al Viro:
"Assorted cleanups and fixes.
In the "trivial API change" department - ->d_compare() losing 'parent'
argument"
* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
cachefiles: Fix race between inactivating and culling a cache object
9p: use clone_fid()
9p: fix braino introduced in "9p: new helper - v9fs_parent_fid()"
vfs: make dentry_needs_remove_privs() internal
vfs: remove file_needs_remove_privs()
vfs: fix deadlock in file_remove_privs() on overlayfs
get rid of 'parent' argument of ->d_compare()
cifs, msdos, vfat, hfs+: don't bother with parent in ->d_compare()
affs ->d_compare(): don't bother with ->d_inode
fold _d_rehash() and __d_rehash() together
fold dentry_rcuwalk_invalidate() into its only remaining caller
Pull qstr constification updates from Al Viro:
"Fairly self-contained bunch - surprising lot of places passes struct
qstr * as an argument when const struct qstr * would suffice; it
complicates analysis for no good reason.
I'd prefer to feed that separately from the assorted fixes (those are
in #for-linus and with somewhat trickier topology)"
* 'work.const-qstr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
qstr: constify instances in adfs
qstr: constify instances in lustre
qstr: constify instances in f2fs
qstr: constify instances in ext2
qstr: constify instances in vfat
qstr: constify instances in procfs
qstr: constify instances in fuse
qstr constify instances in fs/dcache.c
qstr: constify instances in nfs
qstr: constify instances in ocfs2
qstr: constify instances in autofs4
qstr: constify instances in hfs
qstr: constify instances in hfsplus
qstr: constify instances in logfs
qstr: constify dentry_init_security
Pull userns vfs updates from Eric Biederman:
"This tree contains some very long awaited work on generalizing the
user namespace support for mounting filesystems to include filesystems
with a backing store. The real world target is fuse but the goal is
to update the vfs to allow any filesystem to be supported. This
patchset is based on a lot of code review and testing to approach that
goal.
While looking at what is needed to support the fuse filesystem it
became clear that there were things like xattrs for security modules
that needed special treatment. That the resolution of those concerns
would not be fuse specific. That sorting out these general issues
made most sense at the generic level, where the right people could be
drawn into the conversation, and the issues could be solved for
everyone.
At a high level what this patchset does a couple of simple things:
- Add a user namespace owner (s_user_ns) to struct super_block.
- Teach the vfs to handle filesystem uids and gids not mapping into
to kuids and kgids and being reported as INVALID_UID and
INVALID_GID in vfs data structures.
By assigning a user namespace owner filesystems that are mounted with
only user namespace privilege can be detected. This allows security
modules and the like to know which mounts may not be trusted. This
also allows the set of uids and gids that are communicated to the
filesystem to be capped at the set of kuids and kgids that are in the
owning user namespace of the filesystem.
One of the crazier corner casees this handles is the case of inodes
whose i_uid or i_gid are not mapped into the vfs. Most of the code
simply doesn't care but it is easy to confuse the inode writeback path
so no operation that could cause an inode write-back is permitted for
such inodes (aka only reads are allowed).
This set of changes starts out by cleaning up the code paths involved
in user namespace permirted mounts. Then when things are clean enough
adds code that cleanly sets s_user_ns. Then additional restrictions
are added that are possible now that the filesystem superblock
contains owner information.
These changes should not affect anyone in practice, but there are some
parts of these restrictions that are changes in behavior.
- Andy's restriction on suid executables that does not honor the
suid bit when the path is from another mount namespace (think
/proc/[pid]/fd/) or when the filesystem was mounted by a less
privileged user.
- The replacement of the user namespace implicit setting of MNT_NODEV
with implicitly setting SB_I_NODEV on the filesystem superblock
instead.
Using SB_I_NODEV is a stronger form that happens to make this state
user invisible. The user visibility can be managed but it caused
problems when it was introduced from applications reasonably
expecting mount flags to be what they were set to.
There is a little bit of work remaining before it is safe to support
mounting filesystems with backing store in user namespaces, beyond
what is in this set of changes.
- Verifying the mounter has permission to read/write the block device
during mount.
- Teaching the integrity modules IMA and EVM to handle filesystems
mounted with only user namespace root and to reduce trust in their
security xattrs accordingly.
- Capturing the mounters credentials and using that for permission
checks in d_automount and the like. (Given that overlayfs already
does this, and we need the work in d_automount it make sense to
generalize this case).
Furthermore there are a few changes that are on the wishlist:
- Get all filesystems supporting posix acls using the generic posix
acls so that posix_acl_fix_xattr_from_user and
posix_acl_fix_xattr_to_user may be removed. [Maintainability]
- Reducing the permission checks in places such as remount to allow
the superblock owner to perform them.
- Allowing the superblock owner to chown files with unmapped uids and
gids to something that is mapped so the files may be treated
normally.
I am not considering even obvious relaxations of permission checks
until it is clear there are no more corner cases that need to be
locked down and handled generically.
Many thanks to Seth Forshee who kept this code alive, and putting up
with me rewriting substantial portions of what he did to handle more
corner cases, and for his diligent testing and reviewing of my
changes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
fs: Call d_automount with the filesystems creds
fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
evm: Translate user/group ids relative to s_user_ns when computing HMAC
dquot: For now explicitly don't support filesystems outside of init_user_ns
quota: Handle quota data stored in s_user_ns in quota_setxquota
quota: Ensure qids map to the filesystem
vfs: Don't create inodes with a uid or gid unknown to the vfs
vfs: Don't modify inodes with a uid or gid unknown to the vfs
cred: Reject inodes with invalid ids in set_create_file_as()
fs: Check for invalid i_uid in may_follow_link()
vfs: Verify acls are valid within superblock's s_user_ns.
userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
fs: Refuse uid/gid changes which don't map into s_user_ns
selinux: Add support for unprivileged mounts from user namespaces
Smack: Handle labels consistently in untrusted mounts
Smack: Add support for unprivileged mounts from user namespaces
fs: Treat foreign mounts as nosuid
fs: Limit file caps to the user namespace of the super block
userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
userns: Remove implicit MNT_NODEV fragility.
...
Merge more updates from Andrew Morton:
"The rest of MM"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (101 commits)
mm, compaction: simplify contended compaction handling
mm, compaction: introduce direct compaction priority
mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
mm, page_alloc: make THP-specific decisions more generic
mm, page_alloc: restructure direct compaction handling in slowpath
mm, page_alloc: don't retry initial attempt in slowpath
mm, page_alloc: set alloc_flags only once in slowpath
lib/stackdepot.c: use __GFP_NOWARN for stack allocations
mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB
mm, kasan: account for object redzone in SLUB's nearest_obj()
mm: fix use-after-free if memory allocation failed in vma_adjust()
zsmalloc: Delete an unnecessary check before the function call "iput"
mm/memblock.c: fix index adjustment error in __next_mem_range_rev()
mem-hotplug: alloc new page from a nearest neighbor node when mem-offline
mm: optimize copy_page_to/from_iter_iovec
mm: add cond_resched() to generic_swapfile_activate()
Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"
mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode
mm: hwpoison: remove incorrect comments
make __section_nr() more efficient
...
This changes the vfs dentry hashing to mix in the parent pointer at the
_beginning_ of the hash, rather than at the end.
That actually improves both the hash and the code generation, because we
can move more of the computation to the "static" part of the dcache
setup, and do less at lookup runtime.
It turns out that a lot of other hash users also really wanted to mix in
a base pointer as a 'salt' for the hash, and so the slightly extended
interface ends up working well for other cases too.
Users that want a string hash that is purely about the string pass in a
'salt' pointer of NULL.
* merge branch 'salted-string-hash':
fs/dcache.c: Save one 32-bit multiply in dcache lookup
vfs: make the string hashes salt the hash
Update posix_acl_valid to verify that an acl is within a user namespace.
Update the callers of posix_acl_valid to pass in an appropriate
user namespace. For posix_acl_xattr_set and v9fs_xattr_set_acl pass in
inode->i_sb->s_user_ns to posix_acl_valid. For md_unpack_acl pass in
&init_user_ns as no inode or superblock is in sight.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Lockdep complains about potential recursive locking during mount
because the client configuration log is holding a lock on the MGC
obd_device to prevent it from being torn down, while also getting
mutexes on the MDC and OSC devices as they are instantiated:
Lustre: Mounted myth-client
=============================================
[ INFO: possible recursive locking detected ]
4.7.0-rc2-vm-nfs+ #127 Tainted: G C
---------------------------------------------
May be due to missing lock nesting notation
2 locks held by ll_cfg_requeue/5928:
#0: (&cli->cl_sem){.+.+.+}, at: mgc_requeue_thread+0x15d/0x730 [mgc]
#1: (&cld->cld_lock){+.+.+.}, at: mgc_process_log+0x5e/0xf80 [mgc]
CPU: 0 PID: 5928 Comm: ll_cfg_requeue
Call Trace:
[<ffffffff814a0855>] dump_stack+0x86/0xc1
[<ffffffff810e7766>] __lock_acquire+0x726/0x1210
[<ffffffff810e86be>] lock_acquire+0xfe/0x1f0
[<ffffffff81888171>] down_read+0x51/0xa0
[<ffffffffa04a8477>] sptlrpc_conf_client_adapt+0x47/0x150 [ptlrpc]
[<ffffffffa0186b16>] mdc_set_info_async+0x2b6/0x470 [mdc]
[<ffffffffa0294090>] class_notify_sptlrpc_conf+0x190/0x360 [obdclass]
[<ffffffffa01a9e85>] mgc_process_log+0x925/0xf80 [mgc]
[<ffffffffa01abafa>] mgc_requeue_thread+0x1fa/0x730 [mgc]
[<ffffffff810af331>] kthread+0x101/0x120
[<ffffffff8188ad6f>] ret_from_fork+0x1f/0x40
Add a separate lock class for the MGC callpath, since it will always
be held first, and none of the other obd_device locks should ever
be held concurrently.
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kthread_run might sleep during an allocation, and so
it's considered unsafe to call with a state that's not
RUNNABLE.
Move the state setting to after kthread_run call.
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ac5b148109 ("staging: lustre: osc: Track and limit
"unstable" pages") added a new sysfs variable, but corresponding bit of
documentation was not forgotten.
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There are a couple of cases in ll_revalidate_dentry() where
we are pretty sure the dentry is valid, so check for them early
and save more expensive checks for later.
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mark dentries that came to us via NFS in a special way so that
we can tell them apart during open and activate open cache
(we really don't want to do open/close RPC for every NFS IO).
This became needed since dentry revlidate no longer reimplements
any RPCs for lookup, and as such if a dentry is valid,
ll_revalidate_dentry returns 1 and ll_lookup_it() is never visited
during opens, we get straght into ll_file_open() without a valid
intent/RPC. This used to be only true for NFS, so opencache was
engaged needlessly, and it carries a cost of it's own if there is
in fact no repetitive file opening-closing going on
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
Reviewed-on: http://review.whamcloud.com/20354
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-8019
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
They are just one-liners, so no point in having them exported
and called through a different module.
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The lli_trunc_sem is taken in 'read' mode in both
ll_page_mkwrite and vvp_io_fault_start. This can lead to a
deadlock with another thread which asks for the semaphore
in write mode between thse two read calls.
Since all users of lli_trunc_sem are in the vvp layer, we
can satisfy the requirement to exclude truncate by taking
the semaphore only in vvp_io_fault_start.
Signed-off-by: Patrick Farrell <paf@cray.com>
Reviewed-on: http://review.whamcloud.com/19315
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-7981
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Reviewed-by: Andriy Skulysh <andriy.skulysh@seagate.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The reverse order of request_out_callback() and reply_in_callback()
puts the RPC into UNREGISTERING state, which is waiting for RPC &
bulk md unlink, whereas only RPC md unlink has been called so far.
If bulk is lost, even expired_set does not check for UNREGISTERING
state.
The same for write if server returns an error.
This phase is ambiguous, split to UNREG_RPC and UNREG_BULK.
Signed-off-by: Vitaly Fertman <vitaly.fertman@seagate.com>
Seagate-bug-id: MRP-2953, MRP-3206
Reviewed-by: Andriy Skulysh <andriy.skulysh@seagate.com>
Reviewed-by: Alexey Leonidovich Lyashkov <alexey.lyashkov@seagate.com>
Tested-by: Elena V. Gryaznova <elena.gryaznova@seagate.com>
Reviewed-on: http://review.whamcloud.com/19953
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Ann Koehler <amk@cray.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch changes a few things:
- There is no guarantee that request_out_callback will happen
before reply_in_callback, if a request got reply and unlinked
reply buffer before request_out_callback is called, then the
thread waiting on ptlrpc_request_set will miss wakeup event.
This may seriously impact performance of some IO workloads or
result in RPC timeout
- To make code more easier to understand, this patch changes
action-bits "rq_req_unlink" and "rq_reply_unlink" to
status-bits "rq_req_unlinked" and "rq_reply_unlinked"
Signed-off-by: Liang Zhen <liang.zhen@intel.com>
Reviewed-on: http://review.whamcloud.com/12158
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5696
Reviewed-by: Johann Lombardi <johann.lombardi@intel.com>
Reviewed-by: Li Wei <wei.g.li@intel.com>
Reviewed-by: Mike Pershin <mike.pershin@intel.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When imprting clio simplification patch, the check for
pbject got reversed by mistake when converting from
if (obj == NULL) it somehow became (if (obj) which is obviously wrong,
and so when it does hit, a crash was happening as result.
Fix the condition and all if fine now.
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There has been several Lustre Client crashes reported by sites
running with Lustre versions 2.1/2.5, all showing the same
dentry->d_hash->next corrupted pointer cause.
This patch fixes a regression that has been introduced since a
long time by commit :
(LU-506 kernel: FC15 - support dcache scalability changes.)
where i_lock protection usage has been removed and
that is likely to cause racy condition during dentry [un]hashing
and to be the root cause of these crashes.
Signed-off-by: Bruno Faccini <bruno.faccini@intel.com>
Reviewed-on: http://review.whamcloud.com/19287
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-7973
Reviewed-by: Lai Siyao <lai.siyao@intel.com>
Reviewed-by: Yang Sheng <yang.sheng@intel.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
These are just doing spin_lock/unlock on inode's i_lock,
so just do the spinlock directly to make the code more clear
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since the inflight request holds import refcount as well as export,
sometimes obd_disconnect() in client_common_put_super() can't put
the last refcount of OSC import (e.g. due to network disconnection),
this will cause cl_cache being accessed after free.
To fix this issue, ccc_users is used as cl_cache refcount, and
lov/llite/osc all hold one cl_cache refcount respectively, to avoid
the race that a new OST is being added into the system when the client
is mounted.
The following cl_cache functions are added:
- cl_cache_init(): allocate and initialize cl_cache
- cl_cache_incref(): increase cl_cache refcount
- cl_cache_decref(): decrease cl_cache refcount and free the cache
if refcount=0.
Signed-off-by: Emoly Liu <emoly.liu@intel.com>
Reviewed-on: http://review.whamcloud.com/13746
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6173
Reviewed-by: Niu Yawei <yawei.niu@intel.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We want the fixes in here, and we can resolve a merge issue in
drivers/iio/industrialio-trigger.c
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch fixes all checkpatch occurences of
"CHECK: No space is necessary after a cast"
in Lustre code.
Signed-off-by: Emoly Liu <emoly.liu@intel.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch fixes all checkpatch occurences of
"CHECK: Logical continuations should be on the previous line"
in Lustre code.
Signed-off-by: Emoly Liu <emoly.liu@intel.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch fixes one checkpatch warning in lustre:
WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
Signed-off-by: Emoly Liu <emoly.liu@intel.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The "Please contact Oracle Corporation" lines are removed since not
only Oracle has nothing to do with Lustre anymore, there's a pointer
to GPL already that's independent of any particular company.
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The 'copy of GPLv2]' is an ending from template that's no longer needed,
so remove it to avoid any extra confusion.
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since SUN is no longer around and there's no point in contacting them,
just remove that whole thing. Copy of GPL is available online anyway
(URLs to be updated in next patch).
This patch was generated with:
find drivers/staging/lustre -name "*.[ch]" -exec perl -0777 -i -pe 's/ \* Please contact Sun Microsystems, Inc., 4150 Network Circle, Santa Clara,\n \* CA 95054 USA or visit www.sun.com if you need additional information or\n \* have any questions.\n \*\n//igs' {} \;
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Reported-by: Xose Vazquez Perez <xose.vazquez@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>