It does the same thing as "do_brk()", except it handles the VM locking
too.
It turns out that all external callers want that anyway, so we can make
do_brk() static to just mm/mmap.c while at it.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When hostname contains colon (e.g. when it is an IPv6 address) it needs
to be enclosed in brackets to make parsing of NFS device string possible.
Fix nfs_do_root_mount() to enclose hostname properly when needed. NFS code
actually does not need this as it does not parse the string passed by
nfs_do_root_mount() but the device string is exposed to userspace in
/proc/mounts.
CC: Josh Boyer <jwboyer@redhat.com>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In the recent update of the cifs_iovec_write code to use async writes,
the handling of the file position was broken. That patch added a local
"offset" variable to handle the offset, and then only updated the
original "*poffset" before exiting.
Unfortunately, it copied off the original offset from the beginning,
instead of doing so after generic_write_checks had been called. Fix
this by moving the initialization of "offset" after that in the
function.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Pull nfsd bugfixes from J. Bruce Fields:
"One bugfix, and one minor header fix from Jeff Layton while we're
here"
* 'for-3.4' of git://linux-nfs.org/~bfields/linux:
nfsd: include cld.h in the headers_install target
nfsd: don't fail unchecked creates of non-special files
Since we may be simulating flock() locks using NFS byte range locks,
we can't rely on the VFS having checked the file open mode for us.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
All callers of nfs4_handle_exception() that need to handle
NFS4ERR_OPENMODE correctly should set exception->inode
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
Pull fuse updates from Miklos Szeredi.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: use flexible array in fuse.h
fuse: allow nanosecond granularity
fuse: O_DIRECT support for files
fuse: fix nlink after unlink
The bitfield member mount_opt was too small by one bit to hold the mount
option that enabled to include data extents in the integrity checker.
Since the same issue happened when the BTRFS_MOUNT_PANIC_ON_FATAL_ERROR
option was added (git rebase silently merges so that the increase of the
size of the bitfield member is lost), the bit limit was removed entirely.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
When a filesystem is mounted with the degraded option, it is
possible that some of the devices are not there.
btrfs_ioctl_dev_info() crashs in this case because the device
name is a NULL pointer. This ioctl was only used for scrub.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
It is basically a good thing if we are interruptible when waiting for
free space, but the generality in which it is implemented currently
leads to system calls being interruptible that are not documented this
way. For example git can't handle interrupted unlink(), leading to
corrupt repos under space pressure.
Instead we raise the bar to only be interruptible by SIGKILL.
Thanks to David Sterba for suggesting this.
Signed-off-by: Arne Jansen <sensille@gmx.net>
The caller expects this function to return with the lock held and
releases it immediately on error.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
A user reported a panic where we were trying to fix a bad mirror but the
mirror number we were giving was 0, which is invalid. This is because we
don't do the transid verification until after the read, so as far as the
read code is concerned the read was a success. So instead store the mirror
we read from so that if there is some failure post read we know which mirror
to try next and which mirror needs to be fixed if we find a good copy of the
block. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Fix a bug, where in case we need to adjust stripe_size so that the
length of the resulting chunk is less than or equal to max_chunk_size,
DUP chunks turn out to be only half as big as they could be.
Cc: Arne Jansen <sensille@gmx.net>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
iref_to_path and iterate_irefs both increment the eb's refcount to use it
after releasing the path. Both depend on consistent data remaining in the
extent buffer and need a read lock to protect it.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Avoid calling free_extent_buffer more than once when the iterator function
returns non-zero. The only code that uses this is scrub repair for corrupted
nodatasum blocks.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Make free_ipath() behave like most other freeing functions in the
kernel and gracefully do nothing when passed a NULL pointer.
Besides this making the bahaviour consistent with functions such as
kfree(), vfree(), btrfs_free_path() etc etc, it also fixes a real NULL
deref issue in fs/btrfs/ioctl.c::btrfs_ioctl_ino_to_path(). In that
function we have this code:
...
ipath = init_ipath(size, root, path);
if (IS_ERR(ipath)) {
ret = PTR_ERR(ipath);
ipath = NULL;
goto out;
}
...
out:
btrfs_free_path(path);
free_ipath(ipath);
...
If we ever take the true branch of that 'if' statement we'll end up
passing a NULL pointer to free_ipath() which will subsequently
dereference it and we'll go "Boom" :-(
This patch will avoid that.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
clear_extent_bit()
{
next_node = rb_next(&state->rb_node);
...
clear_state_bit(state); <-- this may free next_node
if (next_node) {
state = rb_entry(next_node);
...
}
}
clear_state_bit() calls merge_state() which may free the next node
of the passing extent_state, so clear_extent_bit() may end up
referencing freed memory.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Currently it returns a set of bits that were cleared, but this return
value is not used at all.
Moreover it doesn't seem to be useful, because we may clear the bits
of a few extent_states, but only the cleared bits of last one is
returned.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Our code is not ready to cope with a sectorsize that's not equal to PAGE_SIZE.
It will lead to hanging-on while writing something.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Normally when there are 2 copies of a block, we add both to the
reada extent tree and prefetch only the one that is easier to reach.
This way we can better utilize multiple devices.
In case of DUP this makes no sense as both copies reside on the
same device.
Signed-off-by: Arne Jansen <sensille@gmx.net>
When inserting into the radix tree returns EEXIST, get the existing
entry without giving up the spinlock in between.
There was a race for both the zones trees and the extent tree.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Follow those instructions, and you'll trigger a warning in the
beginning of d_set_d_op():
# mkfs.btrfs /dev/loop3
# mount /dev/loop3 /mnt
# btrfs sub create /mnt/sub
# btrfs sub snap /mnt /mnt/snap
# touch /mnt/snap/sub
touch: cannot touch `tmp': Permission denied
__d_alloc() set d_op to sb->s_d_op (btrfs_dentry_operations), and
then simple_lookup() reset it to simple_dentry_operations, which
triggered the warning.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Pull ext4 regression fixes from Ted Ts'o:
"This fixes a scalability problem reported by Andi Kleen and Tim Chen;
they were quite secretive about the precise nature of their workload,
but they later admitted that it only showed up when they were using a
large sparse file, so the amount of data I/O that was needed was close
to zero.
I'm not sure how realistic this is and it's only a regression if you
consider changes made since 2.6.39 to be a "regression" vis-a-vis the
policy regarding post-merge window bug fixes, but Linus agreed it was
worth fixing, so I'm including it in this pull request.
This also fixes the journalled quota mount options, which I
accidentally broke while I was cleaning up the mount option handling."
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix handling of journalled quota options
ext4: address scalability issue by removing extent cache statistics
Pull vfs fixes from Al Viro:
"A bunch of endianness fixes and a couple of nfsd error value fixes.
Speaking of endianness stuff, I'm rather tempted to slap
ccflags-y += -D__CHECK_ENDIAN__
in fs/Makefile, if not making it default for the entire tree; nfsd
regressions I've caught make one hell of a pile and we'd obviously
benefit from having that kind of stuff caught earlier..."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
lockd: fix the endianness bug
ocfs2: ->e_leaf_clusters endianness breakage
ocfs2: ->rl_count endianness breakage
ocfs: ->rl_used breakage on big-endian
ocfs2: ->l_next_free_req breakage on big-endian
btrfs: btrfs_root_readonly() broken on big-endian
ext4: fix endianness breakage in ext4_split_extent_at()
nfsd: fix compose_entry_fh() failure exits
nfsd: fix error value on allocation failure in nfsd4_decode_test_stateid()
nfsd: fix endianness breakage in TEST_STATEID handling
nfsd: fix error values returned by nfsd4_lockt() when nfsd_open() fails
nfsd: fix b0rken error value for setattr on read-only mount
Because the mount process can run a quotacheck and consume lots of
inodes, we need to be able to run periodic inode reclaim during the
mount process. This will prevent running the system out of memory
during quota checks.
This essentially reverts 2bcf6e97, but that is safe to do now that
the quota sync code that was causing problems during long quotacheck
executions is now gone.
The reclaim work is currently protected from running during the
unmount process by a check against MS_ACTIVE. Unfortunately, this
also means that the reclaim work cannot run during mount. The
unmount process should stop the reclaim cleanly before freeing
anything that the reclaim work depends on, so there is no need to
have this guard in place.
Also, the inode reclaim work is demand driven, so there is no need
to start it immediately during mount. It will be started the moment
an inode is queued for reclaim, so qutoacheck will trigger it just
fine.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Pull CIFS fixes from Steve French.
* git://git.samba.org/sfrench/cifs-2.6:
Fix number parsing in cifs_parse_mount_options
Cleanup handling of NULL value passed for a mount option
Commit 26092bf5 broke handling of journalled quota mount options by
trying to parse argument of every mount option as a number. Fix this
by dealing with the quota options before we call match_int().
Thanks to Jan Kara for discovering this regression.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Check if the project quota is running or not before performing
xfs_qm_statvfs(), just return if not. Otherwise the ASSERT
XFS_IS_QUOTA_RUNNING in xfs_qm_dqget will be popped.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Andi Kleen and Tim Chen have reported that under certain circumstances
the extent cache statistics are causing scalability problems due to
cache line bounces.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
Pull the minimal btrfs branch from Chris Mason:
"We have a use-after-free in there, along with errors when mount -o
discard is enabled, and a BUG_ON(we should compile with UP more
often)."
* 'for-linus-min' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: use commit root when loading free space cache
Btrfs: fix use-after-free in __btrfs_end_transaction
Btrfs: check return value of bio_alloc() properly
Btrfs: remove lock assert from get_restripe_target()
Btrfs: fix eof while discarding extents
Btrfs: fix uninit variable in repair_eb_io_failure
Revert "Btrfs: increase the global block reserve estimates"
With this change, calling
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)
disables privilege granting operations at execve-time. For example, a
process will not be able to execute a setuid binary to change their uid
or gid if this bit is set. The same is true for file capabilities.
Additionally, LSM_UNSAFE_NO_NEW_PRIVS is defined to ensure that
LSMs respect the requested behavior.
To determine if the NO_NEW_PRIVS bit is set, a task may call
prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
It returns 1 if set and 0 if it is not set. If any of the arguments are
non-zero, it will return -1 and set errno to -EINVAL.
(PR_SET_NO_NEW_PRIVS behaves similarly.)
This functionality is desired for the proposed seccomp filter patch
series. By using PR_SET_NO_NEW_PRIVS, it allows a task to modify the
system call behavior for itself and its child tasks without being
able to impact the behavior of a more privileged task.
Another potential use is making certain privileged operations
unprivileged. For example, chroot may be considered "safe" if it cannot
affect privileged tasks.
Note, this patch causes execve to fail when PR_SET_NO_NEW_PRIVS is
set and AppArmor is in use. It is fixed in a subsequent patch.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Eric Paris <eparis@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
v18: updated change desc
v17: using new define values as per 3.4
Signed-off-by: James Morris <james.l.morris@oracle.com>
->root_flags is __le64 and all accesses to it go through the helpers
that do proper conversions. Except for btrfs_root_readonly(), which
checks bit 0 as in host-endian...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The function kstrtoul() used to parse number strings in the mount
option parser is set to expect a base 10 number . This treats the octal
numbers passed for mount options such as file_mode as base10 numbers
leading to incorrect behavior.
Change the 'base' argument passed to kstrtoul from 10 to 0 to
allow it to auto-detect the base of the number passed.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Acked-by: Jeff Layton <jlayton@samba.org>
Reported-by: Chris Clayton <chris2553@googlemail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
->ee_len is __le16, so assigning cpu_to_le32() to it is going to do
Bad Things(tm) on big-endian hosts...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Restore the original logics ("fail on mountpoints, negatives and in
case of fh_compose() failures"). Since commit 8177e (nfsd: clean up
readdirplus encoding) that got broken -
rv = fh_compose(fhp, exp, dchild, &cd->fh);
if (rv)
goto out;
if (!dchild->d_inode)
goto out;
rv = 0;
out:
is equivalent to
rv = fh_compose(fhp, exp, dchild, &cd->fh);
out:
and the second check has no effect whatsoever...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>