While it may be true that playing games with old_fs' block count
during a grow operation shuts up a bunch of warnings, resize2fs
doesn't actually expand the group descriptor array to match the size
we're artificially stuffing into old_fs, which means that if we
actually need to allocate a block out of the larger fs (i.e. we're in
desperation mode), ext2fs_block_alloc_stats2() scribbles on the heap,
leading to crashes if you're lucky and FS corruption if not.
So, rip that piece out and turn off com_err warnings properly and add
a test case to deal with growing a nearly full filesystem.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Recalculate the unused inode count and the block/inode uninit flags
when resizing a filesystem. This can speed up future e2fsck runs
considerably and will reduce mount times.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
resize2fs tries to infer the RAID stride by observing differences
between the locations of adjacent block groups' block and inode
bitmaps within the block group. If the two block groups being
compared belong to different flexbgs, however, it'll be fooled by the
large offset into thinking that the FS has an abnormally large RAID
stride.
Therefore, teach it not to get confused by crossing a flexbg.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: TR Reardon <thomas_reardon@hotmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When shrinking a filesystem, resize2fs wants to free per-bg metadata
blocks that are no longer needed. This behavior is gated on whether
there's a superblock in the group as told by new_fs. The check really
should be against old_fs, since we're effectively freeing blocks out
of old_fs in the transition to new_fs, but prior to sparse_super2 this
didn't matter since superblocks didn't move, so it didn't matter.
Under sparse_super2, however, there's a superblock in the last group,
so now we need to change the test to use old_fs as it should.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently maximum number of bad blocks is not limited in any way.
However our code can really handle at most INT_MAX/2 bad blocks (for
larger numbers binary search indexes start overflowing). So report
number of bad blocks is just too big instead of plain segfaulting.
It won't be too hard to raise the limit but I don't think there's any
real use for disks with over 1 billion of bad blocks...
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
My previous change ended up requiring that the filesystem
be fsck'd after the last mount, even if we are only querying
the minimum size. This is a bit draconian, and it burned
the Fedora installer, which wants to calculate minimum size
for every filesystem in the box at install time, which in turn
requires a full fsck of every filesystem.
Try this one more time, and separate out the tests to make things
a bit more clear. If we're only printing the min size, don't
require the fsck, as this is a bit less dangerous/critical.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When we're moving an inode on a metadata_csum filesystem, we need to
rewrite the checksum of all interior nodes of the extent tree. The
current code does this inefficiently via set_bmap, but we can do this
more efficiently through direct iteration of the extent tree.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If we're shrinking a sparse_super2 filesystem to a single block group,
the superblock will be in block 0. This is perfectly valid (for block
group 0 with a blocksize > 1024) so don't exit.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
At mke2fs time, if we discard the device and discard zeroes data,
don't bother zeroing the inode table blocks a second time.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If we're disabling metadata_csum and the user doesn't provide explicit
instructions to enable or disable uninit_bg, assume that they want
uninit_bg to be turned on by default. Otherwise, we lose all block
group flags and unused inode count, which is a big hit to performance.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Warn the user if we're trying to enable metadata_csum on a FS that
doesn't support extents (since block maps cannot contain checksums).
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Don't complain about checksum failures on the root dir when we're
trying to find l+f if the root dir is going to be rehashed anyway.
The test case for this is t_enable_mcsum in the next patch.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If a directory block lacks space for a checksum and the user directs
e2fsck to fix the directory block (by rehashing it), don't complain a
second time about the checksum verification failure when we get to the
end of the directory block.
Also, don't complain about broken HTREE directories if we're already
planning to rebuild the HTREE directory.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Don't display unused inodes twice, and make it clear that we're
printing a descriptor checksum.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: TR Reardon <thomas_reardon@hotmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The current mk_hugefile code in mke2fs doesn't support creating
non-extent files, so disable the functionality when we're mkfs'ing
without extent support.
The fallocate patches further on will eliminate the need for this.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Add an API so that client programs can discover a reasonable maximum
extent tree depth. This will eventually be used by e2fsck as one of
the criteria to decide if an extent-based file should have its extent
tree rebuilt.
Turn some related magic numbers into constants while we're at it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When we're splitting an extent node, try to allocate the new interior
tree block just prior to the first extent in the block we're trying to
split. The previous logic only set a goal block if we had to split
both the current node and its parent, which is somewhat infrequent.
When that would happen, the goal would start at zero, leading to poor
locality.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Try to be a little smarter about where we go to allocate blocks for a
inode. For a given inode and logical offset, set the goal as if the
file were physically continuous. If it's bmapped, just start looking
at wherever lblk 0 is. If that's not possible (the file has no
lblk>pblk mappings, inline data, etc.) then start looking in the
inode's block group.
[ Fixed memory leak --tytso ]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If the caller supplies a buffer to ext2fs_alloc_block2(), use it
instead of calling ext2fs_zero_blocks2().
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Dynamically grow the block zeroing buffer to a maximum of 4MB, and
allow callers to provide their own zeroed buffer in
ext2fs_zero_blocks2().
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
bmap_rb_extent is defined as __u64:blk __u64:count. So count can
exceed INT_MAX on populated filesystems.
TESTCASE: xfstest ext4/004
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When we're rechecking an inode checksum failure, we need to force the
inode to be re-read from disk so that the verification routine runs,
so drop the stashed inode.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Don't say the physical block number is invalid if an extent block
fails only the checksum. It passes checks, so it's not invalid.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The file IO routines do not handle uninit blocks at all. The read
method should check for the uninit flag and return a buffer of zeroes,
and the write routine should convert unwritten extents.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Don't open-code the creation of the extent tree header, since
ext2fs_extent_open2() knows how to take care of this.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Regression test for a problem inadvertently fixed by the patchset
"e2fsprogs/tune2fs: fix memory leak in inode_scan_and_fix()" by Xiaoguang Wang.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If we apply this patch 'e2fsprogs/tune2fs: rewrite metadata checksums
when resizing inode size', we will trigger a segfault, this is because
of the inode cache issues.
Firstly we should notice that in expand_inode_table(), we have change
the super block's s_inode_size to new inode size(for example, 256).
Then we re-compute metadata checksums, see below code flow:
|-->rewrite_metadata_checksums
|----->rewrite_inodes
|-------->ext2fs_write_inode_full
In ext2fs_write_inode_full(), if an inode cache is hit, the below code will be executed:
/* Check to see if the inode cache needs to be updated */
if (fs->icache) {
for (i=0; i < fs->icache->cache_size; i++) {
if (fs->icache->cache[i].ino == ino) {
memcpy(fs->icache->cache[i].inode, inode,
(bufsize > length) ? length : bufsize);
break;
}
}
}
Before executing rewrite_inodes(), actually the inode in inode cache
is allocated by old inode size(for example, 128), but here the memcpy
will obviously write overflow, '(bufsize > length) ? length : bufsize'
here will return 256(new inode size), so this is wrong, we need to fix
this. I think we should call ext2fs_free_inode_cache() in
expand_inode_table(), to drop the inode cache, because inode size has
changed, if necessary, we will re-create this inode cache.
Steps to reproduce this bug (apply 'tune2fs: rewrite metadata checksums
when resizing inode size' first):
dd if=/dev/zero of=file.img bs=1M count=128
device_name=$(/sbin/losetup -f)
/sbin/losetup -f file.img
mkfs.ext4 -I 128 -O ^flex_bg $device_name
tune2fs -I 256 $device_name
Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When we use tune2fs -I new_ino_size to change inode size, if
everything is OK, the corresponding ext4_group_desc.bg_free_blocks_count
will be decreased, so obviously, we need to re-compute the group
descriptor checksums, and the inode 's size has also changed, we also
need to recompute the checksums of inodes for metadata_csum
filesystem, so here we choose to call a rewrite_metadata_checksums(),
this will fix checksum issues.
Meanwhile, the patch will trigger an existing memory write overflow,
which will casue segfault, please see the next patch.
Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If the inode size is large enough that there are fewer than two inodes
per block, don't report an inode checksum failure as a garbage inode
during the scan because the "more than half are broken" criteria that
we use to decide if a block of inodes is garbage doesn't really apply.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When looking for the start of the hugefile range, the 'next' variable
is incorrectly decremented. If we happened to find a single free
block, the effect of this decrement is that blk == next, which means
that we never modify the loop control variable, so get_start_block
never returns.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When we use ext2fs_open_inode_scan() to iterate inodes and finish
jobs, we also need a ext2fs_close_inode_scan(scan) operation, but in
inode_scan_and_fix(), we forgot to call it, fix this error.
Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Don't allow callers to feed bad block/inode numbers to
ext2fs_*_alloc_stats2, because evil callers (<cough>resize2fs<cough>)
can corrupt library state this way, leading to a crash.
(There will be a subsequent patch to resize2fs to fix its bad
behavior.)
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Set BLOCK_UNINIT in any group whose blocks are all unused, so long as
it isn't the last group. This helps us speed up future e2fsck runs
and mounts because we don't need to read or checksum block bitmaps for
these groups.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If we're going to read the "nr - 1" entry in an indirect block for use
as a "goal" input to the block allocator, we need to byteswap the
entry. While we're at it, if we're allocating blocks for the zeroth
entry in the indirect block, we might as well use the indirect block
as the starting point to try to reduce fragmentation.
(d_fallocate_blkmap will test this...)
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If the user doesn't provide any arguments, the guard fails to run and
the whole thing segfaults on ext2fs_open2(). Don't do that.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fix some gcc-4.8 warnings and other problems that broke the build.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fix various tests that break on non-Linux systems; in particular,
sed -i doesn't work the same on all platforms; and try to keep the
GNU getopt-isms out.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The metadata_csum feature (really, the journal checksum disk format)
didn't stabilize until the 3.18 kernel, at which point the companion
journal_csum feature was turned on by default if metadata_csum was
enabled. Therefore, warn the user if they try to create such a
filesystem on a pre-3.18 kernel.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
e2fsck uses an array to store directory usage information during pass
3; the usage context also contains a pointer to the last directory
looked up. When expanding the dir_info array, this cache pointer
needs to be cleared if the array resize changed the pointer location,
or else we'll later walk off the end of this dead pointer.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: Sami Liedes <sami.liedes@iki.fi>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Sami Liedes reports that e2fsck fails to report the correct directory
inode number during a pass2 check for unexpected HTREE blocks.
Provide the inode number in the problem report.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: Sami Liedes <sami.liedes@iki.fi>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Commit 3e683eef93 ("define bitwise types and annotate conversion
routines") broke the build on various platforms. Turns out that
crossing our fingers wasn't such a good idea, so just define it
separately.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Don't let users change metadata_csum on a mounted filesystem because
there's no way to tell the kernel to turn on the feature; there's no
way to prevent the kernel from rewriting on-disk structures while
tune2fs is also rewriting them; and there's no way to tell the kernel
to reload them after tune2fs is finished.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When enabling checksums, tune2fs naively rewrites every extent in the
entire tree! This is unnecessary since we only need to rewrite each
extent tree block; therefore, only rewrite the extent if it's the
first one in an internal extent tree block.
Also, don't bother iterating the extent tree when clearing checksums.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>