If we think we're going to need to repair either the root directory or
the lost+found directory, reserve a block at the end of pass 1 to
reduce the likelihood of an e2fsck abort while reconstructing
root/lost+found during pass 3.
If / and/or /lost+found are corrupt and duplicate processing in pass
1b allocates all the free blocks in the FS, fsck aborts with an
unusable FS since pass 3 can't recreate / or /lost+found. If either
of those directories are missing, an admin can't easily mount the FS
and access the directory tree to move files off the injured FS and
free up space; this in turn prevents subsequent runs of e2fsck from
being able to continue repairs of the FS.
(One could migrate files manually with debugfs without the help of
path names, but it seems easier if users can simply mount the FS and
use regular FS management tools.)
[ Fixed up an obvious C trap: const char * and const char [] are not
the same thing when you are taking the size of the parameter.
People, run your regression tests! Like spinach, it's good for you. :-)
-- tytso ]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Provide an API to set i_size in an inode and take care of all required
feature flag modifications. Refactor the code to use this new
function.
[ Moved the function to lib/ext2fs/blk_num.c, which is the rest of
these sorts of functions live, and renamed it to be
ext2fs_inode_size_set() instead of ext2fs_inode_set_size() to be
consistent with the other functions in in blk_num.c -- tytso ]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We rely on a nasty hack to adjust the free block count where we pass
signed value into ext2fs_free_blocks_count_add(), which takes an
64-bit unsigned value, and relies on overflow and C's signed->unsigned
semantics to do the subtraction. This works, so long as a 64-bit
signed value is used.
Unfortunately, ext2fs_block_alloc_stats2() and
ext2fs_block_alloc_stats_range(), this is not true, so on a 64-bit
file system, the free blocks accounting can get screwed up.
A simple way to demonstrate the problem is:
mke2fs -F -t ext4 -O 64bit /tmp/foo.img 1M
e2fsck -fy /tmp/foo.img
... which will result in the following e2fsck complaint:
Pass 5: Checking group summary information
Free blocks count wrong (4294968278, counted=982).
Fix? yes
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There are a number of places where we need convert groups to blocks or
clusters by multiply the groups by blocks/clusters per group.
Unfortunately, both quantities are 32-bit, but the result needs to be
64-bit, and very often the cast to 64-bit gets lost.
Fix this by adding new macros, EXT2_GROUPS_TO_BLOCKS() and
EXT2_GROUPS_TO_CLUSTERS().
This should fix a bug where resizing a 64bit file system can result in
calculate_minimum_resize_size() looping forever.
Addresses-Launchpad-Bug: #1321958
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Using C99 initializers makes the code a bit more readable, and it
avoids some gcc -Wall warnings regarding missing initializers.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When resizing an empty 21T file system to 28T, resize2fs was using
this much CPU time and memory:
216.98user 19.77system 4:02.92elapsed 97%CPU (0avgtext+0avgdata 4485664maxresident)k
8inputs+1068680outputs (0major+800745minor)pagefaults 0swaps
After this one-line change:
222.29user 0.49system 3:48.79elapsed 97%CPU (0avgtext+0avgdata 30080maxresident)k
8inputs+1068552outputs (0major+2497minor)pagefaults 0swaps
So this reduces the max memory utilized from 4.2GB to 29MB!
For future work, the primary place where we are spending the most cpu
time (from resize2fs -d 16) are these two places:
blocks_to_move: Memory used: 2508k/25096k (1903k/606k), time: 91.42/91.53/ 0.00
and
calculate_summary_stats: Memory used: 2508k/25612k (1908k/601k), time: 95.33/95.45/ 0.00
The calculate_summary_stats pass can be sped up by using
ext2fs_find_first_{zero,set}_block_bitmap2(), instead of iterating
over the entire block bitmap one bit at a time.
The blocks_to_move pass can be sped up by using a bitmap to store the
location of fs metadata blocks, to avoid an O(N**2) algorithm where N
is the number of groups in the file system.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The bits between end and real_end are set as a safety measure for the
kernel when it uses the bit scan instructions. We need to take this
into account when shrinking or growing the block allocation bitmap,
before we can safely use rbtree bitmaps in resize2fs.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fix a few warnings about unused and uninitialized variables.
Also fix util/subst.c to include <sys/time.h> to avoid using
undeclared functions gettimeofday() and futimes().
Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Creates a program that fuzzes only the metadata blocks (or optionally
all in-use blocks) of an ext* filesystem. There's also a script to
automate fuzz testing of the kernel and e2fsck in a loop.
[ Modified by tytso to add e2fuzz to the clean makefile rule ]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Port tune2fs' -e flag to mke2fs so that we can set error behavior at
format time, and introduce the equivalent errors= setting into
mke2fs.conf.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Directories can't have uninitialized extents, so offer to clear the
uninit flag when we find this situation. The actual directory blocks
will be checked in pass 2 and 3 regardless of the uninit flag.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently, directories cannot be fallocated, which means that the only
way they get bigger is for the kernel to append blocks one by one.
Therefore, if we encounter a logical block offset that is too big, we
needn't bother adding it to the dblist for pass2 processing, because
it's unlikely to contain a valid directory block. The code that
handles extent based directories also does not add toobig blocks to
the dblist.
Note that we can easily cause e2fsck to fail with ENOMEM if we start
feeding it really large logical block offsets, as the dblist
implementation will try to realloc() an array big enough to hold it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Always iterate logical block 0 in a directory, even if no physical
block has been allocated. Pass 2 will notice the lack of mapping and
offer to allocate a new directory block; this enables us to link the
directory into lost+found.
Previously, if there were no logical blocks mapped, we would fail to
pick up even block 0 of the directory for processing in pass 2. This
meant that e2fsck never allocated a block 0 and therefore wouldn't fix
the missing . and .. entries for the directory; subsequent e2fsck runs
would complain about (yet never fix) the problem.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If we notice a hole in the block map of an extent-based directory,
offer to collapse the hole by decreasing the logical block # of the
extent. This saves us from pass 3's inefficient strategy, which fills
the holes by mapping in a lot of empty directory blocks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If a user crafts a carefully constructed filesystem containing a
single directory entry block with an invalid checksum and fewer than
two entries, and then runs e2fsck to fix the filesystem, fsck will
crash when it tries to "compress" the short dir and passes a negative
dirent array length to qsort. Therefore, don't allow directory
"compression" in this situation.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The third argument to strncat is the maximum number of characters to
copy out of the second argument; it is not the maximum length of the
first argument.
Therefore, code in a check just in case we ever find a /sys/block/X
path long enough to hit the end of the buffer. FWIW the longest path
I could find on my machine was 133 bytes.
Fixes-Coverity-Bug: 1252003
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In the loop in ext2fs_get_free_blocks2, we ask the bitmap if there's a
range of free blocks starting at "b" and ending at "b + num - 1".
That quantity is the number of the last block in the range. Since
ext2fs_blocks_count() returns the number of blocks and not the number
of the last block in the filesystem, the check is incorrect.
Put in a shortcut to exit the loop if finish > start, because in that
case it's obvious that we don't need to reset to the beginning of the
FS to continue the search for blocks. This is needed to terminate the
loop because the broken test meant that b could get large enough to
equal finish, which would end the while loop.
The attached testcase shows that with the off by one error, it is
possible to throw e2fsck into an infinite loop while it tries to
find space for the inode table even though there's no space for one.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Since fs->group_desc_count is the number of block groups, the number
of the last group is always one less than this count. Fix the bounds
check to reflect that.
This flaw shouldn't have any user-visible side effects, since the
block bitmap test based on last_grp later on can handle overbig block
numbers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
During the later passes of efsck, we sometimes need to allocate and
map blocks into a file. This can happen either by fsck directly
calling new_block() or indirectly by the library calling new_block
because it needs to allocate a block for lower level metadata (bmap2()
with BMAP_SET; block_iterate3() with BLOCK_CHANGED).
We need to force new_block to allocate blocks from the found block
map, because the FS block map could be inaccurate for various reasons:
the map is wrong, there are missing blocks, the checksum failed, etc.
Therefore, any time fsck does something that could to allocate blocks,
we need to intercept allocation requests so that they're sourced from
the found block map. Remove the previous code that swapped bitmap
pointers as this is now unneeded.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When we call ext2fs_close_free at the end of main(), we need to supply
the address of ctx->fs, because the subsequent e2fsck_free_context
call will try to access ctx->fs (which is now set to a freed block) to
see if it should free the directory block list. This is clearly not
desirable, so fix the problem.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If we encounter an inode with IND/DIND/TIND blocks or internal extent
tree blocks that point into critical FS metadata such as the
superblock, the group descriptors, the bitmaps, or the inode table,
it's quite possible that the validation code for those blocks is not
going to like what it finds, and it'll ask to try to fix the block.
Unfortunately, this happens before duplicate block processing (pass
1b), which means that we can end up doing stupid things like writing
extent blocks into the inode table, which multiplies e2fsck'
destructive effect and can render a filesystem unfixable.
To solve this, create a bitmap of all the critical FS metadata. If
before pass1b runs (basically check_blocks) we find a metadata block
that points into these critical regions, continue processing that
block, but avoid making any modifications, because we could be
misinterpreting inodes as block maps. Pass 1b will find the
multiply-owned blocks and fix that situation, which means that we can
then restart e2fsck from the beginning and actually fix whatever
problems we find.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When we're dumping a fast symlink inode, we print some odd things to
stdout. To clean this up, first don't print inline data EA, since the
inode dump doesn't display file and directory contents. Then, teach
the inode dump function how to print out either an inline data fast
symlink or a non-inline data fast symlink.
(This is a follow-up to the earlier patch "debugfs: Only print the
first 60 bytes from i_block on a fast symlink")
[ Modified by tytso so that the d_inline_dump test works when build
directory is different from the source directory --- i.e., when
doing a VPATH build. ]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When we're about to iterate the blocks of a block-map file, we need to
write the inode out to disk if it's dirty because block_iterate3()
will re-read the inode from disk. (In practice this won't happen
because nothing dirties block-mapped inodes before the iterate call,
but we can program defensively).
More importantly, we need to re-read the inode after the iterate()
operation because it's possible that mappings were changed (or erased)
during the iteration. If we then dirty or clear the inode, we'll
mistakenly write the old inode values back out to disk!
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If the bitmaps are known to be unreadable, don't bother clearing them;
just mark fsck to restart itself after pass 5, by which time the
bitmaps should be fixed.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If e2fsck knows the bitmaps are bad at the exit (probably because they
were bad at the start and have not been fixed), don't offer to
recreate the journal because doing so causes e2fsck to abort a second
time.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If there's a problem with the inode scan during pass 1b, report the
inode that we were trying to examine when the error happened, not the
inode that just went through the checker.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Allow set_inode_field's bmap command in debugfs to allocate blocks,
which enables us to allocate blocks for indirect blocks and internal
extent tree blocks. True, we could do this manually, but seems like
unnecessary bookkeeping activity for humans.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Create a command that will dump an entire inode's space in hex.
[ Modified by tytso to add a description to the man page, and to add
the more formal command name, inode_dump, in addition to short
command name of "idump". ]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently, e4defrag avoids increasing file fragmentation by comparing
the number of runs of physical extents of both the original and the
donor files. Unfortunately, there is a bug in the routine that counts
physical extents, since it doesn't look at the logical block offsets
of the extents. Therefore, a file whose blocks were allocated in
reverse order will be seen as only having one big physical extent, and
therefore will not be defragmented.
Fix the counting routine to consider logical extent offset so that we
defragment backwards-allocated files. This could be problematic if we
ever gain the ability to lay out logically sparse extents in a
physically contiguous manner, but presumably one wouldn't call defrag
on such a file.
Reported-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Coverity (re-)spotted this; it was triaged as a false positive,
but it seems pretty clear that the bh (which was just checked)
isn't currently freed before the function exits.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
It's only necessary to build tst_libext2fs when running "make check".
Also make sure the links of the tst_* programs are done with
$(ALL_LDFLAGS).
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Most systems have a backwards compatibility symlink in
/usr/include/syscall.h to /usr/include/sys/syscall.h, but
sys/syscall.h is the documented location of the header file. Fix two
locations where we were using <syscall.h> instead of <sys/syscall.h>.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There were other protections which would prevent a buffer overflow
from happening, but we should fix this nevertheless.
Addresses-Coverity-Bug: #1225003
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The single quote character must not be in the first character in a
line, or else it can get mistaken as a macro call.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
As an object lesson in why autoreconf is fundamentally unsafe, the
newer version of nls.m4 no longer handles @MKINSTALLDIRS@. So add
this back, since our Makefiles depend on it.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>