e2fsck pass1 is modified to use the block group data prefetch function
to try to fetch the inode tables into the pagecache before it is
needed. We iterate through the blockgroups until we have enough inode
tables that need reading such that we can issue readahead; then we sit
and wait until the last inode table block read of the last group to
start fetching the next bunch.
pass2 is modified to use the dirblock prefetching function to prefetch
the list of directory blocks that are assembled in pass1. We use the
"iterate a subset of a dblist" and avoid copying the dblist. Directory
blocks are fetched incrementally as we walk through the directory
block list. In previous iterations of this patch we would free the
directory blocks after processing, but the performance hit to e2fsck
itself wasn't worth it. Furthermore, it is anticipated that most
users will then mount the FS and start using the directories, so they
may as well remain in the page cache.
pass4 is modified to prefetch the block and inode bitmaps in
anticipation of pass 5, because pass4 is entirely CPU bound.
In general, these mechanisms can decrease fsck time by 10-40%, if the
host system has sufficient memory and the storage system can provide a
lot of IOPs. Pretty much any storage system capable of handling
multiple IOs in-flight at any time will see a fairly large performance
boost. (Single-issue USB mass storage disks seem to suffer badly.)
By default, the readahead buffer size will be set to the size of a block
group's inode table (which is 2MiB for a regular ext4 FS). The -E
readahead_kb= option can be given to specify the amount of memory to
use for readahead or zero to disable it entirely; or an option can be
given in e2fsck.conf.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If the number of unused inodes is greater than number of inodes a
block group, this can cause an e2fsck -n run of the file system to
crash.
We should add more checks to e2fsck to detect this case directly, but
this will at least protect progams (tune2fs, dump, etc.) which use the
inode_scan abstraction from crashing on an invalid file system.
Addresses-Debian-Bug: #773795
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If the inode size is large enough that there are fewer than two inodes
per block, don't report an inode checksum failure as a garbage inode
during the scan because the "more than half are broken" criteria that
we use to decide if a block of inodes is garbage doesn't really apply.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
On BE platforms, we need to swap the inode bytes after doing the
checksum verification but before looking at i_blocks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If an inode fails checksum verification, don't stuff a copy of it in
the inode cache, because this can cause the library to fail to return
the "corrupt inode" error code.
In general, this happens if ext2fs_read_inode_full() is called twice
on an inode with an incorrect checksum. If fs->flags has
EXT2_FLAG_IGNORE_CSUM_ERRORS set during the first call and *unset*
during the second call, the cache hit during the second call fails to
return EXT2_ET_INODE_CSUM_INVALID as you'd expect. This happens
during fsck because the first read_inode call happens as part of
check_blocks and the second call happens during inode checksum
revalidation. A file system with a slightly corrupt non-extent inode
will trigger this.
While we're at it, make the inode read function consistent with the
rest of libext2fs -- copy the metadata object into the caller's buffer
even if it fails checksum verification. This will help e2fsck avoid a
double re-read later on down the line.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When we need to modify the "ignore checksum error" behavior flag to
get us past a library call, it's possible that the library call can
result in other flag bits being changed. Therefore, it is not correct
to restore unconditionally the previous flags value, since this will
have unintended side effects on the other fs->flags; nor is it correct
to assume that we can unconditionally set (or clear) the "ignore csum
error" flag bit. Therefore, we must merge the previous value of the
"ignore csum error" flag with the value of flags after the call.
Note that we want to leave checksum verification on as much as
possible because doing so exposes e2fsck bugs where two metadata
blocks are "sharing" the same disk block, and attempting to fix one
before relocating the other causes major filesystem damage. The
damage is much more obvious when a previously checked piece of
metadata suddenly fails in a subsequent pass.
The modifications to the pass 2, 3, and 3A code are justified as
follows: When e2fsck encounters a block of directory entries and
cannot find the placeholder entry at the end that contains the
checksum, it will try to insert the placeholder. If that fails, it
will schedule the directory for a pass 3A reconstruction. Until that
happens, we don't want directory block writing (pass 2), block
iteration (pass 3), or block reading (pass 3A) to fail due to checksum
errors, because failing to find the placeholder is itself a checksum
verification error, which causes e2fsck to abort without fixing
anything.
The e2fsck call to ext2fs_read_bitmaps must never fail due to a
checksum error because e2fsck subsequently (a) verifies the bitmaps
itself; or (b) decides that they don't match what has been observed,
and rewrites them.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Add a new behavior flag to the inode scan functions; when specified,
this flag will do some simple sanity checking of entire inode table
blocks. If all the checksums are ok, we can skip checksum
verification on individual inodes later on. If more than half of the
inodes look "insane" (bad extent tree root or checksum failure) then
ext2fs_get_next_inode_full() can return a special status code
indicating that what's in the buffer is probably garbage.
When e2fsck' inode scan encounters the 'inode is garbage' return code
it'll offer to zap the inode straightaway instead of trying to recover
anything. This replaces the previous behavior of asking to zap
anything with a checksum error (strict_csum).
Signed-off-by: Darrick J. Wong <darrick.wong@orale.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently we have already exported inode cache flush and free functions
for users. This commit exports inode cache creation function. Later
we will use this function to initialize inode cache and do some unit
tests for inline data.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Fix all the places where we should be using a blk64_t instead of a
blk_t. These fixes are more severe because 64bit values could be
truncated silently.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The ext2fs_read_inode_full() function should not use fs->read_inode()
if the caller has requested more than the base 128 byte inode
structure and the inode size is greater than 128 bytes. Otherwise the
caller won't get all of the bytes that they were asking for, since
there's no way for the fs->read_inode override function can know what
the size of the buffer passed to ext2fs_read_inode_full().
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The changes to support metadata checksum allocated a single large
array for all of the inodes in the inode cache. This is slightly more
efficient, but given that the inode cache is small (only 4 inodes) it
doesn't really have that much benefit. The problem with doing things
this way is that the memory overruns, such as the one fixed in commit
43c4910371, do not get detected by valgrind.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
An inode cache slot will be overrun if a caller to ext2fs_read_inode_full()
or ext2fs_write_inode_full() attempts to read or write a full sized 156
byte inode when the target filesystem contains 128 byte inodes. Limit the
copied inode to the smaller of the target filesystem's or the caller's
requested inode size.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Change the block group algorithm to use the same algorithm as the rest
of the metadata_csum. This mostly involves providing a helper
function to tell if group descriptors should have checksums set or
verified, and modifying the gdt checksum code to use the correct
algorithm.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch adds the ability for the libext2fs functions to read and
write the inode checksum.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Change libext2fs to read and write full-size inodes in preparation for
the metadata checksumming patchset, which will require this. Due to
ABI compatibility requirements, this change must be hidden from client
programs.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Create a new function, io_channel_alloc_buf() which allocates I/O
buffers with appropriate alignment if we are using direct I/O. The
original code was sometimes using a larger alignment factor than
necessary, and would always request an aligned memory buffer even when
it was not necessary since the block device was not opened with
O_DIRECT.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The DEFS line in MCONFIG had gotten so long that it exceeded 4k, and
this was starting to cause some tools heartburn. It also made "make
V=1" almost useless, since trying to following the individual commands
run by make was lost in the noise of all of the defines.
So fix this by putting the configure-generated defines in lib/config.h
and the directory pathnames to lib/dirpaths.h.
In addition, clean up some vestigal defines in configure.in and in the
Makefiles to further shorten the cc command lines.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In inode_open(), if the allocation of &io fails, we go to cleanup
and dereference io to test io->name, which is a bug.
Similarly in undo_open() if allocation of &data fails, we
go to cleanup and dereference data to test data->real.
In the test_open() case we explicitly set retval to the only
possible error return from ext2fs_get_mem(), so remove that
for tidiness.
The other changes just make make earlier returns go through
the error goto for consistency.
In many cases we returned directly from the first error, but
"goto cleanup" etc for every subsequent error. In some
cases this leads to "impossible" tests such as:
if (ptr)
ext2fs_free_mem(&ptr)
on paths where ptr cannot be null because we would have
returned directly earlier, and Coverity flags this.
This isn't really indicative of an error in most cases, but
I think it can be clearer to always exit through the error goto
if it's used later in the function.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Allocate various memory structures to be properly aligned to avoid
needing to use a bounce buffer when doing direct I/O read/writes.
This should also help on FreeBSD systems which require aligned buffers
unconditionally.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The top-level COPYING file states that the e2p and ext2fs libraries
are available under the LGPLv2. The files were incorrectly labelled.
Alex Thomas/Luster has been consulted wrt to the ext3_extents.h file;
the rest of the files were primarily authored by Theodore Ts'o.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
After cleaning up ext2fs_bg_flag_set() and ext2fs_bg_flag_clear(),
we're left with ext2fs_bg_flag_test(). Convert it to
ext2fs_bg_flags_test().
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Some of these could affect filesystems between 2^31 and 2^32-1 blocks.
Thanks to Valerie Aurora Henson for pointing out the problems in
lib/ext2fs/alloc_tables.c, which led me to do a "make gcc-wall" scan
over the source tree.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
As Li Zefan <lizf@cn.fujitsu.com> reported, the creation timestamp was
not getting set on the lost+found inode. This patch makes sure all of
the timestamps are appropriately set.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
It looks like the right place to check for ino=0 in
ext2fs_read_inode_full() is before creating the inode cache, otherwise
since we set icache[i].ino = 0 in create_icache(), it will match the
loop below and thus we return a wrong value.
Signed-off-by: "Manish Katiyar" <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Previously, the portion of the inode table for block group 0 was
always completely zero'ed out, so the ext2fs_open_inode_scan() didn't
handle a non-zero bg_itable_used value for the first block group. Fix
this.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This simplifies the code, and using the uninit_bg with the inode table
lazily initialized is just as good.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This addresses a potential security vulnerability where an untrusted
filesystem can be corrupted in such a way that a program using
libext2fs will allocate a buffer which is far too small. This can
lead to either a crash or potentially a heap-based buffer overflow
crash. No known exploits exist, but main concern is where an
untrusted user who possesses privileged access in a guest Xen
environment could corrupt a filesystem which is then accessed by the
pygrub program, running as root in the dom0 host environment, thus
allowing the untrusted user to gain privileged access in the host OS.
Thanks to the McAfee AVERT Research group for reporting this issue.
Addresses CVE-2007-5497.
Signed-off-by: Rafal Wojtczuk <rafal_wojtczuk@mcafee.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
On big-endian systems, while swapping, ext2fs_swap_inode_full() swaps
only 128+extra_isize bytes and the EAs if they are present. Now if inode
N has EAs, (and this is the inode in the "scratch inode") then inode N+1
also carries seems to have them since the "scratch inode" was never
zeroed.
Signed-off-by: Kalpak Shah <kalpak@clusterfs.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The following patch addresses a memory leak in libext2fs
that occurs when using ext2fs_write_new_inode() on a file system
configured with large inodes.
Signed-off-by: Jim Garlick <garlick@llnl.gov>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This feature is initially intended for testing purposes; it allows an
ext2/ext3 developer to create very large filesystems using sparse files
where most of the block groups are not initialized and so do not require
much disk space. Eventually it could be used as a way of speeding up
mke2fs and e2fsck for large filesystem, but that would be best done by
adding an RO_COMPAT extension to the filesystem to allow the inode table
to be lazily initialized on a per-block basis, instead of being entirely initialized
or entirely unused on a per-blockgroup basis.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>