Compare commits

...

14 Commits

Author SHA1 Message Date
Vitaliy Filippov 3668203dc0 Return error on failed shrink
Test / test_move_reappear (push) Successful in 19s Details
Test / test_rm (push) Successful in 15s Details
Test / test_snapshot_down (push) Successful in 28s Details
Test / test_snapshot_down_ec (push) Successful in 28s Details
Test / test_splitbrain (push) Successful in 22s Details
Test / test_snapshot_chain (push) Successful in 2m26s Details
Test / test_snapshot_chain_ec (push) Successful in 2m47s Details
Test / test_rebalance_verify_imm (push) Successful in 2m33s Details
Test / test_rebalance_verify (push) Successful in 3m12s Details
Test / test_switch_primary (push) Successful in 33s Details
Test / test_write (push) Successful in 51s Details
Test / test_write_no_same (push) Successful in 14s Details
Test / test_write_xor (push) Successful in 54s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 3m57s Details
Test / test_rebalance_verify_ec (push) Successful in 5m9s Details
Test / test_heal_pg_size_2 (push) Successful in 3m49s Details
Test / test_heal_ec (push) Successful in 3m47s Details
Test / test_heal_csum_32k_dmj (push) Successful in 5m40s Details
Test / test_heal_csum_32k_dj (push) Successful in 6m22s Details
Test / test_heal_csum_4k_dmj (push) Successful in 6m46s Details
Test / test_heal_csum_32k (push) Successful in 6m48s Details
Test / test_scrub_zero_osd_2 (push) Successful in 1m1s Details
Test / test_scrub_xor (push) Successful in 48s Details
Test / test_heal_csum_4k_dj (push) Successful in 6m19s Details
Test / test_heal_csum_4k (push) Successful in 5m17s Details
Test / test_scrub_pg_size_3 (push) Successful in 1m57s Details
Test / test_scrub_ec (push) Successful in 36s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 39s Details
Test / test_nfs (push) Successful in 12s Details
Test / test_scrub (push) Successful in 19s Details
2024-03-07 18:09:05 +03:00
Vitaliy Filippov c623c969bd Implement rename over an existing file/directory 2024-03-07 18:09:05 +03:00
Vitaliy Filippov d9df554026 Support --logfile in nfs-proxy 2024-03-07 18:09:05 +03:00
Vitaliy Filippov bc3e340b9d Fix shared file overlap, add FIXMEs 2024-03-07 18:09:05 +03:00
Vitaliy Filippov b14b8ad4a0 Create inode, then direntry, not direntry, then inode; retry ID collisions 2024-03-07 18:09:05 +03:00
Vitaliy Filippov b19d4f19b4 Fix NFS shared/aligned write FIXMEs 2024-03-07 18:09:05 +03:00
Vitaliy Filippov dbacb23ac0 Allow to disable per-inode stats for VitastorFS pools 2024-03-07 18:09:05 +03:00
Vitaliy Filippov 8b22200e0e Add basic NFS tests, fix bugs 2024-03-07 18:09:05 +03:00
Vitaliy Filippov cdf2192788 Return block NFS implementation back as an option too 2024-03-07 18:09:05 +03:00
Vitaliy Filippov a5cfd2a1f2 Move KV FS header into a separate file 2024-03-07 18:09:05 +03:00
Vitaliy Filippov b8fcfe435b Implement packing small files into shared inodes 2024-03-07 18:09:05 +03:00
Vitaliy Filippov dc93c4a7e3 Split new NFS proxy implementation into multiple files 2024-03-07 18:09:05 +03:00
Vitaliy Filippov 0c38bc20b2 WIP VitastorFS with metadata storage in VitastorKV 2024-03-07 18:09:05 +03:00
Vitaliy Filippov 56b7b18adf Fix vitastor-kv hang on reopen & unfinished closed listing
Test / test_interrupted_rebalance_ec (push) Successful in 3m9s Details
Test / test_move_reappear (push) Successful in 20s Details
Test / test_rm (push) Successful in 12s Details
Test / test_snapshot_down (push) Successful in 28s Details
Test / test_snapshot_down_ec (push) Successful in 29s Details
Test / test_splitbrain (push) Successful in 21s Details
Test / test_snapshot_chain (push) Successful in 2m27s Details
Test / test_snapshot_chain_ec (push) Successful in 2m53s Details
Test / test_rebalance_verify_imm (push) Successful in 4m16s Details
Test / test_rebalance_verify (push) Successful in 4m50s Details
Test / test_switch_primary (push) Successful in 36s Details
Test / test_write (push) Successful in 46s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 3m30s Details
Test / test_write_no_same (push) Successful in 16s Details
Test / test_write_xor (push) Successful in 1m27s Details
Test / test_rebalance_verify_ec (push) Successful in 7m18s Details
Test / test_heal_pg_size_2 (push) Successful in 4m0s Details
Test / test_heal_ec (push) Successful in 4m3s Details
Test / test_heal_csum_32k_dmj (push) Successful in 4m55s Details
Test / test_heal_csum_32k_dj (push) Successful in 6m17s Details
Test / test_heal_csum_32k (push) Successful in 6m18s Details
Test / test_heal_csum_4k_dmj (push) Successful in 6m37s Details
Test / test_scrub (push) Successful in 1m37s Details
Test / test_scrub_zero_osd_2 (push) Successful in 1m19s Details
Test / test_heal_csum_4k_dj (push) Successful in 6m55s Details
Test / test_scrub_xor (push) Successful in 50s Details
Test / test_scrub_ec (push) Successful in 45s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m8s Details
Test / test_scrub_pg_size_3 (push) Successful in 1m56s Details
Test / test_heal_csum_4k (push) Successful in 5m39s Details
2024-03-07 18:09:05 +03:00
43 changed files with 4357 additions and 449 deletions

View File

@ -22,7 +22,7 @@ RUN apt-get update
RUN apt-get -y install etcd qemu-system-x86 qemu-block-extra qemu-utils fio libasan5 \ RUN apt-get -y install etcd qemu-system-x86 qemu-block-extra qemu-utils fio libasan5 \
liburing1 liburing-dev libgoogle-perftools-dev devscripts libjerasure-dev cmake libibverbs-dev libisal-dev liburing1 liburing-dev libgoogle-perftools-dev devscripts libjerasure-dev cmake libibverbs-dev libisal-dev
RUN apt-get -y build-dep fio qemu=`dpkg -s qemu-system-x86|grep ^Version:|awk '{print $2}'` RUN apt-get -y build-dep fio qemu=`dpkg -s qemu-system-x86|grep ^Version:|awk '{print $2}'`
RUN apt-get -y install jq lp-solve sudo RUN apt-get -y install jq lp-solve sudo nfs-common
RUN apt-get --download-only source fio qemu=`dpkg -s qemu-system-x86|grep ^Version:|awk '{print $2}'` RUN apt-get --download-only source fio qemu=`dpkg -s qemu-system-x86|grep ^Version:|awk '{print $2}'`
RUN set -ex; \ RUN set -ex; \

View File

@ -856,3 +856,21 @@ jobs:
echo "" echo ""
done done
test_nfs:
runs-on: ubuntu-latest
needs: build
container: ${{env.TEST_IMAGE}}:${{github.sha}}
steps:
- name: Run test
id: test
timeout-minutes: 3
run: /root/vitastor/tests/test_nfs.sh
- name: Print logs
if: always() && steps.test.outcome == 'failure'
run: |
for i in /root/vitastor/testdata/*.log /root/vitastor/testdata/*.txt; do
echo "-------- $i --------"
cat $i
echo ""
done

View File

@ -267,6 +267,7 @@ Optional parameters:
| `--immediate_commit none` | Put pool only on OSDs with this or larger immediate_commit (none < small < all) | | `--immediate_commit none` | Put pool only on OSDs with this or larger immediate_commit (none < small < all) |
| `--primary_affinity_tags tags` | Prefer to put primary copies on OSDs with all specified tags | | `--primary_affinity_tags tags` | Prefer to put primary copies on OSDs with all specified tags |
| `--scrub_interval <time>` | Enable regular scrubbing for this pool. Format: number + unit s/m/h/d/M/y | | `--scrub_interval <time>` | Enable regular scrubbing for this pool. Format: number + unit s/m/h/d/M/y |
| `--no_inode_stats 1` | Disable per-inode statistics for this pool (use for VitastorFS pools) |
| `--pg_stripe_size <number>` | Increase object grouping stripe | | `--pg_stripe_size <number>` | Increase object grouping stripe |
| `--max_osd_combinations 10000` | Maximum number of random combinations for LP solver input | | `--max_osd_combinations 10000` | Maximum number of random combinations for LP solver input |
| `--wait` | Wait for the new pool to come online | | `--wait` | Wait for the new pool to come online |
@ -288,7 +289,7 @@ Modify an existing pool. Modifiable parameters:
``` ```
[-s|--pg_size <number>] [--pg_minsize <number>] [-n|--pg_count <count>] [-s|--pg_size <number>] [--pg_minsize <number>] [-n|--pg_count <count>]
[--failure_domain <level>] [--root_node <node>] [--osd_tags <tags>] [--failure_domain <level>] [--root_node <node>] [--osd_tags <tags>] [--no_inode_stats 0|1]
[--max_osd_combinations <number>] [--primary_affinity_tags <tags>] [--scrub_interval <time>] [--max_osd_combinations <number>] [--primary_affinity_tags <tags>] [--scrub_interval <time>]
``` ```

View File

@ -1737,8 +1737,11 @@ class Mon
for (const inode_num in this.state.osd.space[osd_num][pool_id]) for (const inode_num in this.state.osd.space[osd_num][pool_id])
{ {
const u = BigInt(this.state.osd.space[osd_num][pool_id][inode_num]||0); const u = BigInt(this.state.osd.space[osd_num][pool_id][inode_num]||0);
inode_stats[pool_id][inode_num] = inode_stats[pool_id][inode_num] || inode_stub(); if (inode_num)
inode_stats[pool_id][inode_num].raw_used += u; {
inode_stats[pool_id][inode_num] = inode_stats[pool_id][inode_num] || inode_stub();
inode_stats[pool_id][inode_num].raw_used += u;
}
this.state.pool.stats[pool_id].used_raw_tb += u; this.state.pool.stats[pool_id].used_raw_tb += u;
} }
} }

View File

@ -185,29 +185,48 @@ target_link_libraries(vitastor-nbd
vitastor_client vitastor_client
) )
# vitastor-kv # libvitastor_kv.so
add_executable(vitastor-kv add_library(vitastor_kv SHARED
kv_cli.cpp
kv_db.cpp kv_db.cpp
kv_db.h kv_db.h
) )
target_link_libraries(vitastor-kv target_link_libraries(vitastor_kv
vitastor_client vitastor_client
) )
set_target_properties(vitastor_kv PROPERTIES VERSION ${VERSION} SOVERSION 0)
# vitastor-kv
add_executable(vitastor-kv
kv_cli.cpp
)
target_link_libraries(vitastor-kv
vitastor_kv
)
add_executable(vitastor-kv-stress add_executable(vitastor-kv-stress
kv_stress.cpp kv_stress.cpp
kv_db.cpp
kv_db.h
) )
target_link_libraries(vitastor-kv-stress target_link_libraries(vitastor-kv-stress
vitastor_client vitastor_kv
) )
# vitastor-nfs # vitastor-nfs
add_executable(vitastor-nfs add_executable(vitastor-nfs
nfs_proxy.cpp nfs_proxy.cpp
nfs_conn.cpp nfs_block.cpp
nfs_kv.cpp
nfs_kv_create.cpp
nfs_kv_getattr.cpp
nfs_kv_link.cpp
nfs_kv_lookup.cpp
nfs_kv_read.cpp
nfs_kv_readdir.cpp
nfs_kv_remove.cpp
nfs_kv_rename.cpp
nfs_kv_setattr.cpp
nfs_kv_write.cpp
nfs_fsstat.cpp
nfs_mount.cpp
nfs_portmap.cpp nfs_portmap.cpp
sha256.c sha256.c
nfs/xdr_impl.cpp nfs/xdr_impl.cpp
@ -217,6 +236,7 @@ add_executable(vitastor-nfs
) )
target_link_libraries(vitastor-nfs target_link_libraries(vitastor-nfs
vitastor_client vitastor_client
vitastor_kv
) )
# vitastor-cli # vitastor-cli

View File

@ -82,3 +82,8 @@ uint32_t blockstore_t::get_bitmap_granularity()
{ {
return impl->get_bitmap_granularity(); return impl->get_bitmap_granularity();
} }
void blockstore_t::set_no_inode_stats(const std::vector<uint64_t> & pool_ids)
{
impl->set_no_inode_stats(pool_ids);
}

View File

@ -216,6 +216,9 @@ public:
// Get per-inode space usage statistics // Get per-inode space usage statistics
std::map<uint64_t, uint64_t> & get_inode_space_stats(); std::map<uint64_t, uint64_t> & get_inode_space_stats();
// Set per-pool no_inode_stats
void set_no_inode_stats(const std::vector<uint64_t> & pool_ids);
// Print diagnostics to stdout // Print diagnostics to stdout
void dump_diagnostics(); void dump_diagnostics();

View File

@ -733,3 +733,86 @@ void blockstore_impl_t::disk_error_abort(const char *op, int retval, int expecte
fprintf(stderr, "Disk %s failed: result is %d, expected %d. Can't continue, sorry :-(\n", op, retval, expected); fprintf(stderr, "Disk %s failed: result is %d, expected %d. Can't continue, sorry :-(\n", op, retval, expected);
exit(1); exit(1);
} }
void blockstore_impl_t::set_no_inode_stats(const std::vector<uint64_t> & pool_ids)
{
for (auto & np: no_inode_stats)
{
np.second = 2;
}
for (auto pool_id: pool_ids)
{
if (!no_inode_stats[pool_id])
recalc_inode_space_stats(pool_id, false);
no_inode_stats[pool_id] = 1;
}
for (auto np_it = no_inode_stats.begin(); np_it != no_inode_stats.end(); )
{
if (np_it->second == 2)
{
recalc_inode_space_stats(np_it->first, true);
no_inode_stats.erase(np_it++);
}
else
np_it++;
}
}
void blockstore_impl_t::recalc_inode_space_stats(uint64_t pool_id, bool per_inode)
{
auto sp_begin = inode_space_stats.lower_bound((pool_id << (64-POOL_ID_BITS)));
auto sp_end = inode_space_stats.lower_bound(((pool_id+1) << (64-POOL_ID_BITS)));
inode_space_stats.erase(sp_begin, sp_end);
auto sh_it = clean_db_shards.lower_bound((pool_id << (64-POOL_ID_BITS)));
while (sh_it != clean_db_shards.end() &&
(sh_it->first >> (64-POOL_ID_BITS)) == pool_id)
{
for (auto & pair: sh_it->second)
{
uint64_t space_id = per_inode ? pair.first.inode : (pool_id << (64-POOL_ID_BITS));
inode_space_stats[space_id] += dsk.data_block_size;
}
sh_it++;
}
object_id last_oid = {};
bool last_exists = false;
auto dirty_it = dirty_db.lower_bound((obj_ver_id){ .oid = { .inode = (pool_id << (64-POOL_ID_BITS)) } });
while (dirty_it != dirty_db.end() && (dirty_it->first.oid.inode >> (64-POOL_ID_BITS)) == pool_id)
{
if (IS_STABLE(dirty_it->second.state) && (IS_BIG_WRITE(dirty_it->second.state) || IS_DELETE(dirty_it->second.state)))
{
bool exists = false;
if (last_oid == dirty_it->first.oid)
{
exists = last_exists;
}
else
{
auto & clean_db = clean_db_shard(dirty_it->first.oid);
auto clean_it = clean_db.find(dirty_it->first.oid);
exists = clean_it != clean_db.end();
}
uint64_t space_id = per_inode ? dirty_it->first.oid.inode : (pool_id << (64-POOL_ID_BITS));
if (IS_BIG_WRITE(dirty_it->second.state))
{
if (!exists)
inode_space_stats[space_id] += dsk.data_block_size;
last_exists = true;
}
else
{
if (exists)
{
auto & sp = inode_space_stats[space_id];
if (sp > dsk.data_block_size)
sp -= dsk.data_block_size;
else
inode_space_stats.erase(space_id);
}
last_exists = false;
}
last_oid = dirty_it->first.oid;
}
dirty_it++;
}
}

View File

@ -272,6 +272,7 @@ class blockstore_impl_t
std::map<pool_id_t, pool_shard_settings_t> clean_db_settings; std::map<pool_id_t, pool_shard_settings_t> clean_db_settings;
std::map<pool_pg_id_t, blockstore_clean_db_t> clean_db_shards; std::map<pool_pg_id_t, blockstore_clean_db_t> clean_db_shards;
std::map<uint64_t, int> no_inode_stats;
uint8_t *clean_bitmaps = NULL; uint8_t *clean_bitmaps = NULL;
blockstore_dirty_db_t dirty_db; blockstore_dirty_db_t dirty_db;
std::vector<blockstore_op_t*> submit_queue; std::vector<blockstore_op_t*> submit_queue;
@ -318,6 +319,7 @@ class blockstore_impl_t
blockstore_clean_db_t& clean_db_shard(object_id oid); blockstore_clean_db_t& clean_db_shard(object_id oid);
void reshard_clean_db(pool_id_t pool_id, uint32_t pg_count, uint32_t pg_stripe_size); void reshard_clean_db(pool_id_t pool_id, uint32_t pg_count, uint32_t pg_stripe_size);
void recalc_inode_space_stats(uint64_t pool_id, bool per_inode);
// Journaling // Journaling
void prepare_journal_sector_write(int sector, blockstore_op_t *op); void prepare_journal_sector_write(int sector, blockstore_op_t *op);
@ -428,6 +430,9 @@ public:
// Space usage statistics // Space usage statistics
std::map<uint64_t, uint64_t> inode_space_stats; std::map<uint64_t, uint64_t> inode_space_stats;
// Set per-pool no_inode_stats
void set_no_inode_stats(const std::vector<uint64_t> & pool_ids);
// Print diagnostics to stdout // Print diagnostics to stdout
void dump_diagnostics(); void dump_diagnostics();

View File

@ -487,18 +487,24 @@ void blockstore_impl_t::mark_stable(obj_ver_id v, bool forget_dirty)
} }
if (!exists) if (!exists)
{ {
inode_space_stats[dirty_it->first.oid.inode] += dsk.data_block_size; uint64_t space_id = dirty_it->first.oid.inode;
if (no_inode_stats[dirty_it->first.oid.inode >> (64-POOL_ID_BITS)])
space_id = space_id & ~(((uint64_t)1 << (64-POOL_ID_BITS)) - 1);
inode_space_stats[space_id] += dsk.data_block_size;
used_blocks++; used_blocks++;
} }
big_to_flush++; big_to_flush++;
} }
else if (IS_DELETE(dirty_it->second.state)) else if (IS_DELETE(dirty_it->second.state))
{ {
auto & sp = inode_space_stats[dirty_it->first.oid.inode]; uint64_t space_id = dirty_it->first.oid.inode;
if (no_inode_stats[dirty_it->first.oid.inode >> (64-POOL_ID_BITS)])
space_id = space_id & ~(((uint64_t)1 << (64-POOL_ID_BITS)) - 1);
auto & sp = inode_space_stats[space_id];
if (sp > dsk.data_block_size) if (sp > dsk.data_block_size)
sp -= dsk.data_block_size; sp -= dsk.data_block_size;
else else
inode_space_stats.erase(dirty_it->first.oid.inode); inode_space_stats.erase(space_id);
used_blocks--; used_blocks--;
big_to_flush++; big_to_flush++;
} }

View File

@ -131,6 +131,7 @@ static const char* help_text =
" --immediate_commit none Put pool only on OSDs with this or larger immediate_commit (none < small < all)\n" " --immediate_commit none Put pool only on OSDs with this or larger immediate_commit (none < small < all)\n"
" --primary_affinity_tags tags Prefer to put primary copies on OSDs with all specified tags\n" " --primary_affinity_tags tags Prefer to put primary copies on OSDs with all specified tags\n"
" --scrub_interval <time> Enable regular scrubbing for this pool. Format: number + unit s/m/h/d/M/y\n" " --scrub_interval <time> Enable regular scrubbing for this pool. Format: number + unit s/m/h/d/M/y\n"
" --no_inode_stats 1 Disable per-inode statistics for this pool (use for VitastorFS pools)\n"
" --pg_stripe_size <number> Increase object grouping stripe\n" " --pg_stripe_size <number> Increase object grouping stripe\n"
" --max_osd_combinations 10000 Maximum number of random combinations for LP solver input\n" " --max_osd_combinations 10000 Maximum number of random combinations for LP solver input\n"
" --wait Wait for the new pool to come online\n" " --wait Wait for the new pool to come online\n"
@ -142,7 +143,7 @@ static const char* help_text =
"vitastor-cli modify-pool|pool-modify <id|name> [--name <new_name>] [PARAMETERS...]\n" "vitastor-cli modify-pool|pool-modify <id|name> [--name <new_name>] [PARAMETERS...]\n"
" Modify an existing pool. Modifiable parameters:\n" " Modify an existing pool. Modifiable parameters:\n"
" [-s|--pg_size <number>] [--pg_minsize <number>] [-n|--pg_count <count>]\n" " [-s|--pg_size <number>] [--pg_minsize <number>] [-n|--pg_count <count>]\n"
" [--failure_domain <level>] [--root_node <node>] [--osd_tags <tags>]\n" " [--failure_domain <level>] [--root_node <node>] [--osd_tags <tags>] [--no_inode_stats 0|1]\n"
" [--max_osd_combinations <number>] [--primary_affinity_tags <tags>] [--scrub_interval <time>]\n" " [--max_osd_combinations <number>] [--primary_affinity_tags <tags>] [--scrub_interval <time>]\n"
" Non-modifiable parameters (changing them WILL lead to data loss):\n" " Non-modifiable parameters (changing them WILL lead to data loss):\n"
" [--block_size <size>] [--bitmap_granularity <size>]\n" " [--block_size <size>] [--bitmap_granularity <size>]\n"

View File

@ -153,6 +153,7 @@ void cli_tool_t::loop_and_wait(std::function<bool(cli_result_t &)> loop_cb, std:
ringloop->unregister_consumer(&looper->consumer); ringloop->unregister_consumer(&looper->consumer);
looper->loop_cb = NULL; looper->loop_cb = NULL;
looper->complete_cb(looper->result); looper->complete_cb(looper->result);
ringloop->submit();
delete looper; delete looper;
return; return;
} }

View File

@ -81,6 +81,11 @@ std::string validate_pool_config(json11::Json::object & new_cfg, json11::Json ol
} }
value = value.uint64_value(); value = value.uint64_value();
} }
else if (key == "no_inode_stats" && value.bool_value())
{
// Leave true, remove false
value = true;
}
else if (key == "name" || key == "scheme" || key == "immediate_commit" || else if (key == "name" || key == "scheme" || key == "immediate_commit" ||
key == "failure_domain" || key == "root_node" || key == "scrub_interval") key == "failure_domain" || key == "root_node" || key == "scrub_interval")
{ {
@ -248,7 +253,7 @@ std::string validate_pool_config(json11::Json::object & new_cfg, json11::Json ol
// immediate_commit // immediate_commit
if (!cfg["immediate_commit"].is_null() && !etcd_state_client_t::parse_immediate_commit(cfg["immediate_commit"].string_value())) if (!cfg["immediate_commit"].is_null() && !etcd_state_client_t::parse_immediate_commit(cfg["immediate_commit"].string_value()))
{ {
return "immediate_commit must be one of \"all\", \"small\", or \"none\", but it is "+cfg["scrub_interval"].as_string(); return "immediate_commit must be one of \"all\", \"small\", or \"none\", but it is "+cfg["immediate_commit"].as_string();
} }
// scrub_interval // scrub_interval

View File

@ -529,6 +529,8 @@ resume_3:
st["block_size_fmt"] = format_size(st["block_size"].uint64_value()); st["block_size_fmt"] = format_size(st["block_size"].uint64_value());
if (st["bitmap_granularity"].uint64_value()) if (st["bitmap_granularity"].uint64_value())
st["bitmap_granularity_fmt"] = format_size(st["bitmap_granularity"].uint64_value()); st["bitmap_granularity_fmt"] = format_size(st["bitmap_granularity"].uint64_value());
if (st["no_inode_stats"].bool_value())
st["inode_stats_fmt"] = "disabled";
} }
// All pool parameters are only displayed in the "detailed" mode // All pool parameters are only displayed in the "detailed" mode
// because there's too many of them to show them in table // because there's too many of them to show them in table
@ -547,6 +549,7 @@ resume_3:
{ "bitmap_granularity_fmt", "Bitmap granularity" }, { "bitmap_granularity_fmt", "Bitmap granularity" },
{ "immediate_commit", "Immediate commit" }, { "immediate_commit", "Immediate commit" },
{ "scrub_interval", "Scrub interval" }, { "scrub_interval", "Scrub interval" },
{ "inode_stats_fmt", "Per-inode stats" },
{ "pg_stripe_size", "PG stripe size" }, { "pg_stripe_size", "PG stripe size" },
{ "max_osd_combinations", "Max OSD combinations" }, { "max_osd_combinations", "Max OSD combinations" },
{ "total_fmt", "Total" }, { "total_fmt", "Total" },

View File

@ -101,7 +101,7 @@ void epoll_manager_t::handle_uring_event()
my_uring_prep_poll_add(sqe, epoll_fd, POLLIN); my_uring_prep_poll_add(sqe, epoll_fd, POLLIN);
data->callback = [this](ring_data_t *data) data->callback = [this](ring_data_t *data)
{ {
if (data->res < 0) if (data->res < 0 && data->res != -ECANCELED)
{ {
throw std::runtime_error(std::string("epoll failed: ") + strerror(-data->res)); throw std::runtime_error(std::string("epoll failed: ") + strerror(-data->res));
} }

View File

@ -863,6 +863,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
pc.scrub_interval = parse_time(pool_item.second["scrub_interval"].string_value()); pc.scrub_interval = parse_time(pool_item.second["scrub_interval"].string_value());
if (!pc.scrub_interval) if (!pc.scrub_interval)
pc.scrub_interval = 0; pc.scrub_interval = 0;
// Disable per-inode stats
pc.no_inode_stats = pool_item.second["no_inode_stats"].bool_value();
// Immediate Commit Mode // Immediate Commit Mode
pc.immediate_commit = pool_item.second["immediate_commit"].is_string() pc.immediate_commit = pool_item.second["immediate_commit"].is_string()
? parse_immediate_commit(pool_item.second["immediate_commit"].string_value()) ? parse_immediate_commit(pool_item.second["immediate_commit"].string_value())

View File

@ -60,6 +60,7 @@ struct pool_config_t
uint64_t pg_stripe_size; uint64_t pg_stripe_size;
std::map<pg_num_t, pg_config_t> pg_config; std::map<pg_num_t, pg_config_t> pg_config;
uint64_t scrub_interval; uint64_t scrub_interval;
bool no_inode_stats;
}; };
struct inode_config_t struct inode_config_t

View File

@ -197,6 +197,7 @@ struct kv_op_t
void exec(); void exec();
void next(); // for list void next(); // for list
~kv_op_t();
protected: protected:
int recheck_policy = KV_RECHECK_LEAF; int recheck_policy = KV_RECHECK_LEAF;
bool started = false; bool started = false;
@ -985,14 +986,24 @@ void kv_op_t::exec()
finish(-ENOSYS); finish(-ENOSYS);
} }
kv_op_t::~kv_op_t()
{
if (started && !done)
{
done = true;
db->active_ops--;
}
}
void kv_op_t::finish(int res) void kv_op_t::finish(int res)
{ {
auto db = this->db;
this->res = res; this->res = res;
this->done = true; this->done = true;
db->active_ops--; db->active_ops--;
(std::function<void(kv_op_t *)>(callback))(this);
if (!db->active_ops && db->closing) if (!db->active_ops && db->closing)
db->close(db->on_close); db->close(db->on_close);
(std::function<void(kv_op_t *)>(callback))(this);
} }
void kv_op_t::get() void kv_op_t::get()

View File

@ -147,6 +147,8 @@ json11::Json::object kv_test_t::parse_args(int narg, const char *args[])
" Fraction of key delete operations\n" " Fraction of key delete operations\n"
" --list_prob 300\n" " --list_prob 300\n"
" Fraction of listing operations\n" " Fraction of listing operations\n"
" --reopen_prob 1\n"
" Fraction of database reopens\n"
" --min_key_len 10\n" " --min_key_len 10\n"
" Minimum key size in bytes\n" " Minimum key size in bytes\n"
" --max_key_len 70\n" " --max_key_len 70\n"
@ -607,10 +609,8 @@ void kv_test_t::add_stat(kv_test_lat_t & stat, timespec tv_begin)
int64_t usec = (tv_end.tv_sec - tv_begin.tv_sec)*1000000 + int64_t usec = (tv_end.tv_sec - tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - tv_begin.tv_nsec)/1000; (tv_end.tv_nsec - tv_begin.tv_nsec)/1000;
if (usec > 0) if (usec > 0)
{
stat.usec += usec; stat.usec += usec;
stat.count++; stat.count++;
}
} }
void kv_test_t::print_stats(kv_test_stat_t & prev_stat, timespec & prev_stat_time) void kv_test_t::print_stats(kv_test_stat_t & prev_stat, timespec & prev_stat_time)

View File

@ -146,7 +146,7 @@ public:
" Note that nbd_timeout, nbd_max_devices and nbd_max_part options may also be specified\n" " Note that nbd_timeout, nbd_max_devices and nbd_max_part options may also be specified\n"
" in /etc/vitastor/vitastor.conf or in other configuration file specified with --config_file.\n" " in /etc/vitastor/vitastor.conf or in other configuration file specified with --config_file.\n"
" --logfile /path/to/log/file.txt\n" " --logfile /path/to/log/file.txt\n"
" Wite log messages to the specified file instead of dropping them (in background mode)\n" " Write log messages to the specified file instead of dropping them (in background mode)\n"
" or printing them to the standard output (in foreground mode).\n" " or printing them to the standard output (in foreground mode).\n"
" --dev_num N\n" " --dev_num N\n"
" Use the specified device /dev/nbdN instead of automatic selection.\n" " Use the specified device /dev/nbdN instead of automatic selection.\n"
@ -298,7 +298,7 @@ public:
} }
} }
} }
if (cfg["logfile"].is_string()) if (cfg["logfile"].string_value() != "")
{ {
logfile = cfg["logfile"].string_value(); logfile = cfg["logfile"].string_value();
} }

View File

@ -1,23 +1,18 @@
// Copyright (c) Vitaliy Filippov, 2019+ // Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details) // License: VNPL-1.1 (see README.md for details)
// //
// NFS connection handler for NFS proxy // NFS proxy over Vitastor block images
#include <sys/time.h> #include <sys/time.h>
#include "str_util.h" #include "str_util.h"
#include "nfs_proxy.h" #include "nfs_proxy.h"
#include "nfs_common.h"
#include "nfs_block.h"
#include "nfs/nfs.h" #include "nfs/nfs.h"
#include "cli.h" #include "cli.h"
#define TRUE 1
#define FALSE 0
#define MAX_REQUEST_SIZE 128*1024*1024
static unsigned len_pad4(unsigned len) static unsigned len_pad4(unsigned len)
{ {
return len + (len&3 ? 4-(len&3) : 0); return len + (len&3 ? 4-(len&3) : 0);
@ -28,10 +23,10 @@ static std::string get_inode_name(nfs_client_t *self, diropargs3 & what)
// Get name // Get name
std::string dirhash = what.dir; std::string dirhash = what.dir;
std::string dir; std::string dir;
if (dirhash != "roothandle") if (dirhash != NFS_ROOT_HANDLE)
{ {
auto dir_it = self->parent->dir_by_hash.find(dirhash); auto dir_it = self->parent->blockfs->dir_by_hash.find(dirhash);
if (dir_it != self->parent->dir_by_hash.end()) if (dir_it != self->parent->blockfs->dir_by_hash.end())
dir = dir_it->second; dir = dir_it->second;
else else
return ""; return "";
@ -42,24 +37,9 @@ static std::string get_inode_name(nfs_client_t *self, diropargs3 & what)
: self->parent->name_prefix+name); : self->parent->name_prefix+name);
} }
static nfsstat3 vitastor_nfs_map_err(int err)
{
return (err == EINVAL ? NFS3ERR_INVAL
: (err == ENOENT ? NFS3ERR_NOENT
: (err == ENOSPC ? NFS3ERR_NOSPC
: (err == EEXIST ? NFS3ERR_EXIST
: (err == EIO ? NFS3ERR_IO : (err ? NFS3ERR_IO : NFS3_OK))))));
}
static int nfs3_null_proc(void *opaque, rpc_op_t *rop)
{
rpc_queue_reply(rop);
return 0;
}
static fattr3 get_dir_attributes(nfs_client_t *self, std::string dir) static fattr3 get_dir_attributes(nfs_client_t *self, std::string dir)
{ {
auto & dinf = self->parent->dir_info.at(dir); auto & dinf = self->parent->blockfs->dir_info.at(dir);
return (fattr3){ return (fattr3){
.type = NF3DIR, .type = NF3DIR,
.mode = 0755, .mode = 0755,
@ -108,7 +88,7 @@ static fattr3 get_file_attributes(nfs_client_t *self, inode_t inode_num)
}; };
} }
static int nfs3_getattr_proc(void *opaque, rpc_op_t *rop) static int block_nfs3_getattr_proc(void *opaque, rpc_op_t *rop)
{ {
nfs_client_t *self = (nfs_client_t*)opaque; nfs_client_t *self = (nfs_client_t*)opaque;
GETATTR3args *args = (GETATTR3args*)rop->request; GETATTR3args *args = (GETATTR3args*)rop->request;
@ -116,12 +96,12 @@ static int nfs3_getattr_proc(void *opaque, rpc_op_t *rop)
bool is_dir = false; bool is_dir = false;
std::string dirhash = args->object; std::string dirhash = args->object;
std::string dir; std::string dir;
if (args->object == "roothandle") if (args->object == NFS_ROOT_HANDLE)
is_dir = true; is_dir = true;
else else
{ {
auto dir_it = self->parent->dir_by_hash.find(dirhash); auto dir_it = self->parent->blockfs->dir_by_hash.find(dirhash);
if (dir_it != self->parent->dir_by_hash.end()) if (dir_it != self->parent->blockfs->dir_by_hash.end())
{ {
is_dir = true; is_dir = true;
dir = dir_it->second; dir = dir_it->second;
@ -140,8 +120,8 @@ static int nfs3_getattr_proc(void *opaque, rpc_op_t *rop)
else else
{ {
uint64_t inode_num = 0; uint64_t inode_num = 0;
auto inode_num_it = self->parent->inode_by_hash.find(dirhash); auto inode_num_it = self->parent->blockfs->inode_by_hash.find(dirhash);
if (inode_num_it != self->parent->inode_by_hash.end()) if (inode_num_it != self->parent->blockfs->inode_by_hash.end())
inode_num = inode_num_it->second; inode_num = inode_num_it->second;
auto inode_it = self->parent->cli->st_cli.inode_config.find(inode_num); auto inode_it = self->parent->cli->st_cli.inode_config.find(inode_num);
if (inode_num && inode_it != self->parent->cli->st_cli.inode_config.end()) if (inode_num && inode_it != self->parent->cli->st_cli.inode_config.end())
@ -179,16 +159,16 @@ static int nfs3_getattr_proc(void *opaque, rpc_op_t *rop)
return 0; return 0;
} }
static int nfs3_setattr_proc(void *opaque, rpc_op_t *rop) static int block_nfs3_setattr_proc(void *opaque, rpc_op_t *rop)
{ {
nfs_client_t *self = (nfs_client_t*)opaque; nfs_client_t *self = (nfs_client_t*)opaque;
SETATTR3args *args = (SETATTR3args*)rop->request; SETATTR3args *args = (SETATTR3args*)rop->request;
SETATTR3res *reply = (SETATTR3res*)rop->reply; SETATTR3res *reply = (SETATTR3res*)rop->reply;
std::string handle = args->object; std::string handle = args->object;
auto ino_it = self->parent->inode_by_hash.find(handle); auto ino_it = self->parent->blockfs->inode_by_hash.find(handle);
if (ino_it == self->parent->inode_by_hash.end()) if (ino_it == self->parent->blockfs->inode_by_hash.end())
{ {
if (handle == "roothandle" || self->parent->dir_by_hash.find(handle) != self->parent->dir_by_hash.end()) if (handle == NFS_ROOT_HANDLE || self->parent->blockfs->dir_by_hash.find(handle) != self->parent->blockfs->dir_by_hash.end())
{ {
if (args->new_attributes.size.set_it) if (args->new_attributes.size.set_it)
{ {
@ -228,7 +208,7 @@ static int nfs3_setattr_proc(void *opaque, rpc_op_t *rop)
return 0; return 0;
} }
static int nfs3_lookup_proc(void *opaque, rpc_op_t *rop) static int block_nfs3_lookup_proc(void *opaque, rpc_op_t *rop)
{ {
nfs_client_t *self = (nfs_client_t*)opaque; nfs_client_t *self = (nfs_client_t*)opaque;
LOOKUP3args *args = (LOOKUP3args*)rop->request; LOOKUP3args *args = (LOOKUP3args*)rop->request;
@ -255,8 +235,8 @@ static int nfs3_lookup_proc(void *opaque, rpc_op_t *rop)
return 0; return 0;
} }
} }
auto dir_it = self->parent->dir_info.find(full_name); auto dir_it = self->parent->blockfs->dir_info.find(full_name);
if (dir_it != self->parent->dir_info.end()) if (dir_it != self->parent->blockfs->dir_info.end())
{ {
*reply = (LOOKUP3res){ *reply = (LOOKUP3res){
.status = NFS3_OK, .status = NFS3_OK,
@ -277,7 +257,7 @@ static int nfs3_lookup_proc(void *opaque, rpc_op_t *rop)
return 0; return 0;
} }
static int nfs3_access_proc(void *opaque, rpc_op_t *rop) static int block_nfs3_access_proc(void *opaque, rpc_op_t *rop)
{ {
//nfs_client_t *self = (nfs_client_t*)opaque; //nfs_client_t *self = (nfs_client_t*)opaque;
ACCESS3args *args = (ACCESS3args*)rop->request; ACCESS3args *args = (ACCESS3args*)rop->request;
@ -292,7 +272,7 @@ static int nfs3_access_proc(void *opaque, rpc_op_t *rop)
return 0; return 0;
} }
static int nfs3_readlink_proc(void *opaque, rpc_op_t *rop) static int block_nfs3_readlink_proc(void *opaque, rpc_op_t *rop)
{ {
//nfs_client_t *self = (nfs_client_t*)opaque; //nfs_client_t *self = (nfs_client_t*)opaque;
//READLINK3args *args = (READLINK3args*)rop->request; //READLINK3args *args = (READLINK3args*)rop->request;
@ -303,14 +283,14 @@ static int nfs3_readlink_proc(void *opaque, rpc_op_t *rop)
return 0; return 0;
} }
static int nfs3_read_proc(void *opaque, rpc_op_t *rop) static int block_nfs3_read_proc(void *opaque, rpc_op_t *rop)
{ {
nfs_client_t *self = (nfs_client_t*)opaque; nfs_client_t *self = (nfs_client_t*)opaque;
READ3args *args = (READ3args*)rop->request; READ3args *args = (READ3args*)rop->request;
READ3res *reply = (READ3res*)rop->reply; READ3res *reply = (READ3res*)rop->reply;
std::string handle = args->file; std::string handle = args->file;
auto ino_it = self->parent->inode_by_hash.find(handle); auto ino_it = self->parent->blockfs->inode_by_hash.find(handle);
if (ino_it == self->parent->inode_by_hash.end()) if (ino_it == self->parent->blockfs->inode_by_hash.end())
{ {
*reply = (READ3res){ .status = NFS3ERR_NOENT }; *reply = (READ3res){ .status = NFS3ERR_NOENT };
rpc_queue_reply(rop); rpc_queue_reply(rop);
@ -367,14 +347,14 @@ static int nfs3_read_proc(void *opaque, rpc_op_t *rop)
static void nfs_resize_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode, uint64_t new_size, uint64_t offset, uint64_t count, void *buf); static void nfs_resize_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode, uint64_t new_size, uint64_t offset, uint64_t count, void *buf);
static int nfs3_write_proc(void *opaque, rpc_op_t *rop) static int block_nfs3_write_proc(void *opaque, rpc_op_t *rop)
{ {
nfs_client_t *self = (nfs_client_t*)opaque; nfs_client_t *self = (nfs_client_t*)opaque;
WRITE3args *args = (WRITE3args*)rop->request; WRITE3args *args = (WRITE3args*)rop->request;
WRITE3res *reply = (WRITE3res*)rop->reply; WRITE3res *reply = (WRITE3res*)rop->reply;
std::string handle = args->file; std::string handle = args->file;
auto ino_it = self->parent->inode_by_hash.find(handle); auto ino_it = self->parent->blockfs->inode_by_hash.find(handle);
if (ino_it == self->parent->inode_by_hash.end()) if (ino_it == self->parent->blockfs->inode_by_hash.end())
{ {
*reply = (WRITE3res){ .status = NFS3ERR_NOENT }; *reply = (WRITE3res){ .status = NFS3ERR_NOENT };
rpc_queue_reply(rop); rpc_queue_reply(rop);
@ -480,8 +460,8 @@ static void complete_extend_write(nfs_client_t *self, rpc_op_t *rop, inode_t ino
static void complete_extend_inode(nfs_client_t *self, uint64_t inode, uint64_t new_size, int err) static void complete_extend_inode(nfs_client_t *self, uint64_t inode, uint64_t new_size, int err)
{ {
auto ext_it = self->extend_writes.lower_bound((extend_size_t){ .inode = inode, .new_size = 0 }); auto ext_it = self->parent->blockfs->extend_writes.lower_bound((extend_size_t){ .inode = inode, .new_size = 0 });
while (ext_it != self->extend_writes.end() && while (ext_it != self->parent->blockfs->extend_writes.end() &&
ext_it->first.inode == inode && ext_it->first.inode == inode &&
ext_it->first.new_size <= new_size) ext_it->first.new_size <= new_size)
{ {
@ -490,7 +470,7 @@ static void complete_extend_inode(nfs_client_t *self, uint64_t inode, uint64_t n
{ {
complete_extend_write(self, ext_it->second.rop, inode, ext_it->second.write_res < 0 complete_extend_write(self, ext_it->second.rop, inode, ext_it->second.write_res < 0
? ext_it->second.write_res : ext_it->second.resize_res); ? ext_it->second.write_res : ext_it->second.resize_res);
self->extend_writes.erase(ext_it++); self->parent->blockfs->extend_writes.erase(ext_it++);
} }
else else
ext_it++; ext_it++;
@ -500,7 +480,7 @@ static void complete_extend_inode(nfs_client_t *self, uint64_t inode, uint64_t n
static void extend_inode(nfs_client_t *self, uint64_t inode, uint64_t new_size) static void extend_inode(nfs_client_t *self, uint64_t inode, uint64_t new_size)
{ {
// Send an extend request // Send an extend request
auto & ext = self->extends[inode]; auto & ext = self->parent->blockfs->extends[inode];
ext.cur_extend = new_size; ext.cur_extend = new_size;
auto inode_it = self->parent->cli->st_cli.inode_config.find(inode); auto inode_it = self->parent->cli->st_cli.inode_config.find(inode);
if (inode_it != self->parent->cli->st_cli.inode_config.end() && if (inode_it != self->parent->cli->st_cli.inode_config.end() &&
@ -514,10 +494,10 @@ static void extend_inode(nfs_client_t *self, uint64_t inode, uint64_t new_size)
{ "force_size", true }, { "force_size", true },
}), [=](const cli_result_t & r) }), [=](const cli_result_t & r)
{ {
auto & ext = self->extends[inode]; auto & ext = self->parent->blockfs->extends[inode];
if (r.err) if (r.err)
{ {
fprintf(stderr, "Error extending inode %ju to %ju bytes: %s\n", inode, new_size, r.text.c_str()); fprintf(stderr, "Error extending inode %lu to %lu bytes: %s\n", inode, new_size, r.text.c_str());
} }
if (r.err == EAGAIN || ext.next_extend > ext.cur_extend) if (r.err == EAGAIN || ext.next_extend > ext.cur_extend)
{ {
@ -548,7 +528,7 @@ static void nfs_do_write(nfs_client_t *self, std::multimap<extend_size_t, extend
{ {
auto inode = op->inode; auto inode = op->inode;
int write_res = op->retval < 0 ? op->retval : (op->retval != op->len ? -ERANGE : 0); int write_res = op->retval < 0 ? op->retval : (op->retval != op->len ? -ERANGE : 0);
if (ewr_it == self->extend_writes.end()) if (ewr_it == self->parent->blockfs->extend_writes.end())
{ {
complete_extend_write(self, rop, inode, write_res); complete_extend_write(self, rop, inode, write_res);
} }
@ -558,7 +538,7 @@ static void nfs_do_write(nfs_client_t *self, std::multimap<extend_size_t, extend
if (ewr_it->second.resize_res <= 0) if (ewr_it->second.resize_res <= 0)
{ {
complete_extend_write(self, rop, inode, write_res < 0 ? write_res : ewr_it->second.resize_res); complete_extend_write(self, rop, inode, write_res < 0 ? write_res : ewr_it->second.resize_res);
self->extend_writes.erase(ewr_it); self->parent->blockfs->extend_writes.erase(ewr_it);
} }
} }
}; };
@ -572,7 +552,7 @@ static void nfs_resize_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode,
if (inode_it != self->parent->cli->st_cli.inode_config.end() && if (inode_it != self->parent->cli->st_cli.inode_config.end() &&
inode_it->second.size < new_size) inode_it->second.size < new_size)
{ {
auto ewr_it = self->extend_writes.emplace((extend_size_t){ auto ewr_it = self->parent->blockfs->extend_writes.emplace((extend_size_t){
.inode = inode, .inode = inode,
.new_size = new_size, .new_size = new_size,
}, (extend_write_t){ }, (extend_write_t){
@ -580,7 +560,7 @@ static void nfs_resize_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode,
.resize_res = 1, .resize_res = 1,
.write_res = 1, .write_res = 1,
}); });
auto & ext = self->extends[inode]; auto & ext = self->parent->blockfs->extends[inode];
if (ext.cur_extend > 0) if (ext.cur_extend > 0)
{ {
// Already resizing, just wait // Already resizing, just wait
@ -595,11 +575,11 @@ static void nfs_resize_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode,
} }
else else
{ {
nfs_do_write(self, self->extend_writes.end(), rop, inode, offset, count, buf); nfs_do_write(self, self->parent->blockfs->extend_writes.end(), rop, inode, offset, count, buf);
} }
} }
static int nfs3_create_proc(void *opaque, rpc_op_t *rop) static int block_nfs3_create_proc(void *opaque, rpc_op_t *rop)
{ {
nfs_client_t *self = (nfs_client_t*)opaque; nfs_client_t *self = (nfs_client_t*)opaque;
CREATE3args *args = (CREATE3args*)rop->request; CREATE3args *args = (CREATE3args*)rop->request;
@ -650,7 +630,7 @@ static int nfs3_create_proc(void *opaque, rpc_op_t *rop)
return 1; return 1;
} }
static int nfs3_mkdir_proc(void *opaque, rpc_op_t *rop) static int block_nfs3_mkdir_proc(void *opaque, rpc_op_t *rop)
{ {
nfs_client_t *self = (nfs_client_t*)opaque; nfs_client_t *self = (nfs_client_t*)opaque;
MKDIR3args *args = (MKDIR3args*)rop->request; MKDIR3args *args = (MKDIR3args*)rop->request;
@ -669,19 +649,19 @@ static int nfs3_mkdir_proc(void *opaque, rpc_op_t *rop)
rpc_queue_reply(rop); rpc_queue_reply(rop);
return 0; return 0;
} }
auto dir_id_it = self->parent->dir_info.find(full_name); auto dir_id_it = self->parent->blockfs->dir_info.find(full_name);
if (dir_id_it != self->parent->dir_info.end()) if (dir_id_it != self->parent->blockfs->dir_info.end())
{ {
*reply = (MKDIR3res){ .status = NFS3ERR_EXIST }; *reply = (MKDIR3res){ .status = NFS3ERR_EXIST };
rpc_queue_reply(rop); rpc_queue_reply(rop);
return 0; return 0;
} }
// FIXME: Persist empty directories in some etcd keys, like /vitastor/dir/... // FIXME: Persist empty directories in some etcd keys, like /vitastor/dir/...
self->parent->dir_info[full_name] = (nfs_dir_t){ self->parent->blockfs->dir_info[full_name] = (nfs_dir_t){
.id = self->parent->next_dir_id++, .id = self->parent->blockfs->next_dir_id++,
.mod_rev = 0, .mod_rev = 0,
}; };
self->parent->dir_by_hash["S"+base64_encode(sha256(full_name))] = full_name; self->parent->blockfs->dir_by_hash["S"+base64_encode(sha256(full_name))] = full_name;
*reply = (MKDIR3res){ *reply = (MKDIR3res){
.status = NFS3_OK, .status = NFS3_OK,
.resok = (MKDIR3resok){ .resok = (MKDIR3resok){
@ -700,7 +680,7 @@ static int nfs3_mkdir_proc(void *opaque, rpc_op_t *rop)
return 0; return 0;
} }
static int nfs3_symlink_proc(void *opaque, rpc_op_t *rop) static int block_nfs3_symlink_proc(void *opaque, rpc_op_t *rop)
{ {
// nfs_client_t *self = (nfs_client_t*)opaque; // nfs_client_t *self = (nfs_client_t*)opaque;
// SYMLINK3args *args = (SYMLINK3args*)rop->request; // SYMLINK3args *args = (SYMLINK3args*)rop->request;
@ -711,7 +691,7 @@ static int nfs3_symlink_proc(void *opaque, rpc_op_t *rop)
return 0; return 0;
} }
static int nfs3_mknod_proc(void *opaque, rpc_op_t *rop) static int block_nfs3_mknod_proc(void *opaque, rpc_op_t *rop)
{ {
// nfs_client_t *self = (nfs_client_t*)opaque; // nfs_client_t *self = (nfs_client_t*)opaque;
// MKNOD3args *args = (MKNOD3args*)rop->request; // MKNOD3args *args = (MKNOD3args*)rop->request;
@ -722,7 +702,7 @@ static int nfs3_mknod_proc(void *opaque, rpc_op_t *rop)
return 0; return 0;
} }
static int nfs3_remove_proc(void *opaque, rpc_op_t *rop) static int block_nfs3_remove_proc(void *opaque, rpc_op_t *rop)
{ {
nfs_client_t *self = (nfs_client_t*)opaque; nfs_client_t *self = (nfs_client_t*)opaque;
REMOVE3res *reply = (REMOVE3res*)rop->reply; REMOVE3res *reply = (REMOVE3res*)rop->reply;
@ -752,7 +732,7 @@ static int nfs3_remove_proc(void *opaque, rpc_op_t *rop)
return 1; return 1;
} }
static int nfs3_rmdir_proc(void *opaque, rpc_op_t *rop) static int block_nfs3_rmdir_proc(void *opaque, rpc_op_t *rop)
{ {
nfs_client_t *self = (nfs_client_t*)opaque; nfs_client_t *self = (nfs_client_t*)opaque;
RMDIR3args *args = (RMDIR3args*)rop->request; RMDIR3args *args = (RMDIR3args*)rop->request;
@ -764,8 +744,8 @@ static int nfs3_rmdir_proc(void *opaque, rpc_op_t *rop)
rpc_queue_reply(rop); rpc_queue_reply(rop);
return 0; return 0;
} }
auto dir_it = self->parent->dir_info.find(full_name); auto dir_it = self->parent->blockfs->dir_info.find(full_name);
if (dir_it == self->parent->dir_info.end()) if (dir_it == self->parent->blockfs->dir_info.end())
{ {
*reply = (RMDIR3res){ .status = NFS3ERR_NOENT }; *reply = (RMDIR3res){ .status = NFS3ERR_NOENT };
rpc_queue_reply(rop); rpc_queue_reply(rop);
@ -781,8 +761,8 @@ static int nfs3_rmdir_proc(void *opaque, rpc_op_t *rop)
return 0; return 0;
} }
} }
self->parent->dir_by_hash.erase("S"+base64_encode(sha256(full_name))); self->parent->blockfs->dir_by_hash.erase("S"+base64_encode(sha256(full_name)));
self->parent->dir_info.erase(dir_it); self->parent->blockfs->dir_info.erase(dir_it);
*reply = (RMDIR3res){ .status = NFS3_OK }; *reply = (RMDIR3res){ .status = NFS3_OK };
rpc_queue_reply(rop); rpc_queue_reply(rop);
return 0; return 0;
@ -811,12 +791,12 @@ static int continue_dir_rename(nfs_dir_rename_state *rename_st)
if (!rename_st->items.size()) if (!rename_st->items.size())
{ {
// old dir // old dir
auto old_info = self->parent->dir_info.at(rename_st->old_name); auto old_info = self->parent->blockfs->dir_info.at(rename_st->old_name);
self->parent->dir_info.erase(rename_st->old_name); self->parent->blockfs->dir_info.erase(rename_st->old_name);
self->parent->dir_by_hash.erase("S"+base64_encode(sha256(rename_st->old_name))); self->parent->blockfs->dir_by_hash.erase("S"+base64_encode(sha256(rename_st->old_name)));
// new dir // new dir
self->parent->dir_info[rename_st->new_name] = old_info; self->parent->blockfs->dir_info[rename_st->new_name] = old_info;
self->parent->dir_by_hash["S"+base64_encode(sha256(rename_st->new_name))] = rename_st->new_name; self->parent->blockfs->dir_by_hash["S"+base64_encode(sha256(rename_st->new_name))] = rename_st->new_name;
RENAME3res *reply = (RENAME3res*)rename_st->rop->reply; RENAME3res *reply = (RENAME3res*)rename_st->rop->reply;
*reply = (RENAME3res){ *reply = (RENAME3res){
.status = NFS3_OK, .status = NFS3_OK,
@ -853,7 +833,7 @@ static int continue_dir_rename(nfs_dir_rename_state *rename_st)
static void nfs_do_rename(nfs_client_t *self, rpc_op_t *rop, std::string old_name, std::string new_name); static void nfs_do_rename(nfs_client_t *self, rpc_op_t *rop, std::string old_name, std::string new_name);
static int nfs3_rename_proc(void *opaque, rpc_op_t *rop) static int block_nfs3_rename_proc(void *opaque, rpc_op_t *rop)
{ {
nfs_client_t *self = (nfs_client_t*)opaque; nfs_client_t *self = (nfs_client_t*)opaque;
RENAME3args *args = (RENAME3args*)rop->request; RENAME3args *args = (RENAME3args*)rop->request;
@ -866,8 +846,8 @@ static int nfs3_rename_proc(void *opaque, rpc_op_t *rop)
rpc_queue_reply(rop); rpc_queue_reply(rop);
return 0; return 0;
} }
bool old_is_dir = self->parent->dir_info.find(old_name) != self->parent->dir_info.end(); bool old_is_dir = self->parent->blockfs->dir_info.find(old_name) != self->parent->blockfs->dir_info.end();
bool new_is_dir = self->parent->dir_info.find(new_name) != self->parent->dir_info.end(); bool new_is_dir = self->parent->blockfs->dir_info.find(new_name) != self->parent->blockfs->dir_info.end();
bool old_is_file = false, new_is_file = false; bool old_is_file = false, new_is_file = false;
for (auto & ic: self->parent->cli->st_cli.inode_config) for (auto & ic: self->parent->cli->st_cli.inode_config)
{ {
@ -948,7 +928,7 @@ static void nfs_do_rename(nfs_client_t *self, rpc_op_t *rop, std::string old_nam
}); });
} }
static int nfs3_link_proc(void *opaque, rpc_op_t *rop) static int block_nfs3_link_proc(void *opaque, rpc_op_t *rop)
{ {
//nfs_client_t *self = (nfs_client_t*)opaque; //nfs_client_t *self = (nfs_client_t*)opaque;
//LINK3args *args = (LINK3args*)rop->request; //LINK3args *args = (LINK3args*)rop->request;
@ -962,7 +942,7 @@ static int nfs3_link_proc(void *opaque, rpc_op_t *rop)
static void fill_dir_entry(nfs_client_t *self, rpc_op_t *rop, static void fill_dir_entry(nfs_client_t *self, rpc_op_t *rop,
std::map<std::string, nfs_dir_t>::iterator dir_id_it, struct entryplus3 *entry, bool is_plus) std::map<std::string, nfs_dir_t>::iterator dir_id_it, struct entryplus3 *entry, bool is_plus)
{ {
if (dir_id_it == self->parent->dir_info.end()) if (dir_id_it == self->parent->blockfs->dir_info.end())
{ {
return; return;
} }
@ -980,7 +960,7 @@ static void fill_dir_entry(nfs_client_t *self, rpc_op_t *rop,
} }
} }
static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus) static void block_nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
{ {
nfs_client_t *self = (nfs_client_t*)opaque; nfs_client_t *self = (nfs_client_t*)opaque;
READDIRPLUS3args plus_args; READDIRPLUS3args plus_args;
@ -999,10 +979,10 @@ static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
} }
std::string dirhash = args->dir; std::string dirhash = args->dir;
std::string dir; std::string dir;
if (dirhash != "roothandle") if (dirhash != NFS_ROOT_HANDLE)
{ {
auto dir_it = self->parent->dir_by_hash.find(dirhash); auto dir_it = self->parent->blockfs->dir_by_hash.find(dirhash);
if (dir_it != self->parent->dir_by_hash.end()) if (dir_it != self->parent->blockfs->dir_by_hash.end())
dir = dir_it->second; dir = dir_it->second;
} }
std::string prefix = dir.size() ? dir+"/" : self->parent->name_prefix; std::string prefix = dir.size() ? dir+"/" : self->parent->name_prefix;
@ -1043,12 +1023,12 @@ static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
} }
else else
{ {
// skip directories, they will be added from dir_info // skip directories, they will be added from blockfs->dir_info
} }
} }
// Add directories from dir_info // Add directories from blockfs->dir_info
for (auto dir_id_it = self->parent->dir_info.lower_bound(prefix); for (auto dir_id_it = self->parent->blockfs->dir_info.lower_bound(prefix);
dir_id_it != self->parent->dir_info.end(); dir_id_it++) dir_id_it != self->parent->blockfs->dir_info.end(); dir_id_it++)
{ {
if (prefix != "" && dir_id_it->first.substr(0, prefix.size()) != prefix) if (prefix != "" && dir_id_it->first.substr(0, prefix.size()) != prefix)
break; break;
@ -1061,12 +1041,12 @@ static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
} }
// Add . and .. // Add . and ..
{ {
auto dir_id_it = self->parent->dir_info.find(dir); auto dir_id_it = self->parent->blockfs->dir_info.find(dir);
fill_dir_entry(self, rop, dir_id_it, &entries["."], is_plus); fill_dir_entry(self, rop, dir_id_it, &entries["."], is_plus);
auto sl = dir.rfind("/"); auto sl = dir.rfind("/");
if (sl != std::string::npos) if (sl != std::string::npos)
{ {
auto dir_id_it = self->parent->dir_info.find(dir.substr(0, sl)); auto dir_id_it = self->parent->blockfs->dir_info.find(dir.substr(0, sl));
fill_dir_entry(self, rop, dir_id_it, &entries[".."], is_plus); fill_dir_entry(self, rop, dir_id_it, &entries[".."], is_plus);
} }
} }
@ -1147,7 +1127,7 @@ static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
{ {
READDIRPLUS3res *reply = (READDIRPLUS3res*)rop->reply; READDIRPLUS3res *reply = (READDIRPLUS3res*)rop->reply;
*reply = { .status = NFS3_OK }; *reply = { .status = NFS3_OK };
*(uint64_t*)(reply->resok.cookieverf) = self->parent->dir_info.at(dir).mod_rev; *(uint64_t*)(reply->resok.cookieverf) = self->parent->blockfs->dir_info.at(dir).mod_rev;
reply->resok.reply.entries = entries.size() ? &entries.begin()->second : NULL; reply->resok.reply.entries = entries.size() ? &entries.begin()->second : NULL;
reply->resok.reply.eof = eof; reply->resok.reply.eof = eof;
} }
@ -1155,250 +1135,123 @@ static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
{ {
READDIR3res *reply = (READDIR3res*)rop->reply; READDIR3res *reply = (READDIR3res*)rop->reply;
*reply = { .status = NFS3_OK }; *reply = { .status = NFS3_OK };
*(uint64_t*)(reply->resok.cookieverf) = self->parent->dir_info.at(dir).mod_rev; *(uint64_t*)(reply->resok.cookieverf) = self->parent->blockfs->dir_info.at(dir).mod_rev;
reply->resok.reply.entries = entries.size() ? (entry3*)&entries.begin()->second : NULL; reply->resok.reply.entries = entries.size() ? (entry3*)&entries.begin()->second : NULL;
reply->resok.reply.eof = eof; reply->resok.reply.eof = eof;
} }
rpc_queue_reply(rop); rpc_queue_reply(rop);
} }
static int nfs3_readdir_proc(void *opaque, rpc_op_t *rop) static int block_nfs3_readdir_proc(void *opaque, rpc_op_t *rop)
{ {
nfs3_readdir_common(opaque, rop, false); block_nfs3_readdir_common(opaque, rop, false);
return 0; return 0;
} }
static int nfs3_readdirplus_proc(void *opaque, rpc_op_t *rop) static int block_nfs3_readdirplus_proc(void *opaque, rpc_op_t *rop)
{ {
nfs3_readdir_common(opaque, rop, true); block_nfs3_readdir_common(opaque, rop, true);
return 0; return 0;
} }
// Get file system statistics void block_fs_state_t::init(nfs_proxy_t *proxy)
static int nfs3_fsstat_proc(void *opaque, rpc_op_t *rop)
{ {
nfs_client_t *self = (nfs_client_t*)opaque; // We need inode name hashes for NFS handles to remain stateless and <= 64 bytes long
//FSSTAT3args *args = (FSSTAT3args*)rop->request; dir_info[""] = (nfs_dir_t){
FSSTAT3res *reply = (FSSTAT3res*)rop->reply; .id = 1,
uint64_t tbytes = 0, fbytes = 0; .mod_rev = 0,
auto pst_it = self->parent->pool_stats.find(self->parent->default_pool_id);
if (pst_it != self->parent->pool_stats.end())
{
auto ttb = pst_it->second["total_raw_tb"].number_value();
auto ftb = (pst_it->second["total_raw_tb"].number_value() - pst_it->second["used_raw_tb"].number_value());
tbytes = ttb / pst_it->second["raw_to_usable"].number_value() * ((uint64_t)2<<40);
fbytes = ftb / pst_it->second["raw_to_usable"].number_value() * ((uint64_t)2<<40);
}
*reply = (FSSTAT3res){
.status = NFS3_OK,
.resok = (FSSTAT3resok){
.obj_attributes = {
.attributes_follow = 1,
.attributes = get_dir_attributes(self, ""),
},
.tbytes = tbytes, // total bytes
.fbytes = fbytes, // free bytes
.abytes = fbytes, // available bytes
.tfiles = (size3)(1 << 31), // maximum total files
.ffiles = (size3)(1 << 31), // free files
.afiles = (size3)(1 << 31), // available files
.invarsec = 0,
},
}; };
rpc_queue_reply(rop); clock_gettime(CLOCK_REALTIME, &dir_info[""].mtime);
return 0; assert(proxy->cli->st_cli.on_inode_change_hook == NULL);
} proxy->cli->st_cli.on_inode_change_hook = [this, proxy](inode_t changed_inode, bool removed)
static int nfs3_fsinfo_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
FSINFO3args *args = (FSINFO3args*)rop->request;
FSINFO3res *reply = (FSINFO3res*)rop->reply;
if (args->fsroot != "roothandle")
{ {
// Example error auto inode_cfg_it = proxy->cli->st_cli.inode_config.find(changed_inode);
*reply = (FSINFO3res){ .status = NFS3ERR_INVAL }; if (inode_cfg_it == proxy->cli->st_cli.inode_config.end())
} {
else return;
{ }
// Fill info auto & inode_cfg = inode_cfg_it->second;
*reply = (FSINFO3res){ std::string full_name = inode_cfg.name;
.status = NFS3_OK, if (proxy->name_prefix != "" && full_name.substr(0, proxy->name_prefix.size()) != proxy->name_prefix)
.resok = (FSINFO3resok){ {
.obj_attributes = { return;
.attributes_follow = 1, }
.attributes = get_dir_attributes(self, ""), // Calculate directory modification time and revision (used as "cookie verifier")
}, timespec now;
.rtmax = 128*1024*1024, clock_gettime(CLOCK_REALTIME, &now);
.rtpref = 128*1024*1024, dir_info[""].mod_rev = dir_info[""].mod_rev < inode_cfg.mod_revision ? inode_cfg.mod_revision : dir_info[""].mod_rev;
.rtmult = 4096, dir_info[""].mtime = now;
.wtmax = 128*1024*1024, int pos = full_name.find('/', proxy->name_prefix.size());
.wtpref = 128*1024*1024, while (pos >= 0)
.wtmult = 4096, {
.dtpref = 128, std::string dir = full_name.substr(0, pos);
.maxfilesize = 0x7fffffffffffffff, auto & dinf = dir_info[dir];
.time_delta = { if (!dinf.id)
.seconds = 1, dinf.id = next_dir_id++;
.nseconds = 0, dinf.mod_rev = dinf.mod_rev < inode_cfg.mod_revision ? inode_cfg.mod_revision : dinf.mod_rev;
}, dinf.mtime = now;
.properties = FSF3_SYMLINK | FSF3_HOMOGENEOUS, dir_by_hash["S"+base64_encode(sha256(dir))] = dir;
}, pos = full_name.find('/', pos+1);
}; }
} // Alter inode_by_hash
rpc_queue_reply(rop); if (removed)
return 0; {
} auto ino_it = hash_by_inode.find(changed_inode);
if (ino_it != hash_by_inode.end())
static int nfs3_pathconf_proc(void *opaque, rpc_op_t *rop) {
{ inode_by_hash.erase(ino_it->second);
//nfs_client_t *self = (nfs_client_t*)opaque; hash_by_inode.erase(ino_it);
PATHCONF3args *args = (PATHCONF3args*)rop->request; }
PATHCONF3res *reply = (PATHCONF3res*)rop->reply; }
if (args->object != "roothandle") else
{ {
// Example error std::string hash = "S"+base64_encode(sha256(full_name));
*reply = (PATHCONF3res){ .status = NFS3ERR_INVAL }; auto hbi_it = hash_by_inode.find(changed_inode);
} if (hbi_it != hash_by_inode.end() && hbi_it->second != hash)
else {
{ // inode had a different name, remove old hash=>inode pointer
// Fill info inode_by_hash.erase(hbi_it->second);
bool_t x = FALSE; }
*reply = (PATHCONF3res){ inode_by_hash[hash] = changed_inode;
.status = NFS3_OK, hash_by_inode[changed_inode] = hash;
.resok = (PATHCONF3resok){ }
.obj_attributes = {
// Without at least one reference to a non-constant value (local variable or something else),
// with gcc 8 we get "internal compiler error: side-effects element in no-side-effects CONSTRUCTOR" here
// FIXME: get rid of this after raising compiler requirement
.attributes_follow = x,
},
.linkmax = 0,
.name_max = 255,
.no_trunc = TRUE,
.chown_restricted = FALSE,
.case_insensitive = FALSE,
.case_preserving = TRUE,
},
};
}
rpc_queue_reply(rop);
return 0;
}
static int nfs3_commit_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
//COMMIT3args *args = (COMMIT3args*)rop->request;
cluster_op_t *op = new cluster_op_t;
// fsync. we don't know how to fsync a single inode, so just fsync everything
op->opcode = OSD_OP_SYNC;
op->callback = [self, rop](cluster_op_t *op)
{
COMMIT3res *reply = (COMMIT3res*)rop->reply;
*reply = (COMMIT3res){ .status = vitastor_nfs_map_err(op->retval) };
*(uint64_t*)reply->resok.verf = self->parent->server_id;
rpc_queue_reply(rop);
}; };
self->parent->cli->execute(op);
return 1;
} }
static int mount3_mnt_proc(void *opaque, rpc_op_t *rop) void nfs_block_procs(nfs_client_t *self)
{
//nfs_client_t *self = (nfs_client_t*)opaque;
//nfs_dirpath *args = (nfs_dirpath*)rop->request;
nfs_mountres3 *reply = (nfs_mountres3*)rop->reply;
u_int flavor = RPC_AUTH_NONE;
reply->fhs_status = MNT3_OK;
reply->mountinfo.fhandle = xdr_copy_string(rop->xdrs, "roothandle");
reply->mountinfo.auth_flavors.auth_flavors_len = 1;
reply->mountinfo.auth_flavors.auth_flavors_val = (u_int*)xdr_copy_string(rop->xdrs, (char*)&flavor, sizeof(u_int)).data;
rpc_queue_reply(rop);
return 0;
}
static int mount3_dump_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
nfs_mountlist *reply = (nfs_mountlist*)rop->reply;
*reply = (struct nfs_mountbody*)malloc_or_die(sizeof(struct nfs_mountbody));
xdr_add_malloc(rop->xdrs, *reply);
(*reply)->ml_hostname = xdr_copy_string(rop->xdrs, "127.0.0.1");
(*reply)->ml_directory = xdr_copy_string(rop->xdrs, self->parent->export_root);
(*reply)->ml_next = NULL;
rpc_queue_reply(rop);
return 0;
}
static int mount3_umnt_proc(void *opaque, rpc_op_t *rop)
{
//nfs_client_t *self = (nfs_client_t*)opaque;
//nfs_dirpath *arg = (nfs_dirpath*)rop->request;
// do nothing
rpc_queue_reply(rop);
return 0;
}
static int mount3_umntall_proc(void *opaque, rpc_op_t *rop)
{
// do nothing
rpc_queue_reply(rop);
return 0;
}
static int mount3_export_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
nfs_exports *reply = (nfs_exports*)rop->reply;
*reply = (struct nfs_exportnode*)calloc_or_die(1, sizeof(struct nfs_exportnode) + sizeof(struct nfs_groupnode));
xdr_add_malloc(rop->xdrs, *reply);
(*reply)->ex_dir = xdr_copy_string(rop->xdrs, self->parent->export_root);
(*reply)->ex_groups = (struct nfs_groupnode*)(reply+1);
(*reply)->ex_groups->gr_name = xdr_copy_string(rop->xdrs, "127.0.0.1");
(*reply)->ex_groups->gr_next = NULL;
(*reply)->ex_next = NULL;
rpc_queue_reply(rop);
return 0;
}
nfs_client_t::nfs_client_t()
{ {
struct rpc_service_proc_t pt[] = { struct rpc_service_proc_t pt[] = {
{NFS_PROGRAM, NFS_V3, NFS3_NULL, nfs3_null_proc, NULL, 0, NULL, 0, this}, {NFS_PROGRAM, NFS_V3, NFS3_NULL, nfs3_null_proc, NULL, 0, NULL, 0, self},
{NFS_PROGRAM, NFS_V3, NFS3_GETATTR, nfs3_getattr_proc, (xdrproc_t)xdr_GETATTR3args, sizeof(GETATTR3args), (xdrproc_t)xdr_GETATTR3res, sizeof(GETATTR3res), this}, {NFS_PROGRAM, NFS_V3, NFS3_GETATTR, block_nfs3_getattr_proc, (xdrproc_t)xdr_GETATTR3args, sizeof(GETATTR3args), (xdrproc_t)xdr_GETATTR3res, sizeof(GETATTR3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_SETATTR, nfs3_setattr_proc, (xdrproc_t)xdr_SETATTR3args, sizeof(SETATTR3args), (xdrproc_t)xdr_SETATTR3res, sizeof(SETATTR3res), this}, {NFS_PROGRAM, NFS_V3, NFS3_SETATTR, block_nfs3_setattr_proc, (xdrproc_t)xdr_SETATTR3args, sizeof(SETATTR3args), (xdrproc_t)xdr_SETATTR3res, sizeof(SETATTR3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_LOOKUP, nfs3_lookup_proc, (xdrproc_t)xdr_LOOKUP3args, sizeof(LOOKUP3args), (xdrproc_t)xdr_LOOKUP3res, sizeof(LOOKUP3res), this}, {NFS_PROGRAM, NFS_V3, NFS3_LOOKUP, block_nfs3_lookup_proc, (xdrproc_t)xdr_LOOKUP3args, sizeof(LOOKUP3args), (xdrproc_t)xdr_LOOKUP3res, sizeof(LOOKUP3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_ACCESS, nfs3_access_proc, (xdrproc_t)xdr_ACCESS3args, sizeof(ACCESS3args), (xdrproc_t)xdr_ACCESS3res, sizeof(ACCESS3res), this}, {NFS_PROGRAM, NFS_V3, NFS3_ACCESS, block_nfs3_access_proc, (xdrproc_t)xdr_ACCESS3args, sizeof(ACCESS3args), (xdrproc_t)xdr_ACCESS3res, sizeof(ACCESS3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_READLINK, nfs3_readlink_proc, (xdrproc_t)xdr_READLINK3args, sizeof(READLINK3args), (xdrproc_t)xdr_READLINK3res, sizeof(READLINK3res), this}, {NFS_PROGRAM, NFS_V3, NFS3_READLINK, block_nfs3_readlink_proc, (xdrproc_t)xdr_READLINK3args, sizeof(READLINK3args), (xdrproc_t)xdr_READLINK3res, sizeof(READLINK3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_READ, nfs3_read_proc, (xdrproc_t)xdr_READ3args, sizeof(READ3args), (xdrproc_t)xdr_READ3res, sizeof(READ3res), this}, {NFS_PROGRAM, NFS_V3, NFS3_READ, block_nfs3_read_proc, (xdrproc_t)xdr_READ3args, sizeof(READ3args), (xdrproc_t)xdr_READ3res, sizeof(READ3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_WRITE, nfs3_write_proc, (xdrproc_t)xdr_WRITE3args, sizeof(WRITE3args), (xdrproc_t)xdr_WRITE3res, sizeof(WRITE3res), this}, {NFS_PROGRAM, NFS_V3, NFS3_WRITE, block_nfs3_write_proc, (xdrproc_t)xdr_WRITE3args, sizeof(WRITE3args), (xdrproc_t)xdr_WRITE3res, sizeof(WRITE3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_CREATE, nfs3_create_proc, (xdrproc_t)xdr_CREATE3args, sizeof(CREATE3args), (xdrproc_t)xdr_CREATE3res, sizeof(CREATE3res), this}, {NFS_PROGRAM, NFS_V3, NFS3_CREATE, block_nfs3_create_proc, (xdrproc_t)xdr_CREATE3args, sizeof(CREATE3args), (xdrproc_t)xdr_CREATE3res, sizeof(CREATE3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_MKDIR, nfs3_mkdir_proc, (xdrproc_t)xdr_MKDIR3args, sizeof(MKDIR3args), (xdrproc_t)xdr_MKDIR3res, sizeof(MKDIR3res), this}, {NFS_PROGRAM, NFS_V3, NFS3_MKDIR, block_nfs3_mkdir_proc, (xdrproc_t)xdr_MKDIR3args, sizeof(MKDIR3args), (xdrproc_t)xdr_MKDIR3res, sizeof(MKDIR3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_SYMLINK, nfs3_symlink_proc, (xdrproc_t)xdr_SYMLINK3args, sizeof(SYMLINK3args), (xdrproc_t)xdr_SYMLINK3res, sizeof(SYMLINK3res), this}, {NFS_PROGRAM, NFS_V3, NFS3_SYMLINK, block_nfs3_symlink_proc, (xdrproc_t)xdr_SYMLINK3args, sizeof(SYMLINK3args), (xdrproc_t)xdr_SYMLINK3res, sizeof(SYMLINK3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_MKNOD, nfs3_mknod_proc, (xdrproc_t)xdr_MKNOD3args, sizeof(MKNOD3args), (xdrproc_t)xdr_MKNOD3res, sizeof(MKNOD3res), this}, {NFS_PROGRAM, NFS_V3, NFS3_MKNOD, block_nfs3_mknod_proc, (xdrproc_t)xdr_MKNOD3args, sizeof(MKNOD3args), (xdrproc_t)xdr_MKNOD3res, sizeof(MKNOD3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_REMOVE, nfs3_remove_proc, (xdrproc_t)xdr_REMOVE3args, sizeof(REMOVE3args), (xdrproc_t)xdr_REMOVE3res, sizeof(REMOVE3res), this}, {NFS_PROGRAM, NFS_V3, NFS3_REMOVE, block_nfs3_remove_proc, (xdrproc_t)xdr_REMOVE3args, sizeof(REMOVE3args), (xdrproc_t)xdr_REMOVE3res, sizeof(REMOVE3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_RMDIR, nfs3_rmdir_proc, (xdrproc_t)xdr_RMDIR3args, sizeof(RMDIR3args), (xdrproc_t)xdr_RMDIR3res, sizeof(RMDIR3res), this}, {NFS_PROGRAM, NFS_V3, NFS3_RMDIR, block_nfs3_rmdir_proc, (xdrproc_t)xdr_RMDIR3args, sizeof(RMDIR3args), (xdrproc_t)xdr_RMDIR3res, sizeof(RMDIR3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_RENAME, nfs3_rename_proc, (xdrproc_t)xdr_RENAME3args, sizeof(RENAME3args), (xdrproc_t)xdr_RENAME3res, sizeof(RENAME3res), this}, {NFS_PROGRAM, NFS_V3, NFS3_RENAME, block_nfs3_rename_proc, (xdrproc_t)xdr_RENAME3args, sizeof(RENAME3args), (xdrproc_t)xdr_RENAME3res, sizeof(RENAME3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_LINK, nfs3_link_proc, (xdrproc_t)xdr_LINK3args, sizeof(LINK3args), (xdrproc_t)xdr_LINK3res, sizeof(LINK3res), this}, {NFS_PROGRAM, NFS_V3, NFS3_LINK, block_nfs3_link_proc, (xdrproc_t)xdr_LINK3args, sizeof(LINK3args), (xdrproc_t)xdr_LINK3res, sizeof(LINK3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_READDIR, nfs3_readdir_proc, (xdrproc_t)xdr_READDIR3args, sizeof(READDIR3args), (xdrproc_t)xdr_READDIR3res, sizeof(READDIR3res), this}, {NFS_PROGRAM, NFS_V3, NFS3_READDIR, block_nfs3_readdir_proc, (xdrproc_t)xdr_READDIR3args, sizeof(READDIR3args), (xdrproc_t)xdr_READDIR3res, sizeof(READDIR3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_READDIRPLUS, nfs3_readdirplus_proc, (xdrproc_t)xdr_READDIRPLUS3args, sizeof(READDIRPLUS3args), (xdrproc_t)xdr_READDIRPLUS3res, sizeof(READDIRPLUS3res), this}, {NFS_PROGRAM, NFS_V3, NFS3_READDIRPLUS, block_nfs3_readdirplus_proc, (xdrproc_t)xdr_READDIRPLUS3args, sizeof(READDIRPLUS3args), (xdrproc_t)xdr_READDIRPLUS3res, sizeof(READDIRPLUS3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_FSSTAT, nfs3_fsstat_proc, (xdrproc_t)xdr_FSSTAT3args, sizeof(FSSTAT3args), (xdrproc_t)xdr_FSSTAT3res, sizeof(FSSTAT3res), this}, {NFS_PROGRAM, NFS_V3, NFS3_FSSTAT, nfs3_fsstat_proc, (xdrproc_t)xdr_FSSTAT3args, sizeof(FSSTAT3args), (xdrproc_t)xdr_FSSTAT3res, sizeof(FSSTAT3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_FSINFO, nfs3_fsinfo_proc, (xdrproc_t)xdr_FSINFO3args, sizeof(FSINFO3args), (xdrproc_t)xdr_FSINFO3res, sizeof(FSINFO3res), this}, {NFS_PROGRAM, NFS_V3, NFS3_FSINFO, nfs3_fsinfo_proc, (xdrproc_t)xdr_FSINFO3args, sizeof(FSINFO3args), (xdrproc_t)xdr_FSINFO3res, sizeof(FSINFO3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_PATHCONF, nfs3_pathconf_proc, (xdrproc_t)xdr_PATHCONF3args, sizeof(PATHCONF3args), (xdrproc_t)xdr_PATHCONF3res, sizeof(PATHCONF3res), this}, {NFS_PROGRAM, NFS_V3, NFS3_PATHCONF, nfs3_pathconf_proc, (xdrproc_t)xdr_PATHCONF3args, sizeof(PATHCONF3args), (xdrproc_t)xdr_PATHCONF3res, sizeof(PATHCONF3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_COMMIT, nfs3_commit_proc, (xdrproc_t)xdr_COMMIT3args, sizeof(COMMIT3args), (xdrproc_t)xdr_COMMIT3res, sizeof(COMMIT3res), this}, {NFS_PROGRAM, NFS_V3, NFS3_COMMIT, nfs3_commit_proc, (xdrproc_t)xdr_COMMIT3args, sizeof(COMMIT3args), (xdrproc_t)xdr_COMMIT3res, sizeof(COMMIT3res), self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_NULL, nfs3_null_proc, NULL, 0, NULL, 0, this}, {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_NULL, nfs3_null_proc, NULL, 0, NULL, 0, self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_MNT, mount3_mnt_proc, (xdrproc_t)xdr_nfs_dirpath, sizeof(nfs_dirpath), (xdrproc_t)xdr_nfs_mountres3, sizeof(nfs_mountres3), this}, {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_MNT, mount3_mnt_proc, (xdrproc_t)xdr_nfs_dirpath, sizeof(nfs_dirpath), (xdrproc_t)xdr_nfs_mountres3, sizeof(nfs_mountres3), self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_DUMP, mount3_dump_proc, NULL, 0, (xdrproc_t)xdr_nfs_mountlist, sizeof(nfs_mountlist), this}, {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_DUMP, mount3_dump_proc, NULL, 0, (xdrproc_t)xdr_nfs_mountlist, sizeof(nfs_mountlist), self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_UMNT, mount3_umnt_proc, (xdrproc_t)xdr_nfs_dirpath, sizeof(nfs_dirpath), NULL, 0, this}, {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_UMNT, mount3_umnt_proc, (xdrproc_t)xdr_nfs_dirpath, sizeof(nfs_dirpath), NULL, 0, self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_UMNTALL, mount3_umntall_proc, NULL, 0, NULL, 0, this}, {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_UMNTALL, mount3_umntall_proc, NULL, 0, NULL, 0, self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_EXPORT, mount3_export_proc, NULL, 0, (xdrproc_t)xdr_nfs_exports, sizeof(nfs_exports), this}, {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_EXPORT, mount3_export_proc, NULL, 0, (xdrproc_t)xdr_nfs_exports, sizeof(nfs_exports), self},
}; };
for (int i = 0; i < sizeof(pt)/sizeof(pt[0]); i++) for (int i = 0; i < sizeof(pt)/sizeof(pt[0]); i++)
{ {
proc_table.insert(pt[i]); self->proc_table.insert(pt[i]);
} }
} }
nfs_client_t::~nfs_client_t()
{
}

57
src/nfs_block.h Normal file
View File

@ -0,0 +1,57 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over Vitastor block images - header
#pragma once
struct nfs_dir_t
{
uint64_t id;
uint64_t mod_rev;
timespec mtime;
};
struct extend_size_t
{
inode_t inode;
uint64_t new_size;
};
inline bool operator < (const extend_size_t &a, const extend_size_t &b)
{
return a.inode < b.inode || a.inode == b.inode && a.new_size < b.new_size;
}
struct extend_write_t
{
rpc_op_t *rop;
int resize_res, write_res; // 1 = started, 0 = completed OK, -errno = completed with error
};
struct extend_inode_t
{
uint64_t cur_extend = 0, next_extend = 0;
};
struct block_fs_state_t
{
// filehandle = "S"+base64(sha256(full name with prefix)) or "roothandle" for mount root)
uint64_t next_dir_id = 2;
// filehandle => dir with name_prefix
std::map<std::string, std::string> dir_by_hash;
// dir with name_prefix => dir info
std::map<std::string, nfs_dir_t> dir_info;
// filehandle => inode ID
std::map<std::string, inode_t> inode_by_hash;
// inode ID => filehandle
std::map<inode_t, std::string> hash_by_inode;
// inode extend requests in progress
std::map<inode_t, extend_inode_t> extends;
std::multimap<extend_size_t, extend_write_t> extend_writes;
void init(nfs_proxy_t *proxy);
};
nfsstat3 vitastor_nfs_map_err(int err);

22
src/nfs_common.h Normal file
View File

@ -0,0 +1,22 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy - common functions
#pragma once
#include "nfs/nfs.h"
void nfs_block_procs(nfs_client_t *self);
void nfs_kv_procs(nfs_client_t *self);
int nfs3_fsstat_proc(void *opaque, rpc_op_t *rop);
int nfs3_fsinfo_proc(void *opaque, rpc_op_t *rop);
int nfs3_pathconf_proc(void *opaque, rpc_op_t *rop);
int nfs3_access_proc(void *opaque, rpc_op_t *rop);
int nfs3_null_proc(void *opaque, rpc_op_t *rop);
int nfs3_commit_proc(void *opaque, rpc_op_t *rop);
int mount3_mnt_proc(void *opaque, rpc_op_t *rop);
int mount3_dump_proc(void *opaque, rpc_op_t *rop);
int mount3_umnt_proc(void *opaque, rpc_op_t *rop);
int mount3_umntall_proc(void *opaque, rpc_op_t *rop);
int mount3_export_proc(void *opaque, rpc_op_t *rop);

124
src/nfs_fsstat.cpp Normal file
View File

@ -0,0 +1,124 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy - common FSSTAT, FSINFO, PATHCONF
#include <sys/time.h>
#include "nfs_proxy.h"
#include "nfs_kv.h"
// Get file system statistics
int nfs3_fsstat_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
//FSSTAT3args *args = (FSSTAT3args*)rop->request;
if (self->parent->trace)
fprintf(stderr, "[%d] FSSTAT\n", self->nfs_fd);
FSSTAT3res *reply = (FSSTAT3res*)rop->reply;
uint64_t tbytes = 0, fbytes = 0;
auto pst_it = self->parent->pool_stats.find(self->parent->default_pool_id);
if (pst_it != self->parent->pool_stats.end())
{
auto ttb = pst_it->second["total_raw_tb"].number_value();
auto ftb = (pst_it->second["total_raw_tb"].number_value() - pst_it->second["used_raw_tb"].number_value());
tbytes = ttb / pst_it->second["raw_to_usable"].number_value() * ((uint64_t)2<<40);
fbytes = ftb / pst_it->second["raw_to_usable"].number_value() * ((uint64_t)2<<40);
}
*reply = (FSSTAT3res){
.status = NFS3_OK,
.resok = (FSSTAT3resok){
.obj_attributes = {
.attributes_follow = 0,
//.attributes = get_root_attributes(self),
},
.tbytes = tbytes, // total bytes
.fbytes = fbytes, // free bytes
.abytes = fbytes, // available bytes
.tfiles = (size3)1 << (63-POOL_ID_BITS), // maximum total files
.ffiles = (size3)1 << (63-POOL_ID_BITS), // free files
.afiles = (size3)1 << (63-POOL_ID_BITS), // available files
.invarsec = 0,
},
};
rpc_queue_reply(rop);
return 0;
}
int nfs3_fsinfo_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
FSINFO3args *args = (FSINFO3args*)rop->request;
FSINFO3res *reply = (FSINFO3res*)rop->reply;
if (self->parent->trace)
fprintf(stderr, "[%d] FSINFO %s\n", self->nfs_fd, std::string(args->fsroot).c_str());
if (args->fsroot != NFS_ROOT_HANDLE)
{
*reply = (FSINFO3res){ .status = NFS3ERR_INVAL };
}
else
{
// Fill info
*reply = (FSINFO3res){
.status = NFS3_OK,
.resok = (FSINFO3resok){
.obj_attributes = {
.attributes_follow = 0,
//.attributes = get_root_attributes(self),
},
.rtmax = 128*1024*1024,
.rtpref = 128*1024*1024,
.rtmult = 4096,
.wtmax = 128*1024*1024,
.wtpref = 128*1024*1024,
.wtmult = 4096,
.dtpref = 128,
.maxfilesize = 0x7fffffffffffffff,
.time_delta = {
.seconds = 1,
.nseconds = 0,
},
.properties = FSF3_SYMLINK | FSF3_HOMOGENEOUS,
},
};
}
rpc_queue_reply(rop);
return 0;
}
int nfs3_pathconf_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
PATHCONF3args *args = (PATHCONF3args*)rop->request;
PATHCONF3res *reply = (PATHCONF3res*)rop->reply;
if (self->parent->trace)
fprintf(stderr, "[%d] PATHCONF %s\n", self->nfs_fd, std::string(args->object).c_str());
if (args->object != NFS_ROOT_HANDLE)
{
*reply = (PATHCONF3res){ .status = NFS3ERR_INVAL };
}
else
{
// Fill info
*reply = (PATHCONF3res){
.status = NFS3_OK,
.resok = (PATHCONF3resok){
.obj_attributes = {
// Without at least one reference to a non-constant value (local variable or something else),
// with gcc 8 we get "internal compiler error: side-effects element in no-side-effects CONSTRUCTOR" here
// FIXME: get rid of this after raising compiler requirement
.attributes_follow = 0,
//.attributes = get_root_attributes(self),
},
.linkmax = 0,
.name_max = 255,
.no_trunc = TRUE,
.chown_restricted = FALSE,
.case_insensitive = FALSE,
.case_preserving = TRUE,
},
};
}
rpc_queue_reply(rop);
return 0;
}

192
src/nfs_kv.cpp Normal file
View File

@ -0,0 +1,192 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - common functions
#include <sys/time.h>
#include "str_util.h"
#include "nfs_proxy.h"
#include "nfs_common.h"
#include "nfs_kv.h"
nfstime3 nfstime_from_str(const std::string & s)
{
nfstime3 t;
auto p = s.find(".");
if (p != std::string::npos)
{
t.seconds = stoull_full(s.substr(0, p), 10);
t.nseconds = stoull_full(s.substr(p+1), 10);
p = s.size()-p-1;
for (; p < 9; p++)
t.nseconds *= 10;
for (; p > 9; p--)
t.nseconds /= 10;
}
else
t.seconds = stoull_full(s, 10);
return t;
}
static std::string timespec_to_str(timespec t)
{
char buf[64];
snprintf(buf, sizeof(buf), "%ju.%09ju", t.tv_sec, t.tv_nsec);
int l = strlen(buf);
while (l > 0 && buf[l-1] == '0')
l--;
if (l > 0 && buf[l-1] == '.')
l--;
buf[l] = 0;
return buf;
}
std::string nfstime_to_str(nfstime3 t)
{
return timespec_to_str((timespec){ .tv_sec = t.seconds, .tv_nsec = t.nseconds });
}
std::string nfstime_now_str()
{
timespec t;
clock_gettime(CLOCK_REALTIME, &t);
return timespec_to_str(t);
}
int kv_map_type(const std::string & type)
{
return (type == "" || type == "file" ? NF3REG :
(type == "dir" ? NF3DIR :
(type == "blk" ? NF3BLK :
(type == "chr" ? NF3CHR :
(type == "link" ? NF3LNK :
(type == "sock" ? NF3SOCK :
(type == "fifo" ? NF3FIFO : -1)))))));
}
fattr3 get_kv_attributes(nfs_client_t *self, uint64_t ino, json11::Json attrs)
{
auto type = kv_map_type(attrs["type"].string_value());
auto mode = attrs["mode"].uint64_value();
auto nlink = attrs["nlink"].uint64_value();
nfstime3 mtime = nfstime_from_str(attrs["mtime"].string_value());
nfstime3 atime = attrs["atime"].is_null() ? mtime : nfstime_from_str(attrs["atime"].string_value());
// FIXME In theory we could store the binary structure itself instead of JSON
return (fattr3){
.type = (type == 0 ? NF3REG : (ftype3)type),
.mode = (attrs["mode"].is_null() ? (type == NF3DIR ? 0755 : 0644) : (uint32_t)mode),
.nlink = (nlink == 0 ? 1 : (uint32_t)nlink),
.uid = (uint32_t)attrs["uid"].uint64_value(),
.gid = (uint32_t)attrs["gid"].uint64_value(),
.size = (type == NF3DIR ? 4096 : attrs["size"].uint64_value()),
.used = (type == NF3DIR ? 4096 : attrs["alloc"].uint64_value()),
.rdev = (type == NF3BLK || type == NF3CHR
? (specdata3){ (uint32_t)attrs["major"].uint64_value(), (uint32_t)attrs["minor"].uint64_value() }
: (specdata3){}),
.fsid = self->parent->fsid,
.fileid = ino,
.atime = atime,
.mtime = mtime,
.ctime = mtime,
};
}
std::string kv_direntry_key(uint64_t dir_ino, const std::string & filename)
{
// encode as: d <length> <hex dir_ino> / <filename>
char key[24] = { 0 };
snprintf(key, sizeof(key), "d-%jx/", dir_ino);
int n = strnlen(key, sizeof(key)-1) - 3;
if (n < 10)
key[1] = '0'+n;
else
key[1] = 'A'+(n-10);
return (char*)key + filename;
}
std::string kv_direntry_filename(const std::string & key)
{
// decode as: d <length> <hex dir_ino> / <filename>
auto pos = key.find("/");
if (pos != std::string::npos)
return key.substr(pos+1);
return key;
}
std::string kv_inode_key(uint64_t ino)
{
char key[24] = { 0 };
snprintf(key, sizeof(key), "i-%jx", ino);
int n = strnlen(key, sizeof(key)-1) - 2;
if (n < 10)
key[1] = '0'+n;
else
key[1] = 'A'+(n-10);
return std::string(key, n+2);
}
std::string kv_fh(uint64_t ino)
{
return "S"+std::string((char*)&ino, 8);
}
uint64_t kv_fh_inode(const std::string & fh)
{
if (fh.size() == 1 && fh[0] == 'R')
{
return 1;
}
else if (fh.size() == 9 && fh[0] == 'S')
{
return *(uint64_t*)&fh[1];
}
else if (fh.size() > 17 && fh[0] == 'I')
{
return *(uint64_t*)&fh[fh.size()-8];
}
return 0;
}
bool kv_fh_valid(const std::string & fh)
{
return fh == NFS_ROOT_HANDLE || fh.size() == 9 && fh[0] == 'S' || fh.size() > 17 && fh[0] == 'I';
}
void nfs_kv_procs(nfs_client_t *self)
{
struct rpc_service_proc_t pt[] = {
{NFS_PROGRAM, NFS_V3, NFS3_NULL, nfs3_null_proc, NULL, 0, NULL, 0, self},
{NFS_PROGRAM, NFS_V3, NFS3_GETATTR, kv_nfs3_getattr_proc, (xdrproc_t)xdr_GETATTR3args, sizeof(GETATTR3args), (xdrproc_t)xdr_GETATTR3res, sizeof(GETATTR3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_SETATTR, kv_nfs3_setattr_proc, (xdrproc_t)xdr_SETATTR3args, sizeof(SETATTR3args), (xdrproc_t)xdr_SETATTR3res, sizeof(SETATTR3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_LOOKUP, kv_nfs3_lookup_proc, (xdrproc_t)xdr_LOOKUP3args, sizeof(LOOKUP3args), (xdrproc_t)xdr_LOOKUP3res, sizeof(LOOKUP3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_ACCESS, nfs3_access_proc, (xdrproc_t)xdr_ACCESS3args, sizeof(ACCESS3args), (xdrproc_t)xdr_ACCESS3res, sizeof(ACCESS3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_READLINK, kv_nfs3_readlink_proc, (xdrproc_t)xdr_READLINK3args, sizeof(READLINK3args), (xdrproc_t)xdr_READLINK3res, sizeof(READLINK3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_READ, kv_nfs3_read_proc, (xdrproc_t)xdr_READ3args, sizeof(READ3args), (xdrproc_t)xdr_READ3res, sizeof(READ3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_WRITE, kv_nfs3_write_proc, (xdrproc_t)xdr_WRITE3args, sizeof(WRITE3args), (xdrproc_t)xdr_WRITE3res, sizeof(WRITE3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_CREATE, kv_nfs3_create_proc, (xdrproc_t)xdr_CREATE3args, sizeof(CREATE3args), (xdrproc_t)xdr_CREATE3res, sizeof(CREATE3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_MKDIR, kv_nfs3_mkdir_proc, (xdrproc_t)xdr_MKDIR3args, sizeof(MKDIR3args), (xdrproc_t)xdr_MKDIR3res, sizeof(MKDIR3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_SYMLINK, kv_nfs3_symlink_proc, (xdrproc_t)xdr_SYMLINK3args, sizeof(SYMLINK3args), (xdrproc_t)xdr_SYMLINK3res, sizeof(SYMLINK3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_MKNOD, kv_nfs3_mknod_proc, (xdrproc_t)xdr_MKNOD3args, sizeof(MKNOD3args), (xdrproc_t)xdr_MKNOD3res, sizeof(MKNOD3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_REMOVE, kv_nfs3_remove_proc, (xdrproc_t)xdr_REMOVE3args, sizeof(REMOVE3args), (xdrproc_t)xdr_REMOVE3res, sizeof(REMOVE3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_RMDIR, kv_nfs3_rmdir_proc, (xdrproc_t)xdr_RMDIR3args, sizeof(RMDIR3args), (xdrproc_t)xdr_RMDIR3res, sizeof(RMDIR3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_RENAME, kv_nfs3_rename_proc, (xdrproc_t)xdr_RENAME3args, sizeof(RENAME3args), (xdrproc_t)xdr_RENAME3res, sizeof(RENAME3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_LINK, kv_nfs3_link_proc, (xdrproc_t)xdr_LINK3args, sizeof(LINK3args), (xdrproc_t)xdr_LINK3res, sizeof(LINK3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_READDIR, kv_nfs3_readdir_proc, (xdrproc_t)xdr_READDIR3args, sizeof(READDIR3args), (xdrproc_t)xdr_READDIR3res, sizeof(READDIR3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_READDIRPLUS, kv_nfs3_readdirplus_proc, (xdrproc_t)xdr_READDIRPLUS3args, sizeof(READDIRPLUS3args), (xdrproc_t)xdr_READDIRPLUS3res, sizeof(READDIRPLUS3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_FSSTAT, nfs3_fsstat_proc, (xdrproc_t)xdr_FSSTAT3args, sizeof(FSSTAT3args), (xdrproc_t)xdr_FSSTAT3res, sizeof(FSSTAT3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_FSINFO, nfs3_fsinfo_proc, (xdrproc_t)xdr_FSINFO3args, sizeof(FSINFO3args), (xdrproc_t)xdr_FSINFO3res, sizeof(FSINFO3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_PATHCONF, nfs3_pathconf_proc, (xdrproc_t)xdr_PATHCONF3args, sizeof(PATHCONF3args), (xdrproc_t)xdr_PATHCONF3res, sizeof(PATHCONF3res), self},
{NFS_PROGRAM, NFS_V3, NFS3_COMMIT, nfs3_commit_proc, (xdrproc_t)xdr_COMMIT3args, sizeof(COMMIT3args), (xdrproc_t)xdr_COMMIT3res, sizeof(COMMIT3res), self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_NULL, nfs3_null_proc, NULL, 0, NULL, 0, self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_MNT, mount3_mnt_proc, (xdrproc_t)xdr_nfs_dirpath, sizeof(nfs_dirpath), (xdrproc_t)xdr_nfs_mountres3, sizeof(nfs_mountres3), self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_DUMP, mount3_dump_proc, NULL, 0, (xdrproc_t)xdr_nfs_mountlist, sizeof(nfs_mountlist), self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_UMNT, mount3_umnt_proc, (xdrproc_t)xdr_nfs_dirpath, sizeof(nfs_dirpath), NULL, 0, self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_UMNTALL, mount3_umntall_proc, NULL, 0, NULL, 0, self},
{MOUNT_PROGRAM, MOUNT_V3, MOUNT3_EXPORT, mount3_export_proc, NULL, 0, (xdrproc_t)xdr_nfs_exports, sizeof(nfs_exports), self},
};
for (int i = 0; i < sizeof(pt)/sizeof(pt[0]); i++)
{
self->proc_table.insert(pt[i]);
}
}

97
src/nfs_kv.h Normal file
View File

@ -0,0 +1,97 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - header
#pragma once
#include "nfs/nfs.h"
#define KV_ROOT_INODE 1
#define KV_NEXT_ID_KEY "id"
#define SHARED_FILE_MAGIC_V1 0x711A5158A6EDF17E
struct nfs_kv_write_state;
struct list_cookie_t
{
uint64_t dir_ino, cookieverf, cookie;
};
inline bool operator < (const list_cookie_t & a, const list_cookie_t & b)
{
return a.dir_ino < b.dir_ino || a.dir_ino == b.dir_ino &&
(a.cookieverf < b.cookieverf || a.cookieverf == b.cookieverf && a.cookie < b.cookie);
};
struct list_cookie_val_t
{
std::string key;
};
struct shared_alloc_queue_t
{
nfs_kv_write_state *st;
int state;
uint64_t size;
};
struct kv_inode_extend_t
{
int refcnt = 0;
uint64_t cur_extend = 0, next_extend = 0, done_extend = 0;
std::vector<std::function<void()>> waiters;
};
struct kv_fs_state_t
{
std::map<list_cookie_t, list_cookie_val_t> list_cookies;
uint64_t fs_next_id = 1, fs_allocated_id = 0;
std::vector<uint64_t> unallocated_ids;
std::vector<shared_alloc_queue_t> allocating_shared;
uint64_t cur_shared_inode = 0, cur_shared_offset = 0;
std::map<inode_t, kv_inode_extend_t> extends;
std::vector<uint8_t> zero_block;
};
struct shared_file_header_t
{
uint64_t magic = 0;
uint64_t inode = 0;
uint64_t alloc = 0;
};
nfsstat3 vitastor_nfs_map_err(int err);
nfstime3 nfstime_from_str(const std::string & s);
std::string nfstime_to_str(nfstime3 t);
std::string nfstime_now_str();
int kv_map_type(const std::string & type);
fattr3 get_kv_attributes(nfs_client_t *self, uint64_t ino, json11::Json attrs);
std::string kv_direntry_key(uint64_t dir_ino, const std::string & filename);
std::string kv_direntry_filename(const std::string & key);
std::string kv_inode_key(uint64_t ino);
std::string kv_fh(uint64_t ino);
uint64_t kv_fh_inode(const std::string & fh);
bool kv_fh_valid(const std::string & fh);
void allocate_new_id(nfs_client_t *self, std::function<void(int res, uint64_t new_id)> cb);
void kv_read_inode(nfs_client_t *self, uint64_t ino,
std::function<void(int res, const std::string & value, json11::Json ientry)> cb,
bool allow_cache = false);
uint64_t align_shared_size(nfs_client_t *self, uint64_t size);
int kv_nfs3_getattr_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_setattr_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_lookup_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_readlink_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_read_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_write_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_create_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_mkdir_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_symlink_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_mknod_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_remove_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_rmdir_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_rename_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_link_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_readdir_proc(void *opaque, rpc_op_t *rop);
int kv_nfs3_readdirplus_proc(void *opaque, rpc_op_t *rop);

343
src/nfs_kv_create.cpp Normal file
View File

@ -0,0 +1,343 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - CREATE, MKDIR, SYMLINK, MKNOD
#include <sys/time.h>
#include "str_util.h"
#include "nfs_proxy.h"
#include "nfs_kv.h"
void allocate_new_id(nfs_client_t *self, std::function<void(int res, uint64_t new_id)> cb)
{
if (self->parent->kvfs->fs_next_id <= self->parent->kvfs->fs_allocated_id)
{
cb(0, self->parent->kvfs->fs_next_id++);
return;
}
else if (self->parent->kvfs->fs_next_id > self->parent->fs_inode_count)
{
cb(-ENOSPC, 0);
return;
}
self->parent->db->get(KV_NEXT_ID_KEY, [=](int res, const std::string & prev_str)
{
if (res < 0 && res != -ENOENT)
{
cb(res, 0);
return;
}
uint64_t prev_val = stoull_full(prev_str);
if (prev_val >= self->parent->fs_inode_count)
{
cb(-ENOSPC, 0);
return;
}
if (prev_val < 1)
{
prev_val = 1;
}
uint64_t new_val = prev_val + self->parent->id_alloc_batch_size;
if (new_val >= self->parent->fs_inode_count)
{
new_val = self->parent->fs_inode_count;
}
self->parent->db->set(KV_NEXT_ID_KEY, std::to_string(new_val), [=](int res)
{
if (res == -EAGAIN)
{
// CAS failure - retry
allocate_new_id(self, cb);
}
else if (res < 0)
{
cb(res, 0);
}
else
{
self->parent->kvfs->fs_next_id = prev_val+2;
self->parent->kvfs->fs_allocated_id = new_val;
cb(0, prev_val+1);
}
}, [prev_val](int res, const std::string & value)
{
// FIXME: Allow to modify value from CAS callback? ("update" query)
return res < 0 || stoull_full(value) == prev_val;
});
});
}
struct kv_create_state
{
nfs_client_t *self = NULL;
rpc_op_t *rop = NULL;
bool exclusive = false;
uint64_t verf = 0;
uint64_t dir_ino = 0;
std::string filename;
int res = 0;
uint64_t new_id = 0;
json11::Json::object attrobj;
json11::Json attrs;
std::string direntry_text;
uint64_t dup_ino = 0;
std::function<void(int res)> cb;
};
static void kv_continue_create(kv_create_state *st, int state)
{
if (state == 0) {}
else if (state == 1) goto resume_1;
else if (state == 2) goto resume_2;
else if (state == 3) goto resume_3;
else if (state == 4) goto resume_4;
else if (state == 5) goto resume_5;
if (st->self->parent->trace)
fprintf(stderr, "[%d] CREATE %ju/%s ATTRS %s\n", st->self->nfs_fd, st->dir_ino, st->filename.c_str(), json11::Json(st->attrobj).dump().c_str());
if (st->filename == "" || st->filename.find("/") != std::string::npos)
{
auto cb = std::move(st->cb);
cb(-EINVAL);
return;
}
if (st->attrobj.find("mtime") == st->attrobj.end())
st->attrobj["mtime"] = nfstime_now_str();
if (st->attrobj.find("atime") == st->attrobj.end())
st->attrobj["atime"] = st->attrobj["mtime"];
st->attrs = std::move(st->attrobj);
resume_1:
// Generate inode ID
allocate_new_id(st->self, [st](int res, uint64_t new_id)
{
st->res = res;
st->new_id = new_id;
kv_continue_create(st, 2);
});
return;
resume_2:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
st->self->parent->db->set(kv_inode_key(st->new_id), st->attrs.dump().c_str(), [st](int res)
{
st->res = res;
kv_continue_create(st, 3);
}, [st](int res, const std::string & value)
{
return res == -ENOENT;
});
return;
resume_3:
if (st->res == -EAGAIN)
{
// Inode ID generator failure - retry
goto resume_1;
}
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
{
auto direntry = json11::Json::object{ { "ino", st->new_id } };
if (st->attrs["type"].string_value() == "dir")
{
direntry["type"] = "dir";
}
st->direntry_text = json11::Json(direntry).dump().c_str();
}
// Set direntry
st->dup_ino = 0;
st->self->parent->db->set(kv_direntry_key(st->dir_ino, st->filename), st->direntry_text, [st](int res)
{
st->res = res;
kv_continue_create(st, 4);
}, [st](int res, const std::string & value)
{
// CAS compare - check that the key doesn't exist
if (res == 0)
{
std::string err;
auto direntry = json11::Json::parse(value, err);
if (err != "")
{
fprintf(stderr, "Invalid JSON in direntry %s = %s: %s, overwriting\n",
kv_direntry_key(st->dir_ino, st->filename).c_str(), value.c_str(), err.c_str());
return true;
}
if (st->exclusive && direntry["verf"].uint64_value() == st->verf)
{
st->dup_ino = direntry["ino"].uint64_value();
return false;
}
return false;
}
return true;
});
return;
resume_4:
if (st->res == -EAGAIN)
{
// Direntry already exists
st->self->parent->db->del(kv_inode_key(st->new_id), [st](int res)
{
st->res = res;
kv_continue_create(st, 5);
});
resume_5:
if (st->res < 0)
{
fprintf(stderr, "failed to delete duplicate inode %ju left from create %s (code %d)\n", st->new_id, strerror(-st->res), st->res);
}
else
{
st->self->parent->kvfs->unallocated_ids.push_back(st->new_id);
}
if (st->dup_ino)
{
// Successfully created by the previous "exclusive" request
st->new_id = st->dup_ino;
}
st->res = st->dup_ino ? 0 : -EEXIST;
}
auto cb = std::move(st->cb);
cb(st->res);
}
static void kv_create_setattr(json11::Json::object & attrobj, sattr3 & sattr)
{
if (sattr.mode.set_it)
attrobj["mode"] = (uint64_t)sattr.mode.mode;
if (sattr.uid.set_it)
attrobj["uid"] = (uint64_t)sattr.uid.uid;
if (sattr.gid.set_it)
attrobj["gid"] = (uint64_t)sattr.gid.gid;
if (sattr.atime.set_it)
attrobj["atime"] = nfstime_to_str(sattr.atime.atime);
if (sattr.mtime.set_it)
attrobj["mtime"] = nfstime_to_str(sattr.mtime.mtime);
}
template<class T, class Tok> static void kv_create_reply(kv_create_state *st, int res)
{
T *reply = (T*)st->rop->reply;
if (res < 0)
{
*reply = (T){ .status = vitastor_nfs_map_err(-res) };
}
else
{
*reply = (T){
.status = NFS3_OK,
.resok = (Tok){
.obj = {
.handle_follows = 1,
.handle = xdr_copy_string(st->rop->xdrs, kv_fh(st->new_id)),
},
.obj_attributes = {
.attributes_follow = 1,
.attributes = get_kv_attributes(st->self, st->new_id, st->attrs),
},
},
};
}
rpc_queue_reply(st->rop);
delete st;
}
int kv_nfs3_create_proc(void *opaque, rpc_op_t *rop)
{
kv_create_state *st = new kv_create_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
auto args = (CREATE3args*)rop->request;
st->exclusive = args->how.mode == NFS_EXCLUSIVE;
st->verf = st->exclusive ? *(uint64_t*)&args->how.verf : 0;
st->dir_ino = kv_fh_inode(args->where.dir);
st->filename = args->where.name;
if (args->how.mode == NFS_EXCLUSIVE)
{
st->attrobj["verf"] = *(uint64_t*)&args->how.verf;
}
else if (args->how.mode == NFS_UNCHECKED)
{
kv_create_setattr(st->attrobj, args->how.obj_attributes);
if (args->how.obj_attributes.size.set_it)
{
st->attrobj["size"] = (uint64_t)args->how.obj_attributes.size.size;
st->attrobj["empty"] = true;
}
}
st->cb = [st](int res) { kv_create_reply<CREATE3res, CREATE3resok>(st, res); };
kv_continue_create(st, 0);
return 1;
}
int kv_nfs3_mkdir_proc(void *opaque, rpc_op_t *rop)
{
kv_create_state *st = new kv_create_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
auto args = (MKDIR3args*)rop->request;
st->dir_ino = kv_fh_inode(args->where.dir);
st->filename = args->where.name;
st->attrobj["type"] = "dir";
st->attrobj["parent_ino"] = st->dir_ino;
kv_create_setattr(st->attrobj, args->attributes);
st->cb = [st](int res) { kv_create_reply<MKDIR3res, MKDIR3resok>(st, res); };
kv_continue_create(st, 0);
return 1;
}
int kv_nfs3_symlink_proc(void *opaque, rpc_op_t *rop)
{
kv_create_state *st = new kv_create_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
auto args = (SYMLINK3args*)rop->request;
st->dir_ino = kv_fh_inode(args->where.dir);
st->filename = args->where.name;
st->attrobj["type"] = "link";
st->attrobj["symlink"] = (std::string)args->symlink.symlink_data;
kv_create_setattr(st->attrobj, args->symlink.symlink_attributes);
st->cb = [st](int res) { kv_create_reply<SYMLINK3res, SYMLINK3resok>(st, res); };
kv_continue_create(st, 0);
return 1;
}
int kv_nfs3_mknod_proc(void *opaque, rpc_op_t *rop)
{
kv_create_state *st = new kv_create_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
auto args = (MKNOD3args*)rop->request;
st->dir_ino = kv_fh_inode(args->where.dir);
st->filename = args->where.name;
if (args->what.type == NF3CHR || args->what.type == NF3BLK)
{
st->attrobj["type"] = (args->what.type == NF3CHR ? "chr" : "blk");
st->attrobj["major"] = (uint64_t)args->what.chr_device.spec.specdata1;
st->attrobj["minor"] = (uint64_t)args->what.chr_device.spec.specdata2;
kv_create_setattr(st->attrobj, args->what.chr_device.dev_attributes);
}
else if (args->what.type == NF3SOCK || args->what.type == NF3FIFO)
{
st->attrobj["type"] = (args->what.type == NF3SOCK ? "sock" : "fifo");
kv_create_setattr(st->attrobj, args->what.sock_attributes);
}
else
{
*(MKNOD3res*)rop->reply = (MKNOD3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop);
delete st;
return 0;
}
st->cb = [st](int res) { kv_create_reply<MKNOD3res, MKNOD3resok>(st, res); };
kv_continue_create(st, 0);
return 1;
}

78
src/nfs_kv_getattr.cpp Normal file
View File

@ -0,0 +1,78 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - GETATTR
#include <sys/time.h>
#include "nfs_proxy.h"
#include "nfs_kv.h"
// Attributes are always stored in the inode
void kv_read_inode(nfs_client_t *self, uint64_t ino,
std::function<void(int res, const std::string & value, json11::Json ientry)> cb,
bool allow_cache)
{
auto key = kv_inode_key(ino);
self->parent->db->get(key, [=](int res, const std::string & value)
{
if (ino == KV_ROOT_INODE && res == -ENOENT)
{
// Allow root inode to not exist
cb(0, "", json11::Json(json11::Json::object{ { "type", "dir" } }));
return;
}
if (res < 0)
{
if (res != -ENOENT)
fprintf(stderr, "Error reading inode %s: %s (code %d)\n", kv_inode_key(ino).c_str(), strerror(-res), res);
cb(res, "", json11::Json());
return;
}
std::string err;
auto attrs = json11::Json::parse(value, err);
if (err != "")
{
fprintf(stderr, "Invalid JSON in inode %s = %s: %s\n", kv_inode_key(ino).c_str(), value.c_str(), err.c_str());
res = -EIO;
}
cb(res, value, attrs);
}, allow_cache);
}
int kv_nfs3_getattr_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
GETATTR3args *args = (GETATTR3args*)rop->request;
GETATTR3res *reply = (GETATTR3res*)rop->reply;
std::string fh = args->object;
auto ino = kv_fh_inode(fh);
if (self->parent->trace)
fprintf(stderr, "[%d] GETATTR %ju\n", self->nfs_fd, ino);
if (!kv_fh_valid(fh))
{
*reply = (GETATTR3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop);
return 0;
}
kv_read_inode(self, ino, [=](int res, const std::string & value, json11::Json attrs)
{
if (self->parent->trace)
fprintf(stderr, "[%d] GETATTR %ju -> %s\n", self->nfs_fd, ino, value.c_str());
if (res < 0)
{
*reply = (GETATTR3res){ .status = vitastor_nfs_map_err(-res) };
}
else
{
*reply = (GETATTR3res){
.status = NFS3_OK,
.resok = (GETATTR3resok){
.obj_attributes = get_kv_attributes(self, ino, attrs),
},
};
}
rpc_queue_reply(rop);
});
return 1;
}

188
src/nfs_kv_link.cpp Normal file
View File

@ -0,0 +1,188 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - LINK
#include <sys/time.h>
#include "nfs_proxy.h"
#include "nfs_kv.h"
struct nfs_kv_link_state
{
nfs_client_t *self = NULL;
rpc_op_t *rop = NULL;
uint64_t ino = 0;
uint64_t dir_ino = 0;
std::string filename;
std::string ientry_text;
json11::Json ientry;
bool retrying = false;
int wait = 0;
int res = 0, res2 = 0;
std::function<void(int)> cb;
};
static void nfs_kv_continue_link(nfs_kv_link_state *st, int state)
{
// 1) Read the source inode
// 2) If it's a directory - fail with -EISDIR
// 3) Create the new direntry with the same inode reference
// 4) Update the inode entry with refcount++
// 5) Retry update if CAS failed but the inode exists
// 6) Otherwise fail and remove the new direntry
// Yeah we may leave a bad direntry if we crash
// But the other option is to possibly leave an inode with too big refcount
if (state == 0) {}
else if (state == 1) goto resume_1;
else if (state == 2) goto resume_2;
else if (state == 3) goto resume_3;
else if (state == 4) goto resume_4;
else
{
fprintf(stderr, "BUG: invalid state in nfs_kv_continue_link()");
abort();
}
resume_0:
// Check that the source inode exists and is not a directory
st->wait = st->retrying ? 1 : 2;
st->res2 = 0;
kv_read_inode(st->self, st->ino, [st](int res, const std::string & value, json11::Json attrs)
{
st->res = res == 0 ? (attrs["type"].string_value() == "dir" ? -EISDIR : 0) : res;
st->ientry_text = value;
st->ientry = attrs;
if (!--st->wait)
nfs_kv_continue_link(st, 1);
});
if (!st->retrying)
{
// Check that the new directory exists
kv_read_inode(st->self, st->dir_ino, [st](int res, const std::string & value, json11::Json attrs)
{
st->res2 = res == 0 ? (attrs["type"].string_value() == "dir" ? 0 : -ENOTDIR) : res;
if (!--st->wait)
nfs_kv_continue_link(st, 1);
});
}
return;
resume_1:
if (st->res < 0 || st->res2 < 0)
{
auto cb = std::move(st->cb);
cb(st->res < 0 ? st->res : st->res2);
return;
}
// Write the new direntry
if (!st->retrying)
{
st->self->parent->db->set(kv_direntry_key(st->dir_ino, st->filename),
json11::Json(json11::Json::object{ { "ino", st->ino } }).dump(), [st](int res)
{
st->res = res;
nfs_kv_continue_link(st, 2);
}, [st](int res, const std::string & old_value)
{
return res == -ENOENT;
});
return;
resume_2:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
}
// Increase inode refcount
{
auto new_ientry = st->ientry.object_items();
auto nlink = new_ientry["nlink"].uint64_value();
new_ientry["nlink"] = nlink ? nlink+1 : 2;
st->ientry = new_ientry;
}
st->self->parent->db->set(kv_inode_key(st->ino), st->ientry.dump(), [st](int res)
{
st->res = res;
nfs_kv_continue_link(st, 3);
}, [st](int res, const std::string & old_value)
{
st->res2 = res;
return res == 0 && old_value == st->ientry_text;
});
return;
resume_3:
if (st->res2 == -ENOENT)
{
st->res = -ENOENT;
}
if (st->res == -EAGAIN)
{
// Re-read inode and retry
st->retrying = true;
goto resume_0;
}
if (st->res < 0)
{
// Maybe inode was deleted in the meantime, delete our direntry
st->self->parent->db->del(kv_direntry_key(st->dir_ino, st->filename), [st](int res)
{
st->res2 = res;
nfs_kv_continue_link(st, 4);
});
return;
resume_4:
if (st->res2 < 0)
{
fprintf(stderr, "Warning: failed to delete new linked direntry %ju/%s: %s (code %d)\n",
st->dir_ino, st->filename.c_str(), strerror(-st->res2), st->res2);
}
}
auto cb = std::move(st->cb);
cb(st->res);
}
int kv_nfs3_link_proc(void *opaque, rpc_op_t *rop)
{
auto st = new nfs_kv_link_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
LINK3args *args = (LINK3args*)rop->request;
st->ino = kv_fh_inode(args->file);
st->dir_ino = kv_fh_inode(args->link.dir);
st->filename = args->link.name;
if (st->self->parent->trace)
fprintf(stderr, "[%d] LINK %ju -> %ju/%s\n", st->self->nfs_fd, st->ino, st->dir_ino, st->filename.c_str());
if (!st->ino || !st->dir_ino || st->filename == "")
{
LINK3res *reply = (LINK3res*)rop->reply;
*reply = (LINK3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop);
delete st;
return 0;
}
st->cb = [st](int res)
{
LINK3res *reply = (LINK3res*)st->rop->reply;
if (res < 0)
{
*reply = (LINK3res){ .status = vitastor_nfs_map_err(res) };
}
else
{
*reply = (LINK3res){
.status = NFS3_OK,
.resok = (LINK3resok){
.file_attributes = (post_op_attr){
.attributes_follow = 1,
.attributes = get_kv_attributes(st->self, st->ino, st->ientry),
},
},
};
}
rpc_queue_reply(st->rop);
delete st;
};
nfs_kv_continue_link(st, 0);
return 1;
}

104
src/nfs_kv_lookup.cpp Normal file
View File

@ -0,0 +1,104 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - LOOKUP, READLINK
#include <sys/time.h>
#include "nfs_proxy.h"
#include "nfs_kv.h"
int kv_nfs3_lookup_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
LOOKUP3args *args = (LOOKUP3args*)rop->request;
LOOKUP3res *reply = (LOOKUP3res*)rop->reply;
inode_t dir_ino = kv_fh_inode(args->what.dir);
std::string filename = args->what.name;
if (self->parent->trace)
fprintf(stderr, "[%d] LOOKUP %ju/%s\n", self->nfs_fd, dir_ino, filename.c_str());
if (!dir_ino || filename == "")
{
*reply = (LOOKUP3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop);
return 0;
}
self->parent->db->get(kv_direntry_key(dir_ino, filename), [=](int res, const std::string & value)
{
if (res < 0)
{
*reply = (LOOKUP3res){ .status = vitastor_nfs_map_err(-res) };
rpc_queue_reply(rop);
return;
}
std::string err;
auto direntry = json11::Json::parse(value, err);
if (err != "")
{
fprintf(stderr, "Invalid JSON in direntry %s = %s: %s\n", kv_direntry_key(dir_ino, filename).c_str(), value.c_str(), err.c_str());
*reply = (LOOKUP3res){ .status = NFS3ERR_IO };
rpc_queue_reply(rop);
return;
}
uint64_t ino = direntry["ino"].uint64_value();
kv_read_inode(self, ino, [=](int res, const std::string & value, json11::Json ientry)
{
if (res < 0)
{
*reply = (LOOKUP3res){ .status = vitastor_nfs_map_err(res == -ENOENT ? -EIO : res) };
rpc_queue_reply(rop);
return;
}
*reply = (LOOKUP3res){
.status = NFS3_OK,
.resok = (LOOKUP3resok){
.object = xdr_copy_string(rop->xdrs, kv_fh(ino)),
.obj_attributes = {
.attributes_follow = 1,
.attributes = get_kv_attributes(self, ino, ientry),
},
},
};
rpc_queue_reply(rop);
});
});
return 1;
}
int kv_nfs3_readlink_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
READLINK3args *args = (READLINK3args*)rop->request;
if (self->parent->trace)
fprintf(stderr, "[%d] READLINK %ju\n", self->nfs_fd, kv_fh_inode(args->symlink));
READLINK3res *reply = (READLINK3res*)rop->reply;
if (!kv_fh_valid(args->symlink) || args->symlink == NFS_ROOT_HANDLE)
{
// Invalid filehandle or trying to read symlink from root entry
*reply = (READLINK3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop);
return 0;
}
kv_read_inode(self, kv_fh_inode(args->symlink), [=](int res, const std::string & value, json11::Json attrs)
{
if (res < 0)
{
*reply = (READLINK3res){ .status = vitastor_nfs_map_err(-res) };
}
else if (attrs["type"] != "link")
{
*reply = (READLINK3res){ .status = NFS3ERR_INVAL };
}
else
{
*reply = (READLINK3res){
.status = NFS3_OK,
.resok = (READLINK3resok){
.data = xdr_copy_string(rop->xdrs, attrs["symlink"].string_value()),
},
};
}
rpc_queue_reply(rop);
});
return 1;
}

163
src/nfs_kv_read.cpp Normal file
View File

@ -0,0 +1,163 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - READ
#include <sys/time.h>
#include "nfs_proxy.h"
#include "nfs_kv.h"
struct nfs_kv_read_state
{
nfs_client_t *self = NULL;
rpc_op_t *rop = NULL;
bool allow_cache = true;
inode_t ino = 0;
uint64_t offset = 0, size = 0;
std::function<void(int)> cb;
// state
int res = 0;
json11::Json ientry;
uint64_t aligned_size = 0, aligned_offset = 0;
uint8_t *aligned_buf = NULL;
cluster_op_t *op = NULL;
uint8_t *buf = NULL;
};
static void nfs_kv_continue_read(nfs_kv_read_state *st, int state)
{
if (state == 0) {}
else if (state == 1) goto resume_1;
else if (state == 2) goto resume_2;
else if (state == 3) goto resume_3;
else
{
fprintf(stderr, "BUG: invalid state in nfs_kv_continue_read()");
abort();
}
if (st->offset + sizeof(shared_file_header_t) < st->self->parent->shared_inode_threshold)
{
kv_read_inode(st->self, st->ino, [st](int res, const std::string & value, json11::Json attrs)
{
st->res = res;
st->ientry = attrs;
nfs_kv_continue_read(st, 1);
}, st->allow_cache);
return;
resume_1:
if (st->res < 0 || kv_map_type(st->ientry["type"].string_value()) != NF3REG)
{
auto cb = std::move(st->cb);
cb(st->res < 0 ? st->res : -EINVAL);
return;
}
if (st->ientry["shared_ino"].uint64_value() != 0)
{
st->aligned_size = align_shared_size(st->self, st->offset+st->size);
st->aligned_buf = (uint8_t*)malloc_or_die(st->aligned_size);
st->buf = st->aligned_buf + sizeof(shared_file_header_t) + st->offset;
st->op = new cluster_op_t;
st->op->opcode = OSD_OP_READ;
st->op->inode = st->self->parent->fs_base_inode + st->ientry["shared_ino"].uint64_value();
st->op->offset = st->ientry["shared_offset"].uint64_value();
if (st->offset+st->size > st->ientry["size"].uint64_value())
{
st->op->len = align_shared_size(st->self, st->ientry["size"].uint64_value());
memset(st->aligned_buf+st->op->len, 0, st->aligned_size-st->op->len);
}
else
st->op->len = st->aligned_size;
st->op->iov.push_back(st->aligned_buf, st->op->len);
st->op->callback = [st, state](cluster_op_t *op)
{
st->res = op->retval == op->len ? 0 : op->retval;
delete op;
nfs_kv_continue_read(st, 2);
};
st->self->parent->cli->execute(st->op);
return;
resume_2:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
auto hdr = ((shared_file_header_t*)st->aligned_buf);
if (hdr->magic != SHARED_FILE_MAGIC_V1 || hdr->inode != st->ino)
{
// Got unrelated data - retry from the beginning
free(st->aligned_buf);
st->aligned_buf = NULL;
st->allow_cache = false;
nfs_kv_continue_read(st, 0);
return;
}
auto cb = std::move(st->cb);
cb(0);
return;
}
}
st->aligned_offset = (st->offset & ~(st->self->parent->pool_alignment-1));
st->aligned_size = ((st->offset + st->size + st->self->parent->pool_alignment-1) &
~(st->self->parent->pool_alignment-1)) - st->aligned_offset;
st->aligned_buf = (uint8_t*)malloc_or_die(st->aligned_size);
st->buf = st->aligned_buf + st->offset - st->aligned_offset;
st->op = new cluster_op_t;
st->op->opcode = OSD_OP_READ;
st->op->inode = st->self->parent->fs_base_inode + st->ino;
st->op->offset = st->aligned_offset;
st->op->len = st->aligned_size;
st->op->iov.push_back(st->aligned_buf, st->aligned_size);
st->op->callback = [st](cluster_op_t *op)
{
st->res = op->retval;
delete op;
nfs_kv_continue_read(st, 3);
};
st->self->parent->cli->execute(st->op);
return;
resume_3:
auto cb = std::move(st->cb);
cb(st->res < 0 ? st->res : 0);
return;
}
int kv_nfs3_read_proc(void *opaque, rpc_op_t *rop)
{
READ3args *args = (READ3args*)rop->request;
READ3res *reply = (READ3res*)rop->reply;
auto ino = kv_fh_inode(args->file);
if (args->count > MAX_REQUEST_SIZE || !ino)
{
*reply = (READ3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop);
return 0;
}
auto st = new nfs_kv_read_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
st->ino = ino;
st->offset = args->offset;
st->size = args->count;
st->cb = [st](int res)
{
READ3res *reply = (READ3res*)st->rop->reply;
*reply = (READ3res){ .status = vitastor_nfs_map_err(res) };
if (res == 0)
{
xdr_add_malloc(st->rop->xdrs, st->aligned_buf);
reply->resok.data.data = (char*)st->buf;
reply->resok.data.size = st->size;
reply->resok.count = st->size;
reply->resok.eof = 0;
}
rpc_queue_reply(st->rop);
delete st;
};
if (st->self->parent->trace)
fprintf(stderr, "[%d] READ %ju %ju+%ju\n", st->self->nfs_fd, st->ino, st->offset, st->size);
nfs_kv_continue_read(st, 0);
return 1;
}

375
src/nfs_kv_readdir.cpp Normal file
View File

@ -0,0 +1,375 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - READDIR, READDIRPLUS
#include <sys/time.h>
#include "nfs_proxy.h"
#include "nfs_kv.h"
static unsigned len_pad4(unsigned len)
{
return len + (len&3 ? 4-(len&3) : 0);
}
struct nfs_kv_readdir_state
{
nfs_client_t *self = NULL;
rpc_op_t *rop = NULL;
// Request:
bool is_plus = false;
uint64_t cookie = 0;
uint64_t cookieverf = 0;
uint64_t dir_ino = 0;
uint64_t maxcount = 0;
std::function<void(int)> cb;
// State:
int res = 0;
std::string prefix, start;
void *list_handle;
uint64_t parent_ino = 0;
std::string ientry_text, parent_ientry_text;
json11::Json ientry, parent_ientry;
std::string cur_key, cur_value;
int reply_size = 0;
int to_skip = 0;
uint64_t offset = 0;
int getattr_running = 0, getattr_cur = 0;
// Result:
bool eof = false;
//uint64_t cookieverf = 0; // same field
std::vector<entryplus3> entries;
};
static void nfs_kv_continue_readdir(nfs_kv_readdir_state *st, int state);
static void kv_getattr_next(nfs_kv_readdir_state *st)
{
while (st->is_plus && st->getattr_cur < st->entries.size() && st->getattr_running < st->self->parent->readdir_getattr_parallel)
{
auto idx = st->getattr_cur++;
st->getattr_running++;
kv_read_inode(st->self, st->entries[idx].fileid, [st, idx](int res, const std::string & value, json11::Json ientry)
{
if (res == 0)
{
st->entries[idx].name_attributes = (post_op_attr){
// FIXME: maybe do not read parent attributes and leave them to a GETATTR?
.attributes_follow = 1,
.attributes = get_kv_attributes(st->self, st->entries[idx].fileid, ientry),
};
}
st->getattr_running--;
kv_getattr_next(st);
if (st->getattr_running == 0 && !st->list_handle)
{
nfs_kv_continue_readdir(st, 4);
}
});
}
}
static void nfs_kv_continue_readdir(nfs_kv_readdir_state *st, int state)
{
if (state == 0) {}
else if (state == 1) goto resume_1;
else if (state == 2) goto resume_2;
else if (state == 3) goto resume_3;
else if (state == 4) goto resume_4;
else
{
fprintf(stderr, "BUG: invalid state in nfs_kv_continue_readdir()");
abort();
}
// Limit results based on maximum reply size
// Sadly we have to calculate reply size by hand
// reply without entries is 4+4+(dir_attributes ? sizeof(fattr3) : 0)+8+4 bytes
st->reply_size = 20;
if (st->reply_size > st->maxcount)
{
// Error, too small max reply size
auto cb = std::move(st->cb);
cb(-NFS3ERR_TOOSMALL);
return;
}
// Add . and ..
if (st->cookie <= 1)
{
kv_read_inode(st->self, st->dir_ino, [st](int res, const std::string & value, json11::Json ientry)
{
st->res = res;
st->ientry_text = value;
st->ientry = ientry;
nfs_kv_continue_readdir(st, 1);
});
return;
resume_1:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
if (st->cookie == 0)
{
auto fh = kv_fh(st->dir_ino);
auto entry_size = 20 + 4/*len_pad4(".")*/ + (st->is_plus ? 8 + 88 + len_pad4(fh.size()) : 0);
if (st->reply_size + entry_size > st->maxcount)
{
auto cb = std::move(st->cb);
cb(-NFS3ERR_TOOSMALL);
return;
}
entryplus3 dot = {};
dot.name = xdr_copy_string(st->rop->xdrs, ".");
dot.fileid = st->dir_ino;
dot.name_attributes = (post_op_attr){
.attributes_follow = 1,
.attributes = get_kv_attributes(st->self, st->dir_ino, st->ientry),
};
dot.name_handle = (post_op_fh3){
.handle_follows = 1,
.handle = xdr_copy_string(st->rop->xdrs, fh),
};
st->entries.push_back(dot);
st->reply_size += entry_size;
}
st->parent_ino = st->ientry["parent_ino"].uint64_value();
if (st->parent_ino)
{
kv_read_inode(st->self, st->ientry["parent_ino"].uint64_value(), [st](int res, const std::string & value, json11::Json ientry)
{
st->res = res;
st->parent_ientry_text = value;
st->parent_ientry = ientry;
nfs_kv_continue_readdir(st, 2);
});
return;
resume_2:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
}
auto fh = kv_fh(st->parent_ino);
auto entry_size = 20 + 4/*len_pad4("..")*/ + (st->is_plus ? 8 + 88 + len_pad4(fh.size()) : 0);
if (st->reply_size + entry_size > st->maxcount)
{
st->eof = false;
auto cb = std::move(st->cb);
cb(0);
return;
}
entryplus3 dotdot = {};
dotdot.name = xdr_copy_string(st->rop->xdrs, "..");
dotdot.fileid = st->dir_ino;
dotdot.name_attributes = (post_op_attr){
// FIXME: maybe do not read parent attributes and leave them to a GETATTR?
.attributes_follow = 1,
.attributes = get_kv_attributes(st->self,
st->parent_ino ? st->parent_ino : st->dir_ino,
st->parent_ino ? st->parent_ientry : st->ientry),
};
dotdot.name_handle = (post_op_fh3){
.handle_follows = 1,
.handle = xdr_copy_string(st->rop->xdrs, fh),
};
st->entries.push_back(dotdot);
st->reply_size += entry_size;
}
st->prefix = kv_direntry_key(st->dir_ino, "");
st->eof = true;
st->start = st->prefix;
if (st->cookie > 1)
{
auto lc_it = st->self->parent->kvfs->list_cookies.find((list_cookie_t){ st->dir_ino, st->cookieverf, st->cookie });
if (lc_it != st->self->parent->kvfs->list_cookies.end())
{
st->start = st->prefix+lc_it->second.key;
st->to_skip = 1;
st->offset = st->cookie;
}
else
{
st->to_skip = st->cookie-2;
st->offset = 2;
st->cookieverf = ((uint64_t)lrand48() | ((uint64_t)lrand48() << 31) | ((uint64_t)lrand48() << 62));
}
}
else
{
st->to_skip = 0;
st->offset = 2;
st->cookieverf = ((uint64_t)lrand48() | ((uint64_t)lrand48() << 31) | ((uint64_t)lrand48() << 62));
}
{
auto lc_it = st->self->parent->kvfs->list_cookies.lower_bound((list_cookie_t){ st->dir_ino, st->cookieverf, 0 });
if (lc_it != st->self->parent->kvfs->list_cookies.end() &&
lc_it->first.dir_ino == st->dir_ino &&
lc_it->first.cookieverf == st->cookieverf &&
lc_it->first.cookie < st->cookie)
{
auto lc_start = lc_it;
while (lc_it != st->self->parent->kvfs->list_cookies.end() && lc_it->first.cookieverf == st->cookieverf)
{
lc_it++;
}
st->self->parent->kvfs->list_cookies.erase(lc_start, lc_it);
}
}
st->getattr_cur = st->entries.size();
st->list_handle = st->self->parent->db->list_start(st->start);
st->self->parent->db->list_next(st->list_handle, [=](int res, const std::string & key, const std::string & value)
{
st->res = res;
st->cur_key = key;
st->cur_value = value;
nfs_kv_continue_readdir(st, 3);
});
return;
while (st->list_handle)
{
st->self->parent->db->list_next(st->list_handle, NULL);
return;
resume_3:
if (st->res == -ENOENT || st->cur_key.size() < st->prefix.size() || st->cur_key.substr(0, st->prefix.size()) != st->prefix)
{
st->self->parent->db->list_close(st->list_handle);
st->list_handle = NULL;
break;
}
if (st->to_skip > 0)
{
st->to_skip--;
continue;
}
std::string err;
auto direntry = json11::Json::parse(st->cur_value, err);
if (err != "")
{
fprintf(stderr, "readdir: direntry %s contains invalid JSON: %s, skipping\n",
st->cur_key.c_str(), st->cur_value.c_str());
continue;
}
auto ino = direntry["ino"].uint64_value();
auto name = kv_direntry_filename(st->cur_key);
if (st->self->parent->trace)
{
fprintf(stderr, "[%d] READDIR %ju %lu %s\n",
st->self->nfs_fd, st->dir_ino, st->offset, name.c_str());
}
auto fh = kv_fh(ino);
// 1 entry3 is (8+4+(filename_len+3)/4*4+8) bytes
// 1 entryplus3 is (8+4+(filename_len+3)/4*4+8
// + 4+(name_attributes ? (sizeof(fattr3) = 84) : 0)
// + 4+(name_handle ? 4+(handle_len+3)/4*4 : 0)) bytes
auto entry_size = 20 + len_pad4(name.size()) + (st->is_plus ? 8 + 88 + len_pad4(fh.size()) : 0);
if (st->reply_size + entry_size > st->maxcount)
{
st->eof = false;
st->self->parent->db->list_close(st->list_handle);
st->list_handle = NULL;
break;
}
st->reply_size += entry_size;
auto idx = st->entries.size();
st->entries.push_back((entryplus3){});
auto entry = &st->entries[idx];
entry->name = xdr_copy_string(st->rop->xdrs, name);
entry->fileid = ino;
entry->cookie = st->offset++;
st->self->parent->kvfs->list_cookies[(list_cookie_t){ st->dir_ino, st->cookieverf, entry->cookie }] = { .key = name };
if (st->is_plus)
{
entry->name_handle = (post_op_fh3){
.handle_follows = 1,
.handle = xdr_copy_string(st->rop->xdrs, fh),
};
kv_getattr_next(st);
}
}
resume_4:
while (st->getattr_running > 0)
{
return;
}
void *prev = NULL;
for (int i = 0; i < st->entries.size(); i++)
{
entryplus3 *entry = &st->entries[i];
if (prev)
{
if (st->is_plus)
((entryplus3*)prev)->nextentry = entry;
else
((entry3*)prev)->nextentry = (entry3*)entry;
}
prev = entry;
}
// Send reply
auto cb = std::move(st->cb);
cb(0);
}
static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
{
auto st = new nfs_kv_readdir_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
st->is_plus = is_plus;
if (st->is_plus)
{
READDIRPLUS3args *args = (READDIRPLUS3args*)rop->request;
st->dir_ino = kv_fh_inode(args->dir);
st->cookie = args->cookie;
st->cookieverf = *((uint64_t*)args->cookieverf);
st->maxcount = args->maxcount;
}
else
{
READDIR3args *args = ((READDIR3args*)rop->request);
st->dir_ino = kv_fh_inode(args->dir);
st->cookie = args->cookie;
st->cookieverf = *((uint64_t*)args->cookieverf);
st->maxcount = args->count;
}
if (st->self->parent->trace)
fprintf(stderr, "[%d] READDIR %ju VERF %jx OFFSET %ju LIMIT %ju\n", st->self->nfs_fd, st->dir_ino, st->cookieverf, st->cookie, st->maxcount);
st->cb = [st](int res)
{
if (st->is_plus)
{
READDIRPLUS3res *reply = (READDIRPLUS3res*)st->rop->reply;
*reply = (READDIRPLUS3res){ .status = vitastor_nfs_map_err(res) };
*(uint64_t*)(reply->resok.cookieverf) = st->cookieverf;
reply->resok.reply.entries = st->entries.size() ? &st->entries[0] : NULL;
reply->resok.reply.eof = st->eof;
}
else
{
READDIR3res *reply = (READDIR3res*)st->rop->reply;
*reply = (READDIR3res){ .status = vitastor_nfs_map_err(res) };
*(uint64_t*)(reply->resok.cookieverf) = st->cookieverf;
reply->resok.reply.entries = st->entries.size() ? (entry3*)&st->entries[0] : NULL;
reply->resok.reply.eof = st->eof;
}
rpc_queue_reply(st->rop);
delete st;
};
nfs_kv_continue_readdir(st, 0);
}
int kv_nfs3_readdir_proc(void *opaque, rpc_op_t *rop)
{
nfs3_readdir_common(opaque, rop, false);
return 0;
}
int kv_nfs3_readdirplus_proc(void *opaque, rpc_op_t *rop)
{
nfs3_readdir_common(opaque, rop, true);
return 0;
}

315
src/nfs_kv_remove.cpp Normal file
View File

@ -0,0 +1,315 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - REMOVE, RMDIR
#include <sys/time.h>
#include "nfs_proxy.h"
#include "nfs_kv.h"
#include "cli.h"
struct kv_del_state
{
nfs_client_t *self = NULL;
rpc_op_t *rop = NULL;
uint64_t dir_ino = 0;
std::string filename;
uint64_t ino = 0;
void *list_handle = NULL;
std::string prefix, list_key, direntry_text, ientry_text;
json11::Json direntry, ientry;
int type = 0;
bool is_rmdir = false;
bool rm_data = false;
bool allow_cache = true;
int res = 0, res2 = 0;
std::function<void(int)> cb;
};
static void nfs_kv_continue_delete(kv_del_state *st, int state)
{
// Overall algorithm:
// 1) Get inode attributes and check that it's not a directory (REMOVE)
// 2) Get inode attributes and check that it is a directory (RMDIR)
// 3) Delete direntry with CAS
// 4) Check that the directory didn't contain files (RMDIR) and restore it if it did
// 5) Reduce inode refcount by 1 or delete inode
// 6) If regular file and inode is deleted: delete data
if (state == 0) {}
else if (state == 1) goto resume_1;
else if (state == 2) goto resume_2;
else if (state == 3) goto resume_3;
else if (state == 4) goto resume_4;
else if (state == 5) goto resume_5;
else if (state == 6) goto resume_6;
else if (state == 7) goto resume_7;
else
{
fprintf(stderr, "BUG: invalid state in nfs_kv_continue_delete()");
abort();
}
resume_0:
st->self->parent->db->get(kv_direntry_key(st->dir_ino, st->filename), [st](int res, const std::string & value)
{
st->res = res;
st->direntry_text = value;
nfs_kv_continue_delete(st, 1);
}, st->allow_cache);
return;
resume_1:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
{
std::string err;
st->direntry = json11::Json::parse(st->direntry_text, err);
if (err != "")
{
fprintf(stderr, "Invalid JSON in direntry %s = %s: %s, deleting\n",
kv_direntry_key(st->dir_ino, st->filename).c_str(), st->direntry_text.c_str(), err.c_str());
// Just delete direntry and skip inode
}
else
{
st->ino = st->direntry["ino"].uint64_value();
}
}
// Get inode
st->self->parent->db->get(kv_inode_key(st->ino), [st](int res, const std::string & value)
{
st->res = res;
st->ientry_text = value;
nfs_kv_continue_delete(st, 2);
}, st->allow_cache);
return;
resume_2:
if (st->res < 0)
{
fprintf(stderr, "error reading inode %s: %s (code %d)\n",
kv_inode_key(st->ino).c_str(), strerror(-st->res), st->res);
auto cb = std::move(st->cb);
cb(st->res);
return;
}
{
std::string err;
st->ientry = json11::Json::parse(st->ientry_text, err);
if (err != "")
{
fprintf(stderr, "Invalid JSON in inode %s = %s: %s, treating as a regular file\n",
kv_inode_key(st->ino).c_str(), st->ientry_text.c_str(), err.c_str());
}
}
// (1-2) Check type
st->type = kv_map_type(st->ientry["type"].string_value());
if (st->type == -1 || st->is_rmdir != (st->type == NF3DIR))
{
auto cb = std::move(st->cb);
cb(st->is_rmdir ? -ENOTDIR : -EISDIR);
return;
}
// (3) Delete direntry with CAS
st->self->parent->db->del(kv_direntry_key(st->dir_ino, st->filename), [st](int res)
{
st->res = res;
nfs_kv_continue_delete(st, 3);
}, [st](int res, const std::string & value)
{
return value == st->direntry_text;
});
return;
resume_3:
if (st->res == -EAGAIN)
{
// CAS failure, restart from the beginning
st->allow_cache = false;
goto resume_0;
}
else if (st->res < 0 && st->res != -ENOENT)
{
fprintf(stderr, "failed to remove direntry %s: %s (code %d)\n",
kv_direntry_key(st->dir_ino, st->filename).c_str(), strerror(-st->res), st->res);
auto cb = std::move(st->cb);
cb(st->res);
return;
}
if (!st->ino)
{
// direntry contained invalid JSON and was deleted, finish
auto cb = std::move(st->cb);
cb(0);
return;
}
if (st->is_rmdir)
{
// (4) Check if directory actually is not empty
st->list_handle = st->self->parent->db->list_start(kv_direntry_key(st->ino, ""));
st->self->parent->db->list_next(st->list_handle, [st](int res, const std::string & key, const std::string & value)
{
st->res = res;
st->list_key = key;
st->self->parent->db->list_close(st->list_handle);
nfs_kv_continue_delete(st, 4);
});
return;
resume_4:
st->prefix = kv_direntry_key(st->ino, "");
if (st->res == -ENOENT || st->list_key.size() < st->prefix.size() || st->list_key.substr(0, st->prefix.size()) != st->prefix)
{
// OK, directory is empty
}
else
{
// Not OK, restore direntry
st->self->parent->db->del(kv_direntry_key(st->dir_ino, st->filename), [st](int res)
{
st->res2 = res;
nfs_kv_continue_delete(st, 5);
}, [st](int res, const std::string & value)
{
return res == -ENOENT;
});
return;
resume_5:
if (st->res2 < 0)
{
fprintf(stderr, "failed to restore direntry %s (%s): %s (code %d)",
kv_direntry_key(st->dir_ino, st->filename).c_str(), st->direntry_text.c_str(), strerror(-st->res2), st->res2);
fprintf(stderr, " - inode %ju may be left as garbage\n", st->ino);
}
if (st->res < 0)
{
fprintf(stderr, "failed to list entries from %s: %s (code %d)\n",
kv_direntry_key(st->ino, "").c_str(), strerror(-st->res), st->res);
}
auto cb = std::move(st->cb);
cb(st->res < 0 ? st->res : -ENOTEMPTY);
return;
}
}
// (5) Reduce inode refcount by 1 or delete inode
if (st->ientry["nlink"].uint64_value() > 1)
{
auto copy = st->ientry.object_items();
copy["nlink"] = st->ientry["nlink"].uint64_value()-1;
st->self->parent->db->set(kv_inode_key(st->ino), json11::Json(copy).dump(), [st](int res)
{
st->res = res;
nfs_kv_continue_delete(st, 6);
}, [st](int res, const std::string & old_value)
{
return old_value == st->ientry_text;
});
}
else
{
st->self->parent->db->del(kv_inode_key(st->ino), [st](int res)
{
st->res = res;
nfs_kv_continue_delete(st, 6);
}, [st](int res, const std::string & old_value)
{
return old_value == st->ientry_text;
});
}
return;
resume_6:
if (st->res < 0)
{
// Assume EAGAIN is OK, maybe someone created a hard link in the meantime
auto cb = std::move(st->cb);
cb(st->res == -EAGAIN ? 0 : st->res);
return;
}
// (6) If regular file and inode is deleted: delete data
if ((!st->type || st->type == NF3REG) && st->ientry["nlink"].uint64_value() <= 1 &&
!st->ientry["shared_ino"].uint64_value())
{
// Remove data
st->self->parent->cmd->loop_and_wait(st->self->parent->cmd->start_rm_data(json11::Json::object {
{ "inode", INODE_NO_POOL(st->self->parent->fs_base_inode + st->ino) },
{ "pool", (uint64_t)INODE_POOL(st->self->parent->fs_base_inode + st->ino) },
}), [st](const cli_result_t & r)
{
if (r.err)
{
fprintf(stderr, "Failed to remove inode %jx data: %s (code %d)\n",
st->ino, r.text.c_str(), r.err);
}
st->res = r.err;
nfs_kv_continue_delete(st, 7);
});
return;
resume_7:
auto cb = std::move(st->cb);
cb(st->res);
return;
}
auto cb = std::move(st->cb);
cb(0);
}
int kv_nfs3_remove_proc(void *opaque, rpc_op_t *rop)
{
kv_del_state *st = new kv_del_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
REMOVE3res *reply = (REMOVE3res*)rop->reply;
REMOVE3args *args = (REMOVE3args*)rop->request;
st->dir_ino = kv_fh_inode(args->object.dir);
st->filename = args->object.name;
if (st->self->parent->trace)
fprintf(stderr, "[%d] REMOVE %ju/%s\n", st->self->nfs_fd, st->dir_ino, st->filename.c_str());
if (!st->dir_ino)
{
*reply = (REMOVE3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop);
delete st;
return 0;
}
st->cb = [st](int res)
{
*((REMOVE3res*)st->rop->reply) = (REMOVE3res){
.status = vitastor_nfs_map_err(res),
};
rpc_queue_reply(st->rop);
delete st;
};
nfs_kv_continue_delete(st, 0);
return 1;
}
int kv_nfs3_rmdir_proc(void *opaque, rpc_op_t *rop)
{
kv_del_state *st = new kv_del_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
RMDIR3args *args = (RMDIR3args*)rop->request;
RMDIR3res *reply = (RMDIR3res*)rop->reply;
st->dir_ino = kv_fh_inode(args->object.dir);
st->filename = args->object.name;
st->is_rmdir = true;
if (st->self->parent->trace)
fprintf(stderr, "[%d] RMDIR %ju/%s\n", st->self->nfs_fd, st->dir_ino, st->filename.c_str());
if (!st->dir_ino)
{
*reply = (RMDIR3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop);
delete st;
return 0;
}
st->cb = [st](int res)
{
*((RMDIR3res*)st->rop->reply) = (RMDIR3res){
.status = vitastor_nfs_map_err(res),
};
rpc_queue_reply(st->rop);
delete st;
};
nfs_kv_continue_delete(st, 0);
return 1;
}

392
src/nfs_kv_rename.cpp Normal file
View File

@ -0,0 +1,392 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - RENAME
#include <sys/time.h>
#include "nfs_proxy.h"
#include "nfs_kv.h"
#include "cli.h"
struct nfs_kv_rename_state
{
nfs_client_t *self = NULL;
rpc_op_t *rop = NULL;
// params:
uint64_t old_dir_ino = 0, new_dir_ino = 0;
std::string old_name, new_name;
// state:
bool allow_cache = true;
std::string old_direntry_text, old_ientry_text, new_direntry_text, new_ientry_text;
json11::Json old_direntry, old_ientry, new_direntry, new_ientry;
std::string new_dir_prefix;
void *list_handle = NULL;
bool new_exists = false;
bool rm_dest_data = false;
int res = 0, res2 = 0;
std::function<void(int)> cb;
};
static void nfs_kv_continue_rename(nfs_kv_rename_state *st, int state)
{
// Algorithm (non-atomic of course):
// 1) Read source direntry
// 2) Read destination direntry
// 3) If destination exists:
// 3.1) Check file/folder compatibility (EISDIR/ENOTDIR)
// 3.2) Check if destination is empty if it's a folder
// 4) If not:
// 4.1) Check that the destination directory is actually a directory
// 5) Overwrite destination direntry, restart from beginning if CAS failure
// 6) Delete source direntry, restart from beginning if CAS failure
// 7) If the moved direntry was a regular file:
// 7.1) Read inode
// 7.2) Delete inode if its link count <= 1
// 7.3) Delete inode data if its link count <= 1 and it's a regular non-shared file
// 7.4) Reduce link count by 1 if it's > 1
// 8) If the moved direntry is a directory:
// 8.1) Change parent_ino reference in its inode
if (state == 0) {}
else if (state == 1) goto resume_1;
else if (state == 2) goto resume_2;
else if (state == 3) goto resume_3;
else if (state == 4) goto resume_4;
else if (state == 5) goto resume_5;
else if (state == 6) goto resume_6;
else if (state == 7) goto resume_7;
else if (state == 8) goto resume_8;
else if (state == 9) goto resume_9;
else if (state == 10) goto resume_10;
else if (state == 11) goto resume_11;
else if (state == 12) goto resume_12;
else
{
fprintf(stderr, "BUG: invalid state in nfs_kv_continue_rename()");
abort();
}
resume_0:
// Read the old direntry
st->self->parent->db->get(kv_direntry_key(st->old_dir_ino, st->old_name), [=](int res, const std::string & value)
{
st->res = res;
st->old_direntry_text = value;
nfs_kv_continue_rename(st, 1);
}, st->allow_cache);
return;
resume_1:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
{
std::string err;
st->old_direntry = json11::Json::parse(st->old_direntry_text, err);
if (err != "")
{
fprintf(stderr, "Invalid JSON in direntry %s = %s: %s\n",
kv_direntry_key(st->old_dir_ino, st->old_name).c_str(),
st->old_direntry_text.c_str(), err.c_str());
auto cb = std::move(st->cb);
cb(-EIO);
return;
}
}
// Read the new direntry
st->self->parent->db->get(kv_direntry_key(st->new_dir_ino, st->new_name), [=](int res, const std::string & value)
{
st->res = res;
st->new_direntry_text = value;
nfs_kv_continue_rename(st, 2);
}, st->allow_cache);
return;
resume_2:
if (st->res < 0 && st->res != -ENOENT)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
if (st->res == 0)
{
std::string err;
st->new_direntry = json11::Json::parse(st->new_direntry_text, err);
if (err != "")
{
fprintf(stderr, "Invalid JSON in direntry %s = %s: %s\n",
kv_direntry_key(st->new_dir_ino, st->new_name).c_str(),
st->new_direntry_text.c_str(), err.c_str());
auto cb = std::move(st->cb);
cb(-EIO);
return;
}
}
st->new_exists = st->res == 0;
if (st->new_exists)
{
// Check file/folder compatibility (EISDIR/ENOTDIR)
if ((st->old_direntry["type"] == "dir") != (st->new_direntry["type"] == "dir"))
{
auto cb = std::move(st->cb);
cb((st->new_direntry["type"] == "dir") ? -ENOTDIR : -EISDIR);
return;
}
if (st->new_direntry["type"] == "dir")
{
// Check that the destination directory is empty
st->new_dir_prefix = kv_direntry_key(st->new_direntry["ino"].uint64_value(), "");
st->list_handle = st->self->parent->db->list_start(st->new_dir_prefix);
st->self->parent->db->list_next(st->list_handle, [st](int res, const std::string & key, const std::string & value)
{
st->res = res;
nfs_kv_continue_rename(st, 3);
});
return;
resume_3:
st->self->parent->db->list_close(st->list_handle);
if (st->res != -ENOENT)
{
auto cb = std::move(st->cb);
cb(-ENOTEMPTY);
return;
}
}
}
else
{
// Check that the new directory is actually a directory
kv_read_inode(st->self, st->new_dir_ino, [st](int res, const std::string & value, json11::Json attrs)
{
st->res = res == 0 ? (attrs["type"].string_value() == "dir" ? 0 : -ENOTDIR) : res;
nfs_kv_continue_rename(st, 4);
});
return;
resume_4:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
}
// Write the new direntry
st->self->parent->db->set(kv_direntry_key(st->new_dir_ino, st->new_name), st->old_direntry_text, [st](int res)
{
st->res = res;
nfs_kv_continue_rename(st, 5);
}, [st](int res, const std::string & old_value)
{
return st->new_exists ? (old_value == st->new_direntry_text) : (res == -ENOENT);
});
return;
resume_5:
if (st->res == -EAGAIN)
{
// CAS failure
st->allow_cache = false;
goto resume_0;
}
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
// Delete the old direntry
st->self->parent->db->del(kv_direntry_key(st->old_dir_ino, st->old_name), [st](int res)
{
st->res = res;
nfs_kv_continue_rename(st, 6);
}, [=](int res, const std::string & old_value)
{
return res == 0 && old_value == st->old_direntry_text;
});
return;
resume_6:
if (st->res == -EAGAIN)
{
// CAS failure
st->allow_cache = false;
goto resume_0;
}
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
st->allow_cache = true;
resume_7again:
if (st->new_exists && st->new_direntry["type"].string_value() != "dir")
{
// (Maybe) delete old destination file data
kv_read_inode(st->self, st->new_direntry["ino"].uint64_value(), [st](int res, const std::string & value, json11::Json attrs)
{
st->res = res;
st->new_ientry_text = value;
st->new_ientry = attrs;
nfs_kv_continue_rename(st, 7);
}, st->allow_cache);
return;
resume_7:
if (st->res == 0)
{
// (5) Reduce inode refcount by 1 or delete inode
if (st->new_ientry["nlink"].uint64_value() > 1)
{
auto copy = st->new_ientry.object_items();
copy["nlink"] = st->new_ientry["nlink"].uint64_value()-1;
st->self->parent->db->set(kv_inode_key(st->new_direntry["ino"].uint64_value()), json11::Json(copy).dump(), [st](int res)
{
st->res = res;
nfs_kv_continue_rename(st, 8);
}, [st](int res, const std::string & old_value)
{
return old_value == st->new_ientry_text;
});
}
else
{
st->rm_dest_data = kv_map_type(st->new_ientry["type"].string_value()) == NF3REG
&& !st->new_ientry["shared_ino"].uint64_value();
st->self->parent->db->del(kv_inode_key(st->new_direntry["ino"].uint64_value()), [st](int res)
{
st->res = res;
nfs_kv_continue_rename(st, 8);
}, [st](int res, const std::string & old_value)
{
return old_value == st->new_ientry_text;
});
}
return;
resume_8:
if (st->res == -EAGAIN)
{
// CAS failure - re-read inode
st->allow_cache = false;
goto resume_7again;
}
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
// Delete inode data if required
if (st->rm_dest_data)
{
st->self->parent->cmd->loop_and_wait(st->self->parent->cmd->start_rm_data(json11::Json::object {
{ "inode", INODE_NO_POOL(st->self->parent->fs_base_inode + st->new_direntry["ino"].uint64_value()) },
{ "pool", (uint64_t)INODE_POOL(st->self->parent->fs_base_inode + st->new_direntry["ino"].uint64_value()) },
}), [st](const cli_result_t & r)
{
if (r.err)
{
fprintf(stderr, "Failed to remove inode %jx data: %s (code %d)\n",
st->new_direntry["ino"].uint64_value(), r.text.c_str(), r.err);
}
st->res = r.err;
nfs_kv_continue_rename(st, 9);
});
return;
resume_9:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
}
}
}
if (st->old_direntry["type"].string_value() == "dir" && st->new_dir_ino != st->old_dir_ino)
{
// Change parent_ino in old ientry
st->allow_cache = true;
resume_10:
kv_read_inode(st->self, st->old_direntry["ino"].uint64_value(), [st](int res, const std::string & value, json11::Json ientry)
{
st->res = res;
st->old_ientry_text = value;
st->old_ientry = ientry;
nfs_kv_continue_rename(st, 11);
}, st->allow_cache);
return;
resume_11:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
{
auto ientry_new = st->old_ientry.object_items();
ientry_new["parent_ino"] = st->new_dir_ino;
st->self->parent->db->set(kv_inode_key(st->old_direntry["ino"].uint64_value()), json11::Json(ientry_new).dump(), [st](int res)
{
st->res = res;
nfs_kv_continue_rename(st, 12);
}, [st](int res, const std::string & old_value)
{
return old_value == st->old_ientry_text;
});
}
return;
resume_12:
if (st->res == -EAGAIN)
{
// CAS failure - try again
st->allow_cache = false;
goto resume_10;
}
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
}
auto cb = std::move(st->cb);
cb(st->res);
}
int kv_nfs3_rename_proc(void *opaque, rpc_op_t *rop)
{
auto st = new nfs_kv_rename_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
RENAME3args *args = (RENAME3args*)rop->request;
st->old_dir_ino = kv_fh_inode(args->from.dir);
st->new_dir_ino = kv_fh_inode(args->to.dir);
st->old_name = args->from.name;
st->new_name = args->to.name;
if (st->self->parent->trace)
fprintf(stderr, "[%d] RENAME %ju/%s -> %ju/%s\n", st->self->nfs_fd, st->old_dir_ino, st->old_name.c_str(), st->new_dir_ino, st->new_name.c_str());
if (!st->old_dir_ino || !st->new_dir_ino || st->old_name == "" || st->new_name == "")
{
RENAME3res *reply = (RENAME3res*)rop->reply;
*reply = (RENAME3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop);
delete st;
return 0;
}
if (st->old_dir_ino == st->new_dir_ino && st->old_name == st->new_name)
{
RENAME3res *reply = (RENAME3res*)rop->reply;
*reply = (RENAME3res){ .status = NFS3_OK };
rpc_queue_reply(st->rop);
delete st;
return 0;
}
st->cb = [st](int res)
{
RENAME3res *reply = (RENAME3res*)st->rop->reply;
*reply = (RENAME3res){ .status = vitastor_nfs_map_err(res) };
rpc_queue_reply(st->rop);
delete st;
};
nfs_kv_continue_rename(st, 0);
return 1;
}

189
src/nfs_kv_setattr.cpp Normal file
View File

@ -0,0 +1,189 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - SETATTR
#include <sys/time.h>
#include "nfs_proxy.h"
#include "nfs_kv.h"
#include "cli.h"
struct nfs_kv_setattr_state
{
nfs_client_t *self = NULL;
rpc_op_t *rop = NULL;
uint64_t ino = 0;
uint64_t old_size = 0, new_size = 0;
json11::Json::object set_attrs;
int res = 0, cas_res = 0;
std::string ientry_text;
json11::Json ientry;
json11::Json::object new_attrs;
std::function<void(int)> cb;
};
static void nfs_kv_continue_setattr(nfs_kv_setattr_state *st, int state)
{
// FIXME: NFS client does a lot of setattr calls, so maybe process them asynchronously
if (state == 0) {}
else if (state == 1) goto resume_1;
else if (state == 2) goto resume_2;
else if (state == 3) goto resume_3;
else
{
fprintf(stderr, "BUG: invalid state in nfs_kv_continue_setattr()");
abort();
}
resume_0:
kv_read_inode(st->self, st->ino, [st](int res, const std::string & value, json11::Json attrs)
{
st->res = res;
st->ientry_text = value;
st->ientry = attrs;
nfs_kv_continue_setattr(st, 1);
});
return;
resume_1:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
if (st->ientry["type"].string_value() != "file" &&
st->ientry["type"].string_value() != "" &&
!st->set_attrs["size"].is_null())
{
auto cb = std::move(st->cb);
cb(-EINVAL);
return;
}
// Now we can update it
st->new_attrs = st->ientry.object_items();
st->old_size = st->ientry["size"].uint64_value();
for (auto & kv: st->set_attrs)
{
if (kv.first == "size")
{
st->new_size = kv.second.uint64_value();
}
st->new_attrs[kv.first] = kv.second;
}
st->self->parent->db->set(kv_inode_key(st->ino), json11::Json(st->new_attrs).dump(), [st](int res)
{
st->res = res;
nfs_kv_continue_setattr(st, 2);
}, [st](int res, const std::string & cas_value)
{
st->cas_res = res;
return (res == 0 || res == -ENOENT && st->ino == KV_ROOT_INODE) && cas_value == st->ientry_text;
});
return;
resume_2:
if (st->cas_res == -ENOENT)
{
st->res = -ENOENT;
}
if (st->res == -EAGAIN)
{
// Retry
fprintf(stderr, "CAS failure during setattr, retrying\n");
goto resume_0;
}
if (st->res < 0)
{
fprintf(stderr, "Failed to update inode %ju: %s (code %d)\n", st->ino, strerror(-st->res), st->res);
auto cb = std::move(st->cb);
cb(st->res);
return;
}
if (!st->set_attrs["size"].is_null() &&
st->ientry["size"].uint64_value() > st->set_attrs["size"].uint64_value() &&
!st->ientry["shared_ino"].uint64_value())
{
// Delete extra data when downsizing
st->self->parent->cmd->loop_and_wait(st->self->parent->cmd->start_rm_data(json11::Json::object {
{ "inode", INODE_NO_POOL(st->self->parent->fs_base_inode + st->ino) },
{ "pool", (uint64_t)INODE_POOL(st->self->parent->fs_base_inode + st->ino) },
{ "min_offset", st->set_attrs["size"].uint64_value() },
}), [st](const cli_result_t & r)
{
if (r.err)
{
fprintf(stderr, "Failed to truncate inode %ju: %s (code %d)\n",
st->ino, r.text.c_str(), r.err);
}
st->res = r.err;
nfs_kv_continue_setattr(st, 3);
});
return;
}
resume_3:
auto cb = std::move(st->cb);
cb(st->res);
}
int kv_nfs3_setattr_proc(void *opaque, rpc_op_t *rop)
{
nfs_kv_setattr_state *st = new nfs_kv_setattr_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
auto args = (SETATTR3args*)rop->request;
auto reply = (SETATTR3res*)rop->reply;
std::string fh = args->object;
if (!kv_fh_valid(fh))
{
*reply = (SETATTR3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop);
delete st;
return 0;
}
st->ino = kv_fh_inode(fh);
if (args->new_attributes.size.set_it)
st->set_attrs["size"] = args->new_attributes.size.size;
if (args->new_attributes.mode.set_it)
st->set_attrs["mode"] = (uint64_t)args->new_attributes.mode.mode;
if (args->new_attributes.uid.set_it)
st->set_attrs["uid"] = (uint64_t)args->new_attributes.uid.uid;
if (args->new_attributes.gid.set_it)
st->set_attrs["gid"] = (uint64_t)args->new_attributes.gid.gid;
if (args->new_attributes.atime.set_it == SET_TO_SERVER_TIME)
st->set_attrs["atime"] = nfstime_now_str();
else if (args->new_attributes.atime.set_it == SET_TO_CLIENT_TIME)
st->set_attrs["atime"] = nfstime_to_str(args->new_attributes.atime.atime);
if (args->new_attributes.mtime.set_it == SET_TO_SERVER_TIME)
st->set_attrs["mtime"] = nfstime_now_str();
else if (args->new_attributes.mtime.set_it == SET_TO_CLIENT_TIME)
st->set_attrs["mtime"] = nfstime_to_str(args->new_attributes.mtime.mtime);
if (st->self->parent->trace)
fprintf(stderr, "[%d] SETATTR %ju ATTRS %s\n", st->self->nfs_fd, st->ino, json11::Json(st->set_attrs).dump().c_str());
st->cb = [st](int res)
{
auto reply = (SETATTR3res*)st->rop->reply;
if (res < 0)
{
*reply = (SETATTR3res){
.status = vitastor_nfs_map_err(res),
};
}
else
{
*reply = (SETATTR3res){
.status = NFS3_OK,
.resok = (SETATTR3resok){
.obj_wcc = (wcc_data){
.after = (post_op_attr){
.attributes_follow = 1,
.attributes = get_kv_attributes(st->self, st->ino, st->new_attrs),
},
},
},
};
}
rpc_queue_reply(st->rop);
delete st;
};
nfs_kv_continue_setattr(st, 0);
return 1;
}

913
src/nfs_kv_write.cpp Normal file
View File

@ -0,0 +1,913 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy over VitastorKV database - WRITE
#include <sys/time.h>
#include "nfs_proxy.h"
#include "nfs_kv.h"
struct nfs_rmw_t
{
nfs_kv_write_state *st = NULL;
int continue_state = 0;
uint64_t ino = 0;
uint64_t offset = 0;
uint8_t *buf = NULL;
uint64_t size = 0;
uint8_t *part_buf = NULL;
uint64_t version = 0;
};
struct nfs_kv_write_state
{
nfs_client_t *self = NULL;
rpc_op_t *rop = NULL;
uint64_t ino = 0;
uint64_t offset = 0, size = 0;
bool stable = false;
uint8_t *buf = NULL;
std::function<void(int res)> cb;
// state
bool allow_cache = true;
int res = 0, res2 = 0;
int waiting = 0;
std::string ientry_text;
json11::Json ientry;
uint64_t new_size = 0;
uint64_t aligned_size = 0;
uint8_t *aligned_buf = NULL;
uint64_t shared_inode = 0, shared_offset = 0;
bool was_immediate = false;
nfs_rmw_t rmw[2];
shared_file_header_t shdr;
kv_inode_extend_t *ext = NULL;
~nfs_kv_write_state()
{
if (aligned_buf)
{
free(aligned_buf);
aligned_buf = NULL;
}
}
};
static void nfs_kv_continue_write(nfs_kv_write_state *st, int state);
static void finish_allocate_shared(nfs_client_t *self, int res)
{
std::vector<shared_alloc_queue_t> waiting;
waiting.swap(self->parent->kvfs->allocating_shared);
for (auto & w: waiting)
{
w.st->res = res;
if (res == 0)
{
w.st->shared_inode = self->parent->kvfs->cur_shared_inode;
w.st->shared_offset = self->parent->kvfs->cur_shared_offset;
self->parent->kvfs->cur_shared_offset += (w.size + self->parent->pool_alignment-1) & ~(self->parent->pool_alignment-1);
}
nfs_kv_continue_write(w.st, w.state);
}
}
static void allocate_shared_inode(nfs_kv_write_state *st, int state, uint64_t size)
{
if (st->self->parent->kvfs->cur_shared_inode == 0)
{
st->self->parent->kvfs->allocating_shared.push_back({ st, state, size });
if (st->self->parent->kvfs->allocating_shared.size() > 1)
{
return;
}
allocate_new_id(st->self, [st](int res, uint64_t new_id)
{
if (res < 0)
{
finish_allocate_shared(st->self, res);
return;
}
st->self->parent->kvfs->cur_shared_inode = new_id;
st->self->parent->kvfs->cur_shared_offset = 0;
st->self->parent->db->set(
kv_inode_key(new_id), json11::Json(json11::Json::object{ { "type", "shared" } }).dump(),
[st](int res)
{
if (res < 0)
{
st->self->parent->kvfs->cur_shared_inode = 0;
}
finish_allocate_shared(st->self, res);
},
[](int res, const std::string & old_value)
{
return res == -ENOENT;
}
);
});
}
else
{
st->res = 0;
st->shared_inode = st->self->parent->kvfs->cur_shared_inode;
st->shared_offset = st->self->parent->kvfs->cur_shared_offset;
st->self->parent->kvfs->cur_shared_offset += (size + st->self->parent->pool_alignment-1) & ~(st->self->parent->pool_alignment-1);
nfs_kv_continue_write(st, state);
}
}
uint64_t align_shared_size(nfs_client_t *self, uint64_t size)
{
return (size + sizeof(shared_file_header_t) + self->parent->pool_alignment-1)
& ~(self->parent->pool_alignment-1);
}
static void nfs_do_write(uint64_t ino, uint64_t offset, uint64_t size, std::function<void(cluster_op_t *op)> prepare, nfs_kv_write_state *st, int state)
{
auto op = new cluster_op_t;
op->opcode = OSD_OP_WRITE;
op->inode = st->self->parent->fs_base_inode + ino;
op->offset = offset;
op->len = size;
prepare(op);
st->waiting++;
op->callback = [st, state](cluster_op_t *op)
{
if (op->retval != op->len)
{
st->res = op->retval >= 0 ? -EIO : op->retval;
}
delete op;
st->waiting--;
if (!st->waiting)
{
nfs_kv_continue_write(st, state);
}
};
st->self->parent->cli->execute(op);
}
static void nfs_do_unshare_write(nfs_kv_write_state *st, int state)
{
uint64_t unshare_size = (st->ientry["size"].uint64_value() + st->self->parent->pool_alignment-1)
& ~(st->self->parent->pool_alignment-1);
nfs_do_write(st->ino, 0, unshare_size, [&](cluster_op_t *op)
{
op->iov.push_back(st->aligned_buf + sizeof(shared_file_header_t), unshare_size);
}, st, state);
}
static void nfs_do_rmw(nfs_rmw_t *rmw)
{
auto parent = rmw->st->self->parent;
auto align = parent->pool_alignment;
assert(rmw->size < align);
assert((rmw->offset/parent->pool_block_size) == ((rmw->offset+rmw->size-1)/parent->pool_block_size));
if (!rmw->part_buf)
{
rmw->part_buf = (uint8_t*)malloc_or_die(align);
}
auto op = new cluster_op_t;
op->opcode = OSD_OP_READ;
op->inode = parent->fs_base_inode + rmw->ino;
op->offset = rmw->offset & ~(align-1);
op->len = align;
op->iov.push_back(rmw->part_buf, op->len);
rmw->st->waiting++;
op->callback = [rmw](cluster_op_t *rd_op)
{
if (rd_op->retval != rd_op->len)
{
free(rmw->part_buf);
rmw->part_buf = NULL;
rmw->st->res = rd_op->retval >= 0 ? -EIO : rd_op->retval;
rmw->st->waiting--;
if (!rmw->st->waiting)
{
nfs_kv_continue_write(rmw->st, rmw->continue_state);
}
}
else
{
if (!rmw->version)
{
auto st = rmw->st;
rmw->version = rd_op->version+1;
if (st->rmw[0].st && st->rmw[1].st &&
st->rmw[0].offset/st->self->parent->pool_block_size == st->rmw[1].offset/st->self->parent->pool_block_size)
{
// Same block... RMWs should be sequential
int other = rmw == &st->rmw[0] ? 1 : 0;
st->rmw[other].version = rmw->version+1;
}
}
auto parent = rmw->st->self->parent;
auto align = parent->pool_alignment;
bool is_begin = (rmw->offset % align);
bool is_end = ((rmw->offset+rmw->size) % align);
auto op = new cluster_op_t;
op->opcode = OSD_OP_WRITE;
op->inode = rmw->st->self->parent->fs_base_inode + rmw->ino;
op->offset = rmw->offset & ~(align-1);
op->len = align;
op->version = rmw->version;
if (is_begin)
{
op->iov.push_back(rmw->part_buf, rmw->offset % align);
}
op->iov.push_back(rmw->buf, rmw->size);
if (is_end)
{
op->iov.push_back(rmw->part_buf + (rmw->offset % align) + rmw->size, align - (rmw->offset % align) - rmw->size);
}
op->callback = [rmw](cluster_op_t *op)
{
if (op->retval == -EINTR)
{
// CAS failure - retry
rmw->version = 0;
rmw->st->waiting--;
nfs_do_rmw(rmw);
}
else
{
free(rmw->part_buf);
rmw->part_buf = NULL;
if (op->retval != op->len)
{
rmw->st->res = (op->retval >= 0 ? -EIO : op->retval);
}
rmw->st->waiting--;
if (!rmw->st->waiting)
{
nfs_kv_continue_write(rmw->st, rmw->continue_state);
}
}
delete op;
};
parent->cli->execute(op);
}
delete rd_op;
};
parent->cli->execute(op);
}
static void nfs_do_shared_read(nfs_kv_write_state *st, int state)
{
auto op = new cluster_op_t;
op->opcode = OSD_OP_READ;
op->inode = st->self->parent->fs_base_inode + st->ientry["shared_ino"].uint64_value();
op->offset = st->ientry["shared_offset"].uint64_value();
op->len = align_shared_size(st->self, st->ientry["size"].uint64_value());
op->iov.push_back(st->aligned_buf, op->len);
op->callback = [st, state](cluster_op_t *op)
{
st->res = op->retval == op->len ? 0 : op->retval;
delete op;
nfs_kv_continue_write(st, state);
};
st->self->parent->cli->execute(op);
}
static void nfs_do_fsync(nfs_kv_write_state *st, int state)
{
// Client requested a stable write. Add an fsync
auto op = new cluster_op_t;
op->opcode = OSD_OP_SYNC;
op->callback = [st, state](cluster_op_t *op)
{
delete op;
nfs_kv_continue_write(st, state);
};
st->self->parent->cli->execute(op);
}
static bool nfs_do_shared_readmodify(nfs_kv_write_state *st, int base_state, int state, bool unshare)
{
assert(state <= base_state);
if (state < base_state) {}
else if (state == base_state) goto resume_0;
assert(!st->aligned_buf);
st->aligned_size = unshare
? sizeof(shared_file_header_t) + ((st->new_size + st->self->parent->pool_alignment-1) & ~(st->self->parent->pool_alignment-1))
: align_shared_size(st->self, st->new_size);
st->aligned_buf = (uint8_t*)malloc_or_die(st->aligned_size);
// FIXME do not allocate zeroes if we only need zeroes
memset(st->aligned_buf + sizeof(shared_file_header_t), 0, st->offset);
memset(st->aligned_buf + sizeof(shared_file_header_t) + st->offset + st->size, 0,
st->aligned_size - sizeof(shared_file_header_t) - st->offset - st->size);
if (st->ientry["shared_ino"].uint64_value() != 0 &&
st->ientry["size"].uint64_value() != 0)
{
// Read old data if shared non-empty
nfs_do_shared_read(st, base_state);
return false;
resume_0:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return false;
}
auto hdr = ((shared_file_header_t*)st->aligned_buf);
if (hdr->magic != SHARED_FILE_MAGIC_V1 || hdr->inode != st->ino)
{
// Got unrelated data - retry from the beginning
st->allow_cache = false;
free(st->aligned_buf);
st->aligned_buf = NULL;
nfs_kv_continue_write(st, 0);
return false;
}
}
// FIXME put shared_file_header_t after data to not break alignment
*((shared_file_header_t*)st->aligned_buf) = {
.magic = SHARED_FILE_MAGIC_V1,
.inode = st->ino,
.alloc = st->aligned_size,
};
return true;
}
static void nfs_do_shared_write(nfs_kv_write_state *st, int state, bool only_aligned)
{
nfs_do_write(st->shared_inode, st->shared_offset, st->aligned_size, [&](cluster_op_t *op)
{
if (only_aligned)
op->iov.push_back(st->aligned_buf, st->aligned_size);
else
{
op->iov.push_back(st->aligned_buf, sizeof(shared_file_header_t) + st->offset);
op->iov.push_back(st->buf, st->size);
op->iov.push_back(
st->aligned_buf + sizeof(shared_file_header_t) + st->offset + st->size,
st->aligned_size - (sizeof(shared_file_header_t) + st->offset + st->size)
);
}
}, st, state);
}
static void nfs_do_align_write(nfs_kv_write_state *st, uint64_t ino, uint64_t offset, uint64_t shared_alloc, int state)
{
auto alignment = st->self->parent->pool_alignment;
uint64_t end = (offset+st->size);
uint8_t *good_buf = st->buf;
uint64_t good_offset = offset;
uint64_t good_size = st->size;
bool begin_shdr = false;
uint64_t end_pad = 0;
st->waiting++;
st->rmw[0].st = NULL;
st->rmw[1].st = NULL;
if (offset % alignment)
{
if (shared_alloc && st->offset == 0 && (offset % alignment) == sizeof(shared_file_header_t))
{
// RMW can be skipped at shared beginning
st->shdr = {
.magic = SHARED_FILE_MAGIC_V1,
.inode = st->ino,
.alloc = shared_alloc,
};
begin_shdr = true;
good_offset -= sizeof(shared_file_header_t);
offset = 0;
}
else
{
// Requires read-modify-write in the beginning
auto s = (alignment - (offset % alignment));
if (good_size > s)
{
good_buf += s;
good_offset += s;
good_size -= s;
}
else
good_size = 0;
s = s > st->size ? st->size : s;
st->rmw[0] = {
.st = st,
.continue_state = state,
.ino = ino,
.offset = offset,
.buf = st->buf,
.size = s,
};
nfs_do_rmw(&st->rmw[0]);
}
}
if ((end % alignment) &&
(offset == 0 || end/alignment > (offset-1)/alignment))
{
// Requires read-modify-write in the end
assert(st->offset+st->size <= st->new_size);
if (st->offset+st->size == st->new_size)
{
// rmw can be skipped at end - we can just zero pad the request
end_pad = alignment - (end % alignment);
}
else
{
auto s = (end % alignment);
if (good_size > s)
good_size -= s;
else
good_size = 0;
st->rmw[1] = {
.st = st,
.continue_state = state,
.ino = ino,
.offset = end - s,
.buf = st->buf + st->size - s,
.size = s,
};
nfs_do_rmw(&st->rmw[1]);
}
}
if (good_size > 0 || end_pad > 0 || begin_shdr)
{
// Normal write
nfs_do_write(ino, good_offset, (begin_shdr ? sizeof(shared_file_header_t) : 0)+good_size+end_pad, [&](cluster_op_t *op)
{
if (begin_shdr)
op->iov.push_back(&st->shdr, sizeof(shared_file_header_t));
op->iov.push_back(good_buf, good_size);
if (end_pad)
op->iov.push_back(st->self->parent->kvfs->zero_block.data(), end_pad);
}, st, state);
}
st->waiting--;
if (!st->waiting)
{
nfs_kv_continue_write(st, state);
}
}
static std::string new_normal_ientry(nfs_kv_write_state *st)
{
auto ni = st->ientry.object_items();
ni.erase("empty");
ni.erase("shared_ino");
ni.erase("shared_offset");
ni.erase("shared_alloc");
ni.erase("shared_ver");
ni["size"] = st->ext->cur_extend;
return json11::Json(ni).dump();
}
static std::string new_moved_ientry(nfs_kv_write_state *st)
{
auto ni = st->ientry.object_items();
ni.erase("empty");
ni["shared_ino"] = st->shared_inode;
ni["shared_offset"] = st->shared_offset;
ni["shared_alloc"] = st->aligned_size;
ni.erase("shared_ver");
ni["size"] = st->new_size;
return json11::Json(ni).dump();
}
static std::string new_shared_ientry(nfs_kv_write_state *st)
{
auto ni = st->ientry.object_items();
ni.erase("empty");
ni["size"] = st->new_size;
ni["shared_ver"] = ni["shared_ver"].uint64_value()+1;
return json11::Json(ni).dump();
}
static std::string new_unshared_ientry(nfs_kv_write_state *st)
{
auto ni = st->ientry.object_items();
ni.erase("empty");
ni.erase("shared_ino");
ni.erase("shared_offset");
ni.erase("shared_alloc");
ni.erase("shared_ver");
return json11::Json(ni).dump();
}
static void nfs_kv_extend_inode(nfs_kv_write_state *st, int state, int base_state)
{
if (state == base_state+1)
goto resume_1;
st->ext->cur_extend = st->ext->next_extend;
st->ext->next_extend = 0;
st->res2 = -EAGAIN;
st->self->parent->db->set(kv_inode_key(st->ino), new_normal_ientry(st), [st, base_state](int res)
{
st->res = res;
nfs_kv_continue_write(st, base_state+1);
}, [st](int res, const std::string & old_value)
{
if (res != 0)
{
return false;
}
if (old_value == st->ientry_text)
{
return true;
}
std::string err;
auto ientry = json11::Json::parse(old_value, err).object_items();
if (err != "")
{
fprintf(stderr, "Invalid JSON in inode %lu = %s: %s\n", st->ino, old_value.c_str(), err.c_str());
st->res2 = -EINVAL;
return false;
}
else if (ientry.size() == st->ientry.object_items().size())
{
for (auto & kv: st->ientry.object_items())
{
if (kv.first != "size" && ientry[kv.first] != kv.second)
{
// Something except size changed
return false;
}
}
// OK, only size changed
if (ientry["size"] >= st->new_size)
{
// Already extended
st->res2 = 0;
return false;
}
// size is different but can still be extended, other parameters don't differ
return true;
}
return false;
});
return;
resume_1:
if (st->res == -EAGAIN)
{
// EAGAIN may be OK in fact (see above)
st->res = st->res2;
}
if (st->res == 0)
{
st->ext->done_extend = st->ext->cur_extend;
}
st->ext->cur_extend = 0;
// Wake up other extenders anyway
auto waiters = std::move(st->ext->waiters);
for (auto & cb: waiters)
{
cb();
}
}
// Packing small files into "shared inodes". Insane algorithm...
// Write:
// - If (offset+size <= threshold):
// - Read inode from cache
// - If inode does not exist - stop with -ENOENT
// - If inode is not a regular file - stop with -EINVAL
// - If it's empty (size == 0 || empty == true):
// - If preset size is larger than threshold:
// - Write data into non-shared inode
// - In parallel: clear empty flag
// - If CAS failure: re-read inode and restart
// - Otherwise:
// - Allocate/take a shared inode
// - Allocate space in its end
// - Write data into shared inode
// - If CAS failure: allocate another shared inode and retry
// - Write shared inode reference, set size
// - If CAS failure: free allocated shared space, re-read inode and restart
// - If it's not empty:
// - If non-shared:
// - Write data into non-shared inode
// - In parallel: check if data fits into inode size and extend if it doesn't
// - If CAS failure: re-read inode and retry to extend the size
// - If shared:
// - If data doesn't fit into the same shared inode:
// - Allocate space in a new shared inode
// - Read whole file from shared inode
// - Write data into the new shared inode
// - If CAS failure: allocate another shared inode and retry
// - Update inode metadata (set new size and new shared inode)
// - If CAS failure: free allocated shared space, re-read inode and restart
// - If it fits:
// - Update shared inode data in-place
// - Update inode entry in any case to block parallel non-shared writes
// - If CAS failure: re-read inode and restart
// - Otherwise:
// - Read inode
// - If not a regular file - stop with -EINVAL
// - If shared:
// - Read whole file from shared inode
// - Write data into non-shared inode
// - If CAS failure (block should not exist): restart
// - Update inode metadata (make non-shared, update size)
// - If CAS failure: restart
// - Zero out the shared inode header
// - Write data into non-shared inode
// - Check if size fits
// - Extend if it doesn't
// Read:
// - If (offset+size <= threshold):
// - Read inode from cache
// - If empty: return zeroes
// - If shared:
// - Read the whole file from shared inode, or at least data and shared inode header
// - If the file header in data doesn't match: re-read inode and restart
// - If non-shared:
// - Read data from non-shared inode
// - Otherwise:
// - Read data from non-shared inode
static void nfs_kv_continue_write(nfs_kv_write_state *st, int state)
{
if (state == 0) {}
else if (state == 1) goto resume_1;
else if (state == 2) goto resume_2;
else if (state == 3) goto resume_3;
else if (state == 4) goto resume_4;
else if (state == 5) goto resume_5;
else if (state == 6) goto resume_6;
else if (state == 7) goto resume_7;
else if (state == 8) goto resume_8;
else if (state == 9) goto resume_9;
else if (state == 10) goto resume_10;
else if (state == 11) goto resume_11;
else if (state == 12) goto resume_12;
else if (state == 13) goto resume_13;
else if (state == 14) goto resume_14;
else if (state == 15) goto resume_15;
else if (state == 16) goto resume_16;
else
{
fprintf(stderr, "BUG: invalid state in nfs_kv_continue_write()");
abort();
}
resume_0:
if (!st->size)
{
auto cb = std::move(st->cb);
cb(0);
return;
}
kv_read_inode(st->self, st->ino, [st](int res, const std::string & value, json11::Json attrs)
{
st->res = res;
st->ientry_text = value;
st->ientry = attrs;
nfs_kv_continue_write(st, 1);
}, st->allow_cache);
return;
resume_1:
if (st->res < 0 || kv_map_type(st->ientry["type"].string_value()) != NF3REG)
{
auto cb = std::move(st->cb);
cb(st->res == 0 ? -EINVAL : st->res);
return;
}
st->was_immediate = st->self->parent->cli->get_immediate_commit(st->self->parent->fs_base_inode + st->ino);
st->new_size = st->ientry["size"].uint64_value();
if (st->new_size < st->offset + st->size)
{
st->new_size = st->offset + st->size;
}
if (st->offset + st->size + sizeof(shared_file_header_t) < st->self->parent->shared_inode_threshold)
{
if (st->ientry["size"].uint64_value() == 0 &&
st->ientry["shared_ino"].uint64_value() == 0 ||
st->ientry["empty"].bool_value() &&
(st->ientry["size"].uint64_value() + sizeof(shared_file_header_t)) < st->self->parent->shared_inode_threshold ||
st->ientry["shared_ino"].uint64_value() != 0 &&
st->ientry["shared_alloc"].uint64_value() < sizeof(shared_file_header_t)+st->offset+st->size)
{
// Either empty, or shared and requires moving into a larger place (redirect-write)
allocate_shared_inode(st, 2, align_shared_size(st->self, st->new_size));
return;
resume_2:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
resume_3:
if (!nfs_do_shared_readmodify(st, 3, state, false))
return;
nfs_do_shared_write(st, 4, false);
return;
resume_4:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
st->self->parent->db->set(kv_inode_key(st->ino), new_moved_ientry(st), [st](int res)
{
st->res = res;
nfs_kv_continue_write(st, 5);
}, [st](int res, const std::string & old_value)
{
return res == 0 && old_value == st->ientry_text;
});
return;
resume_5:
if (st->res < 0)
{
st->res2 = st->res;
memset(st->aligned_buf, 0, st->aligned_size);
nfs_do_shared_write(st, 6, true);
return;
resume_6:
free(st->aligned_buf);
st->aligned_buf = NULL;
if (st->res2 == -EAGAIN)
{
goto resume_0;
}
else
{
auto cb = std::move(st->cb);
cb(st->res2);
return;
}
}
auto cb = std::move(st->cb);
cb(0);
return;
}
else if (st->ientry["shared_ino"].uint64_value() != 0)
{
// Non-empty, shared, can be updated in-place
nfs_do_align_write(st, st->ientry["shared_ino"].uint64_value(),
st->ientry["shared_offset"].uint64_value() + sizeof(shared_file_header_t) + st->offset,
st->ientry["shared_alloc"].uint64_value(), 7);
return;
resume_7:
if (st->res == 0 && st->stable && !st->was_immediate)
{
nfs_do_fsync(st, 8);
return;
}
resume_8:
// We always have to change inode entry on shared writes
st->self->parent->db->set(kv_inode_key(st->ino), new_shared_ientry(st), [st](int res)
{
st->res = res;
nfs_kv_continue_write(st, 9);
}, [st](int res, const std::string & old_value)
{
return res == 0 && old_value == st->ientry_text;
});
return;
resume_9:
if (st->res == -EAGAIN)
{
goto resume_0;
}
auto cb = std::move(st->cb);
cb(st->res);
return;
}
// Fall through for non-shared
}
// Unshare?
if (st->ientry["shared_ino"].uint64_value() != 0)
{
if (st->ientry["size"].uint64_value() != 0)
{
assert(!st->aligned_buf);
st->aligned_size = align_shared_size(st->self, st->ientry["size"].uint64_value());
st->aligned_buf = (uint8_t*)malloc_or_die(st->aligned_size);
nfs_do_shared_read(st, 10);
return;
resume_10:
nfs_do_unshare_write(st, 11);
return;
resume_11:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
}
st->self->parent->db->set(kv_inode_key(st->ino), new_unshared_ientry(st), [st](int res)
{
st->res = res;
nfs_kv_continue_write(st, 12);
}, [st](int res, const std::string & old_value)
{
return res == 0 && old_value == st->ientry_text;
});
return;
resume_12:
if (st->res == -EAGAIN)
{
// Restart
goto resume_0;
}
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
st->ientry_text = new_unshared_ientry(st);
}
// Non-shared write
nfs_do_align_write(st, st->ino, st->offset, 0, 13);
return;
resume_13:
if (st->res == 0 && st->stable && !st->was_immediate)
{
nfs_do_fsync(st, 14);
return;
}
resume_14:
if (st->res < 0)
{
auto cb = std::move(st->cb);
cb(st->res);
return;
}
if (st->ientry["empty"].bool_value() ||
st->ientry["size"].uint64_value() < st->new_size ||
st->ientry["shared_ino"].uint64_value() != 0)
{
st->ext = &st->self->parent->kvfs->extends[st->ino];
st->ext->refcnt++;
resume_15:
if (st->ext->next_extend < st->new_size)
{
// Aggregate inode extension requests
st->ext->next_extend = st->new_size;
}
if (st->ext->cur_extend > 0)
{
// Wait for current extend which is already in progress
st->ext->waiters.push_back([st](){ nfs_kv_continue_write(st, 15); });
return;
}
if (st->ext->done_extend < st->new_size)
{
nfs_kv_extend_inode(st, 15, 15);
return;
resume_16:
nfs_kv_extend_inode(st, 16, 15);
}
st->ext->refcnt--;
assert(st->ext->refcnt >= 0);
if (st->ext->refcnt == 0)
{
st->self->parent->kvfs->extends.erase(st->ino);
}
}
if (st->res == -EAGAIN)
{
// Restart
goto resume_0;
}
auto cb = std::move(st->cb);
cb(st->res);
}
int kv_nfs3_write_proc(void *opaque, rpc_op_t *rop)
{
nfs_kv_write_state *st = new nfs_kv_write_state;
st->self = (nfs_client_t*)opaque;
st->rop = rop;
WRITE3args *args = (WRITE3args*)rop->request;
WRITE3res *reply = (WRITE3res*)rop->reply;
st->ino = kv_fh_inode(args->file);
st->offset = args->offset;
st->size = (args->count > args->data.size ? args->data.size : args->count);
if (st->self->parent->trace)
fprintf(stderr, "[%d] WRITE %ju %ju+%ju\n", st->self->nfs_fd, st->ino, st->offset, st->size);
if (!st->ino || st->size > MAX_REQUEST_SIZE)
{
*reply = (WRITE3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop);
delete st;
return 0;
}
st->buf = (uint8_t*)args->data.data;
st->stable = (args->stable != UNSTABLE);
st->cb = [st](int res)
{
WRITE3res *reply = (WRITE3res*)st->rop->reply;
*reply = (WRITE3res){ .status = vitastor_nfs_map_err(res) };
if (res == 0)
{
reply->resok.count = (unsigned)st->size;
reply->resok.committed = st->stable || st->was_immediate ? FILE_SYNC : UNSTABLE;
*(uint64_t*)reply->resok.verf = st->self->parent->server_id;
}
rpc_queue_reply(st->rop);
delete st;
};
nfs_kv_continue_write(st, 0);
return 1;
}

126
src/nfs_mount.cpp Normal file
View File

@ -0,0 +1,126 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// NFS proxy - common NULL, ACCESS, COMMIT, DUMP, EXPORT, MNT, UMNT, UMNTALL
#include <sys/time.h>
#include "nfs_proxy.h"
#include "nfs/nfs.h"
nfsstat3 vitastor_nfs_map_err(int err)
{
if (err < 0)
{
err = -err;
}
return (err == EINVAL ? NFS3ERR_INVAL
: (err == ENOENT ? NFS3ERR_NOENT
: (err == ENOSPC ? NFS3ERR_NOSPC
: (err == EEXIST ? NFS3ERR_EXIST
: (err == EISDIR ? NFS3ERR_ISDIR
: (err == ENOTDIR ? NFS3ERR_NOTDIR
: (err == ENOTEMPTY ? NFS3ERR_NOTEMPTY
: (err == EIO ? NFS3ERR_IO : (err ? NFS3ERR_IO : NFS3_OK)))))))));
}
int nfs3_null_proc(void *opaque, rpc_op_t *rop)
{
rpc_queue_reply(rop);
return 0;
}
int nfs3_access_proc(void *opaque, rpc_op_t *rop)
{
//nfs_client_t *self = (nfs_client_t*)opaque;
ACCESS3args *args = (ACCESS3args*)rop->request;
ACCESS3res *reply = (ACCESS3res*)rop->reply;
*reply = (ACCESS3res){
.status = NFS3_OK,
.resok = (ACCESS3resok){
.access = args->access,
},
};
rpc_queue_reply(rop);
return 0;
}
int nfs3_commit_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
//COMMIT3args *args = (COMMIT3args*)rop->request;
cluster_op_t *op = new cluster_op_t;
// fsync. we don't know how to fsync a single inode, so just fsync everything
op->opcode = OSD_OP_SYNC;
op->callback = [self, rop](cluster_op_t *op)
{
COMMIT3res *reply = (COMMIT3res*)rop->reply;
*reply = (COMMIT3res){ .status = vitastor_nfs_map_err(op->retval) };
*(uint64_t*)reply->resok.verf = self->parent->server_id;
rpc_queue_reply(rop);
};
self->parent->cli->execute(op);
return 1;
}
int mount3_mnt_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
//nfs_dirpath *args = (nfs_dirpath*)rop->request;
if (self->parent->trace)
fprintf(stderr, "[%d] MNT\n", self->nfs_fd);
nfs_mountres3 *reply = (nfs_mountres3*)rop->reply;
u_int flavor = RPC_AUTH_NONE;
reply->fhs_status = MNT3_OK;
reply->mountinfo.fhandle = xdr_copy_string(rop->xdrs, NFS_ROOT_HANDLE);
reply->mountinfo.auth_flavors.auth_flavors_len = 1;
reply->mountinfo.auth_flavors.auth_flavors_val = (u_int*)xdr_copy_string(rop->xdrs, (char*)&flavor, sizeof(u_int)).data;
rpc_queue_reply(rop);
return 0;
}
int mount3_dump_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
if (self->parent->trace)
fprintf(stderr, "[%d] DUMP\n", self->nfs_fd);
nfs_mountlist *reply = (nfs_mountlist*)rop->reply;
*reply = (struct nfs_mountbody*)malloc_or_die(sizeof(struct nfs_mountbody));
xdr_add_malloc(rop->xdrs, *reply);
(*reply)->ml_hostname = xdr_copy_string(rop->xdrs, "127.0.0.1");
(*reply)->ml_directory = xdr_copy_string(rop->xdrs, self->parent->export_root);
(*reply)->ml_next = NULL;
rpc_queue_reply(rop);
return 0;
}
int mount3_umnt_proc(void *opaque, rpc_op_t *rop)
{
//nfs_client_t *self = (nfs_client_t*)opaque;
//nfs_dirpath *arg = (nfs_dirpath*)rop->request;
// do nothing
rpc_queue_reply(rop);
return 0;
}
int mount3_umntall_proc(void *opaque, rpc_op_t *rop)
{
// do nothing
rpc_queue_reply(rop);
return 0;
}
int mount3_export_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
nfs_exports *reply = (nfs_exports*)rop->reply;
*reply = (struct nfs_exportnode*)calloc_or_die(1, sizeof(struct nfs_exportnode) + sizeof(struct nfs_groupnode));
xdr_add_malloc(rop->xdrs, *reply);
(*reply)->ex_dir = xdr_copy_string(rop->xdrs, self->parent->export_root);
(*reply)->ex_groups = (struct nfs_groupnode*)(reply+1);
(*reply)->ex_groups->gr_name = xdr_copy_string(rop->xdrs, "127.0.0.1");
(*reply)->ex_groups->gr_next = NULL;
(*reply)->ex_next = NULL;
rpc_queue_reply(rop);
return 0;
}

View File

@ -21,6 +21,9 @@
#include "addr_util.h" #include "addr_util.h"
#include "str_util.h" #include "str_util.h"
#include "nfs_proxy.h" #include "nfs_proxy.h"
#include "nfs_kv.h"
#include "nfs_block.h"
#include "nfs_common.h"
#include "http_client.h" #include "http_client.h"
#include "cli.h" #include "cli.h"
@ -31,6 +34,8 @@ const char *exe_name = NULL;
nfs_proxy_t::~nfs_proxy_t() nfs_proxy_t::~nfs_proxy_t()
{ {
if (db)
delete db;
if (cmd) if (cmd)
delete cmd; delete cmd;
if (cli) if (cli)
@ -57,12 +62,14 @@ json11::Json::object nfs_proxy_t::parse_args(int narg, const char *args[])
"\n" "\n"
"USAGE:\n" "USAGE:\n"
" %s [STANDARD OPTIONS] [OTHER OPTIONS]\n" " %s [STANDARD OPTIONS] [OTHER OPTIONS]\n"
" --fs <META> mount VitastorFS with metadata in image <META>\n"
" --subdir <DIR> export images prefixed <DIR>/ (default empty - export all images)\n" " --subdir <DIR> export images prefixed <DIR>/ (default empty - export all images)\n"
" --portmap 0 do not listen on port 111 (portmap/rpcbind, requires root)\n" " --portmap 0 do not listen on port 111 (portmap/rpcbind, requires root)\n"
" --bind <IP> bind service to <IP> address (default 0.0.0.0)\n" " --bind <IP> bind service to <IP> address (default 0.0.0.0)\n"
" --nfspath <PATH> set NFS export path to <PATH> (default is /)\n" " --nfspath <PATH> set NFS export path to <PATH> (default is /)\n"
" --port <PORT> use port <PORT> for NFS services (default is 2049)\n" " --port <PORT> use port <PORT> for NFS services (default is 2049)\n"
" --pool <POOL> use <POOL> as default pool for new files (images)\n" " --pool <POOL> use <POOL> as default pool for new files (images)\n"
" --logfile <FILE> log to the specified file\n"
" --foreground 1 stay in foreground, do not daemonize\n" " --foreground 1 stay in foreground, do not daemonize\n"
"\n" "\n"
"NFS proxy is stateless if you use immediate_commit=all in your cluster and if\n" "NFS proxy is stateless if you use immediate_commit=all in your cluster and if\n"
@ -92,6 +99,9 @@ void nfs_proxy_t::run(json11::Json cfg)
srand48(tv.tv_sec*1000000000 + tv.tv_nsec); srand48(tv.tv_sec*1000000000 + tv.tv_nsec);
server_id = (uint64_t)lrand48() | ((uint64_t)lrand48() << 31) | ((uint64_t)lrand48() << 62); server_id = (uint64_t)lrand48() | ((uint64_t)lrand48() << 31) | ((uint64_t)lrand48() << 62);
// Parse options // Parse options
if (cfg["logfile"].string_value() != "")
logfile = cfg["logfile"].string_value();
trace = cfg["log_level"].uint64_value() > 5 || cfg["trace"].uint64_value() > 0;
bind_address = cfg["bind"].string_value(); bind_address = cfg["bind"].string_value();
if (bind_address == "") if (bind_address == "")
bind_address = "0.0.0.0"; bind_address = "0.0.0.0";
@ -131,67 +141,12 @@ void nfs_proxy_t::run(json11::Json cfg)
cmd->ringloop = ringloop; cmd->ringloop = ringloop;
cmd->epmgr = epmgr; cmd->epmgr = epmgr;
cmd->cli = cli; cmd->cli = cli;
// We need inode name hashes for NFS handles to remain stateless and <= 64 bytes long
dir_info[""] = (nfs_dir_t){
.id = 1,
.mod_rev = 0,
};
clock_gettime(CLOCK_REALTIME, &dir_info[""].mtime);
watch_stats(); watch_stats();
assert(cli->st_cli.on_inode_change_hook == NULL); if (!fs_kv_inode)
cli->st_cli.on_inode_change_hook = [this](inode_t changed_inode, bool removed)
{ {
auto inode_cfg_it = cli->st_cli.inode_config.find(changed_inode); blockfs = new block_fs_state_t();
if (inode_cfg_it == cli->st_cli.inode_config.end()) blockfs->init(this);
{ }
return;
}
auto & inode_cfg = inode_cfg_it->second;
std::string full_name = inode_cfg.name;
if (name_prefix != "" && full_name.substr(0, name_prefix.size()) != name_prefix)
{
return;
}
// Calculate directory modification time and revision (used as "cookie verifier")
timespec now;
clock_gettime(CLOCK_REALTIME, &now);
dir_info[""].mod_rev = dir_info[""].mod_rev < inode_cfg.mod_revision ? inode_cfg.mod_revision : dir_info[""].mod_rev;
dir_info[""].mtime = now;
int pos = full_name.find('/', name_prefix.size());
while (pos >= 0)
{
std::string dir = full_name.substr(0, pos);
auto & dinf = dir_info[dir];
if (!dinf.id)
dinf.id = next_dir_id++;
dinf.mod_rev = dinf.mod_rev < inode_cfg.mod_revision ? inode_cfg.mod_revision : dinf.mod_rev;
dinf.mtime = now;
dir_by_hash["S"+base64_encode(sha256(dir))] = dir;
pos = full_name.find('/', pos+1);
}
// Alter inode_by_hash
if (removed)
{
auto ino_it = hash_by_inode.find(changed_inode);
if (ino_it != hash_by_inode.end())
{
inode_by_hash.erase(ino_it->second);
hash_by_inode.erase(ino_it);
}
}
else
{
std::string hash = "S"+base64_encode(sha256(full_name));
auto hbi_it = hash_by_inode.find(changed_inode);
if (hbi_it != hash_by_inode.end() && hbi_it->second != hash)
{
// inode had a different name, remove old hash=>inode pointer
inode_by_hash.erase(hbi_it->second);
}
inode_by_hash[hash] = changed_inode;
hash_by_inode[changed_inode] = hash;
}
};
// Load image metadata // Load image metadata
while (!cli->is_ready()) while (!cli->is_ready())
{ {
@ -202,6 +157,72 @@ void nfs_proxy_t::run(json11::Json cfg)
} }
// Check default pool // Check default pool
check_default_pool(); check_default_pool();
// Check if we're using VitastorFS
fs_kv_inode = cfg["fs"].uint64_value();
if (fs_kv_inode)
{
if (!INODE_POOL(fs_kv_inode))
{
fprintf(stderr, "FS metadata inode number must include pool\n");
exit(1);
}
}
else if (cfg["fs"].is_string())
{
for (auto & ic: cli->st_cli.inode_config)
{
if (ic.second.name == cfg["fs"].string_value())
{
fs_kv_inode = ic.first;
break;
}
}
if (!fs_kv_inode)
{
fprintf(stderr, "FS metadata image \"%s\" does not exist\n", cfg["fs"].string_value().c_str());
exit(1);
}
}
readdir_getattr_parallel = cfg["readdir_getattr_parallel"].uint64_value();
if (!readdir_getattr_parallel)
readdir_getattr_parallel = 8;
id_alloc_batch_size = cfg["id_alloc_batch_size"].uint64_value();
if (!id_alloc_batch_size)
id_alloc_batch_size = 200;
if (fs_kv_inode)
{
// Open DB and wait
int open_res = 0;
bool open_done = false;
db = new kv_dbw_t(cli);
db->open(fs_kv_inode, cfg, [&](int res)
{
open_done = true;
open_res = res;
});
while (!open_done)
{
ringloop->loop();
if (open_done)
break;
ringloop->wait();
}
if (open_res < 0)
{
fprintf(stderr, "Failed to open key/value filesystem metadata index: %s (code %d)\n",
strerror(-open_res), open_res);
exit(1);
}
fs_base_inode = ((uint64_t)default_pool_id << (64-POOL_ID_BITS));
fs_inode_count = ((uint64_t)1 << (64-POOL_ID_BITS)) - 1;
shared_inode_threshold = pool_block_size;
if (!cfg["shared_inode_threshold"].is_null())
{
shared_inode_threshold = cfg["shared_inode_threshold"].uint64_value();
}
kvfs = new kv_fs_state_t;
kvfs->zero_block.resize(pool_block_size);
}
// Self-register portmap and NFS // Self-register portmap and NFS
pmap.reg_ports.insert((portmap_id_t){ pmap.reg_ports.insert((portmap_id_t){
.prog = PMAP_PROGRAM, .prog = PMAP_PROGRAM,
@ -275,9 +296,13 @@ void nfs_proxy_t::run(json11::Json cfg)
} }
// Destroy the client // Destroy the client
cli->flush(); cli->flush();
delete kvfs;
delete db;
delete cli; delete cli;
delete epmgr; delete epmgr;
delete ringloop; delete ringloop;
kvfs = NULL;
db = NULL;
cli = NULL; cli = NULL;
epmgr = NULL; epmgr = NULL;
ringloop = NULL; ringloop = NULL;
@ -382,8 +407,11 @@ void nfs_proxy_t::check_default_pool()
{ {
if (cli->st_cli.pool_config.size() == 1) if (cli->st_cli.pool_config.size() == 1)
{ {
default_pool = cli->st_cli.pool_config.begin()->second.name; auto pool_it = cli->st_cli.pool_config.begin();
default_pool_id = cli->st_cli.pool_config.begin()->first; default_pool_id = pool_it->first;
default_pool = pool_it->second.name;
pool_block_size = pool_it->second.pg_stripe_size;
pool_alignment = pool_it->second.bitmap_granularity;
} }
else else
{ {
@ -398,6 +426,8 @@ void nfs_proxy_t::check_default_pool()
if (p.second.name == default_pool) if (p.second.name == default_pool)
{ {
default_pool_id = p.first; default_pool_id = p.first;
pool_block_size = p.second.pg_stripe_size;
pool_alignment = p.second.bitmap_granularity;
break; break;
} }
} }
@ -421,6 +451,10 @@ void nfs_proxy_t::do_accept(int listen_fd)
int one = 1; int one = 1;
setsockopt(nfs_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one)); setsockopt(nfs_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
auto cli = new nfs_client_t(); auto cli = new nfs_client_t();
if (fs_kv_inode)
nfs_kv_procs(cli);
else
nfs_block_procs(cli);
cli->parent = this; cli->parent = this;
cli->nfs_fd = nfs_fd; cli->nfs_fd = nfs_fd;
for (auto & fn: pmap.proc_table) for (auto & fn: pmap.proc_table)
@ -968,8 +1002,8 @@ void nfs_proxy_t::daemonize()
close(1); close(1);
close(2); close(2);
open("/dev/null", O_RDONLY); open("/dev/null", O_RDONLY);
open("/dev/null", O_WRONLY); open(logfile.c_str(), O_WRONLY|O_APPEND|O_CREAT, 0666);
open("/dev/null", O_WRONLY); open(logfile.c_str(), O_WRONLY|O_APPEND|O_CREAT, 0666);
} }
int main(int narg, const char *args[]) int main(int narg, const char *args[])

View File

@ -4,17 +4,18 @@
#include "epoll_manager.h" #include "epoll_manager.h"
#include "nfs_portmap.h" #include "nfs_portmap.h"
#include "nfs/xdr_impl.h" #include "nfs/xdr_impl.h"
#include "kv_db.h"
#define NFS_ROOT_HANDLE "R"
#define RPC_INIT_BUF_SIZE 32768 #define RPC_INIT_BUF_SIZE 32768
#define MAX_REQUEST_SIZE 128*1024*1024
#define TRUE 1
#define FALSE 0
class cli_tool_t; class cli_tool_t;
struct nfs_dir_t struct kv_fs_state_t;
{ struct block_fs_state_t;
uint64_t id;
uint64_t mod_rev;
timespec mtime;
};
class nfs_proxy_t class nfs_proxy_t
{ {
@ -27,28 +28,29 @@ public:
std::string export_root; std::string export_root;
bool portmap_enabled; bool portmap_enabled;
unsigned nfs_port; unsigned nfs_port;
uint64_t fs_kv_inode = 0;
uint64_t fs_base_inode = 0;
uint64_t fs_inode_count = 0;
int readdir_getattr_parallel = 8, id_alloc_batch_size = 200;
int trace = 0;
std::string logfile = "/dev/null";
pool_id_t default_pool_id; pool_id_t default_pool_id;
uint64_t pool_block_size = 0;
uint64_t pool_alignment = 0;
uint64_t shared_inode_threshold = 0;
portmap_service_t pmap; portmap_service_t pmap;
ring_loop_t *ringloop = NULL; ring_loop_t *ringloop = NULL;
epoll_manager_t *epmgr = NULL; epoll_manager_t *epmgr = NULL;
cluster_client_t *cli = NULL; cluster_client_t *cli = NULL;
cli_tool_t *cmd = NULL; cli_tool_t *cmd = NULL;
kv_dbw_t *db = NULL;
kv_fs_state_t *kvfs = NULL;
block_fs_state_t *blockfs = NULL;
std::vector<XDR*> xdr_pool; std::vector<XDR*> xdr_pool;
// filehandle = "S"+base64(sha256(full name with prefix)) or "roothandle" for mount root)
uint64_t next_dir_id = 2;
// filehandle => dir with name_prefix
std::map<std::string, std::string> dir_by_hash;
// dir with name_prefix => dir info
std::map<std::string, nfs_dir_t> dir_info;
// filehandle => inode ID
std::map<std::string, inode_t> inode_by_hash;
// inode ID => filehandle
std::map<inode_t, std::string> hash_by_inode;
// inode ID => statistics // inode ID => statistics
std::map<inode_t, json11::Json> inode_stats; std::map<inode_t, json11::Json> inode_stats;
// pool ID => statistics // pool ID => statistics
@ -86,28 +88,6 @@ struct rpc_free_buffer_t
unsigned size; unsigned size;
}; };
struct extend_size_t
{
inode_t inode;
uint64_t new_size;
};
inline bool operator < (const extend_size_t &a, const extend_size_t &b)
{
return a.inode < b.inode || a.inode == b.inode && a.new_size < b.new_size;
}
struct extend_write_t
{
rpc_op_t *rop;
int resize_res, write_res; // 1 = started, 0 = completed OK, -errno = completed with error
};
struct extend_inode_t
{
uint64_t cur_extend = 0, next_extend = 0;
};
class nfs_client_t class nfs_client_t
{ {
public: public:
@ -122,8 +102,6 @@ public:
rpc_cur_buffer_t cur_buffer = { 0 }; rpc_cur_buffer_t cur_buffer = { 0 };
std::map<uint8_t*, rpc_used_buffer_t> used_buffers; std::map<uint8_t*, rpc_used_buffer_t> used_buffers;
std::vector<rpc_free_buffer_t> free_buffers; std::vector<rpc_free_buffer_t> free_buffers;
std::map<inode_t, extend_inode_t> extends;
std::multimap<extend_size_t, extend_write_t> extend_writes;
iovec read_iov; iovec read_iov;
msghdr read_msg = { 0 }; msghdr read_msg = { 0 };
@ -133,9 +111,6 @@ public:
std::vector<iovec> send_list, next_send_list; std::vector<iovec> send_list, next_send_list;
std::vector<rpc_op_t*> outbox, next_outbox; std::vector<rpc_op_t*> outbox, next_outbox;
nfs_client_t();
~nfs_client_t();
void select_read_buffer(unsigned wanted_size); void select_read_buffer(unsigned wanted_size);
void submit_read(unsigned wanted_size); void submit_read(unsigned wanted_size);
void handle_read(int result); void handle_read(int result);

View File

@ -239,6 +239,7 @@ class osd_t
void report_statistics(); void report_statistics();
void report_pg_state(pg_t & pg); void report_pg_state(pg_t & pg);
void report_pg_states(); void report_pg_states();
void apply_no_inode_stats();
void apply_pg_count(); void apply_pg_count();
void apply_pg_config(); void apply_pg_config();

View File

@ -390,7 +390,16 @@ void osd_t::on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes
} }
if (run_primary) if (run_primary)
{ {
apply_pg_count(); bool pools = changes.find(st_cli.etcd_prefix+"/config/pools") != changes.end();
bool pgs = changes.find(st_cli.etcd_prefix+"/config/pgs") != changes.end();
if (pools)
{
apply_no_inode_stats();
}
if (pools || pgs)
{
apply_pg_count();
}
apply_pg_config(); apply_pg_config();
} }
} }
@ -602,11 +611,32 @@ void osd_t::on_load_pgs_hook(bool success)
else else
{ {
peering_state &= ~OSD_LOADING_PGS; peering_state &= ~OSD_LOADING_PGS;
apply_pg_count(); if (run_primary)
apply_pg_config(); {
apply_no_inode_stats();
apply_pg_count();
apply_pg_config();
}
} }
} }
void osd_t::apply_no_inode_stats()
{
if (!bs)
{
return;
}
std::vector<uint64_t> no_inode_stats;
for (auto & pool_item: st_cli.pool_config)
{
if (pool_item.second.no_inode_stats)
{
no_inode_stats.push_back(pool_item.first);
}
}
bs->set_no_inode_stats(no_inode_stats);
}
void osd_t::apply_pg_count() void osd_t::apply_pg_count()
{ {
for (auto & pool_item: st_cli.pool_config) for (auto & pool_item: st_cli.pool_config)

View File

@ -68,3 +68,5 @@ SCHEME=xor ./test_scrub.sh
PG_SIZE=3 ./test_scrub.sh PG_SIZE=3 ./test_scrub.sh
PG_SIZE=6 PG_MINSIZE=4 OSD_COUNT=6 SCHEME=ec ./test_scrub.sh PG_SIZE=6 PG_MINSIZE=4 OSD_COUNT=6 SCHEME=ec ./test_scrub.sh
SCHEME=ec ./test_scrub.sh SCHEME=ec ./test_scrub.sh
./test_nfs.sh

167
tests/test_nfs.sh Executable file
View File

@ -0,0 +1,167 @@
#!/bin/bash -ex
PG_COUNT=16
. `dirname $0`/run_3osds.sh
build/src/vitastor-cli --etcd_address $ETCD_URL create -s 10G fsmeta
build/src/vitastor-nfs --fs fsmeta --etcd_address $ETCD_URL --portmap 0 --port 2050 --foreground 1 --trace 1 >>./testdata/nfs.log 2>&1 &
NFS_PID=$!
mkdir -p testdata/nfs
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
MNT=$(pwd)/testdata/nfs
trap "sudo umount -f $MNT"' || true; kill -9 $(jobs -p)' EXIT
# write small file
ls -l ./testdata/nfs
dd if=/dev/urandom of=./testdata/f1 bs=100k count=1
cp testdata/f1 ./testdata/nfs/
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
ls -l ./testdata/nfs | grep f1
diff ./testdata/f1 ./testdata/nfs/f1
format_green "100K file ok"
# overwrite it inplace
dd if=/dev/urandom of=./testdata/f1_90k bs=90k count=1
cp testdata/f1_90k ./testdata/nfs/f1
sudo umount ./testdata/nfs/
format_green "inplace overwrite 90K ok"
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
ls -l ./testdata/nfs | grep f1
# create another copy
dd if=./testdata/f1_90k of=./testdata/nfs/f1_nfs bs=1M
diff ./testdata/f1_90k ./testdata/nfs/f1_nfs
sudo umount ./testdata/nfs/
format_green "another copy 90K ok"
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
ls -l ./testdata/nfs | grep f1
cp ./testdata/nfs/f1 ./testdata/f1_nfs
diff ./testdata/f1_90k ./testdata/nfs/f1
format_green "90K data ok"
# test partial shared overwrite
dd if=/dev/urandom of=./testdata/f1_90k bs=9317 count=1 seek=5 conv=notrunc
dd if=./testdata/f1_90k of=./testdata/nfs/f1 bs=9317 count=1 skip=5 seek=5 conv=notrunc
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
diff ./testdata/f1_90k ./testdata/nfs/f1
format_green "partial inplace shared overwrite ok"
# move it to a larger shared space
dd if=/dev/urandom of=./testdata/f1_110k bs=110k count=1
cp testdata/f1_110k ./testdata/nfs/f1
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
ls -l ./testdata/nfs | grep f1
diff ./testdata/f1_110k ./testdata/nfs/f1
format_green "move shared 90K -> 110K ok"
# extend it to large file + rm
dd if=/dev/urandom of=./testdata/f1_2M bs=2M count=1
cp ./testdata/f1_2M ./testdata/nfs/f1
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
ls -l ./testdata/nfs | grep f1
cp ./testdata/nfs/f1 ./testdata/f1_nfs
diff ./testdata/f1_2M ./testdata/nfs/f1
rm ./testdata/nfs/f1
format_green "extend to 2M + rm ok"
# mkdir
mkdir -p ./testdata/nfs/dir1/dir2
echo abcdef > ./testdata/nfs/dir1/dir2/hnpfls
# rename dir
mv ./testdata/nfs/dir1 ./testdata/nfs/dir3
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
ls -l ./testdata/nfs | grep dir3
ls -l ./testdata/nfs/dir3 | grep dir2
ls -l ./testdata/nfs/dir3/dir2 | grep hnpfls
echo abcdef > ./testdata/hnpfls
diff ./testdata/hnpfls ./testdata/nfs/dir3/dir2/hnpfls
format_green "rename dir with file ok"
# touch
touch -t 202401011404 ./testdata/nfs/dir3/dir2/hnpfls
sudo chown 65534:65534 ./testdata/nfs/dir3/dir2/hnpfls
sudo chmod 755 ./testdata/nfs/dir3/dir2/hnpfls
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
T=`stat -c '%a %u %g %y' ./testdata/nfs/dir3/dir2/hnpfls | perl -pe 's/(:\d+)(.*)/$1/'`
[[ "$T" = "755 65534 65534 2024-01-01 14:04" ]]
format_green "set attrs ok"
# move dir
mv ./testdata/nfs/dir3/dir2 ./testdata/nfs/
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
ls -l ./testdata/nfs | grep dir3
ls -l ./testdata/nfs | grep dir2
format_green "move dir ok"
# symlink, readlink
ln -s dir2 ./testdata/nfs/sym2
[[ "`stat -c '%A' ./testdata/nfs/sym2`" = "lrwxrwxrwx" ]]
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
[[ "`stat -c '%A' ./testdata/nfs/sym2`" = "lrwxrwxrwx" ]]
[[ "`readlink ./testdata/nfs/sym2`" = "dir2" ]]
format_green "symlink, readlink ok"
# mknod: chr, blk, sock, fifo + remove
sudo mknod ./testdata/nfs/nod_chr c 1 5
sudo mknod ./testdata/nfs/nod_blk b 2 6
mkfifo ./testdata/nfs/nod_fifo
perl -e 'use Socket; socket($sock, PF_UNIX, SOCK_STREAM, undef) || die $!; bind($sock, sockaddr_un("./testdata/nfs/nod_sock")) || die $!;'
chmod 777 ./testdata/nfs/nod_*
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
[[ "`ls testdata|wc -l`" -ge 4 ]]
[[ "`stat -c '%A' ./testdata/nfs/nod_blk`" = "brwxrwxrwx" ]]
[[ "`stat -c '%A' ./testdata/nfs/nod_chr`" = "crwxrwxrwx" ]]
[[ "`stat -c '%A' ./testdata/nfs/nod_fifo`" = "prwxrwxrwx" ]]
[[ "`stat -c '%A' ./testdata/nfs/nod_sock`" = "srwxrwxrwx" ]]
sudo rm ./testdata/nfs/nod_*
format_green "mknod + rm ok"
# hardlink
echo ABCDEF > ./testdata/nfs/linked1
i=`stat -c '%i' ./testdata/nfs/linked1`
ln ./testdata/nfs/linked1 ./testdata/nfs/linked2
[[ "`stat -c '%i' ./testdata/nfs/linked2`" -eq $i ]]
echo BABABA > ./testdata/nfs/linked2
diff ./testdata/nfs/linked2 ./testdata/nfs/linked1
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
diff ./testdata/nfs/linked2 ./testdata/nfs/linked1
[[ "`cat ./testdata/nfs/linked2`" = "BABABA" ]]
rm ./testdata/nfs/linked2
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
[[ "`cat ./testdata/nfs/linked1`" = "BABABA" ]]
format_green "hardlink ok"
# rm small
ls -l ./testdata/nfs
dd if=/dev/urandom of=./testdata/nfs/smallfile bs=100k count=1
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
rm ./testdata/nfs/smallfile
if ls ./testdata/nfs | grep smallfile; then false; fi
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
if ls ./testdata/nfs | grep smallfile; then false; fi
format_green "rm small ok"
# rename over existing
echo ZXCVBN > ./testdata/nfs/over1
mv ./testdata/nfs/over1 ./testdata/nfs/linked2
sudo umount ./testdata/nfs/
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
if ls ./testdata/nfs | grep over1; then false; fi
[[ "`cat ./testdata/nfs/linked2`" = "ZXCVBN" ]]
[[ "`cat ./testdata/nfs/linked1`" = "BABABA" ]]
format_green "rename over existing file ok"
format_green OK