Sharding
We support partitioning a collection into a number of smaller subsets called shards.
Right now, only a forward index can be partitioned by running partition_fwd_index
command.
For convenience, we provide shards
command that supports certain bulk operations on all shards.
Partitioning collection
We support two methods of partitioning: random, and by a defined mapping. For example, one can partition collection randomly:
$ partition_fwd_index \
-j 8 \ # use up to 8 threads at a time
-i full_index_prefix \
-o shard_prefix \
-r 123 # partition randomly into 123 shards
Alternatively, a set of files can be provided.
Let's assume we have a folder shard-titles
with a set of text files.
Each file contains new-line-delimited document titles (e.g., TREC-IDs) for one partition.
Then, one would call:
$ partition_fwd_index \
-j 8 \ # use up to 8 threads at a time
-i full_index_prefix \
-o shard_prefix \
-s shard-titles/*
Note that the names of the files passed with -s
will be ignored.
Instead, each shard will be assigned a numerical ID from 0
to N - 1
in order
in which they are passed in the command line.
Then, each resulting forward index will have appended .ID
to its name prefix:
shard_prefix.000
, shard_prefix.001
, and so on.
Working with shards
The shards
tool allows to perform some index operations in bulk on all shards at once.
At the moment, the following subcommands are supported:
- invert,
- compress,
- wand-data, and
- reorder-docids.
All input and output paths passed to the subcommands will be expanded for each individual shards
by extending it with .<shard-id>
(e.g., .000
) or, if substring {}
is present, then
the shard number will be substituted there. For example:
shards reorder-docids --by-url \
-c inv \
-o inv.url \
--documents fwd.{}.doclex \
--reordered-documents fwd.url.{}.doclex
is equivalent to running the following command for every shard XYZ
:
reorder-docids --by-url \
-c inv.XYZ \
-o inv.url.XYZ \
--documents fwd.XYZ.doclex \
--reordered-documents fwd.url.XYZ.doclex