queries
Usage
Benchmarks queries on a given index.
Usage: ../../../build/bin/queries [OPTIONS]
Options:
-h,--help Print this help message and exit
-e,--encoding TEXT REQUIRED Index encoding
-i,--index TEXT REQUIRED Inverted index filename
-w,--wand TEXT WAND data filename
--compressed-wand Needs: --wand
Compressed WAND data file
--tokenizer TEXT:{english,whitespace} [english]
Tokenizer
-H,--html Strip HTML
-F,--token-filters TEXT:{krovetz,lowercase,porter2} ...
Token filters
--stopwords TEXT Path to file containing a list of stop words to filter out
-q,--queries TEXT Path to file with queries
--terms TEXT Term lexicon
--weighted Weights scores by query frequency
-k INT REQUIRED The number of top results to return
-a,--algorithm TEXT REQUIRED
Query processing algorithm
-s,--scorer TEXT REQUIRED Scorer function
--bm25-k1 FLOAT Needs: --scorer
BM25 k1 parameter.
--bm25-b FLOAT Needs: --scorer
BM25 b parameter.
--pl2-c FLOAT Needs: --scorer
PL2 c parameter.
--qld-mu FLOAT Needs: --scorer
QLD mu parameter.
-T,--thresholds TEXT File containing query thresholds
-L,--log-level TEXT:{critical,debug,err,info,off,trace,warn} [info]
Log level
--config Configuration .ini file
--quantized Quantized scores
--extract Extract individual query times
--safe Needs: --thresholds Rerun if not enough results with pruning.
Description
Runs query benchmarks.
Executes each query on the given index multiple times, and takes the minimum of those as the final value. Then, it aggregates statistics across all queries.
Input
This program takes a compressed index as its input along with a file
containing the queries (line by line). Note that you need to specify the
correct index encoding with --encoding
option, as this is currently
not stored in the index. If the index is quantized, you must pass
--quantized
flag.
For certain types of retrieval algorithms, you will also need to pass the so-called "WAND file", which contains some metadata like skip lists and max scores.
Query Parsing
There are several parameters you can define to instruct the program on
how to parse and process the input queries, including which tokenizer to
use, whether to strip HTML from the query, and a list of token filters
(such as stemmers). For a more comprehensive description, see
parse_collection
.
You can also pass a file containing stop-words, which will be excluded from the parsed queries.
In order for the parsing to actually take place, you need to also
provide the term lexicon with --terms
. If not defined, the queries
will be interpreted as lists of document IDs.
Algorithm
You can specify what retrieval algorithm to use with --algorithm
.
Furthermore, -k
option defined how many results to retrieve for each
query.
Scoring
Use --scorer
option to define which scoring function you want to use
(bm25
, dph
, pl2
, qld
). Some scoring functions have additional
parameters that you may override, see the help message above.
Thresholds
You can also pass a file with list of initial score thresholds. Any
documents that evaluate to a score below this value will be excluded.
This can speed up the algorithm, but if the threshold is too high, it
may exclude some of the relevant top-k results. If you want to always
ensure that the results are as if the initial threshold was zero, you
can pass --safe
flag. It will force to recompute the entire query
without an initial threshold if it is detected that relevant documents
have been excluded. This may be useful if you have mostly accurate
threshold estimates, but still need the safety: even though some queries
will be slower, most will be much faster, thus improving overall
throughput and average latency.