queries

Usage

Benchmarks queries on a given index.
Usage: ../../../build/bin/queries [OPTIONS]

Options:
  -h,--help                   Print this help message and exit
  -e,--encoding TEXT REQUIRED Index encoding
  -i,--index TEXT REQUIRED    Inverted index filename
  -w,--wand TEXT              WAND data filename
  --compressed-wand Needs: --wand
                              Compressed WAND data file
  --tokenizer TEXT:{english,whitespace} [english] 
                              Tokenizer
  -H,--html                   Strip HTML
  -F,--token-filters TEXT:{krovetz,lowercase,porter2} ...
                              Token filters
  --stopwords TEXT            Path to file containing a list of stop words to filter out
  -q,--queries TEXT           Path to file with queries
  --terms TEXT                Term lexicon
  --weighted                  Weights scores by query frequency
  -k INT REQUIRED             The number of top results to return
  -a,--algorithm TEXT REQUIRED
                              Query processing algorithm
  -s,--scorer TEXT REQUIRED   Scorer function
  --bm25-k1 FLOAT Needs: --scorer
                              BM25 k1 parameter.
  --bm25-b FLOAT Needs: --scorer
                              BM25 b parameter.
  --pl2-c FLOAT Needs: --scorer
                              PL2 c parameter.
  --qld-mu FLOAT Needs: --scorer
                              QLD mu parameter.
  -T,--thresholds TEXT        File containing query thresholds
  -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn} [info] 
                              Log level
  --config                    Configuration .ini file
  --quantized                 Quantized scores
  --extract                   Extract individual query times
  --safe Needs: --thresholds  Rerun if not enough results with pruning.

Description

Runs query benchmarks.

Executes each query on the given index multiple times, and takes the minimum of those as the final value. Then, it aggregates statistics across all queries.

Input

This program takes a compressed index as its input along with a file containing the queries (line by line). Note that you need to specify the correct index encoding with --encoding option, as this is currently not stored in the index. If the index is quantized, you must pass --quantized flag.

For certain types of retrieval algorithms, you will also need to pass the so-called "WAND file", which contains some metadata like skip lists and max scores.

Query Parsing

There are several parameters you can define to instruct the program on how to parse and process the input queries, including which tokenizer to use, whether to strip HTML from the query, and a list of token filters (such as stemmers). For a more comprehensive description, see parse_collection.

You can also pass a file containing stop-words, which will be excluded from the parsed queries.

In order for the parsing to actually take place, you need to also provide the term lexicon with --terms. If not defined, the queries will be interpreted as lists of document IDs.

Algorithm

You can specify what retrieval algorithm to use with --algorithm. Furthermore, -k option defined how many results to retrieve for each query.

Scoring

Use --scorer option to define which scoring function you want to use (bm25, dph, pl2, qld). Some scoring functions have additional parameters that you may override, see the help message above.

Thresholds

You can also pass a file with list of initial score thresholds. Any documents that evaluate to a score below this value will be excluded. This can speed up the algorithm, but if the threshold is too high, it may exclude some of the relevant top-k results. If you want to always ensure that the results are as if the initial threshold was zero, you can pass --safe flag. It will force to recompute the entire query without an initial threshold if it is detected that relevant documents have been excluded. This may be useful if you have mostly accurate threshold estimates, but still need the safety: even though some queries will be slower, most will be much faster, thus improving overall throughput and average latency.