sample_inverted_index
Usage
A tool for sampling an inverted index.
Usage: ../../../build/bin/sample_inverted_index [OPTIONS]
Options:
-h,--help Print this help message and exit
-L,--log-level TEXT:{critical,debug,err,info,off,trace,warn} [info]
Log level
--config Configuration .ini file
-c,--collection TEXT REQUIRED
Input collection basename
-o,--output TEXT REQUIRED Output collection basename
-r,--rate FLOAT REQUIRED Sampling rate (proportional size of the output index)
-t,--type TEXT REQUIRED Sampling type
--terms-to-drop TEXT A filename containing a list of term IDs that we want to drop
--seed UINT Seed state
Description
Creates a smaller inverted index from an existing one by sampling postings or documents. The purpose of this tool is to reduce time and space requirements while preserving the main statistical properties of the original collection, making it useful for faster experiments and debugging.
Sampling strategy (-t, --type)
random_postings: keep random occurrences per posting list (not whole posting lists).random_docids: keep all postings belonging to a random subset of documents.
Examples
Keep ~25% of postings
sample_inverted_index \
-c path/to/inverted \
-o path/to/inverted.sampled \
-r 0.25 \
-t random_postings
Keep ~25% of the documents
sample_inverted_index \
-c path/to/inverted \
-o path/to/inverted.sampled \
-r 0.25 \
-t random_docids