Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

sample_inverted_index

Usage

A tool for sampling an inverted index.
Usage: ../../../build/bin/sample_inverted_index [OPTIONS]

Options:
  -h,--help                   Print this help message and exit
  -L,--log-level TEXT:{critical,debug,err,info,off,trace,warn} [info] 
                              Log level
  --config                    Configuration .ini file
  -c,--collection TEXT REQUIRED
                              Input collection basename
  -o,--output TEXT REQUIRED   Output collection basename
  -r,--rate FLOAT REQUIRED    Sampling rate (proportional size of the output index)
  -t,--type TEXT REQUIRED     Sampling type
  --terms-to-drop TEXT        A filename containing a list of term IDs that we want to drop
  --seed UINT                 Seed state

Description

Creates a smaller inverted index from an existing one by sampling postings or documents. The purpose of this tool is to reduce time and space requirements while preserving the main statistical properties of the original collection, making it useful for faster experiments and debugging.

Sampling strategy (-t, --type)

  • random_postings: keep random occurrences per posting list (not whole posting lists).
  • random_docids: keep all postings belonging to a random subset of documents.

Examples

Keep ~25% of postings

sample_inverted_index \
    -c path/to/inverted \
    -o path/to/inverted.sampled \
    -r 0.25 \
    -t random_postings

Keep ~25% of the documents

sample_inverted_index \
    -c path/to/inverted \
    -o path/to/inverted.sampled \
    -r 0.25 \
    -t random_docids