Skip to content

Filter large number of prefixes with a input file #117

@digizeph

Description

@digizeph

Description

At times, we need to filter on a large number of prefixes. CLI with prefix filter parameters would be insufficient. We should allow users to provide an input file with filtering criteria written.

This may involve updating bgpkit-parser for large number of filters performance. We could design a general filters file format (e.g. a JSON format) and allow taking that as input.

Concrete Use Cases

1. Prefix-list filtering from RIB extraction

The most common pattern that hits this limit is BGP outage/visibility investigation:

  1. Extract the prefix list for a target ASN from a pre-event RIB dump
  2. Use that prefix list to filter subsequent BGP update files

This is necessary because BGP withdrawal messages carry no AS path or origin ASN — you can only filter withdrawals by prefix. The current workaround is:

# Step 1: Extract prefix list from RIB
monocle parse -o <ASN> /tmp/rib_pre.gz 2>/dev/null | \
  cut -d'|' -f5 | sort -u > /tmp/prefixes.txt

# Step 2: Build comma-separated string
PFXS=$(cat /tmp/prefixes.txt | tr '\n' ',' | sed 's/,$//')

# Step 3: Pass as -p argument
monocle parse -p "$PFXS" /tmp/updates.gz

This breaks down at scale:

ASN size Approx prefixes -p arg size Works?
Small ISP ~200 ~3.6 KB Yes
Medium ISP ~1,000 ~18 KB Yes
Large carrier ~5,000 ~90 KB Marginal
Tier-1 / hyper ~10,000+ ~180 KB+ Exceeds ARG_MAX on many systems

The ARG_MAX limit on Linux is typically ~2 MB but the effective limit for a single argument can be much lower (~128-256 KB). Even below ARG_MAX, very long arguments cause performance issues in shell expansion.

2. Country-level investigation

When investigating a country-level event, you may need to filter by all prefixes originated by ASNs in that country — potentially tens of thousands of prefixes. This is impractical with -p on the command line.

3. Repeated parsing with the same filter set

In a typical investigation, the same prefix list is applied to 10-40+ update files sequentially. Each invocation re-parses the comma-separated -p argument from scratch. A file-based input that's parsed once and reused across invocations would be more efficient.

Proposed Design

Filter file format (JSON)

{
  "prefixes": ["192.0.2.0/24", "198.51.100.0/24", "2001:db8::/32"],
  "origin_asns": [64496, 64497],
  "peer_asns": [174, 6939],
  "as_path_regex": "174 64496$",
  "elem_type": "w",
  "communities": ["64496:100", "64496:200"]
}

All fields optional; when multiple fields are present, they combine with AND logic (same as existing CLI filters).

CLI integration

# Use filter file instead of CLI flags
monocle parse --filter-file /tmp/filters.json /tmp/updates.gz
monocle search --filter-file /tmp/filters.json -t 2025-09-01T12:00:00Z -d 2h

# Filter file can be combined with CLI flags (AND logic)
monocle parse --filter-file /tmp/filters.json -c rrc00 /tmp/updates.gz

Alternative: plain text prefix list

For the most common case (prefix-only filtering), also support a simple newline-delimited file:

# One prefix per line
monocle parse --prefix-file /tmp/prefixes.txt /tmp/updates.gz

This is the most ergonomic option for the RIB-extract-then-filter-updates workflow, since monocle parse -o <ASN> rib.gz | cut -d'|' -f5 | sort -u already produces a newline-delimited prefix list.

Interaction with #82

If #82 adds RIB snapshot queries with --sqlite-path output, the extracted prefix list could be queried from SQLite and fed as a filter file to subsequent update analysis — replacing the current fragile shell pipeline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions