OSINT Processing Pipeline

The 6-stage intelligence processing pipeline inside aesop_intell — from raw document ingestion to structured entity graphs.

Pipeline Stages

STAGE 0 Ingestion Normalize to Document Channel provenance FREE STAGE 1 Extraction Clean HTML/PDF, chunk Extract title/date FREE STAGE 2 Language Detect language Auto-translate LOW STAGE 3 Relevance Gating Keyword-weighted scoring CRITICAL COST FILTER ZERO REJECTED Never reaches NER PASS STAGE 4 NER Entity extraction (6 types) Relation extraction Mistral Small (LLM) HIGH STAGE 5 Classification & Geo Domain/context assign Location resolution H3 indexing LOW + EMBEDDING Vectorization sentence-transformer Vector generation Local model FREE Entity Graph COST PROFILE FREE FREE LOW ZERO HIGH LOW FREE S0 S1 S2 S3 S4 S5 Emb COST SPIKE

Pipeline Stage Reference

Stage Process Cost Technology Key Service File
0 Ingestion Normalize raw input to Document model, assign channel provenance FREE Django ORM services/ingestion.py
1 Extraction Clean HTML/PDF, chunk text, extract title and date metadata FREE BeautifulSoup, pdfplumber services/extraction.py
2 Language Detect language, auto-translate non-default documents LOW langdetect, LLM (conditional) services/language.py
3 Relevance Gating Keyword-weighted relevance scoring, reject irrelevant documents ZERO Keyword matching (no LLM) services/relevance.py
4 NER LLM-based entity extraction (6 types) + relation extraction HIGH Mistral Small (per-document) services/ner.py
5 Classification & Geo Domain/context assignment, location resolution, H3 indexing LOW H3, geocoder services/classification.py
+ Embedding Vector generation from sentence-transformer for similarity search FREE sentence-transformers (local) services/embedding.py

Why Relevance Gating is the Most Important Architectural Decision

Stage 3 (Relevance Gating) is the single most impactful design choice in the pipeline.

The NER stage (Stage 4) is the only stage that requires per-document LLM inference via Mistral Small. Every document that reaches Stage 4 incurs a real computational cost. Relevance gating sits immediately before NER and acts as a zero-cost filter — it uses simple keyword-weighted scoring with no LLM calls, no API costs, and negligible compute.

Documents rejected at Stage 3 never reach NER. This means the entire cost of the pipeline scales not with the number of ingested documents, but with the number of relevant documents. In a typical OSINT pipeline processing thousands of documents daily, the majority are noise. Without this gate, every noisy document would trigger an expensive LLM call.

The design principle: place the cheapest possible filter immediately before the most expensive stage. The relevance gate can reject 60-90% of incoming documents at zero marginal cost, reducing the effective NER bill by the same proportion.

NER Output Schema

Entity Types Extracted

Stage 4 NER extracts six structured entity types from each document:

  • Person — Named individuals (leaders, officials, operatives)
  • Organization — Companies, agencies, groups, militias
  • Location — Cities, regions, facilities, coordinates
  • Event — Incidents, operations, meetings, declarations
  • Asset — Weapons, vehicles, infrastructure, resources
  • Concept — Doctrines, policies, strategies, threats

Relation Types Extracted

Pairwise relations between entities, also extracted by the NER LLM:

  • COMMANDS Person/Org directs another entity
  • LOCATED_IN Entity is situated at a Location
  • SUPPLIES Org/Person provides resources to another
  • FUNDS Financial relationship between entities
  • MEMBER_OF Person belongs to Organization
  • PARTICIPATES_IN Entity involved in Event
  • ALLIED_WITH Cooperative relationship
  • OPPOSES Adversarial relationship
  • OWNS Ownership or control relationship

Embedding Model Options

Sentence-Transformer Model Choices (All Local, No API Cost)

The embedding stage supports 11 model options ranging from ultra-lightweight to premium quality. All run locally — zero API cost regardless of volume.

Tier Model Dimensions Size Notes
Budget all-MiniLM-L6-v2 384 ~80 MB Fastest, lowest RAM, good baseline
Budget all-MiniLM-L12-v2 384 ~120 MB Slightly better than L6
Budget paraphrase-MiniLM-L6-v2 384 ~80 MB Optimized for paraphrase detection
Mid all-mpnet-base-v2 768 ~420 MB Best quality/speed ratio
Mid multi-qa-mpnet-base-dot-v1 768 ~420 MB Optimized for question-answering
Mid msmarco-distilbert-base-v4 768 ~250 MB Search/retrieval focused
Mid paraphrase-multilingual-MiniLM-L12-v2 384 ~470 MB 50+ languages, good for OSINT
Premium all-distilroberta-v1 768 ~290 MB RoBERTa backbone, strong general
Premium bge-large-en-v1.5 1024 ~1.3 GB Top MTEB benchmark scores
Premium e5-large-v2 1024 ~1.3 GB Microsoft, excellent retrieval
Premium gte-large 1024 ~1.3 GB Alibaba DAMO, strong multilingual
Embedding model selection is a deployment-time decision, not a code change. All 11 models are interchangeable at the configuration level. Budget models (MiniLM) are recommended for development and low-RAM deployments. Premium models (bge-large, e5-large, gte-large) offer measurably better semantic similarity at the cost of 3-5x more RAM and slower inference. For multilingual OSINT workloads, paraphrase-multilingual-MiniLM-L12-v2 is the recommended default.