OSINT Processing Pipeline

The 6-stage intelligence processing pipeline inside aesop_intell — from raw document ingestion to structured entity graphs.

Pipeline Stages

Pipeline Stage Reference

Stage	Process	Cost	Technology	Key Service File
0 Ingestion	Normalize raw input to Document model, assign channel provenance	FREE	Django ORM	`services/ingestion.py`
1 Extraction	Clean HTML/PDF, chunk text, extract title and date metadata	FREE	BeautifulSoup, pdfplumber	`services/extraction.py`
2 Language	Detect language, auto-translate non-default documents	LOW	langdetect, LLM (conditional)	`services/language.py`
3 Relevance Gating	Keyword-weighted relevance scoring, reject irrelevant documents	ZERO	Keyword matching (no LLM)	`services/relevance.py`
4 NER	LLM-based entity extraction (6 types) + relation extraction	HIGH	Mistral Small (per-document)	`services/ner.py`
5 Classification & Geo	Domain/context assignment, location resolution, H3 indexing	LOW	H3, geocoder	`services/classification.py`
+ Embedding	Vector generation from sentence-transformer for similarity search	FREE	sentence-transformers (local)	`services/embedding.py`

Why Relevance Gating is the Most Important Architectural Decision

Stage 3 (Relevance Gating) is the single most impactful design choice in the pipeline.

The NER stage (Stage 4) is the only stage that requires per-document LLM inference via Mistral Small. Every document that reaches Stage 4 incurs a real computational cost. Relevance gating sits immediately before NER and acts as a zero-cost filter — it uses simple keyword-weighted scoring with no LLM calls, no API costs, and negligible compute.

Documents rejected at Stage 3 never reach NER. This means the entire cost of the pipeline scales not with the number of ingested documents, but with the number of relevant documents. In a typical OSINT pipeline processing thousands of documents daily, the majority are noise. Without this gate, every noisy document would trigger an expensive LLM call.

The design principle: place the cheapest possible filter immediately before the most expensive stage. The relevance gate can reject 60-90% of incoming documents at zero marginal cost, reducing the effective NER bill by the same proportion.

NER Output Schema

Entity Types Extracted

Stage 4 NER extracts six structured entity types from each document:

Person — Named individuals (leaders, officials, operatives)
Organization — Companies, agencies, groups, militias
Location — Cities, regions, facilities, coordinates
Event — Incidents, operations, meetings, declarations
Asset — Weapons, vehicles, infrastructure, resources
Concept — Doctrines, policies, strategies, threats

Relation Types Extracted

Pairwise relations between entities, also extracted by the NER LLM:

COMMANDS Person/Org directs another entity
LOCATED_IN Entity is situated at a Location
SUPPLIES Org/Person provides resources to another
FUNDS Financial relationship between entities
MEMBER_OF Person belongs to Organization
PARTICIPATES_IN Entity involved in Event
ALLIED_WITH Cooperative relationship
OPPOSES Adversarial relationship
OWNS Ownership or control relationship

Embedding Model Options

Sentence-Transformer Model Choices (All Local, No API Cost)

The embedding stage supports 11 model options ranging from ultra-lightweight to premium quality. All run locally — zero API cost regardless of volume.

Tier	Model	Dimensions	Size	Notes
Budget	`all-MiniLM-L6-v2`	384	~80 MB	Fastest, lowest RAM, good baseline
Budget	`all-MiniLM-L12-v2`	384	~120 MB	Slightly better than L6
Budget	`paraphrase-MiniLM-L6-v2`	384	~80 MB	Optimized for paraphrase detection
Mid	`all-mpnet-base-v2`	768	~420 MB	Best quality/speed ratio
Mid	`multi-qa-mpnet-base-dot-v1`	768	~420 MB	Optimized for question-answering
Mid	`msmarco-distilbert-base-v4`	768	~250 MB	Search/retrieval focused
Mid	`paraphrase-multilingual-MiniLM-L12-v2`	384	~470 MB	50+ languages, good for OSINT
Premium	`all-distilroberta-v1`	768	~290 MB	RoBERTa backbone, strong general
Premium	`bge-large-en-v1.5`	1024	~1.3 GB	Top MTEB benchmark scores
Premium	`e5-large-v2`	1024	~1.3 GB	Microsoft, excellent retrieval
Premium	`gte-large`	1024	~1.3 GB	Alibaba DAMO, strong multilingual

Embedding model selection is a deployment-time decision, not a code change. All 11 models are interchangeable at the configuration level. Budget models (MiniLM) are recommended for development and low-RAM deployments. Premium models (bge-large, e5-large, gte-large) offer measurably better semantic similarity at the cost of 3-5x more RAM and slower inference. For multilingual OSINT workloads, paraphrase-multilingual-MiniLM-L12-v2 is the recommended default.