OSINT Processing Pipeline
The 6-stage intelligence processing pipeline inside aesop_intell — from raw document ingestion to structured entity graphs.
Pipeline Stages
Pipeline Stage Reference
| Stage | Process | Cost | Technology | Key Service File |
|---|---|---|---|---|
| 0 Ingestion | Normalize raw input to Document model, assign channel provenance | FREE | Django ORM | services/ingestion.py |
| 1 Extraction | Clean HTML/PDF, chunk text, extract title and date metadata | FREE | BeautifulSoup, pdfplumber | services/extraction.py |
| 2 Language | Detect language, auto-translate non-default documents | LOW | langdetect, LLM (conditional) | services/language.py |
| 3 Relevance Gating | Keyword-weighted relevance scoring, reject irrelevant documents | ZERO | Keyword matching (no LLM) | services/relevance.py |
| 4 NER | LLM-based entity extraction (6 types) + relation extraction | HIGH | Mistral Small (per-document) | services/ner.py |
| 5 Classification & Geo | Domain/context assignment, location resolution, H3 indexing | LOW | H3, geocoder | services/classification.py |
| + Embedding | Vector generation from sentence-transformer for similarity search | FREE | sentence-transformers (local) | services/embedding.py |
Why Relevance Gating is the Most Important Architectural Decision
Stage 3 (Relevance Gating) is the single most impactful design choice in the pipeline.
The NER stage (Stage 4) is the only stage that requires per-document LLM inference via Mistral Small. Every document that reaches Stage 4 incurs a real computational cost. Relevance gating sits immediately before NER and acts as a zero-cost filter — it uses simple keyword-weighted scoring with no LLM calls, no API costs, and negligible compute.
Documents rejected at Stage 3 never reach NER. This means the entire cost of the pipeline scales not with the number of ingested documents, but with the number of relevant documents. In a typical OSINT pipeline processing thousands of documents daily, the majority are noise. Without this gate, every noisy document would trigger an expensive LLM call.
The design principle: place the cheapest possible filter immediately before the most expensive stage. The relevance gate can reject 60-90% of incoming documents at zero marginal cost, reducing the effective NER bill by the same proportion.
The NER stage (Stage 4) is the only stage that requires per-document LLM inference via Mistral Small. Every document that reaches Stage 4 incurs a real computational cost. Relevance gating sits immediately before NER and acts as a zero-cost filter — it uses simple keyword-weighted scoring with no LLM calls, no API costs, and negligible compute.
Documents rejected at Stage 3 never reach NER. This means the entire cost of the pipeline scales not with the number of ingested documents, but with the number of relevant documents. In a typical OSINT pipeline processing thousands of documents daily, the majority are noise. Without this gate, every noisy document would trigger an expensive LLM call.
The design principle: place the cheapest possible filter immediately before the most expensive stage. The relevance gate can reject 60-90% of incoming documents at zero marginal cost, reducing the effective NER bill by the same proportion.
NER Output Schema
Entity Types Extracted
Stage 4 NER extracts six structured entity types from each document:
- Person — Named individuals (leaders, officials, operatives)
- Organization — Companies, agencies, groups, militias
- Location — Cities, regions, facilities, coordinates
- Event — Incidents, operations, meetings, declarations
- Asset — Weapons, vehicles, infrastructure, resources
- Concept — Doctrines, policies, strategies, threats
Relation Types Extracted
Pairwise relations between entities, also extracted by the NER LLM:
- COMMANDS Person/Org directs another entity
- LOCATED_IN Entity is situated at a Location
- SUPPLIES Org/Person provides resources to another
- FUNDS Financial relationship between entities
- MEMBER_OF Person belongs to Organization
- PARTICIPATES_IN Entity involved in Event
- ALLIED_WITH Cooperative relationship
- OPPOSES Adversarial relationship
- OWNS Ownership or control relationship
Embedding Model Options
Sentence-Transformer Model Choices (All Local, No API Cost)
The embedding stage supports 11 model options ranging from ultra-lightweight to premium quality. All run locally — zero API cost regardless of volume.
| Tier | Model | Dimensions | Size | Notes |
|---|---|---|---|---|
| Budget | all-MiniLM-L6-v2 |
384 | ~80 MB | Fastest, lowest RAM, good baseline |
| Budget | all-MiniLM-L12-v2 |
384 | ~120 MB | Slightly better than L6 |
| Budget | paraphrase-MiniLM-L6-v2 |
384 | ~80 MB | Optimized for paraphrase detection |
| Mid | all-mpnet-base-v2 |
768 | ~420 MB | Best quality/speed ratio |
| Mid | multi-qa-mpnet-base-dot-v1 |
768 | ~420 MB | Optimized for question-answering |
| Mid | msmarco-distilbert-base-v4 |
768 | ~250 MB | Search/retrieval focused |
| Mid | paraphrase-multilingual-MiniLM-L12-v2 |
384 | ~470 MB | 50+ languages, good for OSINT |
| Premium | all-distilroberta-v1 |
768 | ~290 MB | RoBERTa backbone, strong general |
| Premium | bge-large-en-v1.5 |
1024 | ~1.3 GB | Top MTEB benchmark scores |
| Premium | e5-large-v2 |
1024 | ~1.3 GB | Microsoft, excellent retrieval |
| Premium | gte-large |
1024 | ~1.3 GB | Alibaba DAMO, strong multilingual |
Embedding model selection is a deployment-time decision, not a code change.
All 11 models are interchangeable at the configuration level. Budget models (MiniLM) are recommended for development and low-RAM deployments. Premium models (bge-large, e5-large, gte-large) offer measurably better semantic similarity at the cost of 3-5x more RAM and slower inference. For multilingual OSINT workloads,
paraphrase-multilingual-MiniLM-L12-v2 is the recommended default.