Alden Scientific
A low-latency natural language interface for querying complex medical datasets, combining LLM-powered semantic indexing with deterministic runtime routing to deliver sub-20ms medical intent resolution.
Overview
Alden Scientific needed a high-performance natural language interface capable of querying complex medical datasets — specifically LOINC (Logical Observation Identifiers Names and Codes) — with sub-20ms latency. The core challenge was the “latency vs. intelligence” trade-off: delivering the semantic depth of Large Language Models at the speed of deterministic computation.
We designed a hybrid architecture that decouples semantic reasoning from runtime execution, combining offline LLM-powered indexing with a lightning-fast deterministic routing engine.
The Challenge
Medical terminology is inherently ambiguous and complex. Clinicians and researchers need to query diagnostic test data using natural language — including colloquial terms, abbreviations, and multi-part temporal comparisons — while receiving instant, accurate results. The system needed to handle:
- Semantic ambiguity in medical terminology (e.g., mapping “FH” to “Familial Hypercholesterolemia”)
- Complex temporal queries (“Compare glucose between Jan and March of 2020 vs 2021”)
- General medical knowledge questions alongside structured data retrieval
- Production-grade latency requirements under 20ms on CPU-only infrastructure
Our Solution
1. Offline Semantic Indexing — The “Build” Phase
We built an offline ETL pipeline using PyTorch and Hugging Face Transformers to handle medical terminology ambiguity without incurring runtime inference costs.
Using Qwen 2.5-14B (loaded via bitsandbytes in 4-bit quantization), we semantically analyzed over 5,000 raw LOINC test descriptions. Through carefully engineered prompts, the model extracted structured metadata — normalizing Body Systems, Organs, Conditions, and observable Symptoms from unstructured text.
We also developed a custom XML parser using lxml to ingest the MedlinePlus taxonomy, creating a synonym graph that maps colloquial terms to canonical medical concepts. The output: a deterministic “Golden Index” — a highly optimized JSON artifact containing rank-ordered, semantically enriched test metadata ready for instant retrieval.
2. Runtime Routing Engine — The “Reflex” Phase
We developed a CPU-optimized Python search engine utilizing Spacy and FlashText to replace traditional SQL LIKE queries.
The entity resolution pipeline combines Lemmatization (Spacy) for morphological normalization with FlashText for O(1) keyword replacement, ensuring exact matches even for massive dictionary sizes. For ambiguous queries, a fallback mechanism using SentenceTransformers (all-MiniLM-L6-v2) embeds queries into vector space and compares them against pre-computed intent prototypes via cosine similarity.
To solve “semantic dilution” — where a term like “Liver” matches both Glucose and ALT — we implemented a custom ranking algorithm that dynamically boosts markers belonging to the target anatomical system while penalizing physiologically tangential matches.
3. Advanced Temporal Logic & SQL Generation
We addressed the complexity of longitudinal patient data with a custom stateful parser. A logic engine using dateparser and Regex handles linguistic ellipsis, allowing the system to infer missing year or month contexts in multi-part comparison queries. The router translates natural language timeframes into precise, composable SQL WHERE clauses, handling explicit ranges, snapshots, and complex multi-year comparison sets automatically.
4. RAG Integration for General Knowledge
For queries requiring definitions, causal links, or usage explanations (e.g., “What does high Neutrophil count mean?”), the system integrates a Retrieval-Augmented Generation module. The offline pipeline extracts and embeds clinical context alongside marker data, enabling grounded medical explanations without hallucination risks or external API latency.
5. Infrastructure & Optimization
We reduced the Docker container size by 90% — from ~4GB to ~300MB — by stripping GPU drivers and utilizing CPU-only PyTorch wheels for the runtime environment. Dependency management via uv ensures lightning-fast, reproducible environment resolution.
Results
The system delivers production-grade medical NLP with remarkable efficiency:
- Sub-20ms latency for natural language medical queries on CPU-only infrastructure
- 5,000+ LOINC tests semantically indexed with enriched metadata
- 90% reduction in Docker container size for lean deployment
- Zero runtime LLM dependency — all intelligence is pre-computed offline
- Hybrid intent classification seamlessly handles both structured data queries and open-ended medical knowledge questions
Have a similar challenge?
We'd love to hear about your project and explore how AI can help you achieve your goals.
Start a conversation →