Building BrainGap
Psychometric assessment has been gatekept behind million-dollar research budgets for decades. I'm building BrainGap to change that.
About Me
Justin Boehnen
Software engineer building on the cutting edge with AI. I like taking complex problems and creating clean solutions through web technology.
@boehnenThe Problem
I was studying for Azure AZ-104 and got frustrated. Every practice test felt the same. Random questions, a percentage score at the end, no real insight into what I actually needed to study.
The tools that could tell me exactly where my gaps were? They existed, but only inside organizations like ETS, Pearson, and College Board. Building a psychometrically valid assessment required a team of PhDs and a seven-figure budget.
Then LLMs changed everything. Suddenly, the hardest part of assessment (generating quality questions across cognitive levels) became tractable. The question became: could you combine AI generation with real psychometric rigor?
I use AI to be the expert where I'm not. But I don't blindly trust it. Every piece of AI-generated content flows through grounded feedback loops that continuously refine the output based on real data.
The Architecture
Four systems that feed into each other:
Two-Parameter Logistic IRT model, the same math behind the GRE and GMAT. Items are calibrated from real response data, not just AI confidence.
Claude generates items across cognitive levels. OpenAI embeddings deduplicate concepts. Everything flows through structured prompts with validation.
Maximum Fisher Information picks the item that reduces uncertainty most. No more wasting time on questions that are too easy or too hard.
Every response updates item parameters. Poor discrimination, high variance, or negative feedback automatically flags items for review.
How It Evolved
BrainGap started as "Rekindl", a simple cert prep app. Here's the technical evolution:
Hardcoded questions, percentage scores, no adaptivity. Just validating that people would use a cert prep tool.
Integrated Claude for question generation. Could suddenly create questions for any topic. Added OpenAI embeddings for concept deduplication (cosine similarity < 0.05 = merge).
Added cognitive levels: Remember → Understand → Apply → Analyze → Evaluate → Create. Used Exponential Moving Average (EMA) for score updates. Better than percentages, but still not real measurement.
The big pivot. Replaced EMA with 2PL IRT. Added item calibration from response data. Every response now updates difficulty (b) and discrimination (a) parameters. Real ability estimates with standard error.
Moved from "legacy mode" (fixed tier limits, truncation) to "smart mode" (quality-driven). Multi-tier overlap detection, granularity validation, automatic gap filling. Replaced "how many can we fit?" with "how many do we need?"
Items generated when pools run thin. Facet struggle detection: if a user fails 2+ items on the same facet, generate targeted items. Stale concept detection triggers background generation.
Rebranded and rebuilt as an API-first platform. Multi-tenant architecture with OAuth. Anyone can embed psychometrically valid assessment into their product.
What's Next
The core engine works. Now it's about scale and trust:
- → AI detection: Flag responses that look LLM-generated
- → Fairness analysis: DIF detection to ensure items don't advantage/disadvantage subgroups
- → More blueprints: Beyond cloud certs, into technical interviews, compliance training, onboarding
- → SDKs: TypeScript, Python, Go clients for easier integration
Want to build with BrainGap or just chat about psychometrics and AI?