Building BrainGap

Psychometric assessment has been gatekept behind million-dollar research budgets for decades. I'm building BrainGap to change that.

About Me

Justin Boehnen

Software engineer building on the cutting edge with AI. I like taking complex problems and creating clean solutions through web technology.

@boehnen

The Problem

I was studying for Azure AZ-104 and got frustrated. Every practice test felt the same. Random questions, a percentage score at the end, no real insight into what I actually needed to study.

The tools that could tell me exactly where my gaps were? They existed, but only inside organizations like ETS, Pearson, and College Board. Building a psychometrically valid assessment required a team of PhDs and a seven-figure budget.

Then LLMs changed everything. Suddenly, the hardest part of assessment (generating quality questions across cognitive levels) became tractable. The question became: could you combine AI generation with real psychometric rigor?

I use AI to be the expert where I'm not. But I don't blindly trust it. Every piece of AI-generated content flows through grounded feedback loops that continuously refine the output based on real data.

The Architecture

Four systems that feed into each other:

Psychometric Engine

Two-Parameter Logistic IRT model, the same math behind the GRE and GMAT. Items are calibrated from real response data, not just AI confidence.

AI Generation

Claude generates items across cognitive levels. OpenAI embeddings deduplicate concepts. Everything flows through structured prompts with validation.

Adaptive Selection

Maximum Fisher Information picks the item that reduces uncertainty most. No more wasting time on questions that are too easy or too hard.

Continuous Calibration

Every response updates item parameters. Poor discrimination, high variance, or negative feedback automatically flags items for review.

How It Evolved

BrainGap started as "Rekindl", a simple cert prep app. Here's the technical evolution:

Phase 1

Basic Quiz Engine

Hardcoded questions, percentage scores, no adaptivity. Just validating that people would use a cert prep tool.

Phase 2

AI-Powered Generation

Integrated Claude for question generation. Could suddenly create questions for any topic. Added OpenAI embeddings for concept deduplication (cosine similarity < 0.05 = merge).

Phase 3

Bloom's Taxonomy + EMA

Added cognitive levels: Remember → Understand → Apply → Analyze → Evaluate → Create. Used Exponential Moving Average (EMA) for score updates. Better than percentages, but still not real measurement.

Phase 4

IRT & Item Calibration

The big pivot. Replaced EMA with 2PL IRT. Added item calibration from response data. Every response now updates difficulty (b) and discrimination (a) parameters. Real ability estimates with standard error.

Phase 5

Smart Decomposition

Moved from "legacy mode" (fixed tier limits, truncation) to "smart mode" (quality-driven). Multi-tier overlap detection, granularity validation, automatic gap filling. Replaced "how many can we fit?" with "how many do we need?"

Phase 6

On-Demand Generation

Items generated when pools run thin. Facet struggle detection: if a user fails 2+ items on the same facet, generate targeted items. Stale concept detection triggers background generation.

Phase 7

BrainGap API

Rebranded and rebuilt as an API-first platform. Multi-tenant architecture with OAuth. Anyone can embed psychometrically valid assessment into their product.

What's Next

The core engine works. Now it's about scale and trust:

→ AI detection: Flag responses that look LLM-generated
→ Fairness analysis: DIF detection to ensure items don't advantage/disadvantage subgroups
→ More blueprints: Beyond cloud certs, into technical interviews, compliance training, onboarding
→ SDKs: TypeScript, Python, Go clients for easier integration