Production System

AskAlan

A production AI teaching assistant that gives University of Toronto engineering students trusted, instructor-approved access to their own course material — deployed across 19 courses and answering thousands of student questions.

Python · FastAPI · PostgreSQL · Anthropic Claude · RAG

Production system · University of Toronto

19Courses in Production

118KDocument Chunks Indexed

~40%Retrieval Latency Reduced

8K+Real Queries Validated

01 / Overview

AskAlan is a retrieval-augmented AI assistant that gives University of Toronto engineering students trusted access to their instructors' own course material — lectures, exercises, past exams, and announcements — instead of the generic, unsourced answers public chatbots give. It began as a capstone project and is now a production system maintained by a faculty-led team, deployed across 19 courses.

I led the consolidation of its retrieval stack onto Postgres. The first phase migrated the keyword-search layer off Elasticsearch behind a behavior-preserving adapter, mirroring roughly 118,000 document chunks across all 19 courses and validating ranking parity against more than 8,000 real student queries before any cutover. It cut keyword-search latency by about 40% and removed one of three separate datastores from the operational footprint, with dual-read/dual-write safety and instant rollback throughout.

Beyond the migration I worked across the stack: rebuilding multimodal retrieval so exam and image content surfaces correctly, instrumenting end-to-end token and cost tracking across five model providers, cutting streaming latency about 35%, and hardening the live deployment's secret handling. The throughline is shipping reliable changes inside a real system with real users and real constraints — and leaving it measurably faster and cheaper to run.

02 / Key Features

Zero-Downtime Search Migration

Re-architected the retrieval backbone from Elasticsearch onto Postgres behind a behavior-preserving adapter, with dual-read/dual-write validation and one-flag rollback, so live courses saw no disruption.

Validated at Production Scale

Replayed 8,000+ real student queries through both engines to prove ranking parity before cutover, and cut keyword-search latency roughly 40%.

Multimodal Retrieval

Rebuilt the pipeline so lecture text and exam-image content surface together, re-embedding thousands of page images and fixing an embedding defect that was burying real results.

Full-Stack Cost Observability

Instrumented every model and embedding call across five providers for per-course, per-feature cost attribution, surfaced in live faculty and developer dashboards.

Streaming Latency Optimization

Reordered the response-streaming path to cut perceived latency about 35%, improving time-to-first-token on every conversation.

Stack

Python
FastAPI
PostgreSQL
ParadeDB / pgvector
Elasticsearch
Pinecone
Anthropic Claude
Voyage AI
Docker

Links