Bibliography

Research

What SourcePrep was built on. Notes on the papers, repositories, essays, and standards that shaped the project.

Bibliography

What SourcePrep was built on.

A working list of the papers, repositories, essays, and standards SourcePrep draws on. Each entry includes a one-line note on how it shaped the project — and what we changed when we disagreed.

Retrieval & Long Context

Why context engineering matters more than raw context size — and what changes when language models meet long, noisy windows.

PaperLiu et al. · TACL 2024 · 2024

Lost in the Middle: How Language Models Use Long Contexts

Liu et al. show that language models attend to the start and end of a long context far more than the middle. The finding sets the ceiling for any retrieval system that pads its context naïvely. SourcePrep ranks results by relevance and assembles them so the highest-scoring chunks bracket the prompt, never bury it.

PaperHan et al. · arXiv 2025 · 2025

RAG vs. GraphRAG: A Systematic Evaluation and Key Insights

Han et al. compare flat-vector RAG against graph-augmented RAG across reasoning-heavy benchmarks and find that local community search wins on multi-hop questions. SourcePrep’s prep_search follows the same logic: vector hits seed the query, then a trace-graph hop expands the neighborhood before the final assembly.

EssayAnthropic · Anthropic blog · 2024

Contextual Retrieval

Anthropic’s post argued that prepending a few lines of file-level context to each chunk before embedding reduced retrieval failures by 49% in their tests. SourcePrep’s semantic chunker now does exactly this — every chunk carries a synopsis prefix derived from its enclosing module so the embedding sees the same neighborhood the model will reason over.

Compression & Levels of Detail

Why SourcePrep’s context assembler ladders code from full source down to one-line signatures, and the research that makes signature-only context defensible.

PaperOstby · arXiv 2026 · 2026

Stingy Context: 18:1 Hierarchical Code Compression for LLM Auto-Coding

Stingy Context demonstrates that hierarchical level-of-detail extraction can compress code by 18:1 with negligible quality loss for auto-coding tasks. SourcePrep’s context assembler implements the same ladder: full source for the focal file, compressed forms for callees, and one-line signatures for everything else.

RepositoryGitHub

Aider-AI/aider

Aider’s repo-map prunes a project into one-line signatures and ranks the visible set per turn. It is the public proof that LOD 4 holds up in production across the open-source agentic-coding community. SourcePrep’s context assembler implements a similar selection step on top of the trace graph rather than a flat AST extract.

RepositoryGitHub

microsoft/LLMLingua

LLMLingua-2 uses a small BERT classifier to drop the lowest-information tokens from a prompt without losing meaning. SourcePrep runs it over Markdown and docstrings while letting code chunks flow through the structural LOD ladder — two compressors, one assembly. The repo hosts both versions; v2 is what SourcePrep adopts.

Further reading

Paper

Repoformer: Selective Retrieval for Repository-Level Code Completion

Wu et al. · ICML 2024 · 2024

Validates score-gated retrieval — knowing when *not* to fetch context improves accuracy.

Paper

GraphCoder: Code Completion via Code Context Graph-based Retrieval

Liu et al. · ASE 2024 · 2024

Baseline for graph-vs-embedding retrieval comparisons.

Paper

RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion

Phan et al. · arXiv 2024 · 2024

The Search→Expand→Refine pipeline maps 1:1 onto SourcePrep’s search → trace expansion → LOD assembly.

Paper

STALL+: Boosting LLM-based Repository-Level Code Completion with Static Analysis

Liu et al. · arXiv 2024 · 2024

Static-analysis-at-prompting pattern; mirrors SourcePrep’s use of trace-graph import edges to drive dependency-aware retrieval.

Paper

In Line with Context: Repository-Level Code Generation via Context Inlining

Guo et al. · arXiv 2026 · 2026

Flagged as a Phase-2 enhancement — inline callees/callers on top of existing LOD results.

Paper

Long Context Compression with Activation Beacon

Zhang et al. · ICLR 2024 · 2024

Model-internal KV compression — explicitly complementary to SourcePrep’s pre-prompt compression layer.

Paper

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

Pan et al. · ACL 2024 · 2024

BERT-classifier token pruning. Adopted as the language/docs compressor.

Paper

On the Impacts of Contexts on Repository-Level Code Generation

Hai et al. · NAACL Findings 2025 · 2025

Empirical evidence that signatures + docstrings are the highest-ROI context type.

Repo

yamadashy/repomix

GitHub

Production tree-sitter compression at ~70% reduction — evidence that LOD extraction is practical at scale.

Repo

YerbaPage/LongCodeZip

GitHub

Evaluated as an off-the-shelf compressor; rejected due to 7B-model dependency incompatible with local-first architecture.

Paper

CodeRAG: Supportive Code Retrieval on Bigraph

arXiv 2025 · 2025

Bigraph retrieval reference — supports the case for graph-structured code representation.

Code Structure & Chunking

Why chunking on AST boundaries beats character splits for code, and how structural awareness changes retrieval quality.

RepositoryGitHub

garrytan/gbrain

Garry Tan’s gbrain pairs Savitzky–Golay smoothing for semantic boundary detection with reciprocal-rank-fusion across multiple query expansions. Reading it shaped SourcePrep’s current chunker: smooth the similarity curve, cut on local minima, fuse vector and keyword hits with RRF.

PaperWang et al. · arXiv 2025 · 2025

cAST: Enhancing Code RAG with Structural Awareness

cAST shows that chunking on AST boundaries produces meaningfully better embeddings than fixed-window splits, particularly for languages with strong nesting structure. SourcePrep’s tree-sitter chunker is grounded in this finding — a chunk never splits mid-function and the chunk header carries the full enclosing path.

PaperEdge et al. · Microsoft Research · 2024

GraphRAG: From Local to Global — A Graph-RAG Approach to Query-Focused Summarization

Edge et al. layer entity extraction, community detection, and per-community summarization to make a knowledge graph queryable as a hierarchy. SourcePrep’s atlas does the same trick on code: directories and modules become communities, each with a generated synopsis that the assembler can hand to the model in place of the underlying files.

Concepts, Knowledge & Standards

Why SourcePrep treats concepts as first-class artifacts, where the protocol surface comes from, and the older work that grounds the system in something deeper than recent papers.

PaperGuo et al. · ICLR 2021 · 2021

GraphCodeBERT: Pre-training Code Representations with Data Flow

Guo et al. show that pre-training code models on data-flow graphs (not just token streams) measurably improves downstream code understanding. SourcePrep’s trace index encodes the same intuition by carrying control- and data-flow edges alongside symbol references, so retrievers can hop on dependency, not just on text similarity.

BookNonaka & Takeuchi · Oxford University Press · 1995

The Knowledge-Creating Company (SECI Model)

Nonaka & Takeuchi’s SECI model frames organizational knowledge as a four-step cycle: socialize, externalize, combine, internalize. SourcePrep’s concepts feature is a literal externalization tool — the tacit "we don’t do it that way" assumptions in a team’s head become typed, anchored, testable artifacts that downstream agents can read.

SpecificationAnthropic · modelcontextprotocol.io · 2024

Model Context Protocol

MCP is the protocol surface SourcePrep ships its primary interface on. Every prep_* tool, the resources system, and the per-client context budgets are MCP-shaped from the ground up. Without this spec there is no SourcePrep MCP server, and the page SourcePrep advertises to any agent in any IDE would not exist.

Further reading

Paper

The Program Dependence Graph and its Use in Optimization

Ferrante, Ottenstein, Warren · ACM TOPLAS · 1987

Classical PDG reference; grounds SourcePrep’s combined control-flow + data-flow trace graph.

Paper

From Louvain to Leiden: Guaranteeing Well-Connected Communities

Traag, Waltman, van Eck · Scientific Reports · 2019

Theoretical backing for community-detection-driven concept clustering.

Book

Formal Concept Analysis: Mathematical Foundations

Ganter & Wille · Springer · 1999

Mathematical justification for lattice-based concept organization.

Paper

Towards a Theory of the Comprehension of Computer Programs

Ruven Brooks · IJMMS · 1983

Top-down program comprehension theory — the cognitive basis for hypothesis-driven retrieval.

Paper

Stimulus Structures and Mental Representations in Expert Comprehension of Computer Programs

Nancy Pennington · Cognitive Psychology · 1987

Bottom-up comprehension counterpart to Brooks; SourcePrep’s structural trace index supports this mode.

Paper

The Magical Number Seven, Plus or Minus Two

George A. Miller · Psychological Review · 1956

Cognitive-load rationale for concept clustering at human-readable cardinality.

Essay

Documenting Architecture Decisions

Michael Nygard · cognitect.com · 2011

ADR template convention; SourcePrep concepts extend ADRs beyond per-node decisions.

Paper

LLMs4OL Challenge — Large Language Models for Ontology Learning

ISWC · 2024

Establishes SOTA for automated concept extraction; informed SourcePrep’s hybrid embedding+LLM concept-discovery pipeline.

Paper

Traceability Transformed: Generating More Accurate Links with Pre-Trained BERT Models (T-BERT)

arXiv 2021 · 2021

Researched for requirements↔code linking; rejected as too heavy for the SourcePrep architecture but kept as a baseline.

Spec

NASA SWE-072: Bidirectional Traceability

NASA SWE Handbook

Grounds SourcePrep’s curated traceability framework in an established engineering standard.

Spec

Agent Client Protocol (ACP)

agentclientprotocol.com

Zed-backed standard — SourcePrep’s multi-editor integration target.

Spec

Agent-to-Agent Protocol (A2A)

Google / Linux Foundation · a2a-protocol.org · 2025

Identified as a future Layer 4 target for cross-agent discovery.

Spec

SARIF 2.1.0 — Static Analysis Results Interchange Format

OASIS · 2020

SARIF-in / SARIF-out enrichment is a shipped prep_audit capability.

Spec

OCSF — Open Cybersecurity Schema Framework

ocsf.io

Alternative audit-export format; AWS/Splunk-backed.

Spec

agents.md

agents.md

Emerging convention for agent-facing context files — SourcePrep auto-generates AGENTS.md via rules_generator.py.

Paper

KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment

Lu & Wang · NeurIPS 2025 · 2025

Multi-agent KG enrichment that parallels SourcePrep’s multi-pass enrichment pipeline.