LIVE Hugging Face Spaces

CodeLens

Chat with Any GitHub Repository

Clone any public GitHub repo, ask it questions in plain English, get answers with file-level citations — so you always know exactly where to look. A RAG pipeline built for codebases, because reading a new repo cold is genuinely miserable.

Python LangChain ChromaDB Groq API sentence-transformers GitHub API Gradio

Where this started

I wanted to build tools that help people. Genuinely noble intention. So I sat down, thought very hard about what problems people have, and ran through a mental list of big ideas: healthcare, education, climate, finance—

Then it hit me. I'm people. And I have a problem that comes up every single time I pick up a new codebase. I spend the first hour doing nothing but grep-ing around, opening random files, trying to figure out where things live. "Where's the auth logic?" "How does data get from the API to the database?" "What does this file even do?"

What if you could just... ask it?

That was CodeLens. Clone any public GitHub repo, index all the code into a vector database, and chat with it in plain English. Get answers with file-level citations so you always know exactly where to look next. The RAG pipeline for codebases. Simple idea, real problem, let's build it.

Phase 1 & 2

Ingestion — getting the right code into the right chunks

The first design question: how do you get a repo's contents without making 400 API calls? GitHub's Tree API has a recursive mode — one request, full file tree, every path in the repo. No directory traversal loop, no pagination. That's what I used.

Then the filtering question: not everything in a repo belongs in a code index. Binary files, lock files, build artifacts, node_modules — if someone asks "where is auth handled," they don't want matches from package-lock.json. The ingestion pipeline filters by extension (Python, JS, TS, Go, Rust, Markdown, and a handful of others) and skips directories like dist, .git, build, and __pycache__. Only source code and documentation go into the index.

Then chunking. LangChain's RecursiveCharacterTextSplitter has a language-aware mode that splits code along meaningful boundaries — for Python, it prefers to break on class and function definitions; for JavaScript, on function and arrow function blocks. This matters: a chunk that cuts through the middle of a function is much harder for the retrieval model to interpret than one that contains a complete unit. Raw character-count splitting doesn't know what a function boundary is. The language-aware version does.

Jupyter notebooks needed special handling. Raw .ipynb files are JSON blobs containing code cells, markdown cells, output cells, error tracebacks, and embedded images — most of which are noise for a code search index. The pipeline pre-processes notebooks by stripping output cells and keeping only code and markdown, then treats the result like any other source file.

Two small snags: Language.SQL wasn't in LangChain's Language enum, so SQL files fell back to plain text splitting — fine. And I managed to call a method named _assert_ready() before the line that actually defined it. Got an AttributeError, moved the definition above the call, moved on.

Phase 3

The RAG stack — tool choices and the reasoning behind them

Three decisions to make before any retrieval could happen: vector store, embedding model, and retrieval strategy.

ChromaDB for the vector store. It runs in-process — no server to spin up, no Docker, no config. Persistence is just a local directory. For a developer tool where session latency matters, having the store embed directly into the app process is a real advantage. It's also fast enough for repo-sized datasets (tens of thousands of chunks at most).

sentence-transformers (all-MiniLM-L6-v2) for embeddings. It's small, runs on CPU, and loads in under a second. A larger model would improve retrieval precision but would triple the cold-start time on Hugging Face Spaces' free tier. For a tool where the first user action is "paste a GitHub URL and wait," that tradeoff matters. Smaller and fast was the right call here.

Top-k retrieval (k=5). Each query pulls the 5 most semantically similar chunks. Every chunk carries metadata: file path, chunk index, and a line-range estimate. The generation prompt instructs the LLM to reference those file paths explicitly in its answer — that's the citation mechanism. There's no post-processing magic: the model is told what files it's reading from, and it includes those in its response. Simple, but it works.

Groq + Llama 3.3 70B for generation. Low latency at the inference layer matters here — users are already waiting for the repo to index. Adding a slow generation step on top would make the whole tool feel sluggish. Groq's inference is fast enough that answers arrive in under two seconds after retrieval.

Predictably: HuggingFaceEmbeddings had moved — again — from langchain_community to langchain_huggingface. I hit this exact ImportError on SentinelPH. I hit it again here. At some point I'll check the migration docs first. That point was not this project.

Phase 4

Conversational layer — and the LangChain incident

Single-turn Q&A wasn't enough. Real codebase exploration is conversational: you ask "where is auth handled?", get an answer, then follow up with "what calls that function?" or "are there tests for this?" Those follow-up questions depend on context from the previous turn — without it, every question retrieves against the raw query text alone and loses the thread.

The solution: a ConversationalRetrievalChain with ConversationBufferWindowMemory. The pipeline keeps a rolling window of the last few exchanges, condenses the current question plus conversation history into a standalone question, retrieves on that, and generates a grounded answer. Follow-up questions work because the retrieval step has context about what you were already asking about.

That pipeline worked. Then I ran pip install langchain to pull in a small utility, and it resolved to LangChain 1.3.1.

In LangChain 1.x, ConversationalRetrievalChain and ConversationBufferWindowMemory were both removed — not deprecated, not marked legacy, removed. The entire conversational layer was gone.

Rewrote it using LCEL: a RunnableWithMessageHistory wrapping a retrieval chain backed by ChatMessageHistory. The LCEL version is actually more readable — every step in the pipeline is explicit instead of buried in inherited method chains. Better architecture. Terrible timing. Pin your dependencies.

Phase 6

UI — four versions to get one right

The UX flow had a natural two-step shape: first you give CodeLens a repo, then you chat with it. These are distinct modes — indexing takes time, and the chat interface only makes sense once the index is ready. So the design needed to reflect that: a setup step, then a conversation.

v1 put both steps side by side — repo input on the left, chat on the right. Functional. I hated looking at it. It treated indexing as a background detail instead of a deliberate first step. Back to the drawing board.

v2 introduced a proper two-stage flow: a landing page where you enter the repo URL, then a full-screen chat interface that appears once indexing completes. Better concept. Several implementation problems.

The action buttons were orange. Bright, saturated, Gradio-default orange. I added CSS to override the color. The buttons stayed orange. Gradio's variant="primary" sets inline styles that override any external CSS. Fix: remove the variant parameter and style the buttons from scratch without triggering Gradio's defaults.

The landing page stretched edge-to-edge on wide screens. Fixed by wrapping the content in a 3-column gr.Row() with scale ratios 1:2:1 — the outer columns act as gutters, the center holds the content at a readable width.

The page went completely black on load. No error, no traceback. After a while I found it: show_progress=True was rendering a full-screen loading overlay before the UI had appeared. Changed to show_progress=False. Visible again.

v3 and v4 were incremental — refining the chat layout, improving how citations surface in the response, tightening spacing. By v4 it felt right. That's what's live.

Phase 8

Deployment — three crashes, one launch

Pushed to Hugging Face Spaces. It crashed three times before staying up.

Crash 1: The README.md metadata had colorFrom: cyan. Cyan is not a valid Hugging Face Space color. Changed to blue. Accepted without complaint.

Crash 2: requirements.txt pinned gradio==5.29.1, but Hugging Face Spaces ships with its own Gradio (6.14.0). Version conflict, the Space won't start. Fix: remove gradio from requirements entirely and let the platform manage it. Any Gradio Space should do this.

Crash 3: AttributeError: Language has no attribute 'R' — on startup, before any user interaction. The Spaces environment had a different LangChain version than my local, and Language.R didn't exist in it. Fixed with a hasattr(Language, attr) guard: if the language isn't in the installed version, skip it and fall back to plain text splitting.

After crash 3, the Space stayed up. The lesson: the Hugging Face Spaces runtime is not the same as your local environment, and anything that touches LangChain enums or pinned library versions will find that out the hard way.

By The Numbers

1
API call for full tree
GitHub recursive tree endpoint
k=5
chunks retrieved per query
top-k semantic similarity
4
UI redesigns
v1 → v2 → v3 → v4
3
deployment crashes
before first clean launch
1
full pipeline rewrite
courtesy of LangChain 1.3.1
LangChain import hell
SentinelPH wasn't enough

What I'd do differently

  • Pin every dependency to an exact version from day one, especially LangChain. "Latest" is a gamble when you're mid-build and the library ships breaking changes without major version bumps.
  • Check the Hugging Face Spaces runtime environment before writing environment-specific code. The installed LangChain version, the Gradio version, available system packages — all differ from local.
  • Design with the two-step flow from the start. v1 was a placeholder I knew wouldn't survive, and it wasted a session.
  • Add a hasattr guard for every LangChain enum from the beginning — Language, VectorStoreType, anything the platform might ship a different version of.
  • Swap in a larger embedding model behind a loading screen. all-MiniLM is fast but the retrieval quality shows its limits on deeply technical questions. The UX cost of a 5-second load is lower than the cost of poor answers.

What's next

  • Private repo support via GitHub OAuth — right now it's public repos only.
  • Persistent vector stores: re-indexing on every session is the biggest UX bottleneck. Caching indexed repos by commit SHA would eliminate the wait on repeat visits.
  • Code-graph retrieval: instead of pure semantic similarity, use the AST to understand call graphs and import relationships. "What calls this function?" should retrieve based on structure, not just text similarity.
  • A repo map panel alongside the chat — a tree view of the indexed structure so you can navigate the codebase visually while you ask questions.

CodeLens clones public repositories for indexing purposes only.
No code is stored beyond your session.