CodeLens
Chat with Any GitHub Repository
Clone any public GitHub repo, ask it questions in plain English, get answers with file-level citations — so you always know exactly where to look. A RAG pipeline built for codebases, because reading a new repo cold is genuinely miserable.
Where this started
I wanted to build tools that help people. Genuinely noble intention. So I sat down, thought very hard about what problems people have, and ran through a mental list of big ideas: healthcare, education, climate, finance—
Then it hit me. I'm people. And I have a problem that comes up every single time I pick up a new codebase. I spend the first hour doing nothing but grep-ing around, opening random files, trying to figure out where things live. "Where's the auth logic?" "How does data get from the API to the database?" "What does this file even do?"
What if you could just... ask it?
That was CodeLens. Clone any public GitHub repo, index all the code into a vector database, and chat with it in plain English. Get answers with file-level citations so you always know exactly where to look next. The RAG pipeline for codebases. Simple idea, real problem, let's build it.
Ingestion — getting the right code into the right chunks
The first design question: how do you get a repo's contents without making 400 API calls? GitHub's Tree API has a recursive mode — one request, full file tree, every path in the repo. No directory traversal loop, no pagination. That's what I used.
Then the filtering question: not everything in a repo belongs in a code index. Binary
files, lock files, build artifacts, node_modules — if someone asks "where
is auth handled," they don't want matches from package-lock.json. The
ingestion pipeline filters by extension (Python, JS, TS, Go, Rust, Markdown, and a
handful of others) and skips directories like dist, .git,
build, and __pycache__. Only source code and documentation go
into the index.
Then chunking. LangChain's RecursiveCharacterTextSplitter has a
language-aware mode that splits code along meaningful boundaries — for Python, it
prefers to break on class and function definitions; for JavaScript, on function and
arrow function blocks. This matters: a chunk that cuts through the middle of a function
is much harder for the retrieval model to interpret than one that contains a complete
unit. Raw character-count splitting doesn't know what a function boundary is. The
language-aware version does.
Jupyter notebooks needed special handling. Raw .ipynb files are JSON blobs
containing code cells, markdown cells, output cells, error tracebacks, and embedded
images — most of which are noise for a code search index. The pipeline pre-processes
notebooks by stripping output cells and keeping only code and markdown, then treats the
result like any other source file.
Two small snags: Language.SQL wasn't in LangChain's Language
enum, so SQL files fell back to plain text splitting — fine. And I managed to call a
method named _assert_ready() before the line that actually defined it.
Got an AttributeError, moved the definition above the call, moved on.
The RAG stack — tool choices and the reasoning behind them
Three decisions to make before any retrieval could happen: vector store, embedding model, and retrieval strategy.
ChromaDB for the vector store. It runs in-process — no server to spin up, no Docker, no config. Persistence is just a local directory. For a developer tool where session latency matters, having the store embed directly into the app process is a real advantage. It's also fast enough for repo-sized datasets (tens of thousands of chunks at most).
sentence-transformers
(all-MiniLM-L6-v2) for embeddings. It's small, runs on CPU, and loads in
under a second. A larger model would improve retrieval precision but would triple the
cold-start time on Hugging Face Spaces' free tier. For a tool where the first user
action is "paste a GitHub URL and wait," that tradeoff matters. Smaller and fast was
the right call here.
Top-k retrieval (k=5). Each query pulls the 5 most semantically similar chunks. Every chunk carries metadata: file path, chunk index, and a line-range estimate. The generation prompt instructs the LLM to reference those file paths explicitly in its answer — that's the citation mechanism. There's no post-processing magic: the model is told what files it's reading from, and it includes those in its response. Simple, but it works.
Groq + Llama 3.3 70B for generation. Low latency at the inference layer matters here — users are already waiting for the repo to index. Adding a slow generation step on top would make the whole tool feel sluggish. Groq's inference is fast enough that answers arrive in under two seconds after retrieval.
Predictably: HuggingFaceEmbeddings had moved — again — from
langchain_community to langchain_huggingface. I hit this
exact ImportError on SentinelPH. I hit it again here. At some point I'll check the
migration docs first. That point was not this project.
Conversational layer — and the LangChain incident
Single-turn Q&A wasn't enough. Real codebase exploration is conversational: you ask "where is auth handled?", get an answer, then follow up with "what calls that function?" or "are there tests for this?" Those follow-up questions depend on context from the previous turn — without it, every question retrieves against the raw query text alone and loses the thread.
The solution: a ConversationalRetrievalChain with
ConversationBufferWindowMemory. The pipeline keeps a rolling window of the
last few exchanges, condenses the current question plus conversation history into a
standalone question, retrieves on that, and generates a grounded answer. Follow-up
questions work because the retrieval step has context about what you were already
asking about.
That pipeline worked. Then I ran pip install langchain to pull in a
small utility, and it resolved to
LangChain 1.3.1.
In LangChain 1.x, ConversationalRetrievalChain and
ConversationBufferWindowMemory were both removed — not deprecated, not
marked legacy, removed. The entire conversational layer was gone.
Rewrote it using LCEL: a RunnableWithMessageHistory wrapping a retrieval
chain backed by ChatMessageHistory. The LCEL version is actually more
readable — every step in the pipeline is explicit instead of buried in inherited
method chains. Better architecture. Terrible timing. Pin your dependencies.
UI — four versions to get one right
The UX flow had a natural two-step shape: first you give CodeLens a repo, then you chat with it. These are distinct modes — indexing takes time, and the chat interface only makes sense once the index is ready. So the design needed to reflect that: a setup step, then a conversation.
v1 put both steps side by side — repo input on the left, chat on the right. Functional. I hated looking at it. It treated indexing as a background detail instead of a deliberate first step. Back to the drawing board.
v2 introduced a proper two-stage flow: a landing page where you enter the repo URL, then a full-screen chat interface that appears once indexing completes. Better concept. Several implementation problems.
The action buttons were orange. Bright, saturated, Gradio-default orange. I added CSS
to override the color. The buttons stayed orange. Gradio's variant="primary"
sets inline styles that override any external CSS. Fix: remove the variant
parameter and style the buttons from scratch without triggering Gradio's defaults.
The landing page stretched edge-to-edge on wide screens. Fixed by wrapping the content
in a 3-column gr.Row() with scale ratios 1:2:1 — the outer
columns act as gutters, the center holds the content at a readable width.
The page went completely black on load. No error, no traceback. After a while I found
it: show_progress=True was rendering a full-screen loading overlay before
the UI had appeared. Changed to show_progress=False. Visible again.
v3 and v4 were incremental — refining the chat layout, improving how citations surface in the response, tightening spacing. By v4 it felt right. That's what's live.
Deployment — three crashes, one launch
Pushed to Hugging Face Spaces. It crashed three times before staying up.
Crash 1: The README.md
metadata had colorFrom: cyan. Cyan is not a valid Hugging Face Space color.
Changed to blue. Accepted without complaint.
Crash 2: requirements.txt
pinned gradio==5.29.1, but Hugging Face Spaces ships with its own Gradio
(6.14.0). Version conflict, the Space won't start. Fix: remove gradio from
requirements entirely and let the platform manage it. Any Gradio Space should do this.
Crash 3: AttributeError:
Language has no attribute 'R' — on startup, before any user interaction. The
Spaces environment had a different LangChain version than my local, and
Language.R didn't exist in it. Fixed with a hasattr(Language, attr)
guard: if the language isn't in the installed version, skip it and fall back to plain
text splitting.
After crash 3, the Space stayed up. The lesson: the Hugging Face Spaces runtime is not the same as your local environment, and anything that touches LangChain enums or pinned library versions will find that out the hard way.
By The Numbers
What I'd do differently
- → Pin every dependency to an exact version from day one, especially LangChain. "Latest" is a gamble when you're mid-build and the library ships breaking changes without major version bumps.
- → Check the Hugging Face Spaces runtime environment before writing environment-specific code. The installed LangChain version, the Gradio version, available system packages — all differ from local.
- → Design with the two-step flow from the start. v1 was a placeholder I knew wouldn't survive, and it wasted a session.
- → Add a hasattr guard for every LangChain enum from the beginning — Language, VectorStoreType, anything the platform might ship a different version of.
- → Swap in a larger embedding model behind a loading screen. all-MiniLM is fast but the retrieval quality shows its limits on deeply technical questions. The UX cost of a 5-second load is lower than the cost of poor answers.
What's next
- → Private repo support via GitHub OAuth — right now it's public repos only.
- → Persistent vector stores: re-indexing on every session is the biggest UX bottleneck. Caching indexed repos by commit SHA would eliminate the wait on repeat visits.
- → Code-graph retrieval: instead of pure semantic similarity, use the AST to understand call graphs and import relationships. "What calls this function?" should retrieve based on structure, not just text similarity.
- → A repo map panel alongside the chat — a tree view of the indexed structure so you can navigate the codebase visually while you ask questions.
CodeLens clones public repositories for indexing purposes only.
No code is stored beyond your session.