Hackathon project: an LLM context generator in Rust

At Uniswap Labs’ biannual onsite, we dedicate most of our onsite week to a company hackathon. It’s a time to think creatively, work on personal projects related to work, and collaborate with others. This time around, I wrote a short script to gather and coalesce BigQuery table schemas into a text document for use with an LLM. An executable, code, and a usage guide are available on my Github.

The problem to solve

As part of my work, I regularly use LLMs to help with data queries. We have a broad set of high-quality onchain and offchain (i.e., proprietary app data) stored in Google BigQuery, and I write SQL to access and transform the data into actionable insights. LLMs — Claude, ChatGPT, etc. — are very helpful for accelerating that work because they help me quickly iterate on my code.

An important part of using LLMs to their max effectiveness is to learn how to query them. Part of this involves ensuring that your prompt has the necessary context for the LLM. Most simply, if you’re asking an LLM to write a SQL query that roughly takes the form of SELECT (column_names) FROM (table_name), you need to provide the LLM the relevant column and table names. If you don’t, you’ll need to edit them manually before running the query, which can be tedious.

What I do now

You could verbally describe the columns and table names, but it’s pretty tedious. Instead, I typically provide the LLM a screenshot of the table schema, which shows the model the column names and their types. However, this has a few drawbacks:

If you’re writing a query that requires multiple tables, which is common, it can be ineffective to provide many images to the model. Most models only allow four screenshots, and it can be hard to identify which screenshot accords with which table, or to explain the relationship between the screenshots.
The model has to “tokenize” the screenshot, which introduces some error vs. text tokenization (I think). This could lead to less accuracy.
Images are not “semantically dense” when it comes to the context tokens that they consume. For example, the image of a table schema I tested used 2,205 tokens while the text representation of the table schema used 668 tokens — just 1/3rd of the image. If you feed an LLM more tokens than strictly necessary, it will perform worse, all else equal.
While small, it takes a little bit of repetitive effort to find and provide the relevant table schema to the model. This is a small, but constant, annoyance. What’s better

A new standard for providing context to models has emerged, the llms.txt standard. An llms.txt file is essentially a compressed summary of documentation, especially useful for any API provider to make it easier for developers to use LLMs to integrate the API. Here is Anthropic’s llms.txt file. Other AI tools like Cursor use similar concepts — you can define context in a ‘Cursor Rules’ file that guides the LLM’s output according to your preferences.

The project

I wanted to write a tool that could programmatically pull schemas for any table in our Google BigQuery into a single text file. The idea would be to feed that single text file into each of my queries so that any LLM I use would have full information about the schemas and table names with which I work.

I decided to write the project in Rust, just because I wanted to. No, it’s not the best language for the job. But, I’ve been working on learning Rust, enjoy writing the language, and want to make more Rust contributions. Plus it’s blazingly fast.

What I made

Conceptually, the script is very straightforward. Google provides relatively usable APIs that help users identify and collect information about Google BigQuery assets — exactly what I wanted to do. At a high level, I:

Intake the user’s arguments, which include some config details (credentials, output filepath) and the tables that the user wants to collect (lns. 97-128)
Connect to the Google BigQuery API and authenticate (lns. 130-160)
Every table in Google BigQuery is defined by a unique project.dataset.table identifier, so I iterate through the user’s designated projects, datasets, and tables to collect all table names (lns. 163-257)
Collect the schema for every table from the Google BigQuery API (lns. 259-300).
Write every schema to a text file, with formatting (lns. 305-364).

For this project, I used the clap, serde, reqwest, and yup_oauth2 crates for the first time 🙂

Reflections

This was a fun project where I got to code on company time 🙂 But it also produced something usable and useful for me — I’ve already been using the file to accelerate my work. I can more readily chat with the LLM I’m using like I would another employee. Referencing “the Orderflows table” and specific columns and correcting work in natural language rather than writing SQL myself. Here are some reflections on the project:

My company is pushing to use AI to accelerate work, which is new to some people at the company. Having used AI for two years+ helped me understand what I needed to use AI better.
I think the context file I generated could save me up to 40 mins per day on analysis-intensive days (+ a good amount of mental overhead). That’s about an 8% increase in productivity (if I worked eight hour days). While that might not sound like too too much, it’s being applied to a base where I’m already 1.7-2x more productive than I was 1.5 years ago. These improvements compound and yield 10x improvements over time.
A 10% productivity improvement for a typical tech employee is equivalent to something like $30k/yr including benefits. Capital investments on internal AI tooling seem well, well worth the cost.
I like writing Rust! As a beginner engineer, I don’t know all of the errors that I might run into. Writing Rust with robust error handling and strong types feels (at the very least) very secure and reliable. When I write and run Python scripts, I often feel like I’m just praying they execute if I return to them a few months later.

You can use this tool yourself! Check out the ReadMe and Usage guide for more information.