Multi-agent chats as the Step Beyond ChatGPT

This Easter weekend, I forbade myself from working. I half-succeeded: although I made two AutoGen projects, none were for my day job! #soproud Here goes: a cover letter generator and a multi-provider therapy session. (A friend needed both and I thought it would be a good distraction for us to convert a human problem into a technical problem. Because that's healthy.)

The concept behind each is simple:

  1. Set up a Poetry environment (ideally from a Copier template that makes the environment immediately pip-installable) and install pyautogen.
  2. Create a UserProxyAgent as a stand-in for the user. While by default, the UserProxyAgent prompts for human input every time it's invoked, it does not need to, and you can use it just to simulate the opening of the conversation.
  3. Break down the big task into subtask, ideally with a clear input, output, and instruction set.
  4. Create one agent per subtask, with a clear set of instructions, output requirements, and who to pass the baton to under what circumstances.
  5. If applicable: Create a set of "allowable speaker transitions", i.e. which agent can speak after this agent is done, and set up the GroupChat.
  6. Create a CLI script that will invoke the setup with the right environment variables and flexible input parameters.

(I should really make a Copier template for this. Of course, it's a little complicated by the fact that the nature, prompt and setup of each agent is a little different each time, but there's sufficient similarity that it might be worth it.)

I want to talk about the cool parts of the process.

Custom prompting

I think this is the most important part of the process to get right, especially when making a multi-agent process for a highly personalized activity like therapy or job applications. Of course, you should follow the best practices for prompt engineering, but if this is a task you already have some experience in, you should strive to make the secret sauce explicit.

To take the example of the cover letter writing:

  • In the age of the RLHF, most ChatGPT prose sounds the same; I wanted to make sure the cover letter was different. (Yes, I notice the irony.) So I gave the Critic agent a prompt that asked it to check for the presence of an authentic voice, and the Writer agent a permission to engage in a little whimsy if requested. (Tuning it to just the right anount of whimsy was a little tricky, of course. I got a lot of wacky cover letters during the testing.)
  • I also wanted to make sure that the cover letter was good, so I asked the Critic to check for clear sentence structure, paragraph progression, veracity to the resume excerpts, and a few other things I like to have in my writing.

Or, of course - I know the kind of therapy I like, and my friend knew what kind of therapy he prefers. So we could made the prompts specific to our needs. This is a challenge in putting the prompts out there, in fact - both the cover-letter and the therapy session requirements can get a little too personal.

Agent transitions on the happy path and away from it

This is the trickiest part to me, as the definition of finite-state machine transitions is the same for both success and failure - which means it needs to permit for both, confusing either.

  • On success, it's very easy to define the next agent to speak. In the cover letter example, the Critic agent would pass the baton to the Writer agent if the cover letter was good enough. In the therapy session, the UserProxyAgent would pass the baton to any Therapist agent if the user indicated that they were ready to move on.
  • On failure, however, things become less clear. If a single-use agent fails, especially if they fail on a tool invocation, it's difficult to define the next agent to speak. But this is, in part because it's difficult to define what failure is. Let's use a couple of examples, in decreasing order of clarity:

    • In the cover letter example, the Job Description Ingester agent could miss an important detail for the cover letter submission (e.g. the company name or the requirement that the hiring agent be addressed by Mr. Banana lest they fail the candidate for not reading the JD closely enough).
    • Still in the cover letter example, the Critic agent could fail if they had
      no more criticism to give, didn't call the function to save the cover letter to the target folder, but still terminated the workflow.
    • Finally, in the cover letter example, the Critic agent could fail if the cover letter was not good enough, yet it didn't pick up on it. But what does "not good enough" mean? Some of that can be defined in structural terms (e.g. no intro), but some of it is inevitably subjective.
    • Finally, one of the therapy agents could provide bad advice - but what does "bad" therapeutic advice even mean, short of actively instigating harm to self or others?

From vibes to DeepEval evals

This was probably the toughest. Two reasons here: (1) developing evals themselves is difficult, and (2) the cost of running the evals is high, so you don't want to run them too often, which means you don't get as much feedback as you'd like as often as you'd need.

(This is where Small Language Models could shine, in theory! But then you have to adapt the prompts to different models, and that's a whole other can of worms.)

State maintenance

This doesn't have a good answer yet, even though it's a problem that's been solved many times over. The Teachability feature is a sort-of too-smart solution to a different problem, which is that agents don't remember dynamically what you told them - but what you often want is a fully detereministic key-value store. Well, guess what - fully deterministic key-value stores are not exactly uncommon in the tech space! But Autogen doesn't support any of them out-of-the-box, so there's a wide space of custom implementations.


By default, there isn't much - on a run that completes gracefully, you get an object with .chat_history, otherwise you get a traceback. There's some default instrumentation, but none that appears easily extensible.


Each run cost between $0.05 and $0.50, depending on the length of the conversation, using gpt-4-turbo-preview. This is a little expensive - I still wouldn't hesitate to use it for a personal use case, but would likely balk at providing it as a free service to the general public. (Unless it's "bring your own API key", I suppose.)


Multi-agent workflows impress me. In specific use cases, they're a clear step above bare GPT-4 prompting - and even though they're not a panacea, the list of shortcomings is highly tractable. I'm looking forward to shortening it.

Multi-agent chats as the Step Beyond ChatGPT

Cleaning up 5 years of genomics work in 36 hours

In September 2021, I spent the last 36 hours on a rollercoaster made of snakemake, {Rmarkdown}, {renv}, and conda/Docker. The ending had me think I might become the Joker. But the journey was fun.

What we were doing

@Laura_Kellman spent the last couple of years working on a cutting-edge cancer genomics project, accumulating many gigabytes of files and writing tens of RMarkdown notebooks. I volunteered to do a code review and functional replication on another machine.

Step 0: Gather everything

Where do you start when you want to do this? Gather everything.

  • Make a folder.
  • Gather data. If you have any idea where the data came from, great! If you don't, stick them in the pile and we'll figure it out later! Put them all in the data/ subfolder.
  • Gather code.
  • Gather software requirements. What do I need to do to re-create your laptop's computational environment?
  • Bake a cheesecake. It will help you in your moments of anguish.

Step 1: Version control everything, including large data files

To be able to track any changes to the scripts, as well as any alterations to the data generated, we initialized a git repository in RStudio and threw everything in it. And I mean everything, including - perhaps against best practice - data files of up to 1 GB.

(Yes, it would have been better to install git-lfs or git-annex. You certainly cannot do this with really big data, but we decided everything under 1 GB was fair game, despite RStudio's heartfelt protests. The idea is that this repository is perhaps not the version history we'll share, but rather a safety net beneath our efforts.)

Note that if you do this, you cannot upload the repository to GitHub straight away. You'll need to either purge the large files from the version history; or discard this version history altogether (i.e. remove/rename the .git folder), re-initialize with an updated .gitignore, and git add only the things you want to share.

Step 1b: {here}

Hard-coded paths are common in academic code. Needless to say, they have to go! The main solution in this space is {here}, which basically looks through parent folders recursively until it finds one with an .Rproj file or a .git folder, which indicates that the project root has been located, and then make all paths relative to that.

In other words: read_csv(here("data/input.tsv")) is unambiguous, whereas read_csv("~/Documents/Projects/MPRA/Attempt3/data/input.tsv") only works on one machine, and only if you never move the project folders.

Step 2: {renv} and R

{renv}, the preeminent dependency management for R, was a clear choice. We installed it from Github, initialized it (renv::init()), installed RMarkdown to allow it to parse packages from RMarkdown sources with renv::hydrate(), let it install everything, and then saved it to a lockfile so that we could reproduce it at will (renv::snapshot()).

(I actually made a mistake here -- I hadn't checked what R version Laura had been using, nor have I updated the R that ran on my machine, so we ended up on R 4.0.2 instead of Laura's R 3.6.1 or the latest R 4.1.1. To my surprise, this presented less of a problem than I thought it would. Still, don't repeat my mistake! Think of your R version first.)

Why not conda? The main alternative to {renv} was conda with r-forge and bioconda channels. Historically, those have been more painful and often incomplete, plus it doesn't play nice with RStudio. In hindsight, though, conda was worth attempting, if only to see whether it could make the work with Snakemake easier.

Step 3: Snakemake

For the past several years, I've used Makefiles to document my data processing workflows. I've read about Snakemake, but I didn't have time to try it out. Then the time came to help @Laura_Kellman optimize her PhD data processing workflow, so of course I thought now is the time to experiment. (Sorry, Laura.)

To be fair, Snakemake had quite a few features to offer over GNU Make! But I've only ever read the docs and haven't actively deployed it. Rookie mistake.

Aside: the problems with Snakemake

  • Snakemake version -- the latest version requires using Mamba and mamba-forge, but on Windows, it still defaults to 5.4.0, which is more than an entire version behind. (On OS X, it works correctly, at least, so long as you use Mamba.)
  • Snakemake and {renv} + {RMarkdown} really don't play nice. script: "X.Rmd" will fail without an explicit error; with the benefit of hindsight, I guess {RMarkdown}/{knitr} was only installed in the {renv} library and so the very first command that Snakemake was trying to execute was failing. It sure feels like it could tell me that, though, instead of failing with a meaningless StopIteration error and nothing at all in the log file!
  • Snakemake and {renv} + bare R don't play nice, but just in a regular way. By default, Snakemake will call any R script with Rscript --vanilla {script}, which intentionally skips loading .Rprofile, which is {renv}'s entire mechanism of action. To counteract that, add source(".Rprofile") to the top of the scripts.
  • Snakemake, {renv} + {RMarkdown} in a separate conda environment? Yeah, couldn't get that working at all.

So, what worked?

  • Snakemake as a sanity check of what goes where. Even if Snakemake didn't work, writing a Snakefile clarified where to start and which scripts depend on which. It made it easy to write a bare R script that would run all the notebooks in order, which is basically a low-tech Makefile.
  • Making scripts Snakemake-ready is a great preparation for deploying them into a computational cluster. We didn't have time to test it on Sherlock / SCG, but I believe Snakemake would make it easy. (Then again, that's what I thought at the start about this whole undertaking...)
  • Snakemake makes a pretty dependency graph. Look! It's pretty!

Sample Snakemake dependency graph from the Snakemake documentation.

Step 4: Docker (and {renv})

Having a cross-platform replication of the pipeline was already good news, but to make the project run anywhere, we had wanted a Docker container runtime for it.

(If you've never heard of Docker, imagine a blank-slate computer that you set up just to run your code. Instead of manually tuning it, you write up every step of the setup in an automated instruction file called the Dockerfile. That way, anyone can set up that same blank-slate computer and anyone can run your code!

...if this isn't clear, see Boettiger (2014) and Nuest et al. (2020).)

Here were the issues I encountered along the way:

  • Docker and {renv} play games, at least if your plan is to bind-mount the project in at runtime. By default, renv::restore() puts files in your project/renv/library/R-{version}/{system/architecture} - which of course won't work if you're mounting project/ over it. One way to get around this is to call renv::isolate(), which appears place your local library in /usr/local/lib/R/site-library - but then you can't let .Rprofile run (because it will redirect your .libPaths() to the local renv/ folder), so you have to run your mounted scripts with Rscript --vanilla {script} and you can't explicitly invoke source(".Rprofile") or you'll break it. (You'll note that this is, ironically, the exact opposite of what you need to do to work around Snakemake + {renv} interoperability.)
  • Docker and Snakemake need Singularity to play nice. Singularity is a pain to install on bare Windows; having WSL2 helps.

Steps 5 through N: What we didn't do, but wanted to

There wasn't time to do everything. The specific omissions are the following three things in particular:

  • Actual code review. We made everything run, and we scrutinized some key variables along the way according to criteria that Laura knew but we didn't formally codify.
  • Automated tests. We didn't make logical tests with assertr or input/output tests with great_expectations.
  • Cluster deployment. Snakemake promised to make this reasonably easy, but of course by now I've learned not to trust Snakemake's claims. But I'm very excited to try it.


This was unreasonably fun, so much so that I'd consider a job doing nothing but helping bioinformaticians with their pipelines (and briefly looked for it - let me know if you know of one). It was also much, much slower than anticipated, and converted me into a Snakemake skeptic in a way I didn't anticipate. Looking forward to the next time!

Cleaning up 5 years of genomics work in 36 hours