Showing posts with label LLM. Show all posts

Monday, June 8, 2026

Profile-Driven Observability

a graph of a metric showing a regular series of dips

Postmortem Culture means getting more actionable followups out of each Postmortem, and a very frequent action item we look for is, "where did we not have the visibility that we wanted to have?" Adding metrics, log messages, and trace points are all fine action items to take after an outage. Surely there is something which could have been more clear and reduced the time to remediate.

Observability vendors already provide help in instrumenting a codebase, such as Honeycomb's Agent skills and MCP service. I'd like learnings from a Postmortem to flow back into the tooling like a profile-driven optimization: take our handling of recent incidents into account when suggesting instrumentation. Even if the tooling might not otherwise suggest adding visibility in a particular area, if it is something we have struggled with then it is worth adding more help.

Monday, June 1, 2026

Observability Agents and Postmortem Culture

A service I wish to will into existance would help us better leverage our observability work, our traces and metrics and logs, in incident response during an outage.

I want the service to be part of a channel where we’re working on an outage, in Slack or Microsoft Teams or in some incident response tool, so the agent can see what is being worked on and get context of what we’re looking at.
I want the service to continue helping as we close the incident and proceed to followup on issues noted.
I want the service to guide us to do more extensive followup after an incident, not just the most proximate causes.

I want it to offer proactive help during the stressful work for the team, for example:

Unprompted interjection of new information.
Handle comms and quell panic.
Guide us to declare the end of the Incident.
Help draft the Postmortem.
Make the Postmortem better.

1. Unprompted interjection of new information

I want the agent to call out outlier metrics/traces/etc which we don’t appear to already be aware of. It is important to set the threshold appropriately: repeatedly interrupting us for things we already know would be unwelcome, interjecting about something relevant which we don’t appear to have noticed could be a lifesaver. I think this part of the functionality is already being addressed, for example by Honeycomb Canvas. It is a natural progression for an observability tool.

2. Handle comms and quell panic

Someone needs to reflect pithy summaries of what is happening to a Slack or Teams channel for the rest of the company, to provide reassurance and quell panic. We write SOPs to emphasize Comms, but people in the thick of it get focused on the firefighting and sometimes neglect to do so. Summarizing the state of the response and adjusting its level of detail for a broader and less technical audience seems like something a suitable LLM could do.

3. Guide us to declare the end of the Incident

Sometimes a problem just trails off once we’ve addressed enough of its causes, but we may not decide to close the Incident until well after the point where keeping it open is really warranted. I'd like the service to let us know that things appear to be on the path to normalcy and estimate how long before the changes it is seeing would bring us back into the usual range.

4. Help draft the Postmortem

After the incident, the agent should help us write the Postmortem. Tooling often only focuses on the straightforward parts of that: the summarized description of the problem and a timeline, especially if it can annotate the timeline with graphs of impacted metrics and relevant details.

That would be dandy and would save us time, and help get the Postmortem out with less delay — which is important, to be sure. We want to let people give feedback and contribute to the Postmortem while everything is still fresh. I think this part of the functionality is already being addressed, for example incident.io

but also...

5. Make the Postmortem better

Vastly more valuable in the Postmortem would be to help us extract more actionable followups, not just the immediate triggers but as many things which could be better as we can find. If the agent can see negative trends in metrics or traces which do not appear to be a direct result of what we’ve identified as the root cause, being able to implement more fixes and improve the system's robustness without needing to suffer through another incident first would be very valuable.

Most of the incident-focused products available and work that I've seen focuses on the description and the timeline and supporting data in a Postmortem, which is dandy and saves time in the writing of it, but those are part of the Archaeology when what I really want to focus on is the Future. Even things we might not be positioned to take on for a while, we could still try to address proactively before they happen again.

Postmortem Culture

We do our best to design in redundancy and robustness and build reliable systems, but we always end up responding to failures which we didn't adequately control for and improve the reliability of the system over time as it operates. One of the primary tools to do this is the Postmortem, where we describe a problem which happened and list off what we are going to do about it.

A scene from the Simpsons with many rakes arrayed around the ground.

We want to learn as much from every incident as we can. We want to address as many weaknesses in our system as we can, without having an outage for every one of them. If we can identify more things which went wrong, things which were perhaps not the primary problem but nonetheless still a problem, we accelerate the process of improvement. Making maximal use of Postmortems to improve the system is Postmortem Culture.

Every outage starts with stepping on a rake and being hit in the face. We should be able to look past the rake which just hit us in the face, and look around for nearby rakes which we haven't stepped on yet. Postmortem Culture is the rakes we did not step on.

Existing Products

1. Honeycomb Canvas is an existing product in this space, particularly the live assistance during an incident using observability data.

Sample page from Honeycomb Canvas, showing chat boxes and graphs

2. incident.io is another product in this space, especially in helping to draft postmortems — the Archaeology part of the postmortem, at least. incident.io is evolving from an on-call and incident management tool, not an OpenTelemetry collector. The assistance it can currently provide during the incident is more in looking for patterns with prior incidents.

incident.io chat window where the agent answers questions about prior incidents with similar symptoms

3. There are a few products which describe themselves as a Virtual SRE team, though I don't really like that term. An experienced SRE team is a hugely valuable resource and the tools I've seen are at best automating a small part of what SRE would do. I'll write more about these kinds of products as I learn more about them.

Friday, October 10, 2025

New Google Blogger Features ?!?

**Try our New Beta Features**: Create a more engaging reading experience with the help of Google

Google Search previews: Easily insert visual Google Search previews for popular people, locations, pop-culture and more directly in your blog! In Compose View, look for the ‘G’ button in the editor tool bar to get started.

That is the notice greeting me at the top of draft.blogger.com today. After years of not noticing any change in the service at all, it is now getting search previews.

Honestly I would have expected any sudden burst of activity in Google Blogger to be more distinctly AI-related, part of someone's promotion packet to sprinkle LLMs anywhere and everywhere.

Wednesday, September 10, 2025

Continuous Improvement in LLM Code Generation

One week ago, I wrote:

"I wish wish wish that Claude Code would automatically populate a .gitignore for node_modules. Not for the first time, I checked 437 Megabytes of code into git and had to rewrite the history to remove it."

I used Claude Code to create a new frontend project, using Qwik this time, and what do I see?

dgentry@llm:frontend$ cat .gitignore
# Build
/dist
/lib
/lib-types
/server

# Development
node_modules
.env
*.local

...

A classic hacker stock photo in a darkened room sitting in front of a laptop wearing a hoodie and mask, except the person typing is a robot I don't know if this represents something which the Claude Code team specifically made happen since the last time I had it generate code like this, or if the training data of Qwik codebases is so much more likely to have included node_modules in their .gitignore file.

It is one of the perverse things about use of tools like this: we tend to give credit to the tool, and not the community which created the information upon which it relies.

Tuesday, September 2, 2025

LLMs to Blaze a Trail

As an engineering executive there are a few ideas and practices which I reinforce via repetition to the team, either explicitly at the start of a recurring meeting or implicitly by bringing it up whenever relevant. The first of these ideas is:

The product is not the code, not the features, not the designs.
The product is that people can use the service for things that are important to them.
The business is not the code, not the features, not the designs.
The business is that people can use the service for things that are valuable to them.

But the topic of this post is not that. The topic is the second thing I frequently reinforce via repetition:

Get something, anything, working end to end as quickly as you can.
Not even a minimum viable thing. Any thing.

It has been my experience that, as developers, we tend to focus in on one area of a system to explore its requirements and build it out sufficiently until we feel confident that we understand what else will need to be done before moving on to the next piece. This results in a system where the understanding and the plan for development is grown by accretion, each piece layered atop the previous which is left undisturbed by later developments. We might go back and harmonize all of them later... maybe.

It has also been my experience that everything starts progressing more quickly once the system does something, anything end-to-end.

We gain perspective on how the whole system will work and apply it to everything we do subsequently.
One can make a change and see it function all the way through. Enthusiasm improves productivity.
It is far more effective when multiple people work on a system in parallel if they can all see the impacts of each other's work.

Thus:

Get something, anything, working end to end as quickly as you can.
Not even a minimum viable thing. Any thing.

LLMs to Blaze a Trail

With the maturing capabilities of LLM code generation, I tried an experiment with Claude Code. At Google one of the classes in orientation was to construct a web scraper. I asked Claude Code to build a scraper, but an even simpler one: scrape a metric.

In a new scraper directory, create a go program which will scrape a web page formatted
in prometheus metrics format, and extract a floating point value labeled "example"

Create an SQL schema for a timeseries, with columns for a timestamp and a floating point value.

Have scraper connect to a Postgres database and write each sample it collects to the database.

In a new webui/frontend directory, create a web page using React and typescript which will
poll a backend server for changes in a loop and display rows of timeseries data with timestamp,
sample name, and value.

In a new webui/backend directory, create a go program which will handle queries from
webui/frontend and fetch timeseries data from the postgres database.

A classic hacker stock photo in a darkened room sitting in front of a laptop wearing a hoodie and mask, except the person typing is a robot

It produced a small, functional implementation.

scraper	123 lines	Go
backend	169 lines	Go
frontend	127 lines	Typescript
	100 lines	CSS
	43 lines	HTML

A few interesting tidbits:

It produced no unit tests for the Go code. I didn't tell it to.
It did produce unit tests for the TypeScript code, even though I did not tell it to. I think this speaks well for the TypeScript community, the training data is infused wth testing as an expected practice.
I wish wish wish that Claude Code would automatically populate a .gitignore for node_modules. Not for the first time, I checked 437 Megabytes of code into git and had to rewrite the history to remove it.

Unit Tests

Having no tests at all sets a bad example. I don't actually want to encourage the construction of large system test suites at this stage of a project, as the effort to keep updating a large test as the system evolves is likely to outweigh the value of the test at this stage. Yet I do want to set the example by ensuring there is something.

In the scraper directory, keep the main() function in main.go but move the rest of the code
to a scrape.go file. Write tests for scrape.go with a local prometheus server and in-memory
database. Check that metrics are correctly stored in the database.

Claude Code generated 377 lines of test cases, including scraping one value and several values. Most of the code was to set up an in-memory database using sqlite and to run a local Prometheus server.

The cost of the first prompt to generate the system and the second prompt to add unit tests: 93 cents.

Non-trivial example

That example was pretty contrived. How about an example of a more realistic system which:

Implements a protocol connecting to a legacy communications system.
Implements a set of modern protocols connecting to current Internet communications infrastructure, to forward messages to and from the legacy protocol.
Has a management layer watching all of the connections and can stop or restart them as needed.
Has a dashboard and console showing the status and configuration of the system.

Can it produce this? Well... not exactly. I kindof cheated: this is the first thing I attempted, I made up the contrived system later.

The problem is that first step. Claude Code was not much help in producing the first piece, connecting to the legacy system. The tasks there were more like engineering archaeology:

Trying variations on the digest hash function until the remote system suddenly returned 200 OK.
Figuring out what portions of the the poorly documented header fields were actually implemented.
Diagnosing failures when the only indication we get is "Invalid" with no further information about what was invalid.

There just isn't any training data for this, and so trying to rapidly get to a functioning end-to-end system entirely via code generation didn't work. I was able to work on the management layer and the dashboard and so on while still debugging the first piece, but it only started working when that first piece was done.

Could I have set that first piece aside with a mockup, and worked on the rest? Probably, but it was just me not a team and the first piece was the biggest risk. I focussed on eliminating the risk.

In an engineering team, I think I would approach this with a small team whose job is to sketch out the overall system. It might be entirely senior engineers or at least led by a quite senior engineer, and tasked to identify and quantify risks and to plan out a system. That team could multiply its efforts using LLMs to help generate the more well understood portions of the system.

Monday, September 1, 2025

On the Persistence of Human Memory

Tell me this looks wrong to you, too.

Claude Code doesn't see it. I mean, of course Claude Code doesn't see it, it has no eyes or other senses. Nonetheless I tried to get Claude Code to fix it by leading it to a solution.

In frontend/ in the Delivery page, align the Save and Cancel buttons vertically.

In frontend/ in the Delivery page, remove the height property from the Save and Cancel
buttons. Put both buttons inside a div, and set the height of the div to 40px.

Neither of these fixed it, because these were not the problem. The actual problem was:

233 .save-button,
234 .add-button {
235   background-color: #48bb78;
236   margin-top: 1rem;
237 }

This was leftover from when the button was elsewhere on the page, and not removed when it moved to be next to the Cancel button. Poking around with Chrome's Developer Tools and looking at the Elements on the page identified it.

On the Persistence of Human Memory

One thing I am finding is that memory of code generated with the help of an LLM fades much more quickly. Some portions of this system were not amenable to getting help from Claude Code — things which involve low level interoperability with existing and legacy systems. There is no relevant material in the training set, Claude Code could not help in iterative debugging in staring at the errors from the legacy system to figure out what to do next.

Those portions of the codebase, those developed with blood and sweat and tears, remain clear in my memory. Even months later I can predict how they will be impacted by other changes and what will need to be done.

That is not true of the portions which the LLM generated. Continuing with the analogy of treating it as an early career developer, I only reviewed the code I didn't write it. As with any code review, the memory of how it works fades much more quickly compared with actually digging in to the work.

(This is better than Claude Code, though, which retains no memory at all of how code has evolved and instead discovers it all afresh at the start of each session).

Treating an LLM like an early career programming partner can provide large increases in productivity, but it also means that one has less personal recollection of the windy path the code took to get to its current state. One must be able to go spelunking. This isn't that much different from a codebase which one has worked on over a long period: little detailed memory of specific portions of the code remain, but an overall sense of the codebase is retained much longer.

Monday, August 25, 2025

Claude Code's 19 cent Parser

A brief prompt:

In authheader.go write a function to parse a SIP WWW-Authenticate header for Digest
authentication. It should return a map[string]string of key:value pairs which are
present. It should handle the case of valueless parameter with no "=" by populating
an empty string in the map.

Write unit tests, including these WWW-Authenticate headers:
1. WWW-Authenticate: Digest algorithm=MD5,realm="example.com",nonce="abcd="
2. WWW-Authenticate: Digest realm="example.com", nonce="efgh=", opaque="1234__", algorithm=MD5, qop="auth"

From this, Claude Code generated quite reasonable parsing code for a SIP WWW-Authenticate header. It did this in approximately one minute of wall-clock time at a cost of 19 cents. This is considerably more quickly and cheaply than I could have produced a similar function.

I made one manual fix: the string comparison for "Digest" and for parameter field names are supposed to be be case-insensitive, and I added unit tests for it. I hadn't specified this in the prompt, and Claude Code didn't figure that out from the mention of SIP.

I remain of the opinion that vibe coding can be a force multiplier for expertise, not a complete replacement for expertise.

Wisdom

Returning to an earlier topic: does the code which Claude Code generated exhibit wisdom? Did it have shortcomings which would be harmful? Claude Code came up with the following test cases, and wrote a Go table-driven test case for them.

The two I explicitly gave it.
Header with valueless parameter
Header with unquoted values
Empty header
Header with comma in quoted value
Header with extra spaces

I looked into the handling of unquoted values. The SIP standard says that fields like algorithm or qop which are enumerated in specifications can be left unquoted. What Claude Code generated would allow any field to be unquoted, including arbitrary text strings like realm.

The spec says these values must be quoted. Yet there is also the Robustness Principle, to be liberal in what you accept and strict in what you send.

Postel's Law Considered Harmful

Nowadays I think this principle has ultimately been more harmful than good. Over time we end up with a protocol which is only partially specified, where real implementations require a neverending series of quirks handling to work around the behaviors of widely deployed yet incorrect implementations which other implementations have liberally accepted. For new protocols I'm a fan of be strict in what you send and strict in what you accept, to not allow quirks to accumulate. Like barnacles, quirks slow the forward progress over time and tend to cause standrds to bog down and eventually stop even trying to evolve.

But SIP is ancient. In Internet Years it is a centennarian. What should one do about SIP? Being strict in what one accepts would lead to a series of relaxations being added during deployment when engineering philosophy meets harsh reality that there are a lot of barely-compliant production services run by vendors far too large to care what some Internet Rando thinks of their implementation.

Epilogue

I did consider whether to just leave it this way, and allow unquoted strings for all fields. Life is too short to fight the weight of Internet Protocol Inertia... but I couldn't do it. That would make my little corner of the SIP world be part of the problem. I made it only accept unquoted strings for algorithm and qop, the two enumerated fields which my system deals with.

In authheader.go:parseWWWAuthenticate() fields named “algorithm” or “qop” may be
quoted or unquoted. Any other field name must have its value quoted to be accepted.

In authheader_test.go add test cases:
1. fields named “algorithm” or “qop” may be quoted or unquoted.
2. Any other field name must have its value quoted to be accepted.

Monday, August 18, 2025

Training Gemma3-270m for German Q-and-A

Google recently introduced Gemma3-270M, a smaller Gemma3 model with "only" 270 million parameters instead of billions.

The most interesting aspect of this model to me is that it is explicitly intended to be able to run locally, without requiring highly specialized infrastructure — well within what is achievable outside of specialized datacenters. The potential to run the model with an air gap, isolating it from outside, would be interesting for some future stuff I'm working on.

The eventual uses would involve communication in the German language, so I decided to see about adding training to answer questions in German specifically. I referenced an existing colab notebook, which uses Gemma3-270M to predict chess moves. Chess as an application for LLMs isn't as interesting for me personally, we have better ways to use neural networks to play chess, but the training flow is the same.

We start by loading dependencies and instantiating the gemma-3-270m-it model.

%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft
    !pip install --no-deps trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth


from unsloth import FastModel
import torch
max_seq_length = 2048
model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-270m-it",
    max_seq_length = max_seq_length, # Choose any for long context!
    load_in_4bit = False,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

We set it up to accept training data in a chat format using the Huggingface deepset/germanquad dataset, a curated set of training data from the Deutsch Wikipedia and various academic sources.

model = FastModel.get_peft_model(
    model, r = 128,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 128, lora_dropout = 0, bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407, # Seems pretty random
    use_rslora = False, loftq_config = None,
)

from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(tokenizer, chat_template = "gemma3")

from datasets import load_dataset
dataset = load_dataset("deepset/germanquad", split = "train[:10000]")

def convert_to_chatml(example):
    return {
        "conversations": [
            {"role": "system", "content": example["context"]},
            {"role": "user", "content": example["question"]},
            {"role": "assistant", "content": example["answers"]["text"][0]}
        ]
    }
dataset = dataset.map(convert_to_chatml)

def formatting_prompts_func(examples):
   convos = examples["conversations"]
   texts = [tokenizer.apply_chat_template(convo,tokenize = False,
       add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
   return { "text" : texts, }
dataset = dataset.map(formatting_prompts_func, batched = True)

from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model, tokenizer = tokenizer,
    train_dataset = dataset, eval_dataset = None,
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 8,
        gradient_accumulation_steps = 1,
        warmup_steps = 5, num_train_epochs = 1,
        max_steps = 100, learning_rate = 5e-5,
        logging_steps = 1, optim = "adamw_8bit",
        weight_decay = 0.01, lr_scheduler_type = "linear",
        seed = 3407, output_dir="outputs",
        report_to = "none",
    ),
)

from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

We then train the model. This took about three minutes on Google Colab using a Tensor T4 system.

trainer_stats = trainer.train()

Now, the real test: can it give good answers to questions not in its training data?

messages = [
    {'role': 'system','content': 'Bielefeld'},
    {"role" : 'user', 'content' : 'Gibt es Bielefeld?'}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
).removeprefix('<bos>')

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 125,
    temperature = 1, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

<bos><start_of_turn>user
Gibt es Bielefeld?
<end_of_turn>

<start_of_turn>model
Ja
<end_of_turn>

Indeed yes, it can!

If that interaction doesn't make much sense: it is a German joke, alleging that the city of Bielefeld doesn't actually exist. Wikipedia has an explanation in English.

The trained model says that Bielefeld does exist. Clearly it has no sense of humor.

Friday, July 11, 2025

Vibe Coding and Wisdom

I started experimenting with Claude Code a while ago. I am not the first person to make this observation, but thinking of Claude Code as an early career developer whom one is mentoring and needs to guide to a solution is a good mental model for it. It is pretty impressive in what it can do.

Asking it to produce something the size of what one would want to see in a code review from an early career developer produces good results, far faster than I could write myself, at a cost of a few cents.

For example:

Add a command line utility written in Go in cmd/adduser. It takes command line arguments for email, phone, imsi, realm, remsim, gpp_hostname, ns, slack_app_token, slack_channel_id, and smtp_list. It encrypts the slack_app_token, slack_channel_id, and smtp_list using the code in internal/db/encryption.go. It opens a connection to the database using the code in internal/db/server.go, and adds a new row to the accounts table using the command line arguments it was given.

This resulted in a quite functional command line program which did what I asked.

func main() {
        var (
                email          = flag.String("email", "", "Email address (required)")
                phone          = flag.String("phone", "", "Phone number (required)")
                imsi           = flag.String("imsi", "", "IMSI (required)")
                realm          = flag.String("realm", "", "Realm (required)")
                remsim         = flag.String("remsim", "", "Remsim (required)")
                gppHostname    = flag.String("gpp_hostname", "", "GPP hostname (required)")
                ns             = flag.String("ns", "", "NS (required)")
                slackAppToken  = flag.String("slack_app_token", "", "Slack app token")
                slackChannelID = flag.String("slack_channel_id", "", "Slack channel ID")
                smtpList       = flag.String("smtp_list", "", "SMTP list")
        )
        //...omit the rest but it was straightforward code...

Unit Testing

Claude Code doesn't seem to produce unit tests as a regular part of its development... like some engineers I've worked with I suppose. However it can be prompted to do so and produces a reasonable result.

In cmd/forwarding create a unit test which starts a server on localhost and uses an http client to connect and POST JSON with input arguments.

In cmd/user-management create a unit test which starts a server on localhost and uses an http client to fetch /api/account/<uuid>/db

Claude Code produced a quite reasonable table-driven Go test, including creating an in-memory sqlite3 to mock the Postgres database.

+       testCases := []struct {
+               name           string
+               uuid           string
+               requestBody    ForwardRequest
+               expectedStatus int
+               setupDB        func(*sql.DB, string)
+       }{
+               {
+                       name: "successful forward",
+                       uuid: testUUID,
+                       requestBody: ForwardRequest{
+                               Sender:   "test@example.com",
+                               Received: "2023-12-01T10:00:00Z",
+                               Content:  "test message",
+                       },
+                       expectedStatus: http.StatusOK,
+                       setupDB: func(database *sql.DB, uuid string) {
+                               slackToken, _ := db.Encrypt("test-slack-token")
+                               slackChannel, _ := db.Encrypt("test-channel")
+                               smtpList, _ := db.Encrypt("test@smtp.com")
+
+                               query := `INSERT INTO accounts (id, account_number, email, phone_number,
+                                         imsi, realm, remsim, gpp_hostname, ns, slack_app_token,
+                                         slack_channel_id, smtp_list)
+                                                 VALUES (?, 1001, 'test@example.com', '+1234567890',
+                                                         '123456789012345', 'test.realm', 'test-remsim',
+                                                         'test-gpp', 'test-ns', ?, ?, ?)`
+                               database.Exec(query, uuid, slackToken, slackChannel, smtpList)
+                       },
+               },

Image produced by Claude Code's distant relative, Stable Diffusion

Unwise Approaches

Attempting to get Claude Code to generate larger systems with less guidance didn't go as well. What it produced was functional and quite impressive, but unwise.

I wanted to try out an admin panel with a login link sent to an email address. Not suitable for all environments, but sufficient for many services which rely on the user's email.

In the admin sub-directory, generate a web user interface for an admin console using typescript and react, with a backend server written in Go.

The login screen has a text box to enter an email address. When the Submit button is pressed, the backend server should generate a 128 bit random string and use os.exec to run an email.sh process. The backend server should redirect the user to an interstitial page which says "Please click the login link sent to <email address>."

Once logged in, the main page has ...

Claude Code generated a quite functional admin console. One could submit an email address and it would fork the script to send email. It maintained a map of pending login tokens in the Go backend. When one clicked the link in the email the backend would respond with ok it it found that token in its active table, otherwise failure. Quite exhilerating to see all of that work within a couple minutes of starting on it.

However this means the client code, itself, was deciding the success or failure of the login link. If it got an ok from the backend, it would proceed to the URL for the admin panel. The backend code would serve up whatever it was asked for, there was no enforcement in the backend.

Anyone capable of understanding the client JavaScript could figure out the URL of the admin panel for any user. The login link only provided the illusion of protection. It was trivial to bypass.

One can observe that Claude Code generated exactly what I told it to, which is a fair observation. One might also observe that Claude Code just regurgitates its training set, meaning that human developers have done similar things in large numbers. This is also a fair observation.

Nonetheless it reinforces that vibe coding is best used as a multiplier, not a substitute, for actual expertise.