Ask Good Questions: Do not let AI build the elephant in the room

I published a new Ask Good Questions guide and companion field note that may be useful to anyone using AI on larger programming projects.

Guide:

https://askgoodquestions.dev/guides/do-not-let-ai-build-the-elephant-in-the-room

Field note:

https://askgoodquestions.dev/field-notes/the-elephant-does-not-walk-into-the-room

This one is about a very ordinary problem that AI can make worse if we’re not careful.

A project starts with a reasonable structure. The files make sense. The responsibilities are mostly clear. Then you ask the AI to add a feature, fix a panel, add a mode, handle one special case, and then another.

Each change may be reasonable.

Each change may even work.

But over time, one file can become the unofficial center of the application. It starts out as a component, then becomes a component plus state manager, then a component plus state manager plus parser, plus export logic, plus platform-specific behavior.

That is the elephant in the room.

The guide looks at this from the practical side: file size, design drift, full-file replacement, living architecture notes, and scheduled structural reviews.

There is also a review prompt in the guide that can be adapted for your own projects. The point of the prompt is to make the AI stop and analyze the structure before it starts changing code.

The field note is the more personal version. It is about the moment when you realize the file did not become unmanageable because of one terrible decision. It got there one small, reasonable shortcut at a time.

One line from the article sums it up:

The elephant doesn’t appear by magic. You build it.

If you’re using AI in your own development work, I would be interested in whether you are doing any deliberate codebase structure reviews, or whether you mostly review the immediate changes the AI makes.

Yes, it’s important that the AI know it’s OK to refactor as things grow. Sometimes it does it on its own, but more often it’s because I nudge it in that direction. This is because I’m fixated on good organization, as well as SOLID design principles.

Getting from proof of concept to a complete system can be done iteratively, but it definitely needs to be done with constant awareness. :wink:

It would be interesting to see if the design principles are defined in the memory, if the AI respect and keep them generating good quality code without the constant reminder and if it stays true to the instruction over time.

Indeed Mike.

I think that’s the part that gets missed sometimes. Iterative development is fine. In fact, it’s probably the only sane way to get from a proof of concept to a real system.

But it has to be iterative with awareness.

The dangerous pattern is when the proof of concept quietly becomes the architecture. Then every new feature gets bolted onto whatever was already there, and before long the elephant isn’t just in the room. It owns the room.

That’s where I think the developer still has to stay very much in charge.

AI can help refactor. It can help reorganize. It can even point out where things are starting to smell wrong. But it usually needs to be invited into that conversation.

Otherwise, it tends to keep walking forward on the path it’s already on, even when that path is clearly turning into a mess.

That would be interesting, and I think the answer is probably “yes, but only up to a point”.

Having design principles in memory or in a standing project instruction absolutely helps. I’ve seen that myself. If the AI knows that I care about organization, separation of concerns, avoiding made-up methods, keeping files structured, and not turning every quick fix into another layer of duct tape, the output is usually better.

But I still don’t think memory replaces active direction.

The problem is that as the conversation gets longer, the project gets bigger, or the task gets more specific, the AI can still drift. It may remember the principle in a general sense, but not always apply it at the exact moment where it matters. That’s especially true when it is focused on making the immediate change work.

So I tend to think of memory or project instructions as guardrails, not autopilot.

They make the AI more likely to stay on the road, but the developer still has to watch where the road is going. That means reviewing the design, catching drift early, and occasionally saying, “No, stop here. This is turning into a mess. Refactor before we go any farther”.

That’s really the heart of it for me. The AI can follow design principles, but the developer still has to own the design.

Definitely! Now I better understand your point and makes complete sense. There’s no autopilot and the guardrails help to keep in track, but still need active direction.

Exactly.

That’s the balance I’m trying to describe.

The guardrails matter. They make a huge difference, especially when working with Clarion or with a project that has very specific architectural rules.

But the developer still has to stay engaged. The AI can help carry the load, but it shouldn’t be the one deciding where the load is going.

That’s why I think memory, project instructions, and design principles are valuable, but they work best when the developer treats them as part of the process, not as a replacement for the process.

Sounds like you’ve defined Appgen

Yes, there’s definitely a parallel, but I think there’s also one very important difference.

With AppGen, the rules are built into the generation system. You can generate the application a hundred times in a row, and the rules are going to be applied the same way every time.

AI doesn’t really work that way.

Even if the rules are in memory or in the starting prompt, the AI is still working inside the context of a conversation. As the session gets longer, there’s more and more material competing for attention. The rules may still technically be there, but they don’t always carry the same weight they did at the beginning.

That’s one reason I think updated prompts, checkpoint prompts, and session-based workflow matter so much. Instead of assuming the AI will keep perfectly honoring one original instruction forever, you keep restating the current architecture, the current constraints, and the current goal as the project evolves.

There’s also a difference between a session-based guided workflow and an agentic coding workflow where the rules live in one or more Markdown files. Those files can be very useful, but now you’re depending on the AI to decide when it needs to open them, read them, and apply them.

That isn’t the same thing as having the current working rules placed directly in front of the model as part of the active prompt.

So yes, AppGen is probably a good comparison, but AppGen enforces the rules mechanically. With AI, we’re still partly responsible for keeping the rules fresh, visible, and actively applied.

AppGen gives you 100% consistency and time savings.

Ai doesnt for some of the reasons you explain.

I agree.

AppGen gives you mechanical consistency. Once the template rules are defined, they are applied the same way every time. That is a very different kind of time savings.

AI can absolutely save time too, but it is not the same kind of savings. It is more like having a very fast assistant who can help draft, reorganize, explain, review, and generate large amounts of working material quickly.

But that assistant still has to be managed.

With AppGen, the rules are enforced by the generator. With AI, the rules have to be kept visible, current, and checked as the work moves forward. That is why I think AI works best when the developer treats it as an accelerated development partner, not as an AppGen replacement.

Different tool. Different strengths. Different risks.

Ai is somewhere between Disney’s The Sorcerer’s Apprentice and Pair Programming depending on the programming language. :grinning_face:

That’s a pretty fair description. :slight_smile:

And I think the “depending on the programming language” part matters a lot.

In a language or framework where the AI has seen millions of examples, it can feel a lot closer to pair programming. It may still make mistakes, but at least it has a large body of patterns to draw from.

With Clarion, it can drift into Sorcerer’s Apprentice mode pretty quickly if you don’t keep it under control. It may confidently invent functions, borrow ideas from other languages, or keep generating more code when what you really needed was a better structure.

That doesn’t make it useless. It just means the developer has to stay in charge of the broom. :slight_smile:

Whilst the millions of examples helps to reinforce the use of code, its not necessarily the best code either.

Absolutely.

Millions of examples may make the AI more fluent in a language, but fluency is not the same thing as judgment.

A lot of public code is old, rushed, copied, overcomplicated, under-structured, or written to solve one narrow problem in one narrow context. So the AI may have more patterns to draw from, but that does not mean those patterns are automatically good architecture.

That’s another reason I try to remember that AI is sometimes like an overeager junior programmer who has read the entire internet, can type 3,000 words a minute, and is amped up on caffeine.

Very useful?

Absolutely.

Always the person you want making the final design decision?

Probably not. :slight_smile:

AI can suggest code quickly. It can often produce something that compiles or looks plausible. But it does not really know whether that code fits your application, your design standards, your long-term maintenance goals, or the way your product should evolve.

So yes, more examples helps with syntax and common usage, but it does not replace design judgment.

Regarding design standards, I saw this the other day.

How different parts of the world influence how they/we operate.

Norways Sovereign LLM,
LLM’s are predominantly English speaking so English speaking countries have an advantage over non-English speaking countries, besides their evolution of internet infrastructure.

Crucial for Ai to understand if its to enter those domains and perform well.

That’s a good point.

When we talk about design standards, it is easy to think only in terms of code structure, naming conventions, architecture, or framework rules. But design standards also come from the domain, the business culture, the legal environment, the language, and the expectations of the people who will actually use the software.

That’s another place where AI can be both powerful and risky.

If the AI has mostly learned from English-language examples, American business assumptions, and common internet code patterns, then it may be fluent, but not necessarily appropriate for a different market or domain.

A SaaS product aimed at Germany, Austria, or Switzerland may have different expectations around privacy, documentation, contracts, payment terms, support, localization, and even the tone of the UI. A Norwegian LLM makes sense for the same reason. If the model does not understand the language, culture, documents, institutions, and local usage patterns, then it is going to miss things that matter.

That applies to programming too.

Clarion is its own little world with its own history, conventions, templates, AppGen model, ABC patterns, embed points, and developer expectations. If the AI treats Clarion like “slightly odd C#” or “old Pascal with windows”, it will get into trouble quickly.

So yes, for AI to perform well in a domain, it needs more than syntax. It needs context. It needs the local rules. It needs the design standards. And even then, the person using it still has to know enough about the domain to notice when the AI sounds confident but is standing in the wrong country, the wrong language, or the wrong programming model.

I’ve started using Claude and I like it a lot better than ChatGPT. The code it writes is cleaner. However, I will say, when it gets backed into a corner and I’m / it is trying to find a solution, it starts grasping at straws and makes things up. I was working on creating a CSV file, which it did very well, created the group, the CSVFile definition, even most of the mapping. It understood the software that I was sending the data to, all of it. However, for some reason there was a space at the beginning of the CSV record. It couldn’t find it, so it recommended to PUT the record, which of course was impossible because we were adding records. I told it what PUT was for, and it apologized and admitted it was wrong. At least it’s polite.
But I’ve seen it happen with writing Clarion code a number of times. Trust but Verify I say. Now, I use it to write some PHP and interface with Stripe. I don’t know PHP that well, but it is very good at it and doesn’t seem to hallucinate on it at all.

The other thing I’ve had happen a few times, that has wasted a bunch of my time, is it will take me down the wrong path. For instance, I wanted to interface with Stripe, and it recommended a couple of 3rd party interfaces. Each one I tried I ran into a roadblock for what I wanted to do, so I tried a different one. Finally, I asked it if it would be better just to interface directly with Stripe. A big hearty recommendation from Claude says yes, that would be the easiest way.
That being said, it still allowed me to build a php webhook so I could notify myself when a payment came across, and send the new customer a temp code. While I probably wasted a week, I still got the job done much faster than if I had no help at all.

Best Junior programmer ever, just check its work!

That matches my experience too.

I use Claude and ChatGPT both pretty actively, and my own experience is that both are capable of producing very good code. I also test with Gemini, and occasionally Grok and DeepSeek, just to see how different models handle the same kind of problem.

What I’ve found is that they all make mistakes.

I’ve had cases where two of them could not solve a problem, or kept circling around the wrong answer, and the third one fixed it almost immediately on the first pass. But that has not been consistent enough for me to say “this model is always better than that model”.

To me, there is no magic LLM.

Some models may be stronger in certain areas. Some may write cleaner-looking code by default. Some may explain better. Some may be better at a particular language or framework because they have seen more examples. But they can all hallucinate, they can all choose the wrong path, and they can all sound very confident while doing it.

That is why I keep coming back to the same point: the prompt and the workflow matter.

If you give the AI vague instructions, let it make the architectural decisions, and then keep accepting whatever it produces, you can get into trouble pretty quickly. But if you give it clear constraints, keep the design direction current, make it work in smaller pieces, and review what it is doing, the results can be dramatically better.

So I agree with “best junior programmer ever, just check its work”.

I’d just add that we also need to check its direction, not only its code.

Some of your responses are what I would expect an Ai to come up with. :grinning_face: