Rose Willis & Kathryn Conrad / https://betterimagesofai.org / https://creativecommons.org/licenses/by/4.0/
We tested AI coding assistants by tasking an experienced developer with building a Matrix server, a messaging client, a health app, and a blogging site - finding that each project was riddled with hidden flaws ranging from broken cryptography, insecure data handling, and misleading outputs.
Rose Willis & Kathryn Conrad / https://betterimagesofai.org / https://creativecommons.org/licenses/by/4.0/
They say there are 10 types of people in the world: those that understand binary, and those that don't.
But with new developments in artificial intelligence (AI) tools, people are being given more agency over what they can do with a computer - without necessarily having the knowledge about what is happening on the back-end.
Tools specifically designed as ‘coding assistants’ can take natural language prompts and produce detailed and complex code and even fully functioning apps. Often known as ‘vibe coding’, this kind of programming is more about riding the feels rather than searching for a rogue semicolon.
Vibe coding tools, like Cursor, Claude Code, Lovable and Codex, can do this because they’ve been trained on the enormous amounts of code that is publicly available on places like GitHub. The principle is the exact same as any other chatbot powered by a large language model: they provide an output that is likely close to what the user wants, based on analysis of existing code.
But there are risks: People have had their datasets deleted and apparently promising startups have been hacked apart due to basic errors. The outputs produced by LLMs can infamously be unreliable, untrustworthy and unpredictable.
While that's bad enough when it's poor advice about pizza toppings or refund policies, the implications can multiply when dodgy unchecked code gets into gear.
Privacy International examined how these coding assistants work in practice - to see what they can do, and to see how much of a risk vibe coding can pose for privacy and security.
We provided vibecoding tools to both an experinced coder and an inexperienced novice.
The experienced coder built three tools: both a client and server implementation of the Matrix Specification, a federated messaging protocol, named *FERRETCANNON* and *FEVERDREAM* respectively; and an app for users to manage and share their own health and medication information and schedules, named *CAFFEINECRASH*.
The inexperienced coder built a blogging website where users could login and leave reviews on hiking trails.
FERRETCANNON was an experiment in creating a Server implementation of the Matrix Specification.
Matrix is designed with interoperability in mind: services built on it are supposed to be able to communicate with one another securely. Building a functional-looking implementation was surprisingly straightforward. Within a short period, an AI coding assistant had produced an enormous volume of code, and was making all sorts of assurances that it was a "100% complete implementation".
Upon examination, however, it became obvious that instead of using a database to persist longterm data, it was instead storing many things in RAM - volatile storage designed to be used only while the program is running.
This particularly affected the cryptographic identity the server was presenting; instead of generating it once and storing it, because it wasn't storing it anywhere between runs, it generated the same identity every time it started. This would have meant that every server deployed would have presented identical keys.
The codebase ended up very bloated and convoluted, with several different implementations of the same functionality depending on what was calling it.
When asked to create tests to check its work, instead of fixing its broken implementations it took to hardcoding test values and function logic so it could be seen to "pass" its test suite.
FEVERDREAM was an attempt to implement a client counterpart to FERRETCANNON which looked like MSN Messenger.
However, vibecoding the app created numerous downsides.
The first version of FEVERDREAM was sending messages entirely unencrypted — meaning that anyone with access to the server, or positioned between sender and recipient, could read the plaintext messages. While Matrix supports plaintext messages, and many shared - particularly large - rooms don't have end-to-end encryption (E2EE) enabled, the ability to encrypt is important and is part of the specification the bot struggled with.
When the issue was raised and the AI was asked to implement E2EE it did not reach for the standard encryption method the Matrix Specification requires, or another established, well-tested solution. Instead, it built its own.
Custom "rolling your own" encryption implementations, however well-intentioned, are widely [understood in the security community to be a bad idea. They bypass the scrutiny that published, peer-reviewed protocols and implementations thereof have received, and almost certainly introduce new surfaces for attack.
In this case, the immediate impact was that anything encrypted with this encryption could only be decrypted by someone else using the same client, instead of being able to send to anyone on the network. There were possibly also implementation errors which would reduce the security it offered in real usage, although this was not tested.
When the prompt was updated to use the Matrix encryption standard instead, the AI reached for Olm — the old Matrix encryption protocol that was deprecated in 2024.
Rather than making targeted fixes when these problems were raised, the AI began quietly removing the encryption layer altogether — reverting, in other words, to the original plaintext-only state.
This was not announced or explained, and the service continued to look functional. A user without the background to read and understand the underlying code would have had no way of knowing that the protections they believed were in place had simply been deleted.
Throughout this process, the AI also gave inaccurate accounts of what it was doing, and failed to remove faulty code even when the specific problems were pointed out directly.
After much cajoling, eventually it began to implement the Vodozemac library - which replaced Olm. It never ended up finishing the implementation, and never managed to send an encrypted message that wasn't in its "roll your own" crypto.
The end result was a messaging client that worked, in the narrow sense that messages were sent and received, but it did not protect the people using it.
CAFFEINECRASH, in contrast, was a simpler progressive web application (PWA), which can run in a browser but could also be installed on a phone like a native app.
It was designed to let users log medications, dosages, and health metrics - including anxiety scores, blood pressure, and weight - and share summaries with other users or carers.
The AI handled the core functionality competently, as it’s a simple CRUD app. Login and registration worked, and the system offered multi-factor authentication for admin accounts.
But the password implementation was missing a ‘salt’. This is a small piece of random data added to each password before it is stored, which ensures that two users with the same password don’t produce identical records in the database, makes large-scale password cracking significantly harder, and defeats rainbow tables.
Its absence is concerning in an application designed to hold sensitive health information.
Aggregated user data was made accessible to administrators by default, which would not protect the private medical information of users from those running the app. This could reflect the priorities of the AI system, and may reveal serious privacy concerns in other apps.
One other interesting note was the demo data; the AI generated thirty days of entirely fabricated but highly plausible patient records: realistic medication names, credible dosages, health metrics that followed believable patterns.
Since AI programs are trained on data taken from the internet, which could also include the information provided to it by other users, it raises questions about the source for “plausible” data, and whether AI can work with health data privately, securely and fairly?
The blogging website, with user login and a review system, was simpler in scope than the other three.
The AI produced code quickly, but to the inexperienced coder the explanations offered alongside the code were either vague or laden with jargon, and when something went wrong, the AI was reluctant to say so plainly — even when directly asked whether a mistake had been made.
It tended toward agreement rather than correction, and toward the appearance of progress rather than an honest account of what was or was not working.
In this scenario, who is actually able to check what has been produced? A large codebase generated in minutes is not something a non-expert can meaningfully audit, and in our experience the AI offered little help in bridging that gap.
Nor does the process lend itself to learning — the speed and opacity that make it feel efficient are precisely what prevent the user from understanding what is happening or why.
The result was optimisation for the appearance of success. Code that runs, a site that loads, and features that seem to function - but vital questions about whether the platform is secure and whether it handles user data responsibly require further expertise.
LLM coding agents are a genuinely powerful tool, and for experienced developers it can accelerate work that would otherwise take considerably longer. But the case studies here suggest that speed and ease come with risks that are not always visible to the person doing the building - and that the longer an AI system works on a complex problem, the less reliable its output becomes, both in terms of quality and in terms of the accuracy of its own account of what it has produced.
One risk that sits largely out of view is the question of what tgus code is actually based on. AI coding assistants draw on existing repositories, and while this is part of what makes them useful, it also means users are dependent on the AI’s discretion about what to use and when, as out-of-date code can be reached for without warning.
Inexperienced users should think carefully about whether what they are building could possibly create security or privacy risks before they start, and should be cautious about connecting anything they have built to their personal data or other applications.
Experienced developers should treat AI-generated code as they would any unreviewed contribution - checking what has been used, what has been removed, and whether the specification they started with is still being met.
The companies offering these tools should also be conscious of their responsibility to their users, especially when risks to privacy or security emerge or multiply. Coding tools could offer warnings, disclaimers or stop screens, both to those creating the code and to those using apps generated, to help identify and reduce the risk of harm.
The alternative is a huge output of creative software while, under the hood, the engine is inviting itself to be sabotaged.
End-to-end encryption (E2EE) is a form of encryption that is even more private. It ensures that only the “ends” of the communication, usually the person who sent the message and the intended recipient(s), can decrypt and read the message.