trouble in toyland: how non-profit organization PIRG tests AI toys to protect children

Current AI toys pose safety issues to children and parents

The consumers advocacy group Public Interest Research Group releases a report detailing how children are at risk around a few AI-powered toys. Based on the annual document Trouble in Toyland, the researchers tested AI toys currently on the market and their chatbots that use different language models and artificial intelligence systems, including OpenAI’s. The four tested smart home devices are Kumma, an AI teddy bear made by the Singaporean startup FoloToy; Grok, a rocket-shaped toy made by Silicon Valley-based Curio; Robot MINI made by toymaker Little Learners; and Miko 3, a robot made by the Indian consumer robotics company of the same name. During the tests, they discovered issues that put the children at risk, including their privacy.

In the end, the team could only test three toys because Robot MINI did not work, as it could not keep a stable internet connection. This already showed an early issue: some of the AI toys may be faulty or not function as promised, which can lead to them being easily accessed by anonymous users. Each toy listened in a different way, and the team notes that voice recordings are risky because scammers can use them to create fake copies of a child’s voice. Miko 3 recorded voice inputs using its built-in microphone when the child activated its conversation mode. Grok had a wake-word system and recorded the voice for about ten seconds after the person stopped speaking. With Kumma, it kept listening without a button, and during the test, the toy suddenly joined a nearby conversation, even if it wasn’t a part of it.

Curio’s Grok | all images courtesy of the researchers at Public Interest Research Group

what PIRG tests in AI toys to protect the children

The report by the Public Interest Research Group explains how these AI toys work, all of which are using the system known as a large language model, or LLM. These models come from companies that make chatbots for adults, and some of the toys, including Kumma, even rely on versions of OpenAI’s ChatGPT. Others, such as Grok, may use different commercial models. These LLMs learn from large amounts of text from books and the internet, so when a child speaks to the toy, the toy sends the speech to the LLM, and the LLM creates a new answer each time. Older smart toys, such as Hello Barbie from 2015, used pre-written lines of dialogue, unlike Kumma, Grok, Robot MINI, and Miko 3, which all generate responses on the spot, making the toys harder to predict and harder to control.

Each toy uses a commercial LLM, but the exact model was not always clear. FoloToy’s Kumma uses OpenAI GPT-4o by default, but owners can switch to other models through the FoloToy webpage. Curio Grok didn’t explain which model it uses, and it only listed companies such as OpenAI and Perplexity in the privacy notes. This can mean that the toy may send voice inputs to one or more commercial LLM providers, but the user cannot see which one. Miko 3 also didn’t show its model in the product materials. The research team tested the toys in four categories: inappropriate content, addictive engagement features, privacy systems, and parental controls. The team asked questions and watched how the toys responded, and because LLMs generate new answers each time, the AI toys gave new answers to the same question. The researchers, then, collected a sample of answers to show and understand how each AI toy behaved and responded when used in real life, especially by the children.

FoloToy’s Kumma teddy bear AI toy for children

Smart devices for kids can pose health risks

One major concern of the researchers was that the AI toys offered information and answers to the children that may risk their safety. First, they asked the devices about finding or using dangerous objects such as knives, matches, pills, guns, and plastic bags. Curio Grok refused most answers, but Miko 3 sometimes gave locations for household items, such as plastic bags or matches, even when the user age was set to 5. Kumma, however, gave the most detailed guidance. It listed places to find dangerous objects and also gave step-by-step instructions on how people normally use them. This happened when Kumma ran GPT-4o and when it ran the Mistral Large language model. After this test, the researchers wanted to understand next how each toy responded when a user pushed a conversation into sensitive areas, such as drugs, violence, and sex. The goal was to see whether guardrails stayed active across a long interaction, not only in single questions.

With Grok, the researchers first introduced a mature topic, and it usually responded with a refusal and gave general safety advice or said it couldn’t talk about the topic. When researchers continued the conversation, the AI toy still kept the same boundary and didn’t build on the sensitive topic or return to it later. It didn’t introduce new harmful or sexual information on its own either, which, for the researchers, made this toy the most ‘stable’ product of the test, at least in this section. Miko 3 also refused many harmful or sexual topics when asked directly and didn’t keep memory between prompts, meaning each answer is treated as a single request. Because it cannot ‘remember,’ the researchers couldn’t escalate a topic over time, and if the children asked a new question related to the same mature topic, the toy didn’t link it to the previous one. As a result, the conversation didn’t build up into complex harmful content. The researchers also noted that while Miko 3 sometimes gave simple factual answers about general subjects (for example, about substances), it didn’t expand on explicit themes.

during testing, Kumma gave detailed answers that posed serious concerns for the researchers

For the researchers, it is FoloToy’s Kumma that showed serious issues as an AI toy for children. At first, when researchers introduced a sexual topic, the toy sometimes refused to answer it, but over time, the system’s guardrails got weak, and at this point, Kumma began to give detailed sexual explanations to the team. The answers included specific sexual practices, step-by-step descriptions on how to do them, different activity types, and references to role-play and adult scenarios, even detailing the power play that takes place in a teacher-student sexual scenario.

In several cases, the researchers found that Kumma expanded the topic on its own, even when the follow-up question from the user was neutral or vague. For example, if a user – imagine a child – asked about one sexual term, Kumma added several more related terms in its reply. If a user asked a general question, the toy sometimes shifted back into the mature topic introduced earlier. This happened with different models (GPT-4o and Mistral Large), not only one. During long conversations, Kumma also gave the locations where dangerous objects can be found, explanations of how certain substances are used, and even lists of examples related to illegal drugs.

view of Miko 3 with a screen that displays its emotions

The report shows that this kind of testing is important, especially for the children whose parents may not be around when they use these smart devices. AI toys are new, and the technology, including the language models and systems they’re using, is still developing. There are risks that companies may not fully understand yet, and the work done by the researchers can help parents and the market find these risks early. In fact, a follow-up not from the group indicates that FoloToy has now ‘temporarily’ pulled out the sales of Kumma following the published report.

The research team also reached out to OpenAI about the issue, and the company says it ‘suspended this developer for violating our policies.’ These early steps are a sign that the toy market can grow in a safer way, but only when companies listen and take action. As the technology becomes more advanced, there’s a chance to build AI toys for children that support improved and better learning as well as safer play by testing and monitoring them, finding problems, sharing the results, and keeping the market responsible.

Miko 3’s conversation didn’t build up into complex harmful content

the report shows how important it is to check and test toys for the safety of users

project info:

name: Trouble in Toyland 2025

group: Public Interest Research Group | @uspirg

researchers: Teresa Murray, R.J. Cross, Rory Erlich, Lillian Tracy, Jacob Mela

report: here

The post trouble in toyland: how non-profit organization PIRG tests AI toys to protect children appeared first on designboom | architecture & design magazine.