

Market Analysis
Several leading artificial intelligence (AI) models are reportedly failing to meet European regulatory standards in critical areas such as cybersecurity resilience and biased output, according to data reviewed by Reuters.
The European Union (EU) had been deliberating AI regulations for some time, but the launch of OpenAI’s ChatGPT in late 2022 intensified the conversation around the potential risks of these technologies. This surge in public interest prompted lawmakers to draft specific regulations governing "general-purpose" AI systems (GPAI).
Now, a new tool created by Swiss startup LatticeFlow, in collaboration with its partners and supported by EU officials, has assessed generative AI models from major tech companies such as Meta (NASDAQ:), OpenAI, and others across numerous categories, following the guidelines of the EU’s expansive AI Act. The Act is set to be implemented progressively over the next two years.
LatticeFlow's "Large Language Model (LLM) Checker," which assigns each model a score between 0 and 1, ranked models from Alibaba (NYSE:), Anthropic, OpenAI, Meta, and Mistral with average scores of 0.75 or higher. However, the tool revealed significant weaknesses in some models, highlighting where companies may need to invest resources to achieve regulatory compliance.
Firms that fail to comply with the AI Act could face penalties of up to €35 million ($38 million) or 7% of their global annual revenue.
Mixed Performance
While the EU is still finalizing how the AI Act’s regulations on generative AI tools, like ChatGPT, will be enforced, experts are currently working on drafting a code of practice, expected by spring 2025.
LatticeFlow's test, developed alongside researchers from ETH Zurich and Bulgaria's INSAIT institute, offers a preliminary view of areas where tech firms may struggle with compliance. Discriminatory output, for example, has been a recurrent issue with generative AI models, as they often reflect human biases related to gender, race, and other factors.
OpenAI's "GPT-3.5 Turbo" received a score of just 0.46 for addressing discriminatory output, while Alibaba Cloud’s "Qwen1.5 72B Chat" scored even lower, at 0.37. In terms of cybersecurity, particularly prompt hijacking, Meta’s "Llama 2 13B Chat" earned a score of 0.42, and Mistral’s "8x7B Instruct" scored 0.38.
Anthropic’s "Claude 3 Opus," backed by Google, emerged as the top performer with an average score of 0.89.
The LLM Checker, designed according to the AI Act’s provisions, will expand as new enforcement mechanisms are introduced. LatticeFlow also plans to make the tool available to developers to test their models' compliance online.
LatticeFlow’s CEO and co-founder, Petar Tsankov, stated that the results were generally positive and could help guide companies in refining their models to align with the AI Act. "While the EU is still defining all compliance benchmarks, we are already seeing gaps in these models," Tsankov said. "With a stronger focus on regulatory optimization, model developers should be able to meet the necessary requirements."
Meta declined to comment on the results, and Alibaba, Anthropic, Mistral, and OpenAI did not respond to requests for comment.
Although the European Commission cannot officially verify external tools, it has been kept informed about the development of the LLM Checker and views it as an important early step in implementing the AI Act. A spokesperson for the Commission described the initiative as a "first step" toward translating the legislation into technical standards.
Paraphrasing text from "Reuters" all rights reserved by the original author.