Today I came across a new model called Teuken for the first time.
What stood out is that it includes significantly more non-English content and was trained on data from 14 EU languages. It's a publicly funded initiative, involving two Fraunhofer Institutes (IAIS and IIS), the Jülich Research Center, the German AI Association, TU Dresden, DFKI, IONOS, Aleph Alpha, ControlExpert, and WDR - working together under the OpenGPT-X project.
They claim the model performs on par with others of similar size, but I guess we'll have to wait for real-world feedback from users. For context, the latest frontier models have hit 2 trillion parameters - that’s 200 times more than Teuken. So, based purely on size, it wouldn’t have made waves - not even in that November 2023 chart below.
Still, I think it’s a meaningful effort. The project took three years (starting in January 2022, released on November 26, 2204) and cost around €14 million. That’s obviously a drop in the ocean compared to what Big Tech and leading foundation model vendors are spending - but it’s a start. If Europe wants to catch up in this field, doing nothing (as has been the case for most tech waves over the last 40 years) is no longer an option.
Now at least Europe is in the game. But if the goal is not to become yet another example of falling behind in cutting-edge tech, the next step will require much bigger investments.
As for Teuken’s relevance today - it could be a helpful tool for startups and researchers working with less common (or “exotic”) languages. But to drive adoption, it still needs to align better with real-world application needs, which it doesn’t fully do at this stage.
For the record:
Teuken-7B-instruct-commercial-v0.4 is an instruction-tuned, 7B parameter multilingual LLM trained on 4 trillion tokens across all 24 official EU languages. It’s released under Apache 2.0 as part of the OpenGPT-X project.
The base model (Teuken-7B-base-v0.4) is available upon request via contact@opengpt-x.de.
https://cmte.ieee.org/futuredirections/2023/11/14/llms-hitting-2-trillions-parameters/

https://opengpt-x.de/en/