Open source LLMs
Reading: Not all ‘open source’ AI models are actually open: here’s a ranking, Nature, news article 19 June 2024.
With AI models, the “open source” label has extra dimensions. It’s not just about the source code and license, but also training data, meta parameters, trained weights. (It’s a bit of a mess, but there’s a work in progress on defining what open-source AI is).
When looking at a model, for my purposes, I judge the openness based on the license, and then two questions:
- Can I run it? If I have the code and weights (or whatever the trained representation is) I can maybe do something useful with the model.
- Can I reproduce it? If I have the paper and the training data, can I learn and build on this and maybe do something adjacent. This not a practical concern for LLMs because I’m not likely to drop $10k to $50m to train a huge model. But for other types of AI/ML, or for fine tuning, I could.
But you can go into a lot more detail than that. This meat of this Nature news article is reporting on work by Liesenfeld & Dingemans, Rethinking open source generative AI: open-washing and the EU AI Act. They assessed LLMs based on 14 attributes, and give each model a score.
A couple of highlights from the analysis:
- “Around half of the models that they analysed do not provide any details about data sets beyond generic descriptors” — you don’t know what they model was trained on, or the legal status of that data!
- “Peer review seems to have ‘almost completely fallen out of fashion’, being replaced by blog posts with cherry-picked examples, or corporate preprints that are low on detail.”
Their table is over at https://opening-up-chatgpt.github.io
Stanford also maintain the Ecosystem Graphs to track models. They’ve gone with a column describing model access as: “open”, “closed” or “limited”.
But why stop there. You’d want to cross check that with a set of leaderboards for hallucinations and performance.