-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Hello,
I'm excited about improving documentation for licensing in the AI space, but I know that this is a tricky topic and so decided to audit ten model listings. These models were chosen somewhat arbitrarily with a bias towards models I knew more about and so could easily check the answers.
Unfortunately, nine of the ten listings featured major errors. I don't think that this can serve as a useful database in its current form due to the pervasiveness of the mistakes and especially the importance of what mistakes are made about. For some particularly notable examples, I have identified errors in the listing for the model parameters in seven of the ten entries, in the training code, evaluation code, and inference code listing for three of the ten models (only six of the audited models contained these items though!), and every single listing seems to be wrong about the licensing of the paper and/or model card.
I really love and wish to support efforts like this, but given the deeply pervasive nature of the errors right now I worry that the primary impact of this database is to spread misinformation about a critical topic. I very strongly recommend doing a systematic review of the information contained in this repo and an assessment as to how quality control issues can be prevent from arising in the future, and think you should seriously consider making it private until the problems have been adequately addressed.
- The following items are listed as "Component not included" for BLOOM incorrectly: training code, inference code, evaluation code, model parameters (intermediate), model metadata, and evaluation results. Additionally, it says "license not specified" incorrectly for the paper and technical report.
- For BloombergGPT, I think that all of the things currently labeled "undisclosed" should be actually labeled "Component not included" or something like "proprietary - unreleased" which isn't currently an option. The license on the technical report is the arXiv perpetual non-exclusive license.
- DeepSeek-R1 looks largely correct, though I'm confused by the different values entered for "technical report" and "paper." Also, the model card has a MIT license.
- FairSeq Dense has an MIT license on the model weights and an arXiv perpetual non-exclusive license on the paper.
- GPT-4 is internally inconsistent in its listing. For example, model parameters (intermediate) is listed as proprietary while model parameters (final) is listed as not included. Another example of this is that the technical report and paper disagree on how they're licensed, despite presumably being the same artifact. It's hard for me to figure out how entries are determined to be "proprietary" vs "component not included" as well. The GPT-4 architecture is only known via leaks and is listed as proprietary. The evaluation library is listed as "component not included" but OpenAI has a proprietary evaluation library (repeat with many other libraries). Some other items are wrong such as the claim that there isn't a model card or that there aren't any supporting tools (they released their tokenization library).
- Chinchilla's parameters and architecture are wrongly listed as Apache 2.0. The chinchilla model was never released. Again there's an inconsistency between the tech report listing and the paper listing. Model metadata and evaluation are incorrectly listed as not included.
- GPT-NeoX-20B has the following incorrectly listed as "component not included": data processing code, training code, inference code, evaluation code, and evaluation results. It also has the following incorrectly listed as "license not specified": model card, data card, technical report, and research papers.
- Galactica is incorrectly listed as "license not specified" for the model parameters and model card. They have a CC-BY-NC license. As seems to be consistently the case, the values for tech report and paper are inconsistent (which is weird) and the value for the paper is incorrect.
- LaMDA seems to make different decisions about how to describe things as proprietary vs not included vs license unspecified than GPT-4. It also incorrectly claims that the paper's license is unspecified.
- Polyglot-ko has most of the entries incorrectly listed as "component not included." The model card and paper are incorrectly listed as "license not specified."