Data Licensing

Using openly available data doesn’t mean the model has to be open, often referencing the cost and labor that goes into building the model itself (open_data_model, 10)
Copyrighted training data is a barrier to open sourcing a model (copyright_barrier_open, 6)
The variety of sources from which data come makes it easier for companies to get away with a sharealike-type license, especially since there is so little of the data remaining in the output of LLMs. This is a strategic choice, not a constraint (variety_of_sources, 3)
Using openly available data means the model has to be open (open_means_model, 3)
Precedents like Google Books give clues to whether LLMs are also “transformative work” (transformative_work_precedent, 2)
Data partnership is likely a protective measure against copyright lawsuits (data_partnership_protection, 1)

pstore