Microsoft's MAI Models Trained on Unlicensed Web Data
Microsoft trained its MAI models partly on unlicensed web data, contradicting its claims of using only 'enterprise grade, clean and commercially licensed data'.

Microsoft's approach to training its large language models (LLMs) appears to be no different from that of its competitors. Despite touting its use of 'enterprise grade, clean and commercially licensed data,' the company has relied partly on unlicensed web data, such as Common Crawl, to train its new MAI models. This revelation raises questions about the company's commitment to responsible AI development and its adherence to intellectual property rights.
Microsoft's reliance on unlicensed web data puts it in the same boat as other AI labs, which have faced criticism for their use of copyrighted material without permission. The company's stance on fair use, which allows it to scrape data from websites without permission, shifts the burden to site owners to block its crawlers. This approach has sparked debate about the boundaries of fair use and the need for clearer guidelines on AI training data.
The use of unlicensed web data for training AI models has become a contentious issue, with many arguing that it infringes on the rights of content creators. Microsoft's decision to use such data, despite its claims of using only licensed material, highlights the challenges of developing AI models while respecting intellectual property rights. The company's approach to AI training data has significant implications for the future of AI development.
As AI models become increasingly powerful and pervasive, the need for transparent and responsible AI development practices has never been more pressing. Microsoft's use of unlicensed web data serves as a reminder that even tech giants must prioritize responsible AI development and respect intellectual property rights. In response to these concerns, Microsoft and other AI labs must re-examine their approaches to AI training data and prioritize transparency and accountability.
This may involve developing new methods for sourcing and licensing training data or advocating for clearer guidelines on fair use and intellectual property rights in AI development.
Source: The Decoder