Baidu blocks Google, Bing from scraping content amid demand for data used on AI projects

arifnsn August 24, 2024

112 2 minutes read

Baidu blocks Google, Bing from scraping content amid demand for data used on AI projects

Chinese internet search giant Baidu appears to have started blocking the online search engines of Alphabet’s Google and Microsoft’s Bing from scraping content derived out of the mainland firm’s Wikipedia-style service, a Post survey found.

A recent update of Baidu Baike’s robots.txt – a file that tells search engine crawlers which uniform resource locators, commonly known as web addresses, can be accessed from a site – has outright blocked the ability of the Googlebot and Bingbot crawlers to index content from the Chinese platform.

That update appears to have been made some time on August 8, according to records on internet archive service the Wayback Machine. It also showed that earlier on the same day Baidu Baike still allowed Google and Bing to browse and index its online repository of nearly 30 million entries, with only part of its website designated as off limits.

This initiative shows Beijing-based Baidu’s increased effort to safeguard its online assets, as demand for vast troves of data have increased for training and building artificial intelligence (AI) models and applications.

That followed US social news aggregation platform and forum Reddit’s move in July, when it blocked various search engines, except Google, from indexing its online posts and discussions. Google has a multimillion dollar deal with Reddit that gives it the right to scrape the social media platform for data to train its AI services.

Since OpenAI released ChatGPT on November 30, 2022, major search platforms Google and Microsoft have sought to obtain more data for use in their own generative artificial intelligence systems. Photo: Shutterstock

By comparison, the Chinese version of online encyclopaedia Wikipedia has 1.43 million entries to date, which are made accessible to search engine crawlers.

Following Baidu Baike’s robots.txt update, the Post’s survey of Google and Bing on Friday found many entries – probably from older cached content – from the Wikipedia-style service still come up in the US search platforms’ results.

Representatives from Baidu, Google and Microsoft did not immediately reply to requests for comment on Friday.

More than two years after the groundbreaking launch of OpenAI’s ChatGPT, many large AI developers around the world are striking deals with content publishers for access to quality content to for their GenAI projects.

GenAI refers to the algorithms and services, such as ChatGPT, that are used to create new content, including audio, code, images, text, simulations and videos.

OpenAI, for example, in June forged a deal with American news magazine Time that gives it access to all the archived content from more than 100 years of the publication’s history.

arifnsn August 24, 2024

112 2 minutes read

Baidu blocks Google, Bing from scraping content amid demand for data used on AI projects

arifnsn

Machines à sous en ligne gratuites Jouez dès maintenant

Spotlight: The enduring icon bridging New Zealand and China – Xinhua

Chinese construction giant begins works on rail line rehabilitation in Serbia – Xinhua

China, Serbia vow to solidify friendship, cooperation – Xinhua

Myanmar president pledges efforts in Belt and Road Initiative – Xinhua

Chinese vice premier meets Thai princess – Xinhua

arifnsn

With Product You Purchase

Subscribe to our mailing list to get the new updates!

Uruguay suspends all football activity over Nacional Izquierdo's health

Alonso wants Leverkusen to be more aggressive after snatching late Bundesliga win

Related Articles

China weighs injecting 1 trillion yuan of capital into top banks

Boeing strike leaves Asian airlines sweating on plane deliveries

Eurostar and SkyTeam partner for air/rail connections

ADB says AI investments to drive developing Asia growth