Open-Source Datasets: A Double-Edged Sword

Lindsey Bergman

In late December, a New York Times article highlighted a new artificial intelligence (AI) program run by the People’s Republic of China’s (PRC) Ministry of State Security, the PRC’s primary intelligence agency. The article details that the PRC acquired an AI program that can create “instant dossiers on every person of interest in the area and analyze their behavior patterns.” According to internal memos obtained by The Times, the PRC spies proposed feeding their AI program information from databases and footage from cameras to not only select targets but also to map out a target’s network and vulnerabilities.

The Times story is particularly worrying at a time when the proliferation of open-source datasets is rapidly increasing, thereby presenting the PRC with ample opportunity to collect U.S. data for use in their AI programs. While open-source datasets expand access to the public of what was once owned by the technological elite, this democratization of data also increases the vulnerability of U.S. information to PRC exploitation. Reciprocally, the proliferation of open-source datasets also enables the United States to collect adversary information and make predictions about PRC behaviors. These risks and opportunities make open-source datasets a double-edged sword, as the data can be equally exploited for advantage and taken advantage of.

How Machine Learning Works

Machine learning methods require data to learn from, and AI systems are only as efficient as the data that trains them. An AI model built with machine learning is generated by providing prepared training data to the relevant AI algorithm. The first two steps in this process are gathering the raw data and then preparing that data for training. After successful training, machine learning can make predictions, perform actions, or generate synthetic data. As a result, machine learning systems can learn patterns and associations from data. This process underscores the importance of the datasets. As the Belfer Center notes, a suitable algorithm that learns from an extensive and relevant dataset outperforms a great algorithm that learns from minimal or poor data.

When machine learning was first developed, there were very few places researchers could reliably access sufficiently large enough training data to build high-performance systems. Today that is no longer the case due to the proliferation of open-source datasets. The international Open Data Charter defines “open data” as “data that is made freely available for open consumption, at no direct cost to the public, which can be efficiently located, filtered, downloaded, processed, shared, and reused without any significant restrictions on associated derivatives, use, and reuse.”

Now, across the Internet, there is an ever-increasing availability of massive datasets and open-source code libraries, so much so that product developers no longer have to start from scratch. AI developers can begin with off-the-shelf large language models that are freely available to download, such as Meta’s open-source Llama 2, and then utilize online code repositories such as Github, Hugging Face, and Kaggle for datasets that can train generative AI systems. The benefit to the public of the proliferation of open-source datasets is that publicly available data reduces data monopolies, such as one company hoarding all the data on one issue. Open-source data “takes data out of the confines of the technologically privileged few and makes it freely available for all people and entities to use, reuse, and consume.” The proliferation of open-source datasets also saves time and expenses in collecting, aggregating, and storing data, allowing researchers to spend time on solving other problems.

However, open-source datasets are not without their drawbacks and create many copyright issues, dataset biases, and privacy issues. While datasets are often freely available, they may include improperly licensed data, as evidenced by the rise of copyright lawsuits against AI firms. Datasets may also contain certain biases, such as selection bias, in which a particular group is overrepresented, and another is underrepresented in the dataset. Perhaps most significantly, open-source datasets are a privacy risk to U.S. citizens. AI’s ability to track patterns makes the systems highly effective at reidentifying personal data in anonymized datasets. These concerns extend to nation-state adversaries, such as the PRC, collecting data on U.S. citizens to conduct targeted intelligence operations.

The PRC’s Exploitation of Open-Source Datasets

The PRC currently collects a vast amount of open-source data to influence its intelligence, military, and security operations in information warfare and influence targeting. For example, a PRC agency responsible for coordinating influence operations abroad, the United Front Work Department, built an “Overseas Key Information Database,” which appeared to crawl the Internet to build out personal and professional profiles of key individuals globally. The database included records on familial and friendship links, records from international think tanks and universities, scientific journal articles, blog posts, and large amounts of public sector employee records. By crawling existing databases and pulling from such a wide variety of open-source data, the database could track key influencers in a given area and visualize how news and opinions moved through social media platforms, thereby enabling PRC targeting and messaging operations. The PRC only needs to visit Github, Hugging Face, or Kaggle to access U.S. open-source datasets.

While the proliferation of open-source AI-ready datasets has significant benefits for technology and research, adversaries also now have access to these datasets that contain large amounts of information on U.S. citizens. For example, Hugging Face allows users to post their own AI models, train them, and collaborate with others. Their database has over 50,000 datasets, ranging from collections of chatbot conversations to archived tweets of financial influencers on X, medical datasets, and more. Kaggle, another site that facilitates machine learning techniques, shares a “huge repository of community published models, data, and code for your next project.” Datasets range from COVID-19 pandemic data to Netflix user scores to 1.3 million U.S. patent data records, electric vehicle population data by state, high school student performance and demographics, and more. There are even sites for open-source datasets by city, such as “Open Data New York,” with datasets on New York City civil service workers.

At risk is the ability of adversaries like the PRC to exploit this vast array of U.S. information to inform their own AI systems about U.S. behavior and decision-making processes. By collecting a dataset on something as innocuous as Netflix user scores and feeding that into AI algorithms, the PRC can better understand U.S. consumers’ TV streaming preferences, contributing to the PRC’s efforts to influence the film industry and Hollywood. Datasets on powerful financial influencers in the U.S. can assist the PRC in making predictions about potential targets in a misinformation campaign. Additionally, datasets on U.S. patent records can inform PRC intellectual property theft.

Recommendations

With the knowledge that the PRC is conducting widespread collection of U.S. open-source datasets, the United States should seek to better protect sensitive datasets from adversaries that may use them to develop their own AI models. Additionally, while the United States should not discourage open-source data sharing, it should recognize that it openly shares U.S. data on a silver platter with interested nation-state adversaries. With this information, adversaries can use open-source data to train their AI models to better understand and predict U.S. behavior and trends. Still, the opportunity to utilize adversaries’ open-source datasets to understand their decision-making better and support intelligence operations doesn’t rest solely with the PRC – the United States can and should take advantage of this opportunity, too.

Just as the number of U.S. open-source datasets publicly available to train AI models grows daily, there is also a vast amount of growing, available PRC datasets. Examples can be found on Metatext, where over 64 PRC datasets are available for machine learning, including conversations between PRC citizens and doctors, scientific journal articles, and datasets constructed from microblogging websites. The Global Data Barometer is based in the PRC, with datasets from more than 30 PRC government agencies and private companies. Mandarin-language datasets are also widely available on Hugging Face and Kaggle.

As the United States collects PRC open-source datasets, the U.S. Intelligence Community should be wary of potential PRC deception operations to influence U.S. AI systems through fabricated datasets. If the PRC is aware of U.S. collection tactics, they may covertly disseminate open-source datasets with false information to fool U.S. AI systems and cause them to misperform. Such a technique is called inserting “poisoned” or false data into training sets to undermine machine learning-based models. “Poisoned data” can fool algorithms into misclassifying data and also overwhelm U.S. systems with fabricated, inaccurate data. For example, poisoned data could cause a deep neural network image classifier to falsely recognize a friend as a foe.

Accordingly, the United States should expect adversaries such as the PRC to attempt to use open-source datasets to deceive U.S. AI systems by flooding the U.S. collection enterprise with disinformation. Therefore, when collecting PRC open-source datasets, the U.S. Intelligence Community should be wary of all collected information, as it may be a ploy to compromise U.S. AI systems. The U.S. Intelligence Community should incorporate counter-deception analysis into its processes to uncover if any collected datasets are part of a PRC deception operation.

The proliferation of open-source datasets that are publicly available to train AI systems presents both risks and opportunities for the United States. The U.S. Intelligence Community should take advantage of PRC open-source datasets to further the United States intelligence collection on the PRC while also being attuned to potential PRC deception operations designed to fool U.S. AI algorithms. At the same time, the United States should be aware of the reciprocal PRC collection of U.S. open-source datasets and seek to protect sensitive U.S. datasets better. Open-source datasets are a double-edged sword for the United States because they present the PRC with opportunities to collect a vast amount of U.S. data and wage deception operations against the United States while presenting the United States with the same opportunity and threat.

Views expressed are the author’s own and do not represent the views of GSSR, Georgetown University, or any other entity. Image Credit: Pexels