From Laboratories to Language Models: Can AI Support Rigor in the Jungle of Policy Analysis?

Harrison Durland

Policymakers navigate deep uncertainty when dealing with national security and foreign policy dilemmas. Compared to some fields of engineering and traditional “hard sciences” (e.g., physics, medicine), intelligence analysts and political scientists are often unable to test their claims via rigorous experiments or other well-defined and reliable procedures. As a result, policymakers struggle to identify trustworthy sources of information: analysts might selectively choose evidence that supports their arguments, cater to the biases of their audience, or misrepresent opposing viewpoints. In short, whereas so-called “hard scientists” can farm knowledge by relying on generalized procedures, national security policymakers often must forage for information in dark jungles filled with obstacles and poisonous fruit.

In a previous GSSR article, I proposed two methods for improving the reliability or auditability of analyses: visualizing arguments as nodes and links to help readers understand or search for relationships between claims and using dynamic documents that allow post-publication commentary. However, I recognized that these methods would not compare to the impact of scientific tools or methods such as microscopes, statistical calculators, and randomized controlled trials. Central to the problem was the seemingly irreducible complexity of natural language and human intuition: you can design software that reliably performs statistical analyses, but you cannot algorithmically define “common sense.”

Coincidentally, that article was published on the same day that OpenAI released ChatGPT (GPT-3.5). A few months later, OpenAI released GPT-4. These developments forced me—and many others—to reconsider what was technologically feasible in the near term: perhaps one cannot algorithmically define common sense and similar concepts through handwritten algorithms, but might it be possible to train large language models (LLMs) to automate tasks that rely on these concepts? A variety of experiments in 2023 suggested that LLMs or similar generative artificial intelligence (AI) models could eventually support research tasks or methods such as peer review, qualitative data collection/categorization, and even agent-based modeling. While researchers and analysts should recognize that current AI models still have many limitations, it seems reasonably plausible (>25% likely) that within four years, LLMs and related AI models will enable significantly greater efficiency or rigor in policy analysis and social sciences. Thus, intelligence analysts, social scientists, and policymakers should seriously examine these models’ potential and pave the way for their use where appropriate.

The Power of Experiments

To understand how LLMs and related tools could be of value to social scientists and policy analysts, it is helpful to understand some of the mechanisms that make experiments valuable rather than treating them as inimitably magical. This is especially important given that many esteemed scientists have insisted that experiments are “the sole test of the validity of any idea”—that people should eschew reliance on the kind of non-deductive reasoning that is unavoidable in foreign policy and social science. This pessimism is not entirely unfounded: policymakers may review debates between “experts,” request case studies, apply red-teaming methods, and leverage first-principles thinking, but these and other methods still often fail to prevent analytical errors or induce overconfidence—especially when the policymakers or researchers are biased. Unsurprisingly, many debates go unresolved: military and intelligence analysts disagree over the likelihood that China will invade Taiwan by 2028, scholars clash over why democracies are less likely to engage in conflicts with each other, and engineers quarrel over the risks posed by artificial intelligence (AI).

In contrast, well-designed experiments such as randomized controlled trials are useful because they efficiently test a wide range of confounding variables or counterarguments—including arguments that the experimenters may not have even imagined or may have been too biased to accurately evaluate. This is especially important for complex fields like security policy, where it is easy to overlook objections or otherwise make “honest mistakes”; as Thomas Schelling quipped, even the most capable analyst cannot “draw up a list of things that would never occur to [them].” However, experiments are certainly not the only procedure that can reveal flaws in arguments. Oftentimes, researchers who want to improve their assessments can do so by spending tens or hundreds of hours exploring objections, weighing the evidence, and seeking feedback from peers. Thus, in some cases the problem that plagues decision-makers is not simply that analysts are incapable of producing better assessments, but rather, the decision-makers have limited time or expertise to determine who is more trustworthy.

Why is this problem of trustworthiness often less severe in traditional sciences? One explanation is that experiments and other clearly specifiable procedures can serve as due diligence standards that make it easier for audiences to evaluate the quality of research. For example, it is often easier to properly interpret the results of an experiment than a debate on the same issue, and verifying the integrity of an experiment is often easier than conducting the original experiment. Importantly, thorough verifications (e.g., full replications) are often unnecessary because the threat of punishment incentivizes researchers to be honest/diligent. For example, a researcher in a marketing department could report sloppy or fraudulent experimental results to their boss, but if their boss audits the findings for irregularities and determines that the researcher was deceptive or negligent—that they failed to follow the clearly specified procedure—they can fire the researcher.

In summary, experiments and similar procedures can not only improve the accuracy of a researcher’s assessment by testing counterarguments but also make it significantly easier for less familiar audiences to verify the results.

The Limits of Experimentation in Policy and Social Science

Unfortunately, it is often impossible to rely on direct experiments in policymaking, especially for national security and foreign policy: analysts cannot “rerun alternative histories” to determine whether sanctions on some country (e.g., North Korea) had a deterrent effect on other actors or test hypotheses for why democratic states rarely fight one another. If the People’s Republic of China (PRC) manages to catch up to Western semiconductor manufacturing in the next five years, it may be difficult to determine the degree to which U.S. semiconductor export controls actually accelerated this outcome, in part because we cannot observe the counterfactual world where such policies were not enacted. Additionally, it is often hard to draw insights from case studies given that they may be limited in number or similarity, and relevant data may be unreliable (e.g., intentionally misreported) or hard to objectively measure (e.g., leadership “quality”).

Peer review can improve some analyses, but it can still fail to identify flaws in research or take too long to return feedback in fast-moving fields such as technology policy. Audiences can try to set generic analytical standards: Intelligence Community Directive 203 dictates that analysis should “be informed by all relevant information available,” “identify and assess plausible alternative hypotheses,” and “make the most accurate judgments and assessments possible.” However, these standards are far more ambiguous than standards like “assessments should be based on statistically significant results from randomized controlled trials.” Such ambiguity makes it difficult to verify compliance and subsequently punish bad-faith non-compliance (e.g., cherry-picking evidence, misrepresenting opponents’ arguments). Moreover, it may not be possible for analysts to actually follow very stringent standards (e.g., “consider every plausibly relevant case study”).

The Potential for LLMs and Other Generative AI Models to Support Analysis

Current frontier LLMs (e.g., GPT-4) have many shortcomings, but they can already accelerate some research tasks, and researchers should be open-minded about future models’ potential to dramatically improve research standards—especially given the surprising progress in models’ capabilities over the past three years. Some noteworthy potential use cases (in loose order of difficulty) include:

“Smart Search”: LLMs can help find information even when the searcher does not know the exact words/phrases to search for, and it seems very likely that near-future (<2 years) AI models could perform these concept searches across many long documents, such as archival records of wars, intelligence reports, or journal articles. This could allow analysts to more quickly find relevant information or make verifiable claims like “GPT-5 did not identify a rebuttal to X argument in any of the 72 listed articles on this topic.”

Larger-scale, replicable qualitative coding: It can be difficult to collect and label data for potentially important variables in social science, especially when those variables require sifting through large amounts of natural language text or are “subjective.” For example, the Consortium for the Study of Terrorism and Responses to Terrorism hired dozens of interns (including me) to spend hundreds of hours trawling documents for details about the biographies and relationships of domestic extremists—a task that current LLMs or their near-future successors would very likely accelerate. Such capabilities could make some large-scale data collection feasible, thus allowing audiences to set higher standards for empirical analyses. For more controversial or “subjective” measures (e.g., patent quality), researchers might carefully fine-tune models with high-effort, high-quality evaluations and, if the model performs reliably in out-of-sample testing, use it as a more credible or replicable judge.

Preliminary argument review and analytical support: Journals or other institutions could potentially use LLMs to accelerate or deepen the peer review process by relying on LLMs to provide initial rounds of feedback. Some researchers have examined GPT-4’s ability to provide feedback, with one paper concluding that “LLM feedback could benefit researchers.” Relatedly, in 2023, the Intelligence Advanced Research Projects Activity launched the Rapid Explanation, Analysis, and Sourcing Online (REASON) Program to “exploit recent advances in artificial intelligence” to assist analysts by “pointing them to key pieces of evidence beyond what they have already considered.” LLMs will almost certainly fail to catch some issues (false negatives) and may erroneously flag legitimate claims (false positives) more often than a human reviewer, yet they could still enable analysts to identify patterns across data sources more efficiently than traditional methods or identify important objections “that would never occur to [them].” As a result, audiences could set standards such as “subject your arguments to review by state-of-the-art LLM X (and at least report the results in an appendix)”—then if the audience finds that the LLM consistently raises a strong rebuttal or alternative explanation that a researcher failed to acknowledge, the audience could treat this as negligence.

Animal models for human interactions and reasoning: Mice are very different from humans, yet scientists have found that experiments with mice still provide valuable insights into human biology, such as determining whether a substance is likely to cause cancer. Although the results for mice do not always directly translate to humans, these experiments can provide preliminary tests of hypotheses’ plausibility or flag surprising dynamics for closer study. LLMs are also very different from humans but might still be useful for experiments relevant to human reasoning or social interactions. One 2023 study used GPT-3.5 (not even GPT-4) to control dozens of virtual characters in a village (reminiscent of The Sims) and found that, despite their flaws, the agents produced believable simulations of “both individual and emergent group behavior.” It seems plausible (>10% likely) that by the end of this decade, researchers evaluating the impact of a proposed organizational structure (e.g., creating the Space Force, merging the State Department and USAID, creating a “new government agency” to regulate AI) would find it worthwhile to test their claims experimentally via these “social simulations,” even if they are imperfect. Unlike with wargaming or other simulations dependent on physical humans, experimenters could run thousands of these LLM-based simulations and carefully control the dozens of high-level variables that seem relevant. On the side of individual human reasoning, LLMs might make it easier for researchers to directly test the usefulness rather than “correctness” of theories (e.g., realism, liberalism), guidance (e.g., Intelligence Community Directive 203’s Analytical Standards), or structured analytical techniques (e.g., analysis of competing hypotheses), such as by evaluating their impact on the predictive performance of LLMs.

Researchers Should Test LLMs’ Abilities and Prepare Where Appropriate

Within many scientific fields, procedural standards such as experiments and statistical significance tests can produce results that are more accurate and easier to verify—even for non-experts. In some fields (e.g., microbiology), developing such standards required tools that made the standards practical, such as microscopes to test theories through observation. However, such high-quality standards have long been impractical for many topics in national security and foreign policy—as well as the social sciences more broadly, which those policy topics often rely on. Some nice-sounding due diligence standards (e.g., “test your claims against all relevant case studies”) are difficult to enforce because they demand too much from researchers or the standards are hard to operationalize (e.g., what counts as a “relevant” case study?). Formal peer review can help but is often slow and fails to catch some errors in research. Thus, it can be difficult for policymakers to determine which (if any) analysts and their assessments are reliable. Should analysts give up hope of meaningfully improving rigor in these areas? Will the domains forever be treacherous jungles of complexity to their audiences and inhabitants? Perhaps, yet the ascendance of LLMs over the past two years offers some reason for optimism: it is entirely plausible that these and related AI models will enable better methods this decade, such as by making qualitative data collection more practical or by generating initial rebuttals to one’s analysis. As with the earliest microscopes, such tools will likely require further improvements before scholars can trust them with some tasks, but if effective, they could significantly reduce some of the barriers to empirical analyses and operationalize due diligence standards that were previously unenforceable. Thankfully—unlike with many policy questions—researchers can begin experimentally evaluating LLMs’ usefulness for policy analysis and social science. Given how valuable even marginal improvements in national security and foreign policy assessments could be, more scholars and analysts should thoroughly explore the tools’ potential and prepare for their integration where appropriate.

Views expressed are the author’s own and do not represent the views of GSSR, Georgetown University, or any other entity. Image Credit: Canva Images