Revolution in the Test Tube: How AI Agents Are Transforming Scientific Research

From Data Analysis to a Partner in Discovery: Artificial intelligence in science is undergoing a profound transformation. For years it was seen mainly as a tool for analysing large datasets. Today we’re witnessing an evolution from a passive analyst to an active partner in research — AI systems can help formulate hypotheses, design experiments, and interpret results (always under human oversight). This trend is well documented in 2025 surveys and reports, including the Stanford AI Index 2025 and the State of AI Report 2025.

1. The new paradigm: concrete uses of AI agents in science

Specialised “scientist agents” are accelerating the discovery cycle across multiple disciplines. Below are examples from 2024–2025 grounded in peer-reviewed publications and official reports.

1.1. Biology & medicine: decoding the “language of life”

Protein and drug design. The ESM3 model (EvolutionaryScale) generated an entirely new fluorescent protein, esmGFP, documented in Science (2025) — the work shows large evolutionary distance from natural FPs and experimental validation. See Science (2025) and the team’s blog EvolutionaryScale.
Structures and biomolecular interactions. AlphaFold 3 (2024) broadened prediction to complexes involving proteins, nucleic acids and small molecules [Nature, 2024]. Independent analyses also highlight limitations for some RNA cases and edge scenarios [JCIM, 2025], [C&EN, 2025].
“Co-Scientist”. A multi-agent system from Google DeepMind proposed and tested biological mechanisms; results on bacterial gene transfer were published in Cell (2025) [Cell, 2025]. Overview: Google Research.
Virtual lab & nanobodies. A Stanford-led team demonstrated a “Virtual Lab” — a group of AI agents working with scientists — that designed 92 nanobodies neutralising SARS-CoV-2 variants; selected constructs were experimentally validated and published in Nature (2025): Nature (article). See also summaries by Stanford Medicine (news), Reuters — Health Rounds, and Nature News on the preprint.
Universal interaction models. ATOMICA learns representations of intermolecular interfaces (protein–ligand, protein–RNA, etc.); this is a preprint with experimental validation of selected cases [bioRxiv, 2025] (project page: Harvard).

1.2. Chemistry & materials: towards autonomous labs

Robotised synthesis. A classic reference is Liverpool’s mobile “robot chemist” (Nature, 2020), which planned and ran experiments autonomously [Nature, 2020]. Newer “self-driving lab” platforms markedly increase throughput; for example, NC State’s lines demonstrate high-rate flow experiments and data intensification reported in Nature Chemical Engineering 2025 [overview], [NC State lab news].
New materials from generative models. Approaches based on LLMs and diffusion (e.g., CrystaLLM) generate stable candidate crystal structures, corroborated by simulation [Nat. Commun., 2024]. Recent work combines LLMs with diffusion to improve stability and novelty [arXiv, 2025], and LLMs alone can yield stable crystals without finetuning in MatLLMSearch [arXiv, 2025].

1.3. Formal & computational sciences

Algorithm discovery. AlphaEvolve (DeepMind) is an evolutionary coding agent that proposed and verified new algorithms. The team reports faster 4×4 matrix multiplication (48 scalar mults) and speedups to selected compute kernels in Google’s stack — see the blog and arXiv (2025).
Mathematical reasoning. In 2025, systems from OpenAI and Google achieved “gold-level” performance at the International Mathematical Olympiad, solving 5/6 problems under time constraints — see Reuters and DeepMind (with solutions PDF).
Earth-system & weather models. ORBIT (ORNL) scales to ~113B parameters on Frontier [arXiv, 2024], [ORNL news, 2024]. In weather forecasting, the WeatherNext/GenCast family (Google DeepMind) achieves high-quality probabilistic forecasts up to 15 days — see GenCast, the Nature article, and the public dataset in Earth Engine WeatherNext Gen. In parallel, ECMWF has made its AI Forecast System (AIFS) operational with up-to-15-day forecasts [ECMWF, 2025].

These examples fuel innovation but still demand careful validation.

2. “Engines of discovery”: technologies powering the revolution

2.1. Advanced reasoning & planning

New models (e.g., o1/o3, Gemini Deep Think) use “think-before-answering” techniques and increased test-time compute. At the same time, research shows brittleness to distractors — models can fail due to small, irrelevant additions to a problem [arXiv: Cats Confuse Reasoning LLMs, 2025].

2.2. Agent systems & tool use

A true scientific agent integrates an LLM “brain” with digital and physical tools via function calling/APIs — from structural databases to robotised lab equipment. Practical demonstrations include Co-Scientist [Google Research] and virtual labs [Nature, 2025].

2.3. Reinforcement learning with verifiable rewards (RLVR)

Where correctness can be automatically checked (proofs, program tests, molecular properties), RL with verifiable rewards speeds up learning to reason — yet stability and generalisation still require rigorous evaluation. (See AI Index 2025 for an overview; examples include AlphaEvolve and IMO 2025.) See also DeepSeek-R1 in Nature [2025] and a recent survey [arXiv, 2025].

2.4. World models & simulation

The Genie family (DeepMind) creates interactive, controllable worlds for low-cost hypothesis testing Genie 2, Genie 3. Such simulations shorten the “hypothesis → experiment” loop before moving to expensive wet-lab work.

3. The lab of the future: challenges and prospects

3.1. Towards integrated “AI labs”

The strategic goal is an integrated chain: from literature review and hypothesis generation through planning and running experiments (in silico + robotics) to analysis and manuscript drafting — all with a human in the loop. The concept of open-endedness (constantly raising difficulty and competence) is becoming central.

3.2. Challenges and limitations

Technical challenge	Implications
Brittleness of reasoning	Susceptibility to distractors and errors in multi-step tasks (arXiv, 2025).
Out-of-distribution generalisation	Quality drops on novel problem classes; examples of limitations in AF3 for RNA/unusual cases (JCIM, 2025; C&EN, 2025).
Real-world validation	Predictions and simulations are hypotheses — drug candidates, new molecules or materials require costly, time-consuming validation in wet labs and/or production (TRL).

3.3. The evolving role of scientists

AI will not replace researchers; it will amplify them — automating tedious steps and freeing time for creative questions, cross-disciplinary synthesis, and ethical oversight. The biggest wins come from human↔machine synergy.

References & selected sources

Stanford AI Index 2025 — report page
State of AI Report 2025 — online
ESM3 / esmGFP — Science (2025), EvolutionaryScale blog
AlphaFold 3 — Nature (2024); limitations: JCIM (2025), C&EN (2025)
Virtual Lab / nanobodies — Nature (2025), Stanford Medicine, Reuters
ATOMICA — bioRxiv (2025), project page
AlphaEvolve — blog, arXiv (2025)
IMO 2025 — gold level — Reuters, DeepMind, solutions (PDF)
ORBIT (113B) / climate models — arXiv (2024), ORNL news
WeatherNext / GenCast — project page, Nature, Earth Engine dataset
ECMWF AIFS — press release (2025)
Brittleness of reasoning — arXiv (2025)

Od Informacji do Wiedzy

Blog o informacjach na temat informacji i wiedzy