SciDER: Scientific Data-centric End-to-end Researcher

Ke Lin; Yilin Lu; Shreyas Bhat; Xuehang Guo; Junier Oliva; Qingyun Wang

SciDER: Scientific Data-centric End-to-end Researcher

Beginner

Ke Lin, Yilin Lu, Shreyas Bhat et al.3/2/2026

arXiv

Key Summary

•SciDER is a team of smart AI helpers that can run almost the whole research process: think of ideas, read raw data, write and run code, and improve itself with feedback.
•Unlike many past systems that only work on clean public datasets, SciDER can open messy, real lab data and turn it into organized knowledge before coding.
•It uses a self-evolving memory so it can remember what worked before and get better over time, like a scientist who keeps good lab notes.
•Four specialized agents share the work: an Ideation agent, a Data Analysis agent, an Experimentation agent (coding and running), and a Critic agent.
•A retrieval step (RAG) helps the system pull the right memories and references into context so it can reason with the most relevant facts.
•On idea generation tests, SciDER produced more original and feasible research ideas than other systems.
•On machine learning competitions (MLE-Bench), SciDER earned more gold-level solutions than strong baselines in a single try.
•On tough science coding tasks (SciCode) across physics, chemistry, biology, and more, SciDER solved more problems than several top models.
•It comes as a modular Python package and a lightweight web app so researchers can upload data and run a full workflow with little setup.
•Limitations include no built-in paper-writing module yet and reliance on external AI APIs, which can affect cost and data privacy.

Why This Research Matters

When research moves faster and fails less often, everyone benefits: medicines can be discovered sooner, materials can become stronger and lighter, and clean energy solutions can be tested more quickly. SciDER focuses on the most common pain points—messy data and brittle code—and turns them into strengths by starting with data analysis. Its critic-led loop reduces silent errors that might otherwise mislead scientists. The self-evolving memory means today’s hard-won fix becomes tomorrow’s instant advantage. Because it is modular and easy to run, smaller labs and startups can access advanced research automation without building everything from scratch. Over time, this can democratize discovery and widen participation in cutting-edge science.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how a school science fair project has many steps: think of a question, gather materials, run tests, study the results, and share what you learned? For a long time, AI tools could help with only a few of these steps, especially when the data was already neat and tidy. But real science is messier: data comes in strange formats, with missing pieces and odd labels, and the right code depends on those real details.

Before SciDER, many AI research systems were like recipe books that worked best with pre-packaged ingredients. They performed okay on public machine learning datasets that were cleaned and standardized. But when scientists brought their own raw lab data—say, microscope images with custom metadata, physics logs in unusual text formats, or biology files from special instruments—those systems often got confused. They could suggest big ideas, but they struggled to open the files, understand the structure, and write runnable code that fits the exact data.

Here’s the problem researchers faced: translating an abstract idea into a precise experiment depends on the nitty-gritty of the data. If a system cannot inspect folders, parse file types, understand schemas, and detect quality issues, it may write code that crashes or uses the wrong columns. Without that data-first understanding, even clever ideas can’t turn into working experiments.

People tried a few paths. Some built agents that were great at brainstorming, but they needed humans to clean data and fix code. Others made agents that could run pipelines, but only after the data was already in a standard shape. A few added memory, but often it was not organized or used in a way that helps at the exact moment of need. The result: slow feedback loops, many manual fixes, and limited success on specialized tasks.

What was missing was a data-centric approach—where the system begins by carefully reading the raw data, learning its structure and meaning, and then uses that knowledge to guide ideation, coding, and execution. Also missing was a memory that improves over time and can be recalled just-in-time for similar future tasks, like a scientist’s lab notebook that grows with every experiment.

🍞 Top Bread (Hook): Imagine you’re building a LEGO city. If you don’t look into the box to see which pieces you have, you might design a bridge that’s impossible to build. 🥬 Filling (The Actual Concept): Data-centric approach

What it is: A way of doing research where raw data is the starting point that guides ideas, code, and experiments.
How it works:
1. Open and inspect the real files and folders you have.
2. Figure out formats, columns, links, and data quality.
3. Write a structured report about the data’s structure, meaning, and dependencies.
4. Use that report to design experiments and generate code that truly fits the data.
Why it matters: Without it, the system might write code that doesn’t run or test the wrong thing, wasting time and missing discoveries. 🍞 Bottom Bread (Anchor): A biology lab uploads images with custom labels in sidecar JSON files. A data-centric system reads both the images and the JSON, learns how they match, and then writes code that loads and analyzes them correctly.

🍞 Top Bread (Hook): Think of a toolbox where each tool has a specific job, and you can swap tools in and out depending on what you’re building. 🥬 Filling (The Actual Concept): Modular Python package

What it is: SciDER is delivered as a set of interchangeable Python modules and a simple web app.
How it works:
1. Each agent (idea, data analysis, coding, execution, critic) is its own module.
2. You can run the full workflow or just one part.
3. You plug in your favorite AI models and change settings as needed.
Why it matters: Without modularity, you’re stuck with a one-size-fits-all tool that can’t adapt to different projects or preferences. 🍞 Bottom Bread (Anchor): A researcher only wants to analyze data now and code later. They run the Data Analysis module today, save the report, and come back to the Coding module next week.

The gap SciDER fills is this: it introduces a data-first pipeline that can open diverse raw scientific data, produces a clear report that guides code generation, and uses a critic agent to refine each step. It also adds self-evolving memory so it learns from experience and becomes more capable over time.

Why should anyone care? Because many discoveries depend on squeezing truth from messy data. If an AI system can quickly parse your unusual files, propose smart hypotheses grounded in what’s really there, write runnable code, and learn from each attempt, then you can test more ideas faster. That means better medical analyses, stronger materials, cleaner energy solutions, and even smarter space missions. Daily life benefits when science speeds up and becomes more reliable.

In short, SciDER is like a careful, curious junior scientist that looks at the data first, plans experiments that truly fit, writes working code, and asks a mentor (the critic) for feedback—then remembers what it learned for next time.

02Core Idea

The aha moment in one sentence: If you make the AI start from the raw data and give it a memory that learns from each attempt, it can reliably turn fuzzy ideas into working experiments across many scientific fields.

Let’s explain this idea three different ways:

Detective analogy: A good detective doesn’t guess first and check clues later; they inspect the crime scene, gather evidence, form hypotheses that match the facts, and refine their theory. SciDER is that kind of detective for science data.
Chef analogy: A chef checks the pantry and produce before designing the menu. SciDER checks your data pantry, then cooks up experiments that use what’s truly available.
Sports coach analogy: A coach studies game footage (data), designs plays (experiments) tailored to the team’s strengths, runs drills (execution), reviews mistakes (critic), and builds a playbook (memory) for future games.

Before vs After:

Before: Many agents created shiny plans that fell apart when they hit messy, custom data. They treated all datasets like identical puzzles and often wrote code that didn’t run or didn’t match the real structure.
After: With SciDER, the data analysis comes first and shapes everything that follows. The coding agent writes scripts that match the discovered formats and dependencies. The critic agent spots errors early, and the memory keeps what worked so future attempts improve.

Why it works (the intuition):

Decisions are better when they’re tied to the truth on disk. By grounding ideation and code in the data report, the system avoids wishful thinking.
Iteration with feedback fixes mistakes fast. The critic agent acts like a peer reviewer who catches gaps and suggests concrete fixes.
Remembering past reasoning speeds up future success. SciDER’s memory pulls in the most relevant know-how at just the right moment.

🍞 Top Bread (Hook): You know how your brain brings back the right memory when you face a similar problem, like remembering how you solved last week’s math puzzle? 🥬 Filling (The Actual Concept): Retrieval-augmented generation (RAG)

What it is: A way for AI to fetch the most relevant notes, examples, or references from a library before it writes or decides.
How it works:
1. Turn past notes and guides into searchable chunks.
2. When a new task arrives, search for the most related chunks.
3. Feed those chunks into the AI’s context so it can use them while thinking.
Why it matters: Without RAG, the AI relies only on what it can recall in the moment and may miss crucial, specific knowledge. 🍞 Bottom Bread (Anchor): When handling a new physics dataset with similar units as a past project, RAG retrieves a note explaining how to convert the units correctly, helping the AI avoid mistakes.

🍞 Top Bread (Hook): Imagine keeping two notebooks: a small sticky note pad for what you need right now, and a big binder with lessons that will help across the whole semester. 🥬 Filling (The Actual Concept): Self-evolving memory mechanism

What it is: SciDER’s memory that grows over time, split into short-term and long-term, with project- and task-specific parts.
How it works:
1. While working, the system summarizes useful reasoning into small chunks.
2. It saves general tips into task memory and project-specific insights into project memory.
3. Next time, it retrieves the most relevant chunks with a search and places them into context.
4. After finishing, it updates memories with new lessons learned.
Why it matters: Without this memory, the system would repeat the same mistakes and never improve, like doing every lab as if it were the first time. 🍞 Bottom Bread (Anchor): If a dataset needed a special way to parse timestamps, SciDER saves that parsing trick. Next project with similar logs? It reuses the trick immediately.

Building blocks of SciDER’s core idea:

Data analysis first: produce a trustworthy data report.
Idea generation tied to real data: hypothesize based on discovered structures and quality.
Code that matches the report: generate scripts that load, clean, and train exactly as the data demands.
Critic feedback: an internal reviewer to catch errors, incompleteness, or bias and push refinements.
Memory plus RAG: keep the best reasoning chunks and fetch them when needed.

Together, these pieces form a loop: inspect data, propose, code, run, review, remember, repeat—each turn of the loop makes the system more capable and more efficient.

03Methodology

At a high level: Input (user query and raw data) → Ideation agent (hypotheses and plan) → Data Analysis agent (structured report) → Coding agent (experiment scripts) → Execution agent (run and monitor) → Critic agent (review and refine) → Memory (save lessons) → Output (results, reports, codebase).

Let’s walk through each step like a recipe, and explain why each step exists and what breaks without it.

Inputs and setup

What happens: A researcher provides a research question and uploads raw data. Optional hints (like instrument info or file quirks) can be added. SciDER prepares a workspace and model settings, then starts the loop.
Why it exists: Without clear inputs, the system cannot tailor ideas or code to the real problem. Without a workspace, files and outputs get messy.
Example: A materials scientist uploads microscopy images with CSV metadata and asks to detect crystal defects.

Ideation agent: turning questions into testable plans 🍞 Top Bread (Hook): Think of a brainstorming buddy who reads about your topic and comes back with a short list of smart experiments you could actually do tomorrow. 🥬 Filling (The Actual Concept): Ideation agent

What it is: An agent that searches literature, drafts hypotheses, outlines experiments, and checks novelty.
How it works:
1. Extracts keywords from the question and context.
2. Retrieves relevant papers and datasets.
3. Proposes a hypothesis and an experiment outline grounded in the available data.
4. Scores novelty and feasibility using an AI-as-judge and self-revises if weak.
Why it matters: Without targeted, data-aware ideas, you get vague plans that don’t match the files you actually have. 🍞 Bottom Bread (Anchor): For noisy telescope light curves, the ideation agent proposes testing whether dips in brightness with certain durations indicate exoplanet transits, with a plan to compare several classifiers.

Data Analysis agent: making the data speak first 🍞 Top Bread (Hook): Imagine a friendly librarian who unpacks a box of mixed books, labels them, and writes a guide so you can find the right chapters fast. 🥬 Filling (The Actual Concept): Data Analysis agent

What it is: An agent that opens raw files, understands formats and schemas, profiles quality, and produces a structured report.
How it works:
1. Walks the folder tree and inventories files.
2. Uses readers for text, tables, images, and custom formats.
3. Profiles structure (schemas, encodings), quality (missing values, outliers), semantics (what fields mean), and dependencies (how files link).
4. Outputs a clear report that the coding agent will follow.
Why it matters: Without this report, code may guess wrong column names, misread encodings, or miss file links and crash. 🍞 Bottom Bread (Anchor): It notes that senso $r_d$ ata.csv has a time column in UTC and a temperature column with some missing values, and that labels.json maps IDs to classes.

Coding agent: writing experiments that actually run 🍞 Top Bread (Hook): Picture a careful carpenter who measures twice and cuts once—building to the exact blueprint. 🥬 Filling (The Actual Concept): Experimentation agent (coding phase)

What it is: An agent that generates runnable scripts tailored to the data report and experiment plan.
How it works:
1. Reads the data analysis report and idea outline.
2. Writes data loaders, preprocessing, model training, evaluation, and saving of results.
3. If an error appears (like a type mismatch), it reflects, edits, and retries until the workspace is ready.
Why it matters: Without code that matches the data, even great ideas fail at import time or produce nonsense results. 🍞 Bottom Bread (Anchor): For images plus JSON labels, it writes a dataset class that pairs each image path with the correct label from the JSON, then trains a CNN with proper transforms.

Execution agent: running and monitoring 🍞 Top Bread (Hook): Think of a kitchen timer and a safety monitor combined—start the oven, watch the bake, stop if smoke appears. 🥬 Filling (The Actual Concept): Experimentation agent (execution phase)

What it is: An agent that launches the experiment, monitors logs, and halts on trouble.
How it works:
1. Starts the run and watches for progress.
2. Detects problems like timeouts, frozen screens, failing tests, or exploding losses.
3. Stops and reports feedback to the coding agent for fixes.
Why it matters: Without monitoring, you might waste hours on a stuck job or miss early warnings. 🍞 Bottom Bread (Anchor): If validation loss never improves after many steps, it stops early and suggests trying a different learning rate.

Critic agent: internal peer review 🍞 Top Bread (Hook): Imagine a coach who watches the replay and says, here’s what went well, here’s what to change next time. 🥬 Filling (The Actual Concept): Critic agent

What it is: An agent that checks accuracy, completeness, bias, and gaps, then proposes concrete improvements.
How it works:
1. Reviews the idea, data report, code, and results.
2. Flags missing tests, weak evaluation, or suspicious assumptions.
3. Suggests specific fixes and tighter baselines or ablations.
Why it matters: Without critique, errors slip through and results look better (or worse) than they really are. 🍞 Bottom Bread (Anchor): The critic notices class imbalance and recommends stratified splits and F1 scoring, then requests an ablation without the new feature to confirm its value.

Memory: learn once, help forever 🍞 Top Bread (Hook): Picture a growing lab notebook where every trick, bug fix, and best practice is saved for future you. 🥬 Filling (The Actual Concept): Self-evolving memory mechanism

What it is: A memory bank split into short-term and long-term, with task- and project-specific chunks that are retrieved when needed.
How it works:
1. Summarizes new reasoning into small chunks after each step.
2. Saves general lessons to task memory and project-specific tips to project memory.
3. Uses search to bring the most relevant chunks into context on the next task.
4. Updates after each iteration.
Why it matters: Without memory, the system repeats avoidable mistakes and wastes time rediscovering fixes. 🍞 Bottom Bread (Anchor): After parsing a tricky custom TIFF header once, SciDER stores the parsing logic and reuses it next time a similar file appears.

The secret sauce

Data-first grounding makes the plan realistic.
Critic-led feedback closes the loop quickly.
Memory plus retrieval turns experience into immediate skill.

Concrete mini example with actual data flow

Input: A CSV of patient vitals (weird date format), a JSON with diagnosis labels.
Data Analysis report: Notes date format like day-month-year, missing heart-rate entries, and that patien $t_i$ d links CSV rows to JSON labels.
Coding: Writes a parser for the date, imputes heart rate by forward fill, merges labels by patien $t_i$ d, trains a gradient boosting model, saves metrics.
Execution: Runs training, detects slow convergence, suggests a learning rate tweak.
Critic: Requests cross-validation and a confusion matrix; asks for a baseline logistic regression.
Memory: Keeps the date parser and the best preprocessing pipeline for future medical CSVs.

04Experiments & Results

What did they test and why?

Idea generation quality: Can SciDER invent research ideas that are both original and doable, compared to top-tier literature? This matters because weak or copycat ideas waste time.
End-to-end ML performance: Can SciDER build competitive pipelines on real Kaggle-style tasks in one shot? This shows practical usefulness beyond toy demos.
Science coding skill across domains: Can it solve physics, chemistry, biology, and math coding tasks that require deep reasoning? This demonstrates generalization to tough, less-familiar problems.

Who was the competition?

Systems like AI Scientist, AI Scientist v2, AI Researcher, and others that are strong at ideation or pipelines but often less data-centric or without robust memory and critic loops.

Scoreboard with context

On idea generation (AI-Idea-Bench): SciDER achieved much higher novelty scores than strong baselines and also improved feasibility and alignment. Think of it like bringing a project proposal that is both fresher and more practical than what most other students turned in.
On MLE-Bench (a collection of real ML competitions): In a single try, SciDER earned more gold-level results than several competitive systems while keeping similar overall medal rates. That is like winning the top ribbon more often even when everyone only gets one shot.
On SciCode (difficult science coding tasks): SciDER solved a larger share of both main and sub-problems than several powerful general models. Imagine a multi-subject exam where SciDER scored higher not just in one class but across many, from physics to biology.

Why these results are meaningful

Strong novelty and feasibility together show that SciDER doesn’t just dream big; it dreams smart and buildable.
Better gold-level outcomes on MLE-Bench suggest that the data-first analysis and critic feedback help reach robust, high-scoring solutions without many retries.
Success on SciCode indicates that grounding code in data reports, plus iterative critique and memory, helps on unfamiliar, research-grade problems where generic coding assistants often stumble.

Surprising findings

The biggest jump appeared in novelty scoring, suggesting that data-grounded ideation may actually unlock more creative directions (perhaps because it connects real data affordances to fresh hypothesis shapes).
The critic-led loop and execution monitoring seemed to prevent wasted compute on bad runs, boosting efficiency. Even simple interventions like stopping early and tuning basic settings made a noticeable difference.
Disabling memory during formal benchmarking (to avoid leakage) still left SciDER clearly ahead, implying that the core data-first and critic design are strong foundations even without long-term learning.

A simple story to tie it together

Imagine three classmates tackling a tough project. One writes flashy ideas without checking the materials. Another codes fast but ignores the blueprint. SciDER is the classmate who inspects the parts first, suggests a clever plan that fits, builds carefully, tests while watching for smoke, asks for feedback, and writes down what worked. No wonder the final grade is higher.

05Discussion & Limitations

Limitations

No built-in paper-writing yet: SciDER focuses on the hard middle of research—data analysis and experiments—so you still need a separate step or tool to turn results into a polished manuscript.
Dependence on external AI APIs: Costs, rate limits, and privacy policies can affect usage, especially on sensitive datasets.
Requires readable data access: If data is locked behind proprietary formats with no reader, SciDER may need custom adapters.
Quality of tools and models matters: Using weaker base models or turning off monitoring may reduce performance.

Required resources

A Python environment with the SciDER package and access to chosen AI models.
Compute suitable for your experiments (from CPU for light tasks up to GPUs for deep learning).
Storage and permissions to handle raw data, intermediate artifacts, and logs.

When not to use

Highly sensitive or regulated data when you cannot guarantee on-prem or privacy-preserving model use.
Situations where results must be deterministic and fully auditable without any AI-generated steps.
Extremely novel file types with no parsers and no possibility to write one in time.

Open questions

How to best compress and generalize the memory so it helps across projects without mixing contexts?
Can we add trustworthy uncertainty estimates so users know when to doubt a result?
What are the strongest safety checks to prevent subtle data leakage, bias, or overfitting in autonomous loops?
How can we integrate automated paper drafting and figure generation without sacrificing factual rigor?
What human-in-the-loop checkpoints offer the biggest safety boost for the least extra effort?

06Conclusion & Future Work

Three-sentence summary

SciDER is a data-centric, end-to-end research system that starts from raw data, proposes grounded ideas, writes and runs code, and improves through a critic and self-evolving memory.
By anchoring every step in what the data truly looks like, it avoids fragile, one-size-fits-all plans and delivers runnable, competitive experiments across diverse scientific domains.
Evaluations show stronger originality, feasibility, and coding success than several leading baselines, and a modular package plus web UI make it practical to adopt.

Main achievement

Turning data analysis into the foundation of ideation and coding—then closing the loop with criticism and memory—so the agent reliably moves from abstract ideas to working experiments.

Future directions

Add automated paper writing and figure generation with strict verification checks.
Expand privacy-preserving and on-prem model options for sensitive data.
Grow the library of custom readers for tricky scientific formats and improve memory generalization.

Why remember this

SciDER demonstrates that looking at the data first, checking yourself with a critic, and remembering what works are the keys to trustworthy, scalable AI-driven science. It is a recipe that other research agents can follow to turn more bright ideas into real, testable discoveries.

Practical Applications

•Automate data profiling for new lab datasets to catch issues before coding begins.
•Generate data-aware experiment plans that respect file formats, schemas, and dependencies.
•Produce runnable machine learning pipelines for competitions or internal projects in a single pass.
•Monitor long-running experiments and auto-stop on early warning signs to save compute.
•Add ablation studies and baseline checks via the critic to strengthen result credibility.
•Build reusable readers for custom scientific formats and store them in memory for future reuse.
•Run only the parts you need (e.g., just Data Analysis) to integrate with existing workflows.
•Use RAG to inject prior project notes or domain tips into new tasks for faster ramp-up.
•Support multidisciplinary teams (physics, chemistry, biology) with one consistent pipeline.
•Teach junior team members with transparent, step-by-step reports and code they can audit.

Version: 1