🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval | How I Study AI

DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

Intermediate
Maojun Sun, Yue Wu, Yifei Xie et al.3/5/2026
arXiv

Key Summary

  • •DARE is a new way for AI assistants to find the right R functions by also looking at what the data looks like, not just the words in the question.
  • •The team built RPKB, a clean library of 8,191 R package functions with extra tags that describe the data each function expects.
  • •DARE uses a tiny dual-encoder (about 23M parameters) that mixes the user’s request with a short 'data profile' to get better matches.
  • •On a retrieval test, DARE reached 93.47% NDCG@10, which is like jumping from a solid B to an A+ compared to popular big embedding models.
  • •DARE is fast (about 3.7 ms per query and 8,512 queries per second), so it fits real-time AI agent workflows.
  • •RCodingAgent is an R-focused AI agent that, when plugged into DARE, solves tough statistical coding tasks much more reliably.
  • •Across 16 realistic R analysis challenges, using DARE raised success rates by up to 56.25%, especially helping smaller and mid-size LLMs.
  • •The secret sauce: conditioning retrieval on data distributions (like 'Poisson counts' or 'high-dimensional genomic data') to avoid picking the wrong tool.
  • •This work helps AI agents stop guessing R functions and start choosing statistically compatible ones.
  • •It narrows the gap between modern AI automation and decades of trusted methods in the R ecosystem.

Why This Research Matters

When AI agents pick statistically wrong tools, people can make bad decisions—about health, finance, or science—without realizing it. DARE helps agents choose R functions that fit the data’s true shape, not just the question’s wording, which greatly reduces silent errors. Because DARE is tiny and fast, it works in real-time, fitting naturally into interactive analysis. It lifts the performance of both small and frontier LLMs, making advanced statistics accessible and reliable. It also unlocks the deep expertise stored in R’s ecosystem, honoring decades of statistical research. By focusing retrieval on data distributions, DARE sets a template that can extend to other languages and domains. In short, it turns open-book AI into right-book, right-page AI.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how choosing the right tool depends on the job and the material? You wouldn’t use a butter knife to cut wood; you’d pick a saw because wood is hard and thick.

🥬 Filling (The Actual Story):

  • What the world looked like before: AI agents (LLMs) learned lots of coding tricks, mostly in Python. They could clean data, make plots, and build models pretty well. But the R world—the home of many careful, time-tested statistical methods—was like a huge toolbox hidden behind a foggy window.
  • The problem: When people asked AI agents to do statistical analysis in R, the agents often picked the wrong function, made up function names, or used bad settings. Why? Because statistical methods don’t just depend on words in the question. They also depend on the shape of the data: Is it counts like 0, 1, 2 (Poisson)? Is it super wide with thousands of columns (high-dimensional)? Are values missing? Is it time-series, genomic, or text? Traditional retrieval engines matched text-to-text but ignored data distributions.

Concept: Large Language Model (LLM) agents 🍞 Hook: Imagine a clever helper robot that can read, write, and run code to solve your homework. 🥬 Concept: LLM agents are AI systems that plan, write code, run tools, and explain results in natural language.

  • How it works: (1) read your request; (2) plan steps; (3) choose tools; (4) write and run code; (5) check results; (6) fix mistakes.
  • Why it matters: Without the right tools, even a smart robot can get stuck or make wrong choices. 🍞 Anchor: Asking 'compare two treatments' should lead the agent to a correct statistical test, not just any function that sounds similar.

Concept: The R ecosystem and CRAN 🍞 Hook: Think of a giant, organized toolbox full of specialized, professional instruments. 🥬 Concept: R and CRAN host thousands of carefully reviewed statistical packages built by experts.

  • How it works: Each package offers functions with specific assumptions (like 'data must be counts' or 'assumes normal errors').
  • Why it matters: Using a function with the wrong assumptions can give misleading results. 🍞 Anchor: If your data are counts from sequencing, a Poisson-friendly method beats a generic average.

Concept: Retrieval-Augmented Generation (RAG) 🍞 Hook: Imagine open-book tests—you do better when you can look up the right page. 🥬 Concept: RAG lets AI look up documents or function descriptions while solving a task.

  • How it works: (1) read the question; (2) search a library; (3) bring the best pages into context; (4) write the answer using them.
  • Why it matters: Without looking things up, the AI may guess or hallucinate. 🍞 Anchor: If a query is about 'survival analysis with truncation,' RAG should fetch 'cut data by event' from the right package.

Concept: Data distribution characteristics 🍞 Hook: You bake differently for cupcakes vs. sourdough bread; ingredients and baking times change. 🥬 Concept: Statistical methods depend on data traits like distribution (normal, Poisson), dimensionality (low/high), and modality (tabular, genomic).

  • How it works: You tag data with a 'profile' that states these traits. Then retrieval uses both question and data profile.
  • Why it matters: Ignoring these traits can match you to a tool that looks right in words but is wrong in math. 🍞 Anchor: 'High-dimensional genomic counts' should not retrieve a simple linear regression designed for small, normal data.

What people tried and why it failed: Many retrieval models ranked R functions by textual similarity only: if the words matched, it looked like a win. But two functions can sound similar and still be statistically incompatible. For example, two regressions—one for normal noise and one for count data—share words like 'fit' and 'model,' but the math is different.

The gap: We needed a search method that looks at what the data look like (distribution, size, modality) while also matching the words in the question.

The real stakes for daily life: If a public health analyst picks the wrong test, their conclusion about a medicine might be off. If a finance analyst uses a tool that assumes normality on heavy-tailed returns, risk may be mismeasured. If a bioinformatician analyzes count data with the wrong model, key genes could be missed.

Concept: RPKB (R Package Knowledge Base) 🍞 Hook: Think of a library where every book has a label card telling you exactly which readers it’s for. 🥬 Concept: RPKB is a cleaned collection of 8,191 R functions with both text descriptions and structured 'data profiles' (like 'Poisson counts,' 'high-dimensional').

  • How it works: Crawl CRAN docs, filter out generic utilities, and add structured profiles describing data assumptions.
  • Why it matters: The profiles let search engines reason about statistical fit, not just word match. 🍞 Anchor: A function for 'ATAC-STARR genomic counts, length filters 150–600 bp' can be found precisely when the query says so.

Bottom Bread (Anchor): With DARE and RPKB, an agent that gets 'I have high-dimensional genomic counts; estimate regulatory scores' retrieves the exact R function 'sharpr2' and uses it correctly—no guesswork.

02Core Idea

🍞 Top Bread (Hook): Imagine shopping for shoes. If you search only by color ('blue shoes'), you might grab ones that look right but don’t fit. Add your size ('blue shoes, size 7'), and your matches become truly wearable.

🥬 Filling (The Main Innovation):

  • The 'aha!' in one sentence: DARE improves function retrieval by mixing the request text with a compact 'data profile' so results are both semantically relevant and statistically compatible.

Multiple analogies:

  1. Cooking: Don’t just search 'spicy recipe'; include 'vegetarian and gluten-free' so you get a recipe you can actually eat.
  2. Library: Don’t just ask for 'books on planets'; add 'for 5th graders' so the reading level fits.
  3. Sports: Don’t just pick 'shoes for running'; add 'trail, wet conditions' so you get grippy soles.

Before vs. After:

  • Before: Retrieval focused on words. Two near-identical-sounding methods (e.g., for normal vs. count data) could be mixed up.
  • After: Retrieval also sees the data’s shape. It distinguishes look-alike functions and ranks the truly compatible one at the top.

Why it works (intuition, not equations):

  • Adding a data profile is like telling the search engine the 'rules of the game' your data follows. This shrinks the candidate pool to methods that make sense under those rules. DARE’s training teaches the model to put semantically-similar but distribution-mismatched functions farther apart, and to pull matched pairs closer.

Building Blocks (each with a sandwich explanation):

Concept: Data Profile (query-side and function-side) 🍞 Hook: Like putting your shoe size on a sticky note when asking a store clerk. 🥬 Concept: A short, structured summary of key data traits (e.g., 'tabular, numerical, Poisson, high-dimensional').

  • How it works: The profile is concatenated with the text (question or function docs) before encoding.
  • Why it matters: It injects the most crucial, non-textual clues about compatibility. 🍞 Anchor: Query profile 'genomic counts, high-dimensional' pairs with function profile 'ATAC-STARR counts; 150–600 bp fragments'.

Concept: Bi-encoder with shared weights 🍞 Hook: Imagine twins trained to understand different notes but speak the same secret language. 🥬 Concept: Two encoders turn the query (with its profile) and each function (with its profile) into vectors in the same space.

  • How it works: Same neural network encodes both sides, so comparable meanings land near each other.
  • Why it matters: Makes fast, scalable nearest-neighbor search over thousands of functions possible. 🍞 Anchor: The vector for 'high-dimensional genomic counts' lands closer to the right genomic scoring function than to a generic regression.

Concept: Cosine similarity for scoring 🍞 Hook: Think of the angle between two arrows; smaller angle means they point in the same direction. 🥬 Concept: Cosine similarity measures how aligned the query and function vectors are.

  • How it works: Higher cosine = better match; retrieval ranks by this score.
  • Why it matters: It gives a smooth, direction-based signal that works well with embeddings. 🍞 Anchor: The right function’s vector points almost the same way as the query vector, so it ranks #1.

Math (with simple examples):

  • Cosine similarity: s(eq,ef)=eq⊤ef∥eq∥ ∥ef∥s(e_q,e_f) = \frac{e_q^\top e_f}{\lVert e_q \rVert \, \lVert e_f \rVert}s(eq​,ef​)=∥eq​∥∥ef​∥eq⊤​ef​​. Example: If eq=(1,2)e_q=(1,2)eq​=(1,2) and ef=(2,1)e_f=(2,1)ef​=(2,1), then eq⊤ef=1×2+2×1=4e_q^\top e_f=1\times2+2\times1=4eq⊤​ef​=1×2+2×1=4, ∥eq∥=12+22=5≈2.236\lVert e_q \rVert=\sqrt{1^2+2^2}=\sqrt{5}\approx2.236∥eq​∥=12+22​=5​≈2.236, ∥ef∥=22+12=5≈2.236\lVert e_f \rVert=\sqrt{2^2+1^2}=\sqrt{5}\approx2.236∥ef​∥=22+12​=5​≈2.236, so s=4/(2.236×2.236)=4/5=0.8s=4/(2.236\times2.236)=4/5=0.8s=4/(2.236×2.236)=4/5=0.8.
  • Best function selection: f^(q,cq)=arg⁡max⁡f∈Fs((q,cq),f)\hat f(q,c_q)=\arg\max_{f\in \mathcal{F}} s\big((q,c_q),f\big)f^​(q,cq​)=argmaxf∈F​s((q,cq​),f). Example: If three functions score 0.8,0.6,0.20.8, 0.6, 0.20.8,0.6,0.2, we pick the first (0.8).

Concept: Contrastive training (InfoNCE) 🍞 Hook: Learn by telling apart look-alikes: like sorting red apples from red tomatoes. 🥬 Concept: Pull matching query–function pairs closer and push mismatched ones apart.

  • How it works: Use in-batch negatives so every other function in the batch becomes a 'contrast' to the true one.
  • Why it matters: Teaches the model fine-grained distinctions, especially when semantics are similar but distributions differ. 🍞 Anchor: 'glm' for Gaussian vs. 'glm.nb' for counts—contrastive learning helps rank the correct one higher given Poisson-like data.

Math (with simple example):

  • InfoNCE loss: Li=−log⁡exp⁡(cos⁡(eqi,efi)/τ)∑j=1Nexp⁡(cos⁡(eqi,efj)/τ)L_i=-\log \frac{\exp(\cos(e_{q_i},e_{f_i})/\tau)}{\sum_{j=1}^{N} \exp(\cos(e_{q_i},e_{f_j})/\tau)}Li​=−log∑j=1N​exp(cos(eqi​​,efj​​)/τ)exp(cos(eqi​​,efi​​)/τ)​. Example: Suppose a batch of N=3N=3N=3 has cosine scores to (pos,neg1,neg2) as (0.9,0.3,0.1)(0.9, 0.3, 0.1)(0.9,0.3,0.1) and τ=1\tau=1τ=1. Numerator =e0.9≈2.46=e^{0.9}\approx2.46=e0.9≈2.46. Denominator =e0.9+e0.3+e0.1≈2.46+1.35+1.11=4.92=e^{0.9}+e^{0.3}+e^{0.1}\approx2.46+1.35+1.11=4.92=e0.9+e0.3+e0.1≈2.46+1.35+1.11=4.92. Fraction ≈2.46/4.92=0.5\approx2.46/4.92=0.5≈2.46/4.92=0.5, so Li=−log⁡(0.5)≈0.693L_i=-\log(0.5)\approx0.693Li​=−log(0.5)≈0.693.

Concept: Efficient nearest-neighbor search 🍞 Hook: Like pre-sorting a phone book so you can look up names fast. 🥬 Concept: Precompute function vectors, then use fast maximum inner-product search.

  • How it works: Query once, scan many efficiently.
  • Why it matters: Agents need low latency while retrieving from thousands of functions repeatedly. 🍞 Anchor: With DARE, retrieval adds only a few milliseconds to an agent’s loop, keeping it snappy.

03Methodology

At a high level: Input (user question + data sample) → Build a query-side data profile → Encode query and profiles → Retrieve top R functions by cosine similarity → Feed retrieved docs to the agent → Generate and run R code → Validate outputs.

Step-by-step (with 'why' and examples):

  1. Build the knowledge base (RPKB) 🍞 Hook: Imagine tidying a garage into neat bins with clear labels. 🥬 Concept: RPKB is made by crawling CRAN docs, extracting function-level pieces (Description, Usage, Arguments, Value), filtering out generic utilities, and adding structured data profiles.
  • How it works:
    • Extraction: Parse HTML/PDF to get metadata and per-function docs.
    • Chunk & filter: Keep analytical/statistical methods; drop basic I/O or vague helpers.
    • Data profile generation: Use an LLM to infer modality (tabular/genomic/etc.), distribution (normal/Poisson/etc.), dimensionality (low/high), missing-data rules, and any specific constraints (e.g., 'columns: start, end, PLASMID, RNA').
    • Storage: Index in a vector DB for retrieval.
  • Why it matters: Clean, profile-rich entries let retrieval consider statistical fit, not just matching words. 🍞 Anchor: The 'sharpr2' entry includes precise constraints for ATAC-STARR counts and fragment lengths 150–600 bp, so it’s findable for that exact need.
  1. Formulate the retrieval problem 🍞 Hook: Think of matching a question card to the best answer card in a giant deck. 🥬 Concept: Each function is a pair (text docs, function-side data profile). Each user query also gets a query-side data profile.
  • How it works: Build a joint representation so the model can compute a similarity score between the query and each function.
  • Why it matters: This lets the system prefer statistically compatible functions. 🍞 Anchor: The query 'high-dimensional genomic counts' pairs with the function 'genomic counts; length filters; Poisson-like'.

Math for selection:

  • Retrieval target: f^(q,cq)=arg⁡max⁡f∈Fs((q,cq),f)\hat f(q,c_q)=\arg\max_{f\in \mathcal{F}} s\big((q,c_q),f\big)f^​(q,cq​)=argmaxf∈F​s((q,cq​),f). Example: If three candidate scores are 0.92,0.75,0.400.92, 0.75, 0.400.92,0.75,0.40, we choose the first (0.920.920.92).
  1. Encode queries and functions (DARE bi-encoder) 🍞 Hook: Like translating both the question and each tool description into the same map so distances are comparable. 🥬 Concept: A shared encoder ε(⋅)\varepsilon(\cdot)ε(⋅) (initialized from all-MiniLM-L6-v2) embeds text-plus-profile into vectors.
  • How it works: Concatenate text and its profile, then encode to a vector. Do this for the query (eqe_qeq​) and for every function (efe_fef​).
  • Why it matters: Shared space enables meaningful comparisons with cosine similarity. 🍞 Anchor: The query vector for 'DNA/RNA counts; high-dim; fragment filters' lands near the 'sharpr2' function vector.

Math for cosine similarity:

  • s(eq,ef)=eq⊤ef∥eq∥ ∥ef∥s(e_q,e_f) = \frac{e_q^\top e_f}{\lVert e_q \rVert \, \lVert e_f \rVert}s(eq​,ef​)=∥eq​∥∥ef​∥eq⊤​ef​​. Example: If eq=(3,4)e_q=(3,4)eq​=(3,4) and ef=(6,8)e_f=(6,8)ef​=(6,8), then eq⊤ef=3×6+4×8=50e_q^\top e_f=3\times6+4\times8=50eq⊤​ef​=3×6+4×8=50, ∥eq∥=5\lVert e_q \rVert=5∥eq​∥=5, ∥ef∥=10\lVert e_f \rVert=10∥ef​∥=10, so s=50/(5×10)=1.0s=50/(5\times10)=1.0s=50/(5×10)=1.0 (perfect alignment).
  1. Train with contrastive learning (InfoNCE) 🍞 Hook: Practice separating near-twins so you can always tell which one is yours. 🥬 Concept: For each query, its true function is the 'positive'; other batch functions are 'negatives'. The loss boosts the positive score and lowers the negative scores.
  • How it works: Compute the InfoNCE for each query and average over the batch.
  • Why it matters: It sharpens the model’s ability to avoid semantically-similar but distribution-mismatched tools. 🍞 Anchor: 'glm.nb' should win when the profile says 'count-like'; 'glm' should win when the profile says 'Gaussian-like'.

Math for InfoNCE:

  • Li=−log⁡exp⁡(cos⁡(eqi,efi)/τ)∑j=1Nexp⁡(cos⁡(eqi,efj)/τ)L_i=-\log \frac{\exp(\cos(e_{q_i},e_{f_i})/\tau)}{\sum_{j=1}^{N} \exp(\cos(e_{q_i},e_{f_j})/\tau)}Li​=−log∑j=1N​exp(cos(eqi​​,efj​​)/τ)exp(cos(eqi​​,efi​​)/τ)​. Example: With N=4N=4N=4, cosines (0.8,0.1,0.0,−0.2)(0.8,0.1,0.0,-0.2)(0.8,0.1,0.0,−0.2) for (pos,neg1,neg2,neg3), τ=1\tau=1τ=1. Numerator =e0.8≈2.23=e^{0.8}\approx2.23=e0.8≈2.23. Denominator =e0.8+e0.1+e0.0+e−0.2≈2.23+1.11+1.00+0.82=5.16=e^{0.8}+e^{0.1}+e^{0.0}+e^{-0.2}\approx2.23+1.11+1.00+0.82=5.16=e0.8+e0.1+e0.0+e−0.2≈2.23+1.11+1.00+0.82=5.16. Fraction ≈2.23/5.16=0.432\approx2.23/5.16=0.432≈2.23/5.16=0.432, so Li≈−log⁡(0.432)≈0.84L_i\approx-\log(0.432)\approx0.84Li​≈−log(0.432)≈0.84.
  1. Plug DARE into an R-focused agent (RCodingAgent) 🍞 Hook: Like giving your helper robot a special R compass that points to the right tool. 🥬 Concept: The agent first retrieves with DARE, then reads the retrieved docs (usage, arguments, examples) to write and run R code, check outputs, and refine if needed.
  • How it works: (1) DARE returns top functions plus structured metadata; (2) the agent copies correct syntax and parameters; (3) execute; (4) verify a ground-truth number.
  • Why it matters: Grounded docs cut hallucinations, and retrieval ensures the tool fits the data’s assumptions. 🍞 Anchor: For genomic counts with fragment filters, the agent uses 'sharpr2' exactly as the docs show and prints the first estimated score.
  1. Inference efficiency 🍞 Hook: Speed matters—no one likes a laggy helper. 🥬 Concept: Because DARE is small (about 23M parameters) and uses a simple bi-encoder, it’s fast and scalable.
  • How it works: Precompute function embeddings; encode a query in milliseconds; search with optimized nearest-neighbor.
  • Why it matters: Agents often retrieve many times per task; slow retrieval would bottleneck the whole system. 🍞 Anchor: DARE adds only a few ms, keeping interactive analysis smooth.
  1. The secret sauce 🍞 Hook: The difference between 'pretty good' and 'spot on' is using the right rulebook. 🥬 Concept: Concatenating the data profile with text on both the query and function sides and training contrastively under those conditions.
  • Why it matters: It teaches the model to treat 'distribution and dimensionality' as first-class citizens in retrieval. 🍞 Anchor: This is why 'non-Gaussian, high-dimensional ICA' retrieves a specialized ICA fortify function rather than a generic PCA routine.

04Experiments & Results

What was tested and why:

  • The team measured retrieval quality (how well the search ranks the right R function) and end-to-end agent success (does the agent produce the correct numeric answer on real R scripts?). They also measured speed (latency and throughput) because agents must be responsive.

Concept: Evaluation metrics 🍞 Hook: Think of grading a test: you don’t just count right answers; you also care if the best answers appear at the top. 🥬 Concept: Recall@k, MRR@k, and NDCG@k score different parts of ranking quality; Latency and QPS score speed.

  • How it works:
    • Recall@k: Did the right answer show up in the first k spots?
    • MRR@k: How early did the first correct answer appear, on average?
    • NDCG@k: Rewards putting the correct item very early.
    • Latency/QPS: How fast per query and how many queries per second.
  • Why it matters: Early, accurate hits save tokens, reduce confusion, and speed up the agent. 🍞 Anchor: If the correct tool is ranked #1 most of the time, the agent is both cheaper and more reliable.

Math with simple examples:

  • Recall@k: Recall@k=∑q∈Q1{∃j≤k:rel(f(j)(q))=1}/∣Q∣\text{Recall@}k=\sum_{q\in Q} \mathbf{1}\{\exists j\le k: \text{rel}(f^{(q)}_{(j)})=1\} / |Q|Recall@k=∑q∈Q​1{∃j≤k:rel(f(j)(q)​)=1}/∣Q∣. Example: If in 10 queries, the correct tool appears in the top-3 for 8 of them, Recall@3 =8/10=0.8=8/10=0.8=8/10=0.8.
  • DCG@k: DCG@k=∑j=1krel(f(j))log⁡(j+1)\text{DCG@}k=\sum_{j=1}^{k} \frac{\text{rel}(f_{(j)})}{\log(j+1)}DCG@k=∑j=1k​log(j+1)rel(f(j)​)​. Example: If the correct item is at position 1 with rel=1, DCG@3 =1/log⁡(2)≈1/0.693≈1.44=1/\log(2)\approx1/0.693\approx1.44=1/log(2)≈1/0.693≈1.44; if at position 3, DCG@3 =1/log⁡(4)≈1/1.386≈0.72=1/\log(4)\approx1/1.386\approx0.72=1/log(4)≈1/1.386≈0.72.
  • NDCG@k: NDCG@k=DCG@k/IDCG@k\text{NDCG@}k=\text{DCG@}k/\text{IDCG@}kNDCG@k=DCG@k/IDCG@k. Example: If IDCG@3 is 1.441.441.44 and your DCG@3 is 0.720.720.72, then NDCG@3 =0.72/1.44=0.5=0.72/1.44=0.5=0.72/1.44=0.5.
  • MRR@k: MRR@k=∑q∈Q1/rank(q) /∣Q∣\text{MRR@}k=\sum_{q\in Q} 1/\text{rank}(q)\,/|Q|MRR@k=∑q∈Q​1/rank(q)/∣Q∣. Example: If first relevant ranks across 3 queries are (1,2,not in top-k)(1,2,\text{not in top-}k)(1,2,not in top-k), MRR =(1+1/2+0)/3=0.5=(1+1/2+0)/3=0.5=(1+1/2+0)/3=0.5.
  • Latency: L=Tseq∣Q∣×1000L=\frac{T_{seq}}{|Q|}\times1000L=∣Q∣Tseq​​×1000 ms/query. Example: If Tseq=2T_{seq}=2Tseq​=2 seconds for ∣Q∣=500|Q|=500∣Q∣=500 queries, L=2500×1000=4L=\frac{2}{500}\times1000=4L=5002​×1000=4 ms/query.
  • QPS: QPS=∣Q∣Tbatch\text{QPS}=\frac{|Q|}{T_{batch}}QPS=Tbatch​∣Q∣​. Example: If 10,000 queries finish in 2 seconds, QPS =10,000/2=5,000=10{,}000/2=5{,}000=10,000/2=5,000.

The competition:

  • DARE vs. strong open-source embedders (e.g., BGE-M3, E5-large-v2, Snowflake Arctic, MXBAI, GTE-large, MPNet, MiniLM base). Many have 10–25x more parameters than DARE.

The scoreboard (with context):

  • DARE achieved 93.47% NDCG@10, beating the best large baseline by about 17–18%. That’s like jumping from a class average of B- to an A+.
  • Recall@10 was 98.63% (almost everything relevant is in the top-10), and Recall@1 hit 87.39% (the right tool is number one most of the time).
  • MRR@10 reached 91.76%—high consistency across queries.
  • Speed: DARE ran at ~3.7 ms/query and ~8,512 QPS; larger baselines often took >10 ms and <3,000 QPS. For interactive agents, this is the difference between snappy and sluggish.

Downstream agent results:

  • Setup: 16 real R analysis tasks (e.g., hypothesis tests, survival analysis, mixed-effects models, genomic scoring). Agents had to output a specific ground-truth number (like a p-value or first score) to pass.
  • Without DARE: Even capable LLMs often stumbled in R, defaulting to Python habits, picking wrong tools, or hallucinating function names.
  • With DARE: Success rates rose dramatically—up to +56.25% absolute improvement. Some mid-size models more than tripled their success.

Surprising findings:

  • A tiny 23M-parameter retriever, trained with distribution-aware contrastive signals, outperformed huge general-purpose models by large margins.
  • The biggest wins appeared when tasks demanded precise statistical assumptions (e.g., counts vs. Gaussian, high-dimensionality constraints). This confirms the core hypothesis: distribution-aware conditioning is the missing piece.

05Discussion & Limitations

Limitations:

  • Coverage: DARE depends on the RPKB corpus. If a needed function isn’t in RPKB or its profile is incomplete, retrieval may underperform.
  • Profile quality: Query-side profiles must be inferred. If profiling is wrong (e.g., calling non-Gaussian data 'Gaussian'), retrieval can steer to mismatched tools.
  • Scope: The current system targets R. Python and other ecosystems would need their own curated knowledge bases and profiles.
  • Granularity: Some functions have complex, nested assumptions (e.g., censoring plus missingness plus non-Gaussian noise). A short profile may not capture every nuance.

Required resources:

  • A clean, indexed function library (like RPKB) with profiles.
  • A mid-range GPU for fine-tuning (training used an A100 but fine-tuning can be scaled down); CPU inference is feasible due to the small model size.
  • A vector database or ANN index for fast retrieval.
  • Optional: Light-weight LLM prompt engineering to infer query-side profiles automatically.

When not to use:

  • If tasks don’t rely on statistical assumptions (e.g., simple string parsing), plain semantic retrieval might suffice.
  • If the data profile is unknown and cannot be inferred, forcing distribution-aware matching may not help.
  • If you need cross-language retrieval (Python, Julia) without curated corpora, DARE’s R-trained version won’t generalize out-of-the-box.

Open questions:

  • How to auto-check and correct wrong profiles? A feedback loop from execution results back into profiling could further boost reliability.
  • Can we model hierarchical assumptions (e.g., 'Gaussian but with censoring and missingness') with structured, compositional profiles?
  • How does multi-hop tool composition (sequence of functions) change retrieval signals? A graph of function dependencies might help.
  • Could mixture-of-experts agents dynamically route to R, Python, or SQL specialists based on the detected data profile and task?
  • What’s the best way to generalize DARE beyond R—can we unify tool profiles across ecosystems so agents switch smoothly when needed?

06Conclusion & Future Work

Three-sentence summary:

  • DARE adds data-distribution awareness to retrieval, so AI agents pick R functions that not only sound right but also fit the data’s statistical rules.
  • Backed by RPKB and contrastive training, DARE delivers both accuracy (NDCG@10 = 93.47%) and speed (~3.7 ms/query), beating bigger general-purpose models.
  • Plugged into RCodingAgent, DARE turns many failing R tasks into successes, with up to 56.25% absolute gains on real statistical analyses.

Main achievement:

  • Showing that a small, plug-and-play, distribution-aware retriever can massively improve tool selection and downstream reliability in the R statistical ecosystem.

Future directions:

  • Broaden the corpus and keep profiles fresh; strengthen auto-profiling and error-correction; support multi-tool compositions; integrate into mixture-of-experts systems; extend the approach to other languages and domains.

Why remember this:

  • It’s the moment retrieval for AI agents learned to care about the data’s shape, not just the words—making statistical automation safer, smarter, and faster in the real world.

Practical Applications

  • •Automated selection of appropriate statistical tests based on detected data distributions (e.g., Poisson vs. Gaussian).
  • •Reliable R code generation for specialized domains like genomics, survival analysis, and mixed-effects modeling.
  • •Fast, accurate tool lookup in interactive notebooks and data science IDEs.
  • •Guardrails for LLM agents to prevent hallucinated function names and mismatched parameters.
  • •Enterprise analytics assistants that follow domain best practices from curated R packages.
  • •Education platforms that suggest statistically compatible R functions for student datasets.
  • •Research pipelines that auto-profile datasets and retrieve matching methods for reproducible workflows.
  • •Data platform copilots that standardize preprocessing steps (e.g., EHR dosing, age normalization) using vetted R tools.
  • •Quality assurance bots that verify tool–data compatibility before running long experiments.
  • •Cross-team knowledge hubs where function profiles document assumptions and constraints for safer reuse.
#distribution-aware retrieval#RPKB#RCodingAgent#bi-encoder#InfoNCE#cosine similarity#CRAN#retrieval-augmented generation#statistical package retrieval#NDCG#maximum inner-product search#data profile#R LLM agents#contrastive learning#tool learning
Version: 1

Notes

0/2000
Press Cmd+Enter to submit