šŸŽ“How I Study AIHISA
šŸ“–Read
šŸ“„PapersšŸ“°BlogsšŸŽ¬Courses
šŸ’”Learn
šŸ›¤ļøPathsšŸ“šTopicsšŸ’”ConceptsšŸŽ“Shorts
šŸŽÆPractice
šŸ“Daily LogšŸŽÆPrompts🧠Review
SearchSettings
OpenAutoNLU: Open Source AutoML Library for NLU | How I Study AI

OpenAutoNLU: Open Source AutoML Library for NLU

Beginner
Grigory Arshinov, Aleksandr Boriskin, Sergey Senichev et al.3/2/2026
arXiv

Key Summary

  • •OpenAutoNLU is a simple, open-source tool that automatically builds text understanding models for you.
  • •It looks at your dataset and smartly picks the best training style for your data size without you setting knobs.
  • •It works for both text classification (sorting messages) and NER (finding names and places in text).
  • •It checks your data quality first and flags confusing or mislabeled examples so you don’t train on bad data.
  • •It can detect out-of-distribution (OOD) inputs, which means it knows when a message is too different from what it learned.
  • •It uses light, fast models and can export to ONNX for super-quick deployment on many devices.
  • •In tests on four popular datasets, it usually matches or beats other AutoML tools while being easier to use.
  • •It can even ask an LLM to create extra training or test examples when you don’t have enough data.
  • •Its key trick is data-aware regime selection: it chooses between few-shot learning and full fine-tuning automatically.
  • •You can train and run a model in just a couple of lines of code, with OOD detection included if you want.

Why This Research Matters

OpenAutoNLU lowers the barrier to building reliable text models by auto-picking the right training strategy for your data size. That means small teams can ship strong chatbots, ticket routers, and content filters without wrestling with complex settings. Its built-in data checks reduce label noise, which is a major hidden cause of poor performance. Configurable OOD detection makes systems safer, since they can admit uncertainty instead of guessing. ONNX export and lightweight pipelines keep inference fast and cheap, fitting real-world latency and budget limits. LLM-based augmentation and test synthesis help when labeled data is scarce. All together, it turns messy, real-life NLP needs into a smooth, low-code pipeline from dataset to deployment.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

šŸž Hook: Imagine you’re packing a school lunchbox. Some days you have tons of ingredients; other days you have just a few. You still need a tasty, healthy lunch either way.

🄬 The Concept (AutoML): What it is: AutoML is a helper that chooses and trains machine learning models for you. How it works (step by step):

  1. Look at your data. 2) Pick a fitting model type and settings. 3) Train and tune. 4) Evaluate and export. Why it matters: Without AutoML, you must guess many settings yourself, which is slow and error-prone. šŸž Anchor: Like a smart lunch robot that sees what’s in your fridge and picks the fastest recipe that still tastes great.

šŸž Hook: You know how you can understand what someone means even if they use different words? Computers want to do that too.

🄬 The Concept (NLU): What it is: Natural Language Understanding (NLU) helps computers make sense of human text. How it works: 1) Read words, 2) turn them into numbers a computer understands, 3) find patterns, 4) predict labels or entities. Why it matters: Without NLU, apps can’t reliably sort emails, answer questions, or extract facts. šŸž Anchor: Your email app labeling a message as ā€œPromotionsā€ or ā€œImportantā€ is NLU at work.

šŸž Hook: Think about sorting your Lego by color or size.

🄬 The Concept (Text Classification): What it is: Text classification sorts a message into a category. How it works: 1) Read a message, 2) compare it to patterns learned from examples, 3) assign a label. Why it matters: Without it, chatbots can’t tell if you’re asking about refunds or delivery. šŸž Anchor: A support bot tagging ā€œI forgot my passwordā€ as a ā€œlogin problem.ā€

šŸž Hook: When you read a story, you notice names of people and places.

🄬 The Concept (NER): What it is: Named Entity Recognition finds special names like people, places, or companies in text. How it works: 1) Scan the sentence, 2) mark token-by-token which parts are entities, 3) output typed spans. Why it matters: Without NER, systems can’t pull useful nuggets like ā€œParisā€ or ā€œAliceā€ from text. šŸž Anchor: Highlighting ā€œHogwartsā€ as a place and ā€œHarry Potterā€ as a person in a sentence.

The World Before: Building reliable text classifiers or NER models was hard. People had to choose between many approaches: classic ML with embeddings, few-shot tricks, or full transformer fine-tuning. Every option worked best in different data sizes, and small mistakes in setup could waste time and compute. General AutoML tools focused mostly on spreadsheets (tabular data), and even NLP-focused ones often needed many settings. Meanwhile, data was messy: mislabeled texts, confusing examples, and unbalanced classes hurt results. And in the real world, models get weird, never-seen inputs (called out-of-distribution, or OOD). Most tools didn’t handle all these needs together.

šŸž Hook: You know how a teacher spots when a test question is badly written or confusing?

🄬 The Concept (Data Quality Diagnostics): What it is: A set of tests that flag mislabeled or low-signal examples before training. How it works: 1) Train a quick model, 2) see which examples it is uncertain about or disagrees with, 3) map easy vs. hard examples, 4) suggest fixes. Why it matters: Without this, bad data sneaks in and weakens your final model. šŸž Anchor: If 10 math problems have wrong answers written on them, your study will suffer; spotting them first helps you learn better.

šŸž Hook: Imagine a librarian who knows when a book doesn’t belong in a section.

🄬 The Concept (Out-of-Distribution Detection): What it is: A way to tell if a new message is too different from the training data. How it works: 1) Score how ā€œfamiliarā€ the input looks, 2) compare to a threshold, 3) if too strange, flag as OOD. Why it matters: Without OOD detection, the model confidently guesses on nonsense, causing bad user experiences. šŸž Anchor: A travel bot that says, ā€œI don’t understandā€ for a recipe request instead of giving a random flight price.

šŸž Hook: Think of a super talkative parrot that can help you come up with sentences.

🄬 The Concept (LLMs): What it is: Large Language Models are powerful text generators and readers. How it works: 1) Read a prompt, 2) predict likely next words many times, 3) produce helpful text. Why it matters: Without LLM help, small datasets stay small; with them, you can synthesize training or testing examples. šŸž Anchor: If you only have 3 examples of ā€œcancel order,ā€ an LLM can write 20 more realistic variations.

The Problem: People needed a simple, text-first AutoML tool that automatically chooses the right approach for their data size, checks data quality, supports both classification and NER, and can say ā€œI don’t knowā€ sensibly.

Failed Attempts: General AutoML (H2O, LightAutoML, AutoGluon) focused on tables or required extra setup to work well on text. NLP-specific tools (like AutoIntent) helped, but still needed manual preset choices, had limits for tiny classes, or lacked unified NER support and flexible OOD.

The Gap: A one-stop, low-code, NLP-native AutoML that: (1) auto-selects the training regime based on per-class counts, (2) includes text-specific data checks, (3) offers configurable OOD, (4) supports both classification and NER, and (5) deploys easily.

Real Stakes: Better customer support bots, safer content filters, faster on-premise deployments, and fewer mislabeled-data headaches. Schools, hospitals, banks, and small startups can all benefit without hiring a big ML team.

šŸž Hook: Report cards average your grades across subjects, not just the one you’re best at.

🄬 The Concept (Macro F1 Score): What it is: A fairness-minded score that averages performance equally across all classes. How it works: For each class, compute F1=2ā‹…precisionā‹…recallprecision+recallF1 = 2 \cdot \frac{precision \cdot recall}{precision + recall}F1=2ā‹…precision+recallprecisionā‹…recall​, then Macro-F1 is 1Cāˆ‘i=1CF1i\frac{1}{C}\sum_{i=1}^{C} F1_iC1ā€‹āˆ‘i=1C​F1i​. Why it matters: Without Macro-F1, a model might look great by only doing well on common classes and ignoring rare ones. šŸž Anchor: If you score A in Math but D in Art and History, the average tells the full story, not just the A.

Numerical example after the formulas: Suppose two classes (C=2C=2C=2). For class 1, precision=0.8precision=0.8precision=0.8, recall=0.5recall=0.5recall=0.5, so F11=2ā‹…0.8ā‹…0.50.8+0.5=2ā‹…0.41.3=0.615F1_1=2\cdot\frac{0.8\cdot0.5}{0.8+0.5}=2\cdot\frac{0.4}{1.3}=0.615F11​=2ā‹…0.8+0.50.8ā‹…0.5​=2ā‹…1.30.4​=0.615. For class 2, precision=0.6precision=0.6precision=0.6, recall=0.6recall=0.6recall=0.6, so F12=2ā‹…0.6ā‹…0.60.6+0.6=2ā‹…0.361.2=0.6F1_2=2\cdot\frac{0.6\cdot0.6}{0.6+0.6}=2\cdot\frac{0.36}{1.2}=0.6F12​=2ā‹…0.6+0.60.6ā‹…0.6​=2ā‹…1.20.36​=0.6. Macro-F1 =12(0.615+0.6)=0.6075=\frac{1}{2}(0.615+0.6)=0.6075=21​(0.615+0.6)=0.6075.

02Core Idea

šŸž Hook: You know how coaches pick different drills depending on how many kids show up to practice? Two kids = simple drills; a full team = full scrimmage.

🄬 The Concept (Data-Aware Training Regime Selection): What it is: The key insight is to automatically choose the training method based on how many labeled examples each class has. How it works: 1) Count the minimum number of examples per class, call it nminn_{min}nmin​. 2) If 2≤nmin≤52 \le n_{min} \le 52≤nmin​≤5, use an ultra few-shot method (AncSetFit). 3) If 5<nmin≤805 < n_{min} \le 805<nmin​≤80, use a standard few-shot method (SetFit). 4) If nmin>80n_{min} > 80nmin​>80, do full transformer fine-tuning. Why it matters: Without this, users must guess which method to try and may waste time or hurt quality. šŸž Anchor: If the smallest class has nmin=4n_{min}=4nmin​=4, OpenAutoNLU picks AncSetFit; if nmin=50n_{min}=50nmin​=50, it picks SetFit; if nmin=120n_{min}=120nmin​=120, it fine-tunes a transformer.

Numerical example after the inequalities: If your class counts are (420085)\begin{pmatrix} 4 \\ 200 \\ 85 \end{pmatrix}​420085​​, then nmin=4n_{min}=4nmin​=4, which fits 2≤nmin≤52 \le n_{min} \le 52≤nmin​≤5, so AncSetFit is chosen. If your class counts are (5090300)\begin{pmatrix} 50 \\ 90 \\ 300 \end{pmatrix}​5090300​​, then nmin=50n_{min}=50nmin​=50, which fits 5<nmin≤805 < n_{min} \le 805<nmin​≤80, so SetFit is chosen. If your class counts are (81100150)\begin{pmatrix} 81 \\ 100 \\ 150 \end{pmatrix}​81100150​​, then nmin=81n_{min}=81nmin​=81, which fits nmin>80n_{min} > 80nmin​>80, so full fine-tuning is chosen.

Three Analogies:

  • Thermostat analogy: The system ā€œfeelsā€ how cold (data-scarce) or warm (data-rich) your dataset is and flips to the right heating mode (few-shot vs. fine-tuning).
  • Toolbox analogy: For tiny screws (very few examples), use a precision screwdriver (AncSetFit). For regular screws (dozens), use a standard one (SetFit). For big bolts (hundreds+), bring the power drill (full fine-tuning).
  • Classroom analogy: 3 students? Do tutoring (few-shot). 30 students? Run a normal lesson (fine-tuning).

šŸž Hook: If you barely have any flashcards, you study differently than when you have a full deck.

🄬 The Concept (Few-Shot Learning): What it is: Teaching models from very few examples per class. How it works: 1) Build an embedding space, 2) pull similar examples together and push different ones apart (contrastive learning), 3) add a simple head to classify. Why it matters: Without few-shot, tiny datasets underperform or can’t train at all. šŸž Anchor: Learning to recognize ā€œrefund requestā€ intent after seeing 5 examples.

šŸž Hook: When you have a whole textbook, you can fine-tune your knowledge to the test.

🄬 The Concept (Full Fine-Tuning): What it is: Adjusting all (or most) weights of a big model so it specializes for your task. How it works: 1) Start from a pretrained transformer, 2) train on your labeled data, 3) tune hyperparameters to maximize validation score. Why it matters: Without fine-tuning, you might cap out in accuracy even with lots of data. šŸž Anchor: With 100+ examples per class, the model customizes deeply and gets higher accuracy.

šŸž Hook: Imagine a language-savvy friend who reads entire sentences to grasp meaning.

🄬 The Concept (Transformer Models): What it is: A powerful architecture (like BERT) that understands context in text. How it works: 1) Break text into tokens, 2) apply attention to weigh important parts, 3) produce contextual embeddings. Why it matters: Without transformers, many NLP tasks would be much less accurate. šŸž Anchor: Focusing on the words ā€œcapital of Franceā€ to answer ā€œParis.ā€

šŸž Hook: To tell apples from oranges, you look at them side-by-side.

🄬 The Concept (Contrastive Learning, as in SetFit): What it is: A way to learn by comparing pairs and triplets so similar things get close and different things get far in embedding space. How it works: 1) Make positive pairs (same label) and negative pairs (different labels), 2) train to shrink distance for positives and grow it for negatives, 3) then train a simple classifier. Why it matters: Without contrastive learning, few-shot models struggle to form strong class boundaries with little data. šŸž Anchor: Pull ā€œreset passwordā€ and ā€œforgot passwordā€ closer, push ā€œorder pizzaā€ away.

šŸž Hook: Baking a cake is easier if the oven temp and time are just right.

🄬 The Concept (Hyperparameter Optimization): What it is: Systematically searching for the best training settings (like learning rate or batch size). How it works: 1) Try smart guesses (e.g., TPE/Optuna), 2) compare validation scores, 3) keep the best. Why it matters: Without this, you may overfit or underfit. šŸž Anchor: Trying learning rates like 3Ɨ10āˆ’53\times10^{-5}3Ɨ10āˆ’5 or 1Ɨ10āˆ’41\times10^{-4}1Ɨ10āˆ’4 and picking the one that wins on validation.

Numerical example after the learning-rate range: If the search range is [10āˆ’6,10āˆ’3][10^{-6}, 10^{-3}][10āˆ’6,10āˆ’3], you might test 3Ɨ10āˆ’53\times10^{-5}3Ɨ10āˆ’5, 1Ɨ10āˆ’41\times10^{-4}1Ɨ10āˆ’4, and 5Ɨ10āˆ’55\times10^{-5}5Ɨ10āˆ’5; if 1Ɨ10āˆ’41\times10^{-4}1Ɨ10āˆ’4 gives the highest Macro-F1 (say 0.890.890.89 vs. 0.860.860.86 and 0.880.880.88), you keep 1Ɨ10āˆ’41\times10^{-4}1Ɨ10āˆ’4.

Before vs. After:

  • Before: Users juggled presets and frameworks, guessing which method matched their data and risking wasted compute.
  • After: OpenAutoNLU checks nminn_{min}nmin​ and picks few-shot vs. fine-tuning automatically, adds data-quality checks, and lets you toggle OOD with a single flag.

Why It Works (intuition): Different data sizes call for different bias–variance trade-offs. Few-shot methods make strong structure from minimal data via contrastive learning; full fine-tuning shines when there’s enough signal to reliably adapt a large model. Automatically matching the method to nminn_{min}nmin​ reduces overfitting in tiny regimes and underfitting in large ones.

Building Blocks:

  • Data scanning (compute nminn_{min}nmin​)
  • Method resolver (AncSetFit, SetFit, or Finetuning)
  • Optional data-quality diagnostics
  • Optional data-level balancing (upsampling or downsampling)
  • Training and evaluation (with Macro-F1)
  • Optional OOD detection layer
  • ONNX export for fast deployment

03Methodology

High-level recipe: Input text and labels → (optional) data-quality check → method selection by nminn_{min}nmin​ → data balancing (if needed) → train (few-shot or fine-tune) → evaluate (Macro-F1, optionally OOD) → export model (ONNX).

Step 1: Data Quality Check (optional)

  • What happens: The system trains a quick model, watches per-example behavior across epochs, and flags mislabeled, ambiguous, or low-signal samples (using retagging, uncertainty, V-information, and dataset cartography for classification; for NER, it uses label aggregation with Monte Carlo dropout).
  • Why it exists: Bad data poisons training—spotting it early prevents wasted compute and boosts final quality.
  • Example: If a sample labeled ā€œbillingā€ consistently looks like ā€œloginā€ to the model, it’s flagged for review.

šŸž Hook: Drawing a map helps you find easy roads and tricky turns.

🄬 The Concept (Dataset Cartography): What it is: A way to visualize which samples are easy, ambiguous, or hard based on training dynamics. How it works: 1) Track confidence and variability per sample over epochs, 2) cluster into regions, 3) show a map and thresholds. Why it matters: Without it, tough or mislabeled samples hide in the mix. šŸž Anchor: Seeing a cluster of ā€œhard-to-learnā€ examples tells you where to clean labels first.

Step 2: Compute nminn_{min}nmin​ and Choose Method

  • What happens: The pipeline computes the smallest per-class sample count nminn_{min}nmin​, then applies thresholds to select the method.
  • Why it exists: Matching method to data regime maximizes learning efficiency and accuracy.
  • Formula and example: If class counts are (66100)\begin{pmatrix} 6 \\ 6 \\ 100 \end{pmatrix}​66100​​, then nmin=6n_{min}=6nmin​=6, which fits 5<nmin≤805 < n_{min} \le 805<nmin​≤80 → choose SetFit.

Step 3: Data-Level Optimization (Balancing)

  • What happens: If many classes are under 81 examples, underrepresented classes can be upsampled to n=81n=81n=81 via character/word perturbations or LLM paraphrases; for few-shot methods, very large classes can be downsampled to stay balanced.
  • Why it exists: Balanced classes help models treat all labels fairly and often improve Macro-F1.
  • Example: If counts are (2030200)\begin{pmatrix} 20 \\ 30 \\ 200 \end{pmatrix}​2030200​​, upsample first two to 818181 (e.g., LLM paraphrases) so all are near 81; then fine-tuning becomes viable.

Numerical example after upsampling: Starting counts (2030200)\begin{pmatrix} 20 \\ 30 \\ 200 \end{pmatrix}​2030200​​ become (8181200)\begin{pmatrix} 81 \\ 81 \\ 200 \end{pmatrix}​8181200​​. Now nmin=81n_{min}=81nmin​=81 satisfies nmin>80n_{min}>80nmin​>80, so the system can switch from SetFit to full fine-tuning.

Step 4: Train with the Selected Regime

  • Few-shot (AncSetFit when 2≤nmin≤52 \le n_{min} \le 52≤nmin​≤5): Uses anchor labels (short descriptions) and triplet/contrastive learning to build a strong embedding space, then a simple classifier.
  • Few-shot (SetFit when 5<nmin≤805 < n_{min} \le 805<nmin​≤80): Builds sentence embeddings via contrastive training and fits a logistic-regression head.
  • Full fine-tuning (when nmin>80n_{min} > 80nmin​>80): Fine-tunes a transformer with early stopping; optionally runs Optuna with TPE to pick learning rate, batch size, and weight decay.
  • Why it exists: Each regime is matched to the data size to avoid under/overfitting.
  • Example: With 60 examples per smallest class, SetFit runs 20 contrastive iterations and trains a light classifier; with 120 examples per smallest class, the system fine-tunes a BERT checkpoint.

Step 5: Objective and Hyperparameter Search (Fine-tuning Only)

  • What happens: Optimize Macro-F1 on a validation split using Optuna’s TPE sampler over learning rate in [10āˆ’6,10āˆ’3][10^{-6}, 10^{-3}][10āˆ’6,10āˆ’3], batch size, and weight decay; early stopping with patience avoids overtraining.
  • Why it exists: Fine-tuning has many knobs; a smart search finds a strong combo.
  • Numerical example: Try learning rates 1Ɨ10āˆ’51\times10^{-5}1Ɨ10āˆ’5, 3Ɨ10āˆ’53\times10^{-5}3Ɨ10āˆ’5, and 1Ɨ10āˆ’41\times10^{-4}1Ɨ10āˆ’4. If 3Ɨ10āˆ’53\times10^{-5}3Ɨ10āˆ’5 yields Macro-F1 0.900.900.90 vs. 0.880.880.88 and 0.890.890.89, keep 3Ɨ10āˆ’53\times10^{-5}3Ɨ10āˆ’5.

Step 6: OOD Detection (Optional)

  • What happens: Choose an OOD method matched to the regime (e.g., maximum softmax probability for SetFit, Mahalanobis distance for fine-tuned encoders, or an explicit out-of-scope class). Generate some synthetic OOD (like gibberish) and tune the threshold on validation.
  • Why it exists: Real systems face strange inputs; a thresholded OOD score lets the model say ā€œI’m not sure.ā€
  • Example: If the OOD score threshold is tuned so that only 1 in 100 in-scope examples is mistakenly flagged, you get cautious but accurate OOD behavior.

Step 7: LLM-Assisted Data Generation (Optional)

  • What happens: When data is scarce, an LLM creates new training or test examples following your label definitions.
  • Why it exists: Extra realistic examples boost learning and give a proxy test set when you lack one.
  • Example: With only 8 examples per class, generate 40 more paraphrases per class to reach stronger few-shot training.

Step 8: NER Pipeline Details (if doing NER)

  • What happens: Accepts offset- or bracket-style annotations; converts to BIO tags; stratified splitting preserves entity proportions; evaluates at the entity level (precision/recall/F1), including partial matches.
  • Why it exists: NER needs careful tagging and fair evaluation so entities aren’t split incorrectly.
  • Example: ā€œVisit New York Cityā€ → tags (Bāˆ’LOCIāˆ’LOCIāˆ’LOC)\begin{pmatrix} B-LOC \\ I-LOC \\ I-LOC \end{pmatrix}​Bāˆ’LOCIāˆ’LOCIāˆ’LOC​​; partial match scoring gives credit if ā€œNew Yorkā€ is found but ā€œCityā€ is missed.

Step 9: Export for Fast Inference (ONNX)

  • What happens: Export to ONNX with tokenizer files and label mapping; at inference, the runner auto-detects hardware (GPU/CPU) and picks batch sizes that won’t crash memory.
  • Why it exists: Deployment must be fast, portable, and safe.
  • Example: Move your model from training server to a CPU-only edge box and still get snappy responses.

šŸž Hook: Packing your lunch neatly makes it easy to grab and go.

🄬 The Concept (ONNX Export): What it is: A portable format that lets models run quickly on many devices. How it works: 1) Convert model graph, 2) bundle tokenizer and labels, 3) load on target hardware. Why it matters: Without ONNX, deployment can be slower or platform-tied. šŸž Anchor: Saving your trained model as ONNX so an app can run it in two lines of code.

Secret Sauce:

  • The method chooser keyed by nminn_{min}nmin​, plus adaptive up/down-sampling, quietly moves you into the most effective regime.
  • Built-in data diagnostics reduce label noise before it bites.
  • Configurable OOD lets you tune safety vs. recall with a simple threshold factor.

04Experiments & Results

The Test: The team measured how well OpenAutoNLU classifies intents and handles OOD across four well-known datasets: Banking77, HWU64, MASSIVE, and SNIPS. They reported fairness-minded Macro-F1 (overall in-domain quality), In-scope F1 (with no OOD noise in test), and OOD F1 (how well it says ā€œthis is out of scopeā€).

The Competition: They compared against AutoIntent (intent-focused AutoML), AutoGluon (general AutoML), LightAutoML, and H2O. To be fair, they held the pretrained backbones as consistent as possible (e.g., BERT where supported), so differences would come from the AutoML logic and pipelines, not just representations.

Scoreboard with Context:

  • In OOD-unaware settings (realistic: OOD shows up at test, but in-domain score is reported), OpenAutoNLU typically leads or ties on 3/4 datasets. AutoGluon slightly tops on Banking77 but at higher compute cost. Think: OpenAutoNLU getting an A while others get B-to-B+ on most subjects, and AutoGluon gets an A on just one subject but studies much longer.
  • In clean, controlled tests without OOD in the test set, OpenAutoNLU again shines on HWU64, MASSIVE, and SNIPS and is very close on Banking77 (0.912 vs. 0.920). That’s like getting 91.2% vs. 92.0%; both are strong, but OpenAutoNLU keeps this level consistently across multiple courses.
  • For OOD detection: OpenAutoNLU’s unsupervised OOD option is particularly strong, often beating supervised approaches; it balances in-domain accuracy with being cautious about strange inputs. In some cases, adding supervised OOD examples helps OOD F1 but can slightly reduce in-domain Macro-F1, showing a trade-off you can tune.

Surprising Findings:

  • OpenAutoNLU’s unsupervised OOD can be so good that adding labeled OOD doesn’t always help overall results—sometimes it hurts in-domain classification.
  • Data-aware selection plus light balancing (upsampling to n=81n=81n=81 when many classes are small) often moves datasets into a regime where full fine-tuning wins—without the user changing any code.
  • LLM-made test sets track real test sets closely in small/medium data ranges (differences under 5 percentage points), offering a practical proxy when a true test set isn’t available.

Numbers Made Meaningful:

  • Example in-scope Macro-F1 on full datasets (no OOD in test): OpenAutoNLU hits about 0.912 (Banking77), 0.890 (HWU64), 0.876 (MASSIVE), 0.921 (SNIPS). That’s like scoring mostly A-/A across four classes.
  • In tougher OOD-unaware tests (OOD present, but you grade only in-domain classes), OpenAutoNLU generally stays top-tier while being compute-efficient—a strong sign it’s production-ready.

Takeaway: Across varied data sizes and OOD conditions, the simple ā€œlet the system pick the regimeā€ idea pays off. You get solid accuracy, sensible OOD handling, and fast deployment—with less tinkering.

05Discussion & Limitations

Limitations:

  • Fixed thresholds for method switching (nmin=5n_{min}=5nmin​=5 and nmin=80n_{min}=80nmin​=80) are empirically good defaults but may be suboptimal for unusual datasets. A future meta-model could learn better cutoffs per domain.
  • LLM-generated test sets are reliable mainly for small/medium regimes; for very large data, generated tests can drift and be less predictive of true performance.
  • OOD trade-offs exist: boosting OOD F1 with supervision can reduce in-domain Macro-F1, so you must tune the threshold to your risk tolerance.

Required Resources:

  • GPU recommended (for fine-tuning and speedy contrastive training), but ONNX export enables fast CPU inference after training.
  • Access to an LLM endpoint (optional) for augmentation/test synthesis if you want data boosts.

When NOT to Use:

  • Extremely tiny single-class or unbalanced datasets with nearly no variation may not benefit enough—consider collecting a few more examples per rare class first.
  • Highly sensitive domains where synthetic (LLM) data is not allowed and you can’t gather more labeled data may limit performance.
  • If your latency budget cannot tolerate any OOD scoring overhead, you may run without OOD (but accept the risks).

Open Questions:

  • Can a learned meta-model pick not just the training regime but also the best augmentation and OOD method per dataset automatically?
  • How to further improve unsupervised OOD in close-to-in-domain settings without hurting in-domain accuracy?
  • Can we better estimate label noise per annotator and auto-clean at scale for NER and classification alike?
  • What are the best strategies to combine LLM-generated examples with real data without introducing bias or drift?

06Conclusion & Future Work

Three-Sentence Summary: OpenAutoNLU is a text-first AutoML library that automatically chooses between few-shot learning and full fine-tuning based on your dataset’s smallest class size, so you don’t have to guess. It bundles data-quality checks, configurable OOD detection, NER support, LLM-based augmentation/test synthesis, and one-click ONNX export in a low-code API. In benchmarks, it’s competitive or better than other AutoML tools while being simpler and faster to deploy.

Main Achievement: Turning a tricky, many-knob NLP setup into a data-aware, mostly automatic pipeline that works across small, medium, and large data regimes with strong in-domain and OOD performance.

Future Directions: Replace fixed thresholds with a learned meta-model, refine unsupervised OOD (especially on close OOD), and deepen data-quality tooling for even better label cleanup. Expand multilingual coverage and domain adapters while maintaining low latency.

Why Remember This: It’s a practical, batteries-included path from a handful of examples to a production-grade model—classification or NER—without changing your code. In other words, it’s the smart lunch robot for NLP: it checks your fridge (data), picks the right recipe (training regime), improves ingredients (data quality), says ā€œI don’t knowā€ when needed (OOD), and packs it to-go (ONNX).

Practical Applications

  • •Customer support intent routing that stays accurate even with few examples per new intent.
  • •Helpdesk triage that flags unfamiliar tickets as OOD instead of misrouting them.
  • •E-commerce product review classification with automatic data cleaning before training.
  • •On-device or on-premise text classifiers using ONNX export for low-latency inference.
  • •Healthcare note de-identification via NER with careful entity-level evaluation.
  • •Content moderation that rejects unusual or adversarial posts as OOD to avoid false approvals.
  • •Internal email tagging and prioritization with minimal configuration and robust OOD handling.
  • •Financial document NER for extracting organizations and amounts with stratified splitting.
  • •Low-resource languages or domains boosted by LLM-generated paraphrases and test sets.
  • •Rapid prototype-to-production pipelines where the same API handles few-shot and full-data cases.
#AutoML#Natural Language Understanding#Text Classification#Named Entity Recognition#Few-Shot Learning#Transformer Fine-Tuning#Contrastive Learning#Out-of-Distribution Detection#Data Quality Diagnostics#Macro F1#SetFit#AncSetFit#Optuna#ONNX#Dataset Cartography
Version: 1

Notes

0/2000
Press Cmd+Enter to submit