OpenAutoNLU: Open Source AutoML Library for NLU
Key Summary
- ā¢OpenAutoNLU is a simple, open-source tool that automatically builds text understanding models for you.
- ā¢It looks at your dataset and smartly picks the best training style for your data size without you setting knobs.
- ā¢It works for both text classification (sorting messages) and NER (finding names and places in text).
- ā¢It checks your data quality first and flags confusing or mislabeled examples so you donāt train on bad data.
- ā¢It can detect out-of-distribution (OOD) inputs, which means it knows when a message is too different from what it learned.
- ā¢It uses light, fast models and can export to ONNX for super-quick deployment on many devices.
- ā¢In tests on four popular datasets, it usually matches or beats other AutoML tools while being easier to use.
- ā¢It can even ask an LLM to create extra training or test examples when you donāt have enough data.
- ā¢Its key trick is data-aware regime selection: it chooses between few-shot learning and full fine-tuning automatically.
- ā¢You can train and run a model in just a couple of lines of code, with OOD detection included if you want.
Why This Research Matters
OpenAutoNLU lowers the barrier to building reliable text models by auto-picking the right training strategy for your data size. That means small teams can ship strong chatbots, ticket routers, and content filters without wrestling with complex settings. Its built-in data checks reduce label noise, which is a major hidden cause of poor performance. Configurable OOD detection makes systems safer, since they can admit uncertainty instead of guessing. ONNX export and lightweight pipelines keep inference fast and cheap, fitting real-world latency and budget limits. LLM-based augmentation and test synthesis help when labeled data is scarce. All together, it turns messy, real-life NLP needs into a smooth, low-code pipeline from dataset to deployment.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: Imagine youāre packing a school lunchbox. Some days you have tons of ingredients; other days you have just a few. You still need a tasty, healthy lunch either way.
š„¬ The Concept (AutoML): What it is: AutoML is a helper that chooses and trains machine learning models for you. How it works (step by step):
- Look at your data. 2) Pick a fitting model type and settings. 3) Train and tune. 4) Evaluate and export. Why it matters: Without AutoML, you must guess many settings yourself, which is slow and error-prone. š Anchor: Like a smart lunch robot that sees whatās in your fridge and picks the fastest recipe that still tastes great.
š Hook: You know how you can understand what someone means even if they use different words? Computers want to do that too.
š„¬ The Concept (NLU): What it is: Natural Language Understanding (NLU) helps computers make sense of human text. How it works: 1) Read words, 2) turn them into numbers a computer understands, 3) find patterns, 4) predict labels or entities. Why it matters: Without NLU, apps canāt reliably sort emails, answer questions, or extract facts. š Anchor: Your email app labeling a message as āPromotionsā or āImportantā is NLU at work.
š Hook: Think about sorting your Lego by color or size.
š„¬ The Concept (Text Classification): What it is: Text classification sorts a message into a category. How it works: 1) Read a message, 2) compare it to patterns learned from examples, 3) assign a label. Why it matters: Without it, chatbots canāt tell if youāre asking about refunds or delivery. š Anchor: A support bot tagging āI forgot my passwordā as a ālogin problem.ā
š Hook: When you read a story, you notice names of people and places.
š„¬ The Concept (NER): What it is: Named Entity Recognition finds special names like people, places, or companies in text. How it works: 1) Scan the sentence, 2) mark token-by-token which parts are entities, 3) output typed spans. Why it matters: Without NER, systems canāt pull useful nuggets like āParisā or āAliceā from text. š Anchor: Highlighting āHogwartsā as a place and āHarry Potterā as a person in a sentence.
The World Before: Building reliable text classifiers or NER models was hard. People had to choose between many approaches: classic ML with embeddings, few-shot tricks, or full transformer fine-tuning. Every option worked best in different data sizes, and small mistakes in setup could waste time and compute. General AutoML tools focused mostly on spreadsheets (tabular data), and even NLP-focused ones often needed many settings. Meanwhile, data was messy: mislabeled texts, confusing examples, and unbalanced classes hurt results. And in the real world, models get weird, never-seen inputs (called out-of-distribution, or OOD). Most tools didnāt handle all these needs together.
š Hook: You know how a teacher spots when a test question is badly written or confusing?
š„¬ The Concept (Data Quality Diagnostics): What it is: A set of tests that flag mislabeled or low-signal examples before training. How it works: 1) Train a quick model, 2) see which examples it is uncertain about or disagrees with, 3) map easy vs. hard examples, 4) suggest fixes. Why it matters: Without this, bad data sneaks in and weakens your final model. š Anchor: If 10 math problems have wrong answers written on them, your study will suffer; spotting them first helps you learn better.
š Hook: Imagine a librarian who knows when a book doesnāt belong in a section.
š„¬ The Concept (Out-of-Distribution Detection): What it is: A way to tell if a new message is too different from the training data. How it works: 1) Score how āfamiliarā the input looks, 2) compare to a threshold, 3) if too strange, flag as OOD. Why it matters: Without OOD detection, the model confidently guesses on nonsense, causing bad user experiences. š Anchor: A travel bot that says, āI donāt understandā for a recipe request instead of giving a random flight price.
š Hook: Think of a super talkative parrot that can help you come up with sentences.
š„¬ The Concept (LLMs): What it is: Large Language Models are powerful text generators and readers. How it works: 1) Read a prompt, 2) predict likely next words many times, 3) produce helpful text. Why it matters: Without LLM help, small datasets stay small; with them, you can synthesize training or testing examples. š Anchor: If you only have 3 examples of ācancel order,ā an LLM can write 20 more realistic variations.
The Problem: People needed a simple, text-first AutoML tool that automatically chooses the right approach for their data size, checks data quality, supports both classification and NER, and can say āI donāt knowā sensibly.
Failed Attempts: General AutoML (H2O, LightAutoML, AutoGluon) focused on tables or required extra setup to work well on text. NLP-specific tools (like AutoIntent) helped, but still needed manual preset choices, had limits for tiny classes, or lacked unified NER support and flexible OOD.
The Gap: A one-stop, low-code, NLP-native AutoML that: (1) auto-selects the training regime based on per-class counts, (2) includes text-specific data checks, (3) offers configurable OOD, (4) supports both classification and NER, and (5) deploys easily.
Real Stakes: Better customer support bots, safer content filters, faster on-premise deployments, and fewer mislabeled-data headaches. Schools, hospitals, banks, and small startups can all benefit without hiring a big ML team.
š Hook: Report cards average your grades across subjects, not just the one youāre best at.
š„¬ The Concept (Macro F1 Score): What it is: A fairness-minded score that averages performance equally across all classes. How it works: For each class, compute , then Macro-F1 is . Why it matters: Without Macro-F1, a model might look great by only doing well on common classes and ignoring rare ones. š Anchor: If you score A in Math but D in Art and History, the average tells the full story, not just the A.
Numerical example after the formulas: Suppose two classes (). For class 1, , , so . For class 2, , , so . Macro-F1 .
02Core Idea
š Hook: You know how coaches pick different drills depending on how many kids show up to practice? Two kids = simple drills; a full team = full scrimmage.
š„¬ The Concept (Data-Aware Training Regime Selection): What it is: The key insight is to automatically choose the training method based on how many labeled examples each class has. How it works: 1) Count the minimum number of examples per class, call it . 2) If , use an ultra few-shot method (AncSetFit). 3) If , use a standard few-shot method (SetFit). 4) If , do full transformer fine-tuning. Why it matters: Without this, users must guess which method to try and may waste time or hurt quality. š Anchor: If the smallest class has , OpenAutoNLU picks AncSetFit; if , it picks SetFit; if , it fine-tunes a transformer.
Numerical example after the inequalities: If your class counts are , then , which fits , so AncSetFit is chosen. If your class counts are , then , which fits , so SetFit is chosen. If your class counts are , then , which fits , so full fine-tuning is chosen.
Three Analogies:
- Thermostat analogy: The system āfeelsā how cold (data-scarce) or warm (data-rich) your dataset is and flips to the right heating mode (few-shot vs. fine-tuning).
- Toolbox analogy: For tiny screws (very few examples), use a precision screwdriver (AncSetFit). For regular screws (dozens), use a standard one (SetFit). For big bolts (hundreds+), bring the power drill (full fine-tuning).
- Classroom analogy: 3 students? Do tutoring (few-shot). 30 students? Run a normal lesson (fine-tuning).
š Hook: If you barely have any flashcards, you study differently than when you have a full deck.
š„¬ The Concept (Few-Shot Learning): What it is: Teaching models from very few examples per class. How it works: 1) Build an embedding space, 2) pull similar examples together and push different ones apart (contrastive learning), 3) add a simple head to classify. Why it matters: Without few-shot, tiny datasets underperform or canāt train at all. š Anchor: Learning to recognize ārefund requestā intent after seeing 5 examples.
š Hook: When you have a whole textbook, you can fine-tune your knowledge to the test.
š„¬ The Concept (Full Fine-Tuning): What it is: Adjusting all (or most) weights of a big model so it specializes for your task. How it works: 1) Start from a pretrained transformer, 2) train on your labeled data, 3) tune hyperparameters to maximize validation score. Why it matters: Without fine-tuning, you might cap out in accuracy even with lots of data. š Anchor: With 100+ examples per class, the model customizes deeply and gets higher accuracy.
š Hook: Imagine a language-savvy friend who reads entire sentences to grasp meaning.
š„¬ The Concept (Transformer Models): What it is: A powerful architecture (like BERT) that understands context in text. How it works: 1) Break text into tokens, 2) apply attention to weigh important parts, 3) produce contextual embeddings. Why it matters: Without transformers, many NLP tasks would be much less accurate. š Anchor: Focusing on the words ācapital of Franceā to answer āParis.ā
š Hook: To tell apples from oranges, you look at them side-by-side.
š„¬ The Concept (Contrastive Learning, as in SetFit): What it is: A way to learn by comparing pairs and triplets so similar things get close and different things get far in embedding space. How it works: 1) Make positive pairs (same label) and negative pairs (different labels), 2) train to shrink distance for positives and grow it for negatives, 3) then train a simple classifier. Why it matters: Without contrastive learning, few-shot models struggle to form strong class boundaries with little data. š Anchor: Pull āreset passwordā and āforgot passwordā closer, push āorder pizzaā away.
š Hook: Baking a cake is easier if the oven temp and time are just right.
š„¬ The Concept (Hyperparameter Optimization): What it is: Systematically searching for the best training settings (like learning rate or batch size). How it works: 1) Try smart guesses (e.g., TPE/Optuna), 2) compare validation scores, 3) keep the best. Why it matters: Without this, you may overfit or underfit. š Anchor: Trying learning rates like or and picking the one that wins on validation.
Numerical example after the learning-rate range: If the search range is , you might test , , and ; if gives the highest Macro-F1 (say vs. and ), you keep .
Before vs. After:
- Before: Users juggled presets and frameworks, guessing which method matched their data and risking wasted compute.
- After: OpenAutoNLU checks and picks few-shot vs. fine-tuning automatically, adds data-quality checks, and lets you toggle OOD with a single flag.
Why It Works (intuition): Different data sizes call for different biasāvariance trade-offs. Few-shot methods make strong structure from minimal data via contrastive learning; full fine-tuning shines when thereās enough signal to reliably adapt a large model. Automatically matching the method to reduces overfitting in tiny regimes and underfitting in large ones.
Building Blocks:
- Data scanning (compute )
- Method resolver (AncSetFit, SetFit, or Finetuning)
- Optional data-quality diagnostics
- Optional data-level balancing (upsampling or downsampling)
- Training and evaluation (with Macro-F1)
- Optional OOD detection layer
- ONNX export for fast deployment
03Methodology
High-level recipe: Input text and labels ā (optional) data-quality check ā method selection by ā data balancing (if needed) ā train (few-shot or fine-tune) ā evaluate (Macro-F1, optionally OOD) ā export model (ONNX).
Step 1: Data Quality Check (optional)
- What happens: The system trains a quick model, watches per-example behavior across epochs, and flags mislabeled, ambiguous, or low-signal samples (using retagging, uncertainty, V-information, and dataset cartography for classification; for NER, it uses label aggregation with Monte Carlo dropout).
- Why it exists: Bad data poisons trainingāspotting it early prevents wasted compute and boosts final quality.
- Example: If a sample labeled ābillingā consistently looks like āloginā to the model, itās flagged for review.
š Hook: Drawing a map helps you find easy roads and tricky turns.
š„¬ The Concept (Dataset Cartography): What it is: A way to visualize which samples are easy, ambiguous, or hard based on training dynamics. How it works: 1) Track confidence and variability per sample over epochs, 2) cluster into regions, 3) show a map and thresholds. Why it matters: Without it, tough or mislabeled samples hide in the mix. š Anchor: Seeing a cluster of āhard-to-learnā examples tells you where to clean labels first.
Step 2: Compute and Choose Method
- What happens: The pipeline computes the smallest per-class sample count , then applies thresholds to select the method.
- Why it exists: Matching method to data regime maximizes learning efficiency and accuracy.
- Formula and example: If class counts are , then , which fits ā choose SetFit.
Step 3: Data-Level Optimization (Balancing)
- What happens: If many classes are under 81 examples, underrepresented classes can be upsampled to via character/word perturbations or LLM paraphrases; for few-shot methods, very large classes can be downsampled to stay balanced.
- Why it exists: Balanced classes help models treat all labels fairly and often improve Macro-F1.
- Example: If counts are , upsample first two to (e.g., LLM paraphrases) so all are near 81; then fine-tuning becomes viable.
Numerical example after upsampling: Starting counts become . Now satisfies , so the system can switch from SetFit to full fine-tuning.
Step 4: Train with the Selected Regime
- Few-shot (AncSetFit when ): Uses anchor labels (short descriptions) and triplet/contrastive learning to build a strong embedding space, then a simple classifier.
- Few-shot (SetFit when ): Builds sentence embeddings via contrastive training and fits a logistic-regression head.
- Full fine-tuning (when ): Fine-tunes a transformer with early stopping; optionally runs Optuna with TPE to pick learning rate, batch size, and weight decay.
- Why it exists: Each regime is matched to the data size to avoid under/overfitting.
- Example: With 60 examples per smallest class, SetFit runs 20 contrastive iterations and trains a light classifier; with 120 examples per smallest class, the system fine-tunes a BERT checkpoint.
Step 5: Objective and Hyperparameter Search (Fine-tuning Only)
- What happens: Optimize Macro-F1 on a validation split using Optunaās TPE sampler over learning rate in , batch size, and weight decay; early stopping with patience avoids overtraining.
- Why it exists: Fine-tuning has many knobs; a smart search finds a strong combo.
- Numerical example: Try learning rates , , and . If yields Macro-F1 vs. and , keep .
Step 6: OOD Detection (Optional)
- What happens: Choose an OOD method matched to the regime (e.g., maximum softmax probability for SetFit, Mahalanobis distance for fine-tuned encoders, or an explicit out-of-scope class). Generate some synthetic OOD (like gibberish) and tune the threshold on validation.
- Why it exists: Real systems face strange inputs; a thresholded OOD score lets the model say āIām not sure.ā
- Example: If the OOD score threshold is tuned so that only 1 in 100 in-scope examples is mistakenly flagged, you get cautious but accurate OOD behavior.
Step 7: LLM-Assisted Data Generation (Optional)
- What happens: When data is scarce, an LLM creates new training or test examples following your label definitions.
- Why it exists: Extra realistic examples boost learning and give a proxy test set when you lack one.
- Example: With only 8 examples per class, generate 40 more paraphrases per class to reach stronger few-shot training.
Step 8: NER Pipeline Details (if doing NER)
- What happens: Accepts offset- or bracket-style annotations; converts to BIO tags; stratified splitting preserves entity proportions; evaluates at the entity level (precision/recall/F1), including partial matches.
- Why it exists: NER needs careful tagging and fair evaluation so entities arenāt split incorrectly.
- Example: āVisit New York Cityā ā tags ; partial match scoring gives credit if āNew Yorkā is found but āCityā is missed.
Step 9: Export for Fast Inference (ONNX)
- What happens: Export to ONNX with tokenizer files and label mapping; at inference, the runner auto-detects hardware (GPU/CPU) and picks batch sizes that wonāt crash memory.
- Why it exists: Deployment must be fast, portable, and safe.
- Example: Move your model from training server to a CPU-only edge box and still get snappy responses.
š Hook: Packing your lunch neatly makes it easy to grab and go.
š„¬ The Concept (ONNX Export): What it is: A portable format that lets models run quickly on many devices. How it works: 1) Convert model graph, 2) bundle tokenizer and labels, 3) load on target hardware. Why it matters: Without ONNX, deployment can be slower or platform-tied. š Anchor: Saving your trained model as ONNX so an app can run it in two lines of code.
Secret Sauce:
- The method chooser keyed by , plus adaptive up/down-sampling, quietly moves you into the most effective regime.
- Built-in data diagnostics reduce label noise before it bites.
- Configurable OOD lets you tune safety vs. recall with a simple threshold factor.
04Experiments & Results
The Test: The team measured how well OpenAutoNLU classifies intents and handles OOD across four well-known datasets: Banking77, HWU64, MASSIVE, and SNIPS. They reported fairness-minded Macro-F1 (overall in-domain quality), In-scope F1 (with no OOD noise in test), and OOD F1 (how well it says āthis is out of scopeā).
The Competition: They compared against AutoIntent (intent-focused AutoML), AutoGluon (general AutoML), LightAutoML, and H2O. To be fair, they held the pretrained backbones as consistent as possible (e.g., BERT where supported), so differences would come from the AutoML logic and pipelines, not just representations.
Scoreboard with Context:
- In OOD-unaware settings (realistic: OOD shows up at test, but in-domain score is reported), OpenAutoNLU typically leads or ties on 3/4 datasets. AutoGluon slightly tops on Banking77 but at higher compute cost. Think: OpenAutoNLU getting an A while others get B-to-B+ on most subjects, and AutoGluon gets an A on just one subject but studies much longer.
- In clean, controlled tests without OOD in the test set, OpenAutoNLU again shines on HWU64, MASSIVE, and SNIPS and is very close on Banking77 (0.912 vs. 0.920). Thatās like getting 91.2% vs. 92.0%; both are strong, but OpenAutoNLU keeps this level consistently across multiple courses.
- For OOD detection: OpenAutoNLUās unsupervised OOD option is particularly strong, often beating supervised approaches; it balances in-domain accuracy with being cautious about strange inputs. In some cases, adding supervised OOD examples helps OOD F1 but can slightly reduce in-domain Macro-F1, showing a trade-off you can tune.
Surprising Findings:
- OpenAutoNLUās unsupervised OOD can be so good that adding labeled OOD doesnāt always help overall resultsāsometimes it hurts in-domain classification.
- Data-aware selection plus light balancing (upsampling to when many classes are small) often moves datasets into a regime where full fine-tuning winsāwithout the user changing any code.
- LLM-made test sets track real test sets closely in small/medium data ranges (differences under 5 percentage points), offering a practical proxy when a true test set isnāt available.
Numbers Made Meaningful:
- Example in-scope Macro-F1 on full datasets (no OOD in test): OpenAutoNLU hits about 0.912 (Banking77), 0.890 (HWU64), 0.876 (MASSIVE), 0.921 (SNIPS). Thatās like scoring mostly A-/A across four classes.
- In tougher OOD-unaware tests (OOD present, but you grade only in-domain classes), OpenAutoNLU generally stays top-tier while being compute-efficientāa strong sign itās production-ready.
Takeaway: Across varied data sizes and OOD conditions, the simple ālet the system pick the regimeā idea pays off. You get solid accuracy, sensible OOD handling, and fast deploymentāwith less tinkering.
05Discussion & Limitations
Limitations:
- Fixed thresholds for method switching ( and ) are empirically good defaults but may be suboptimal for unusual datasets. A future meta-model could learn better cutoffs per domain.
- LLM-generated test sets are reliable mainly for small/medium regimes; for very large data, generated tests can drift and be less predictive of true performance.
- OOD trade-offs exist: boosting OOD F1 with supervision can reduce in-domain Macro-F1, so you must tune the threshold to your risk tolerance.
Required Resources:
- GPU recommended (for fine-tuning and speedy contrastive training), but ONNX export enables fast CPU inference after training.
- Access to an LLM endpoint (optional) for augmentation/test synthesis if you want data boosts.
When NOT to Use:
- Extremely tiny single-class or unbalanced datasets with nearly no variation may not benefit enoughāconsider collecting a few more examples per rare class first.
- Highly sensitive domains where synthetic (LLM) data is not allowed and you canāt gather more labeled data may limit performance.
- If your latency budget cannot tolerate any OOD scoring overhead, you may run without OOD (but accept the risks).
Open Questions:
- Can a learned meta-model pick not just the training regime but also the best augmentation and OOD method per dataset automatically?
- How to further improve unsupervised OOD in close-to-in-domain settings without hurting in-domain accuracy?
- Can we better estimate label noise per annotator and auto-clean at scale for NER and classification alike?
- What are the best strategies to combine LLM-generated examples with real data without introducing bias or drift?
06Conclusion & Future Work
Three-Sentence Summary: OpenAutoNLU is a text-first AutoML library that automatically chooses between few-shot learning and full fine-tuning based on your datasetās smallest class size, so you donāt have to guess. It bundles data-quality checks, configurable OOD detection, NER support, LLM-based augmentation/test synthesis, and one-click ONNX export in a low-code API. In benchmarks, itās competitive or better than other AutoML tools while being simpler and faster to deploy.
Main Achievement: Turning a tricky, many-knob NLP setup into a data-aware, mostly automatic pipeline that works across small, medium, and large data regimes with strong in-domain and OOD performance.
Future Directions: Replace fixed thresholds with a learned meta-model, refine unsupervised OOD (especially on close OOD), and deepen data-quality tooling for even better label cleanup. Expand multilingual coverage and domain adapters while maintaining low latency.
Why Remember This: Itās a practical, batteries-included path from a handful of examples to a production-grade modelāclassification or NERāwithout changing your code. In other words, itās the smart lunch robot for NLP: it checks your fridge (data), picks the right recipe (training regime), improves ingredients (data quality), says āI donāt knowā when needed (OOD), and packs it to-go (ONNX).
Practical Applications
- ā¢Customer support intent routing that stays accurate even with few examples per new intent.
- ā¢Helpdesk triage that flags unfamiliar tickets as OOD instead of misrouting them.
- ā¢E-commerce product review classification with automatic data cleaning before training.
- ā¢On-device or on-premise text classifiers using ONNX export for low-latency inference.
- ā¢Healthcare note de-identification via NER with careful entity-level evaluation.
- ā¢Content moderation that rejects unusual or adversarial posts as OOD to avoid false approvals.
- ā¢Internal email tagging and prioritization with minimal configuration and robust OOD handling.
- ā¢Financial document NER for extracting organizations and amounts with stratified splitting.
- ā¢Low-resource languages or domains boosted by LLM-generated paraphrases and test sets.
- ā¢Rapid prototype-to-production pipelines where the same API handles few-shot and full-data cases.