MolHIT: Advancing Molecular-Graph Generation with Hierarchical Discrete Diffusion Models
Key Summary
- •MolHIT is a new AI that builds molecules as graphs, moving from broad chemical groups to exact atoms step by step.
- •It adds a middle “group” layer to the diffusion process (HDDM) so the model first decides the kind of atom, then the exact atom.
- •It splits atom types by their roles (Decoupled Atom Encoding), like separating charged and aromatic atoms, so tricky motifs aren’t lost.
- •On the MOSES benchmark, MolHIT reaches 99.1% validity and the best overall quality (94.2%) while still being creative (high scaffold novelty).
- •On the wider GuacaMol set, MolHIT stays robust and generates more realistic charged molecules than past graph models.
- •A new Project-and-Noise sampler plus temperature (top-p) sampling gives a clean control knob for the quality-versus-novelty trade-off.
- •MolHIT beats strong 1D string models by keeping graph-level novelty without losing chemical correctness.
- •It also works well when given goals (like target drug-likeness or molecular weight) and when asked to extend a given scaffold.
Why This Research Matters
Designing new molecules is slow and expensive, and tiny errors can make a design useless. MolHIT makes graph-based models both creative and chemically reliable, which can save months of lab time. By correctly handling aromatic and charged atoms, it better reflects real medicinal chemistry, not just simplified cases. The coarse-to-fine hierarchy turns hard jumps into easy steps, improving generation stability. Clear quality-versus-novelty control helps teams tune outputs for early exploration or late-stage refinement. Strong conditional generation lets scientists aim for target properties, speeding up lead optimization.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Top Bread (Hook): Imagine building with LEGO bricks. If you follow the instructions, your spaceship looks real. If you place even one brick wrong, the wings won’t fit and the ship falls apart.
🥬 The Concept: Machine Learning Basics
- What it is: A way for computers to learn patterns from examples so they can make or create new things.
- How it works: (1) Show many examples; (2) The computer guesses; (3) We tell it how close it was; (4) It adjusts; (5) Repeat until it gets good.
- Why it matters: Without learning from examples, the computer would just guess randomly, making nonsense molecules. 🍞 Bottom Bread (Anchor): Show a computer many drug-like molecules; it learns which shapes look “right” so it can make new ones that also look “right.”
🍞 Top Bread (Hook): You know how weather forecasts talk in chances, like a 70% chance of rain? That’s using probabilities to make decisions.
🥬 The Concept: Probabilistic Modeling
- What it is: Teaching computers to use uncertainty numbers (probabilities) when making choices.
- How it works: (1) Look at data; (2) Estimate how likely each choice is; (3) Pick or sample; (4) Adjust those likelihoods with feedback.
- Why it matters: Molecules are combinations with many options; probabilities help the model choose sensible atoms and bonds. 🍞 Bottom Bread (Anchor): If there’s a 90% chance oxygen fits here and 10% chance sulfur fits, the model mostly picks oxygen—leading to more valid molecules.
🍞 Top Bread (Hook): A friendship map shows who’s connected to whom. Molecules are like that too—atoms are friends if a bond connects them.
🥬 The Concept: Graph Theory
- What it is: A way to represent things (nodes) and their connections (edges).
- How it works: (1) Make a node for each atom; (2) Draw edges for bonds; (3) Use math to understand paths, rings, and clusters.
- Why it matters: If we ignore connections, we may build impossible molecules. 🍞 Bottom Bread (Anchor): In benzene, the six carbon atoms form a ring; graph theory captures that ring clearly.
🍞 Top Bread (Hook): In chemistry sets, you learn that some pieces go together and others don’t—like magnets with plus and minus ends.
🥬 The Concept: Chemistry Fundamentals
- What it is: The rules atoms follow—valence, charge, and bonding—that ensure molecules are real and stable.
- How it works: (1) Each atom has allowed bonds; (2) Charges change how atoms connect; (3) Rings and double bonds have special patterns.
- Why it matters: If we break these rules, we generate molecules that can’t exist. 🍞 Bottom Bread (Anchor): Carbon usually makes four bonds; if a model gives it six, RDKit flags it as invalid.
🍞 Top Bread (Hook): Think of a subway map where stations are atoms and lines are bonds. The whole map is the molecule.
🥬 The Concept: Molecular Graphs
- What it is: A diagram where atoms are nodes and bonds are edges.
- How it works: (1) List atoms; (2) Add bond types (single, double, aromatic); (3) Keep track of special roles like charge or aromaticity.
- Why it matters: This picture lets AI learn real chemical structure, not just a string of letters. 🍞 Bottom Bread (Anchor): C–C single bonds and C=C double bonds look different in a graph, so the model treats them differently too.
The world before: Two big AI approaches tried to make new molecules. One treated molecules as 1D strings (like SMILES). This often gave very high validity—like spelling words correctly—but these models tended to copy common patterns, limiting true novelty. The other approach used 2D graphs that match how chemists think. These were great at exploring new structures but often produced invalid or unstable molecules more than the 1D models did.
The problem: Could we get the best of both worlds—keep the graph model’s creativity while matching the 1D model’s high validity?
Failed attempts: Past graph diffusion methods treated each atom type as totally separate and assumed the model would learn everything from scratch. They also used coarse atom tokens that ignored crucial properties like charges and aromaticity. That made it hard to reconstruct certain common motifs (like [nH] in drug-like rings), and the models struggled to output valid chemistry with realistic diversity.
The gap: We needed (1) a smarter way to guide the noising/denoising steps that respects chemical groups, and (2) better atom tokens that don’t hide important roles like “charged” or “aromatic.”
Real stakes: Better molecular generation can speed up drug discovery and materials design. That could mean finding medicines faster, cheaper solar materials, or greener chemicals. Even tiny improvements in validity and novelty can save enormous lab time and cost.
02Core Idea
🍞 Top Bread (Hook): Imagine sketching a face. First you draw big shapes (oval for the head), then you add details (eyes, nose), and finally tiny features (eyelashes). Coarse-to-fine drawing gives better results.
🥬 The Concept: Diffusion Models
- What it is: A way for AI to learn by adding noise to data and then learning to remove it to recover the original.
- How it works: (1) Start with a clean example; (2) Add small random noise many times; (3) Train a model to undo each noisy step; (4) At generation time, start from noise and undo it step by step.
- Why it matters: This “undoing noise” skill lets AI create brand new, realistic samples. 🍞 Bottom Bread (Anchor): Start from a scrambled molecule and denoise it to a fresh, valid molecule.
🍞 Top Bread (Hook): Think of sorting crayons. First, you group them by warm vs. cool colors, then you pick the exact shade.
🥬 The Concept: Discrete Diffusion Models
- What it is: A diffusion method for categories (like atom types) instead of pixels.
- How it works: (1) Replace exact categories with broader or masked tokens step by step; (2) Learn to reverse that; (3) Sample by stepping back from masked/rough states to exact categories.
- Why it matters: Atoms and bonds are categories, so we need diffusion that handles choices, not just numbers. 🍞 Bottom Bread (Anchor): Instead of a pixel turning from 120 to 118, a token turns from “masked” to “nitrogen.”
🍞 Top Bread (Hook): When you shop for fruit, you might first choose “citrus,” then pick “orange,” then choose a specific variety.
🥬 The Concept: Hierarchical Discrete Diffusion Model (HDDM)
- What it is: A diffusion process with an extra mid-level category stage between “exact atom” and “masked.”
- How it works: (1) Add noise so exact atoms become mid-level groups (like {N,O,S}); (2) Then to a masked state; (3) During generation, go back from masked → group → exact atom; (4) Use schedules so these steps are consistent.
- Why it matters: Without the group stage, the model must jump directly from masked to exact, which is harder and causes more invalid atoms. 🍞 Bottom Bread (Anchor): First choose “halogen group,” then decide “chlorine,” leading to a more reliable final molecule.
🍞 Top Bread (Hook): If a recipe just says “use cheese,” you might pick the wrong kind. If it says “soft, salty, white cheese,” you’ll choose correctly.
🥬 The Concept: Decoupled Atom Encoding (DAE)
- What it is: Splitting atom tokens by their roles (like aromatic vs. aliphatic and charged vs. neutral) so the model sees the real chemistry.
- How it works: (1) Expand atom vocabulary to include roles (like n, nH, n+, c, c+, etc.); (2) Train and decode with these richer tokens; (3) Keep graphs simple (no extra hydrogen nodes) but keep chemical precision.
- Why it matters: Without these roles, the same token can mean different things, so the model can’t reconstruct key motifs or make valid charged species. 🍞 Bottom Bread (Anchor): Treating [nH] as its own token rescues drug-like ring motifs that older models frequently missed.
The “Aha!” in one sentence: Teach the model to decide atoms in two steps (group then exact) and give it tokens that explicitly encode chemical roles—this makes valid and novel molecule generation much easier.
Multiple analogies:
- School to major to class: Pick a school (science vs. arts), then a major (chemistry), then a specific class (organic lab).
- Map to country to city: Zoom from the world map to a country, and only then choose the exact city.
- Sorting laundry: First separate by color, then by fabric, then by exact washing setting.
Before vs. After:
- Before: Graph models were creative but often invalid; 1D models were valid but less novel.
- After: MolHIT’s HDDM + DAE lets graph models stay creative while achieving near-1D validity on MOSES.
Why it works (intuition):
- The mid-level group gives the model an easier decision first (broad chemistry), so the final choice (exact atom) is made within the right neighborhood.
- Richer tokens (DAE) remove hidden ambiguity, so the model doesn’t guess roles from bonds alone.
- A sampling tweak (Project-and-Noise) avoids getting stuck and encourages exploration.
Building blocks:
- HDDM with group states and simple schedules.
- DAE vocabulary that encodes aromaticity and charge.
- A PN-sampler with temperature/top-p control for quality vs. novelty.
- A graph transformer backbone for node-edge prediction.
- Optional conditioning (via adaptive layer norm and classifier-free guidance) to target properties.
03Methodology
At a high level: Input molecular graph → Add noise to atoms (hierarchical) and bonds (uniform) → Train denoiser → Sample with PN-sampler and temperature → Output a valid, novel molecule.
🍞 Top Bread (Hook): Imagine painting a wall. You first cover it with primer (mask), then add big color strokes (groups), and finally add details (exact shades).
🥬 The Concept: Forward Process in HDDM (atoms) and Uniform (bonds)
- What it is: The way we add noise to move from exact atoms → mid-level groups → masked, while bonds follow a simple uniform noising.
- How it works: (1) For atoms, use two schedules (α, β) to blend identity, group projection, and mask; (2) For bonds, use a uniform schedule that evenly randomizes bond categories; (3) Repeat across timesteps.
- Why it matters: If atoms and bonds are noised poorly, the reverse (denoise) step becomes too hard and leads to invalid chemistry. 🍞 Bottom Bread (Anchor): An exact “chlorine” slowly turns into a “halogen group” token, then eventually to “masked,” ready to be guessed back later.
Step-by-step recipe:
- Inputs and encoding
- What happens: Read a molecule as a graph: atoms (nodes) and bonds (edges). Encode atoms with DAE so roles like [nH] and charges are explicit; encode bonds as categorical (single, double, aromatic…).
- Why it exists: Without rich atom roles, the model confuses similar-looking states and fails to reconstruct common drug motifs.
- Example: In indole, the pyrrolic nitrogen [nH] gets its own token, not just a generic “N.”
- Forward diffusion (training-time noising)
- What happens: Apply HDDM to atoms: exact → group → masked, controlled by α and β; apply uniform noising to bonds.
- Why it exists: The mid-level group reduces the difficulty gap between masked and exact states; bond uniformity stabilizes edge learning.
- Example: A neutral aromatic “c” moves to the aromatic-group bucket, and later to a mask; an aromatic bond becomes a uniformly random bond at later times.
- Denoiser (graph transformer)
- What happens: A graph transformer sees the noisy graph at time t and predicts the clean atom and bond distributions (cross-entropy loss ties to ELBO guarantees).
- Why it exists: The network must learn to remove noise correctly or generation collapses.
- Example: Given a mid-level group “{N,O,S},” the model proposes high probability for “N” in this spot and suggests a single bond to a carbon neighbor.
- Grouping strategy
- What happens: Use deterministic groups aligned with chemistry. On MOSES: {C}, {N,O,S}, {F,Cl,Br}, and {c,o,n,[nH],s}. On GuacaMol, extend with charged and heavier atoms.
- Why it exists: Meaningful groups make coarse decisions easier and more accurate.
- Example: “F,Cl,Br” share halogen behavior, so the model learns a sensible coarse choice before picking chlorine.
- PN-sampler for generation 🍞 Top Bread (Hook): When sketching, you might erase and redraw a line to explore better shapes.
🥬 The Concept: Project-and-Noise (PN) Sampler
- What it is: A sampling trick that takes the model’s guessed clean graph, snaps it to a discrete one-hot choice, then re-applies forward noise to continue denoising from an earlier step.
- How it works: (1) Predict clean atom/bond distributions; (2) Sample discrete atoms/bonds (project to one-hot); (3) Re-noise to the previous time using the known forward kernel; (4) Repeat.
- Why it matters: If we always stick to the exact reverse posterior, we may explore too little; PN boosts diversity while staying principled. 🍞 Bottom Bread (Anchor): The model picks “chlorine,” commits to it, then lightly re-noises to keep options open in neighbors, encouraging fresh structures.
- Temperature and top-p control 🍞 Top Bread (Hook): At an ice cream shop, you can choose from all flavors or only the top few best-sellers.
🥬 The Concept: Temperature and Top-p Sampling
- What it is: Methods to control how adventurous the model is when choosing tokens.
- How it works: (1) Temperature smooths or sharpens probabilities; (2) Top-p keeps only the most probable tokens that sum to p, then samples.
- Why it matters: Without control, you either get boring repeats or wild, invalid outputs. 🍞 Bottom Bread (Anchor): With top-p=0.9, MolHIT balances high validity (up to 99.4%) and strong novelty on MOSES.
- Conditional modeling (optional) 🍞 Top Bread (Hook): If you tell a chef you want a dish that’s low-salt and under 500 calories, they adjust the recipe.
🥬 The Concept: Classifier-Free Guidance (CFG) with Adaptive LayerNorm
- What it is: A way to steer generation toward target properties (like QED or MW) without a separate classifier.
- How it works: (1) Embed the target properties; (2) Use adaptive layer norm to modulate node features; (3) Blend guided and unguided predictions.
- Why it matters: Without guidance, hitting multiple properties at once is much harder. 🍞 Bottom Bread (Anchor): Ask for molecules with high QED and moderate MW; MolHIT tracks those targets closely.
The secret sauce:
- The extra mid-level atom stage makes denoising easier and safer.
- DAE removes hidden ambiguities, so reconstruction is near-perfect on tough motifs.
- PN-sampling plus top-p gives a smooth quality–novelty dial.
- Simple, closed-form forward kernels and a cross-entropy loss tie back to principled ELBO training.
04Experiments & Results
🍞 Top Bread (Hook): Think of a sports league where teams are graded not just on wins, but also on fair play and creativity. We want a model that “wins” (valid chemistry), plays fair (matches real data), and is creative (new scaffolds).
🥬 The Concept: Validity, Novelty, and Scaffold Novelty
- What it is: Validity checks chemistry rules; novelty checks if molecules are unseen; scaffold novelty checks if backbones are truly new.
- How it works: (1) RDKit sanitization for validity; (2) Compare with training set for novelty; (3) Use Bemis–Murcko scaffolds to count brand-new backbones.
- Why it matters: High novelty alone can be noisy junk; scaffold novelty rewards real structural innovation. 🍞 Bottom Bread (Anchor): A molecule that’s valid, unique, and has a never-seen-before scaffold is a big win for discovery.
The tests and baselines:
- Datasets: MOSES (drug-like, mostly neutral) and GuacaMol (broader chemistry with charged species).
- Baselines: 1D sequence models (VAE, CharRNN, SAFE-GPT, GenMol) and 2D graph models (DiGress, DisCo, Cometh, DeFoG).
- Metrics: Validity, Uniqueness, Novelty, Filters, FCD, SNN, Scaffold Similarity, plus quality and new scaffold-based metrics.
Scoreboard on MOSES:
- Validity: MolHIT hits 99.1%, essentially matching top 1D approaches while staying a graph model.
- Quality: 94.2% (state of the art among graph models), meaning valid, unique, and drug-like with reasonable synthetic accessibility.
- Scaffold Novelty: 0.39, Pareto-strong; MolHIT finds new backbones better than others while keeping validity high.
- FCD: 1.03, competitive and better than many graph baselines (lower is better).
- Takeaway: This is like getting an A+ in “plays by the rules” while also scoring top marks in “builds new shapes.”
GuacaMol results:
- MolHIT is robust across validity, uniqueness, novelty, and KL; FCD is competitive but not the best.
- Crucially, with DAE, MolHIT restores realistic rates of charged and special atoms that baselines miss, reflecting the true data distribution better.
- Note: MolHIT trained fewer epochs than some baselines here; more training likely improves FCD further.
Multi-property guided generation (on MOSES):
- Targets: QED, SA, logP, MW.
- MolHIT’s average MAE is 0.058 (about half the error of a strong marginal+DAE baseline) and average Pearson r ≈ 0.807.
- Validity stays high (≈96%), so guidance does not break chemistry.
- Meaning: When we ask for certain properties, MolHIT tracks them closely without falling apart.
Scaffold extension:
- Task: Given a test scaffold, generate full molecules that include it.
- MolHIT boosts validity (≈84%) and Hit@5 (≈9.8%), beating DiGress by a large margin.
- Interpretation: MolHIT can “finish the puzzle” around a fixed core better than past models.
Ablations (what matters most):
- Adding DAE to DiGress: Validity jumps (to ≈96%).
- Then PN-sampler: Quality rises strongly (to ≈92.9%).
- Then HDDM: Final boost to SOTA levels (quality ≈94.2%, FCD ≈1.03, validity ≈99.1%).
Surprises:
- Temperature (top-p) works very well for molecular graphs. With top-p around 0.8–0.9, quality and validity rise while novelty remains strong—like smartly narrowing options without becoming boring.
05Discussion & Limitations
Limitations:
- Model size and architecture: The backbone is a standard graph transformer. Larger or specialized models could push results higher but weren’t explored.
- Training budget: On GuacaMol, MolHIT trained fewer epochs than some baselines; extended training likely improves metrics like FCD further.
- Vocabulary complexity: DAE expands tokens (especially on GuacaMol). While this brings chemical precision, it can also make learning harder without enough data.
- 3D realism: This work focuses on 2D graphs. Some drug-relevant behaviors depend on 3D shape, which is not modeled here.
Required resources:
- A GPU setup that can handle graph transformers and diffusion training on millions of molecules.
- RDKit and preprocessing to compute properties and scaffolds.
- Careful tuning of sampling hyperparameters (temperature, top-p) and HDDM schedules.
When not to use:
- If you require guaranteed 3D-accurate geometry or protein-ligand fit out of the box—MolHIT is 2D and won’t replace docking or 3D generative models directly.
- If your domain rarely includes charged or aromatic species, the benefit of DAE may be smaller (though usually still helpful).
- If your dataset is extremely tiny, the expanded vocabulary could overfit without careful regularization.
Open questions:
- How large can HDDM+DAE scale with model size and data? Is there a clear path to match or beat best 1D FCD on GuacaMol with longer training?
- What is the best automated grouping for HDDM—can we learn groups unsupervised from chemistry data or use expert-curated ontologies?
- How to merge this with 3D generation so 2D validity pairs with 3D plausibility?
- Can PN-sampling be adaptively tuned during sampling to automatically target a chosen quality–novelty frontier?
06Conclusion & Future Work
Three-sentence summary:
- MolHIT introduces a Hierarchical Discrete Diffusion Model that denoises atoms in two steps—group first, then exact—while bonds follow a stable uniform schedule.
- It pairs this with Decoupled Atom Encoding, which makes key roles like aromaticity and charge explicit, fixing reconstruction failures and boosting chemical validity.
- A Project-and-Noise sampler plus temperature/top-p control gives a reliable, tunable path to state-of-the-art quality and near-perfect validity on MOSES, with strong performance on GuacaMol and conditional tasks.
Main achievement:
- MolHIT closes the validity gap for graph-based models while keeping their hallmark novelty, setting a new Pareto frontier in the quality–novelty trade-off.
Future directions:
- Scale up the backbone and training budget, learn optimal HDDM groups automatically, extend to 3D generation and protein design, and refine adaptive sampling strategies.
Why remember this:
- The simple idea of adding a mid-level group and richer atom tokens transforms graph diffusion from “often invalid” to “reliably valid and creative.” It’s a blueprint for many other discrete generative tasks where coarse-to-fine choices and better tokens can unlock big gains.
Practical Applications
- •Early-stage drug discovery: Generate diverse, valid scaffolds to jump-start hit finding.
- •Lead optimization: Steer molecules toward higher QED or better SA while preserving core scaffolds.
- •Materials design: Propose polymers or small molecules with targeted lipophilicity or mass.
- •Scaffold hopping: Discover new backbones that keep activity but avoid patent overlaps.
- •Library enrichment: Expand screening collections with valid, novel, and synthesizable candidates.
- •Educational tool: Visualize how atom roles (aromatic, charged) affect molecular validity.
- •Pre-filtering for docking: Generate valid 2D graphs that convert cleanly to 3D for downstream docking.
- •Chemical space exploration: Use top-p to balance safe quality vs. daring novelty.
- •Designing charged ligands: Restore realistic rates of charged motifs for transporter or enzyme targets.