Generative Deep Learning for Molecular SMILES Generation

A comparative study of graph‐to‐sequence models using GNNs, Graphormer, and Birdie training for chemically valid molecule design.

Introduction

This experiment investigates advanced graph-to-sequence deep learning approaches for molecular design, focusing on generating chemically valid SMILES strings from molecular graphs. We compare:

  1. GNN→Transformer: A baseline message-passing GNN encoder with a Transformer decoder.
  2. Graphormer→Transformer: A structurally-biased Transformer encoder with explicit centrality, spatial, and edge encodings.
  3. Graphormer + Birdie→Transformer: The Graphormer model further trained with stochastic, corruption-based objectives (infilling, deshuffling, selective copying, standard copying) to improve robustness.

All models are trained with a combined cross-entropy and REINFORCE‐style validity loss, rewarding chemically valid outputs via RDKit validation.

Background and Motivation

Discovering novel drug‐like molecules traditionally requires expensive, manual exploration of vast chemical spaces. Language‐modeling techniques applied to SMILES can generate syntactically valid strings but often miss underlying graph topology, producing chemically invalid compounds. Bridging this gap demands encoder–decoder architectures that respect molecular graphs’ structure while leveraging powerful sequence modeling.

Graph-Based Generative Models

Graph Neural Networks (GNNs) perform local message passing over nodes and edges, capturing atomic interactions but struggle with long-range dependencies and global structure (over‐smoothing). Graphormer extends Transformers to graphs by injecting:

  1. Centrality encodings (node degrees)
  2. Spatial encodings (shortest‐path distances)
  3. Edge encodings (bond‐type biases along paths)
  4. Virtual node ([VNode]) for global readout This combination enables rich, topology-aware representations.

Problem Statement

Research Objectives

  1. Effective graph encoding: Compare standard GNN vs. Graphormer encoders.
  2. SMILES accuracy & validity: Evaluate token-level accuracy and RDKit‐verified validity.
  3. Robustness via self-supervision: Apply Birdie training objectives to improve long-range dependency modeling.
  4. Comprehensive benchmarking: Quantitatively assess all variants on accuracy, validity, diversity, and noise resilience.

Contributions

  1. Architectural comparison of three encoder–decoder variants (GNN, Graphormer, Graphormer+Birdie).
  2. Structural inductive biases: Integration of centrality, spatial, and edge encodings into attention.
  3. Validity‐aware loss: A REINFORCE‐style auxiliary loss rewarding chemically valid outputs.
  4. Birdie multi-task pretraining: Randomized infilling, deshuffling, selective copying, and autoencoding objectives.
  5. Full PyTorch pipeline: From SDF parsing and feature extraction to batched training and inference.

Data Preparation

We processed a dataset of ~100K molecules from ChEMBL, ensuring chemical diversity and drug-like properties:

Data processing from SDF → SMILES & Tokenization

#parsing SMILES with RDKit
from rdkit import Chem
smiles = Chem.MolToSmiles(molecule, canonical=True)
# SMILES
SMILES: "CC(=O)O"
#tokens 
Tokens: ["<sos>", "C", "C", "(", "=", "O", ")", "O", "<eos>"]

# decoder input and labes (Simple pretraining)
dec_input_ids = ["<sos>","C", "C", "(", "=", "O",")", "O"]
labels = ["C", "C", "(", "=", "O", ")", "O", "<eos>"]

Model Architectures

GNN Encoder → Transformer Decoder

Figure: GNN→Transformer architecture. Input molecular graph is embedded via an MPNN (3 layers, hidden=128), global readout pools to a 512-d vector, then passed to an 8-layer causal Transformer decoder (d_model=512, heads=8) for SMILES generation.

Graphormer Encoder → Transformer Decoder

Figure: Graphormer→Transformer. Graphormer encoder integrates centrality, spatial, and edge biases in self-attention across 6 layers (d_model=512). The pooled [VNode] token seeds the Transformer decoder similarly configured to the baseline.

Graphormer + Birdie → Transformer Decoder

Birdie Graphormer→Transformer. Same as Graphormer, with stochastic Birdie objectives applied per batch to encoder inputs and SMILES targets (infilling, deshuffling, selective copying).

Experiments & Results

Loss and validity comparisons

Model Final Total Loss Chemical Validity (%)
GNN → Transformer 1.10 81.7
Graphormer → Transformer 0.40 91.3
Graphormer + Birdie → Transformer 0.30 92.0

**Sample molecules from Graphormer+Birdie model **


Findings:


Conclusion

Combining graph-aware self-attention with corruption-based pretraining substantially improves molecular SMILES generation. The Graphormer encoder captures global topology effectively, and Birdie objectives reinforce decoder robustness—achieving a 92% chemical validity rate and strong convergence behavior. citeturn0file0

Future Work


Citations

  1. Basnet, S. (2025). Generative Deep Learning Experiment in Molecule Generation (Report).
  2. Ying, R. et al. (2021). Do Transformers Really Perform Bad for Graph Representation? (Graphormer).
  3. Hu, W. et al. (2020). Strategies for Pre‐training Graph Neural Networks.
  4. Vaswani, A. et al. (2017). Attention Is All You Need.
  5. Krenn, M. et al. (2020). SELFIES: a robust molecular string representation.
  6. Kipf, T. & Welling, M. (2017). Semi‐Supervised Classification with Graph Convolutional Networks.
  7. RDKit: Open‐source cheminformatics; https://www.rdkit.org