Sentinel-2 Super-Resolution

Benchmarking Deep Learning Models for 10m to 2.5m Upsampling

Aman Bagrecha

2026-04-04

Agenda

  1. Why super-resolution for Sentinel-2?
  2. How does super-resolution work?
  3. Model families: CNN, GAN, Diffusion, Multi-image
  4. Benchmarked models & setup
  5. Visual comparison over Riyadh
  6. Performance & resource comparison
  7. Recommendations by use case

Why Super-Resolution for Sentinel-2?

The problem

  • Sentinel-2 delivers free, global, 5-day revisit imagery
  • But spatial resolution is limited: 10m for RGBN bands
  • Many applications need 2–5m detail:
    • Urban mapping & building footprints
    • Precision agriculture
    • Infrastructure monitoring
    • Change detection

The opportunity

  • Deep learning can hallucinate plausible detail from learned priors
  • 4x upsampling: 10m → 2.5m
  • Avoids cost of commercial VHR imagery
  • Can leverage temporal stacks for even better results

How Does Super-Resolution Work?

Core idea: Learn a mapping \(f: I_{LR} \rightarrow I_{SR}\) from low-resolution to high-resolution images.

Three main paradigms:

  1. Regression-based (CNN/Transformer)
    • Minimize pixel-wise loss (L1/L2)
    • Fast, deterministic, but can be blurry
  2. Adversarial (GAN-based)
    • Generator + discriminator
    • Sharper outputs, but can introduce artifacts
  3. Diffusion-based
    • Iterative denoising from noise to image
    • Highest perceptual quality, slowest inference
LR Image (10m)
     │
Feature Extraction
     │
     ├─ CNN ──────── L1/L2 loss
     │                   │
     ├─ GAN ──────── Adversarial loss
     │                   │
     ├─ Diffusion ── Score matching
     │                   │
     ▼                   ▼
            SR Output (2.5m)

Single-Image vs Multi-Image SR

Single-Image SR (SISR)

  • Uses one input image
  • Relies on learned spatial priors
  • Faster, simpler pipeline
  • Most models fall here

Used by: SEN2SRLite, EvoLand, DiffFuSR, OpenSR, L1BSR

Multi-Image SR (MISR)

  • Uses multiple revisits of the same area
  • Exploits sub-pixel shifts between acquisitions
  • Can recover real detail, not just hallucinated
  • More complex data pipeline

Used by: Satlas (8 images), WorldStrat (8 images)

Multi-image SR has a theoretical advantage: information from multiple views can resolve ambiguities that no single image can.

Models Benchmarked

Model Type Bands Input Output Architecture
SEN2SRLite CNN RGBN 1 image 2.5m Lightweight CNN via mlstac
EvoLand CNN (ONNX) RGBN 1 image 2.5m ONNX-deployed spatial SR
L1BSR CNN RGBN 1 image 5m Registration-aware SR
DiffFuSR Diffusion RGB 1 image 2.5m Score-based diffusion
OpenSR Diffusion RGBN 1 image 2.5m Latent diffusion (LDSR)
Satlas GAN (ESRGAN) RGB 8 images 2.5m Multi-temporal ESRGAN

Test Area: Riyadh

Setup:

  • Location: Riyadh (46.68 E, 24.71 N)
  • Source: Sentinel-2 L2A from Planetary Computer
  • Scene: S2A_MSIL2A_20241016T072901
  • Input bands: B04, B03, B02, B08 (RGBN)
  • Area: 1024 x 1024 px at 10m

Input Sentinel-2 RGB (10m)

Full Scene Comparison

Input 10m

SEN2SRLite 2.5m

EvoLand 2.5m

DiffFuSR 2.5m

L1BSR 5m

Satlas 2.5m

Patch Comparison – Input vs SEN2SRLite vs OpenSR

Input (10m → upscaled)

Input 10m

SEN2SRLite (2.5m)

SEN2SRLite

OpenSR (2.5m)

OpenSR

Both resolve building edges and road network well; SEN2SRLite appears slightly smoother while OpenSR adds finer texture detail.

Patch Comparison – SEN2SRLite vs EvoLand

Input (10m → upscaled)

Input 10m

SEN2SRLite (2.5m)

SEN2SRLite

EvoLand (2.5m)

EvoLand

Very similar sharpness; EvoLand has slightly darker shadows and higher contrast compared to SEN2SRLite’s more neutral tone.

Patch Comparison – DiffFuSR vs OpenSR

Input (10m → upscaled)

Input 10m

DiffFuSR (2.5m)

DiffFuSR

OpenSR (2.5m)

OpenSR

Both diffusion models produce sharp results; DiffFuSR has slightly warmer tones, while OpenSR preserves a cooler radiometry closer to the input.

Patch Comparison – DiffFuSR vs Satlas

Input (10m → upscaled)

Input 10m

DiffFuSR (2.5m)

DiffFuSR

Satlas (2.5m, 8-image)

Satlas

Satlas aggregates 8 dates producing a noticeably different color tone; DiffFuSR is sharper on single-date detail while Satlas resolves road markings better from temporal fusion.

Patch Comparison – L1BSR vs SEN2SRLite

Input (10m → upscaled)

Input 10m

L1BSR (5m)

L1BSR

SEN2SRLite (2.5m)

SEN2SRLite

L1BSR outputs 5m (2x) rather than 2.5m (4x), so it appears slightly softer; building outlines are resolved but fine urban texture is less detailed.

Visual Observations Summary

Model Observation
SEN2SRLite Clean sharpening with neutral colors; best all-round edge definition for buildings and roads
OpenSR Fine texture detail from diffusion; slightly cooler tone than input
EvoLand Comparable to SEN2SRLite in sharpness; darker shadows, higher contrast
DiffFuSR Natural-looking textures with warm tones; RGB-only limits multispectral use
L1BSR Good for 5m but noticeably softer than the 2.5m models on fine urban structure
Satlas Multi-date fusion produces distinct color tone; resolves some details others miss but introduces temporal averaging

Performance Comparison

Benchmarked on NVIDIA P100 GPU
Model Runtime GPU Mem CPU Mem Output
SEN2SRLite 9.1 s 0.44 GB 1.12 GB 2.5m, 4-band
L1BSR 9.0 s 1.20 GB 5m, 4-band
EvoLand 29.6 s 0 GB 1.76 GB 2.5m, 4-band
Satlas 34.8 s 0.59 GB 1.03 GB 2.5m, 3-band
DiffFuSR 83.2 s 9.75 GB 4.03 GB 2.5m, 3-band
OpenSR 136.8 s 4.25 GB 2.47 GB 2.5m, 4-band

Key takeaway: CNN models are 10–15x faster than diffusion models with fraction of GPU memory.

Runtime vs GPU Memory

Practical Considerations

Deployment complexity

Model Complexity
EvoLand Low – ONNX runtime only
SEN2SRLite Low – pip install
L1BSR Medium – repo clone
Satlas High – preprocessing pipeline
DiffFuSR High – manual checkpoint wiring
OpenSR High – custom low-mem wrapper

Common failure modes

  • GPU OOM: OpenSR & DiffFuSR need large GPUs
  • Band ordering: EvoLand & L1BSR expect B02,B03,B04,B08 (not RGBN)
  • Package conflicts: Satlas needs pinned torch==2.1.0
  • Geo-referencing: Satlas output needs manual georeferencing
  • Domain mismatch: L1BSR designed for L1B, not L2A data

Recommendations

Use Case Best Model Why
Default baseline SEN2SRLite Fast (9s), low memory, 4-band output
Operational deployment EvoLand ONNX-based, no GPU required, deterministic
Multi-temporal analysis Satlas Only working multi-image model
Highest perceptual quality DiffFuSR Best textures, but RGB-only & GPU-heavy
Modern diffusion pipeline OpenSR Geospatial-aware tiling, but slowest
5m is sufficient L1BSR Fastest, lightweight

For urban regions, SEN2SRLite offers the best balance of sharpness, speed, and band coverage.

Key Takeaways

  1. CNN-based models (SEN2SRLite, EvoLand) offer the best speed-quality tradeoff for operational use
  2. Diffusion models produce perceptually pleasing results but at 10–15x the compute cost
  3. Multi-image SR (Satlas) is promising but requires complex preprocessing
  4. No single model wins everywhere – choice depends on:
    • Band requirements (RGB vs RGBN)
    • Latency constraints
    • Available GPU memory
    • Deployment complexity tolerance

Next Steps

  • Quantitative metrics (PSNR, SSIM) against VHR reference
  • Test on diverse land cover types
  • Evaluate full multispectral fusion (DiffFuSR 12-band path)
  • Operationalize best model in production pipeline

Thank You

Aman Bagrecha

All code, data, and benchmarks available in this repository.