Sentinel-2 Super-Resolution

Benchmarking Deep Learning Models for 10m to 2.5m Upsampling

Aman Bagrecha

2026-04-04

Agenda

Why super-resolution for Sentinel-2?
How does super-resolution work?
Model families: CNN, GAN, Diffusion, Multi-image
Benchmarked models & setup
Visual comparison over Riyadh
Performance & resource comparison
Recommendations by use case

Why Super-Resolution for Sentinel-2?

The problem

Sentinel-2 delivers free, global, 5-day revisit imagery
But spatial resolution is limited: 10m for RGBN bands
Many applications need 2–5m detail:
- Urban mapping & building footprints
- Precision agriculture
- Infrastructure monitoring
- Change detection

The opportunity

Deep learning can hallucinate plausible detail from learned priors
4x upsampling: 10m → 2.5m
Avoids cost of commercial VHR imagery
Can leverage temporal stacks for even better results

How Does Super-Resolution Work?

Core idea: Learn a mapping \(f: I_{LR} \rightarrow I_{SR}\) from low-resolution to high-resolution images.

Three main paradigms:

Regression-based (CNN/Transformer)
- Minimize pixel-wise loss (L1/L2)
- Fast, deterministic, but can be blurry
Adversarial (GAN-based)
- Generator + discriminator
- Sharper outputs, but can introduce artifacts
Diffusion-based
- Iterative denoising from noise to image
- Highest perceptual quality, slowest inference

LR Image (10m)
     │
Feature Extraction
     │
     ├─ CNN ──────── L1/L2 loss
     │                   │
     ├─ GAN ──────── Adversarial loss
     │                   │
     ├─ Diffusion ── Score matching
     │                   │
     ▼                   ▼
            SR Output (2.5m)

Single-Image vs Multi-Image SR

Single-Image SR (SISR)

Uses one input image
Relies on learned spatial priors
Faster, simpler pipeline
Most models fall here

Used by: SEN2SRLite, EvoLand, DiffFuSR, OpenSR, L1BSR

Multi-Image SR (MISR)

Uses multiple revisits of the same area
Exploits sub-pixel shifts between acquisitions
Can recover real detail, not just hallucinated
More complex data pipeline

Used by: Satlas (8 images), WorldStrat (8 images)

Multi-image SR has a theoretical advantage: information from multiple views can resolve ambiguities that no single image can.

Models Benchmarked

Model	Type	Bands	Input	Output	Architecture
SEN2SRLite	CNN	RGBN	1 image	2.5m	Lightweight CNN via `mlstac`
EvoLand	CNN (ONNX)	RGBN	1 image	2.5m	ONNX-deployed spatial SR
L1BSR	CNN	RGBN	1 image	5m	Registration-aware SR
DiffFuSR	Diffusion	RGB	1 image	2.5m	Score-based diffusion
OpenSR	Diffusion	RGBN	1 image	2.5m	Latent diffusion (LDSR)
Satlas	GAN (ESRGAN)	RGB	8 images	2.5m	Multi-temporal ESRGAN

Test Area: Riyadh

Setup:

Location: Riyadh (46.68 E, 24.71 N)
Source: Sentinel-2 L2A from Planetary Computer
Scene: S2A_MSIL2A_20241016T072901
Input bands: B04, B03, B02, B08 (RGBN)
Area: 1024 x 1024 px at 10m

Full Scene Comparison

Patch Comparison – Input vs SEN2SRLite vs OpenSR

Input (10m → upscaled)

SEN2SRLite (2.5m)

OpenSR (2.5m)

Both resolve building edges and road network well; SEN2SRLite appears slightly smoother while OpenSR adds finer texture detail.

Patch Comparison – SEN2SRLite vs EvoLand

Input (10m → upscaled)

SEN2SRLite (2.5m)

EvoLand (2.5m)

Very similar sharpness; EvoLand has slightly darker shadows and higher contrast compared to SEN2SRLite’s more neutral tone.

Patch Comparison – DiffFuSR vs OpenSR

Input (10m → upscaled)

DiffFuSR (2.5m)

OpenSR (2.5m)

Both diffusion models produce sharp results; DiffFuSR has slightly warmer tones, while OpenSR preserves a cooler radiometry closer to the input.

Patch Comparison – DiffFuSR vs Satlas

Input (10m → upscaled)

DiffFuSR (2.5m)

Satlas (2.5m, 8-image)

Satlas aggregates 8 dates producing a noticeably different color tone; DiffFuSR is sharper on single-date detail while Satlas resolves road markings better from temporal fusion.

Patch Comparison – L1BSR vs SEN2SRLite

Input (10m → upscaled)

L1BSR (5m)

SEN2SRLite (2.5m)

L1BSR outputs 5m (2x) rather than 2.5m (4x), so it appears slightly softer; building outlines are resolved but fine urban texture is less detailed.

Visual Observations Summary

Model	Observation
SEN2SRLite	Clean sharpening with neutral colors; best all-round edge definition for buildings and roads
OpenSR	Fine texture detail from diffusion; slightly cooler tone than input
EvoLand	Comparable to SEN2SRLite in sharpness; darker shadows, higher contrast
DiffFuSR	Natural-looking textures with warm tones; RGB-only limits multispectral use
L1BSR	Good for 5m but noticeably softer than the 2.5m models on fine urban structure
Satlas	Multi-date fusion produces distinct color tone; resolves some details others miss but introduces temporal averaging

Performance Comparison

Benchmarked on NVIDIA P100 GPU
Model	Runtime	GPU Mem	CPU Mem	Output
SEN2SRLite	9.1 s	0.44 GB	1.12 GB	2.5m, 4-band
L1BSR	9.0 s	1.20 GB	–	5m, 4-band
EvoLand	29.6 s	0 GB	1.76 GB	2.5m, 4-band
Satlas	34.8 s	0.59 GB	1.03 GB	2.5m, 3-band
DiffFuSR	83.2 s	9.75 GB	4.03 GB	2.5m, 3-band
OpenSR	136.8 s	4.25 GB	2.47 GB	2.5m, 4-band

Key takeaway: CNN models are 10–15x faster than diffusion models with fraction of GPU memory.

Runtime vs GPU Memory

Practical Considerations

Deployment complexity

Model	Complexity
EvoLand	Low – ONNX runtime only
SEN2SRLite	Low – pip install
L1BSR	Medium – repo clone
Satlas	High – preprocessing pipeline
DiffFuSR	High – manual checkpoint wiring
OpenSR	High – custom low-mem wrapper

Common failure modes

GPU OOM: OpenSR & DiffFuSR need large GPUs
Band ordering: EvoLand & L1BSR expect B02,B03,B04,B08 (not RGBN)
Package conflicts: Satlas needs pinned torch==2.1.0
Geo-referencing: Satlas output needs manual georeferencing
Domain mismatch: L1BSR designed for L1B, not L2A data

Recommendations

Use Case	Best Model	Why
Default baseline	SEN2SRLite	Fast (9s), low memory, 4-band output
Operational deployment	EvoLand	ONNX-based, no GPU required, deterministic
Multi-temporal analysis	Satlas	Only working multi-image model
Highest perceptual quality	DiffFuSR	Best textures, but RGB-only & GPU-heavy
Modern diffusion pipeline	OpenSR	Geospatial-aware tiling, but slowest
5m is sufficient	L1BSR	Fastest, lightweight

For urban regions, SEN2SRLite offers the best balance of sharpness, speed, and band coverage.

Key Takeaways

CNN-based models (SEN2SRLite, EvoLand) offer the best speed-quality tradeoff for operational use
Diffusion models produce perceptually pleasing results but at 10–15x the compute cost
Multi-image SR (Satlas) is promising but requires complex preprocessing
No single model wins everywhere – choice depends on:
- Band requirements (RGB vs RGBN)
- Latency constraints
- Available GPU memory
- Deployment complexity tolerance

Next Steps

Quantitative metrics (PSNR, SSIM) against VHR reference
Test on diverse land cover types
Evaluate full multispectral fusion (DiffFuSR 12-band path)
Operationalize best model in production pipeline

Thank You

Aman Bagrecha

All code, data, and benchmarks available in this repository.