Large File Handling in DVC for GIS: A Production-Ready Workflow

Q: Can DVC pipelines run in CI/CD (GitHub Actions, GitLab CI)?

Yes. Inject cloud credentials as CI secrets, run dvc pull to restore inputs, and run dvc repro to execute the pipeline. Use dvc push in the post-stage to persist new outputs. Cache the .dvc/cache directory between runs using the platform's cache action to avoid re-downloading unchanged assets.

Q: How do I prevent QGIS sidecar files and GDAL overviews from being tracked?

List them in .dvcignore. Common patterns: *.aux.xml, *.ovr, *.qgz, *.lock, and __pycache__/. DVC honours .dvcignore the same way Git honours .gitignore — entries are excluded from hashing and tracking entirely.

Geospatial datasets routinely exceed the practical limits of Git — version-controlling them with DVC as part of the Geospatial Data Versioning Fundamentals & Architecture discipline decouples heavy binary payloads from repository metadata so multi-gigabyte orthomosaics, LiDAR point clouds, and national-scale vector exports can be tracked, shared, and reproduced without repository bloat.

Prerequisites & Environment Setup

Before implementing a DVC-backed spatial workflow, confirm every item below:

Git 2.30+ with a clean working directory; Git LFS disabled (DVC replaces LFS for spatial workflows)
DVC 3.0+ installed via pip install "dvc[s3]" — substitute gcs, azure, or hdfs for your backend
Python 3.9+ with geopandas, rasterio, pyproj, and shapely installed
Object-storage bucket (AWS S3, GCS, Azure Blob, or MinIO) with IAM credentials that allow GetObject, PutObject, DeleteObject, and ListBucket
Multipart upload enabled on the bucket; minimum part size 5 MB (required for files above 5 GB)
Coordinate reference system (EPSG:4326 or project CRS) defined and consistent across all input datasets
Precision tolerance standard documented — e.g. vertex snap tolerance 1e-8 degrees for EPSG:4326 data

DVC replaces heavy binary payloads with lightweight .dvc pointer files containing cryptographic checksums (MD5 by default; SHA-256 configurable) and remote storage paths. For spatial data, the critical design consideration is whether embedded metadata — projections, CRS identifiers, attribute schemas, and tiling structures — is preserved alongside the binary payload, and how checksums are computed across chunked reads.

Core Algorithmic Patterns

Content-Addressable Storage and Pointer Files

DVC stores every tracked artifact in a content-addressable cache keyed by its hash. The pointer file (.dvc) contains the hash, size, and remote path — not the data itself. This architecture is distinct from Git LFS, which embeds pointers inside Git objects. The spatial implication: two versions of the same orthomosaic with pixel-level edits produce entirely separate cache entries, making rollback a cache lookup rather than a delta reconstruction. For workflows where incremental vector changes are more common than full-file replacements, the delta tracking algorithm that underpins feature-level diffing can complement DVC by reducing upload volume for frequently updated layers.

Pipeline DAG and Stage Dependency Resolution

dvc.yaml encodes a directed acyclic graph (DAG) of processing stages. DVC’s dependency resolver performs a topological sort at runtime, comparing stored hashes against current hashes for each deps entry. A stage re-executes only when at least one dependency hash changes. Spatial complexity: for pipelines that emit many intermediate tiles (e.g. tiled COG exports), representing the tile directory as a single outs entry rather than individual files prevents the O(n) hash overhead from scaling with tile count.

Immutable Artifact Versioning via Git Tags

DVC integrates with Git tagging: a git tag v2024.06.23 committed alongside .dvc files fixes the exact hash of every tracked artifact at that point in time. Teams relying on pointer synchronization for raster datasets benefit directly — the tag provides the stable reference that clients resolve during dvc pull, eliminating hash mismatches between branches.

Production Workflow Implementation

The diagram below shows the full lifecycle from raw data ingestion through to team synchronization.

Step 1: Repository Initialization and Remote Configuration

Initialize a standard Git repository and layer DVC on top. Configure a remote storage backend that supports multipart uploads — essential for files exceeding 5 GB.

mkdir spatial-dvc-repo && cd spatial-dvc-repo
git init
dvc init
git add .dvc .dvcignore .gitignore
git commit -m "Initialize DVC tracking"

Configure the remote with region-specific endpoints and parallelism tuned to your network:

# Add default remote
dvc remote add -d geodata s3://your-bucket-name/spatial-data

# Set region endpoint and parallel job count
dvc remote modify geodata endpointurl https://s3.us-east-1.amazonaws.com
dvc remote modify geodata jobs 8

# Commit remote config
git add .dvc/config
git commit -m "Configure S3 remote for large file transfers"

Step 2: Tracking Large Geospatial Assets

Add datasets using dvc add. DVC automatically computes hashes and generates .dvc files. For multi-file directories — tiled GeoTIFFs, shapefile components, or LAS/LAZ point clouds — track the parent directory to maintain structural integrity.

# Track a single large raster
dvc add data/raw/orthomosaic_2024.tif

# Track a directory containing shapefile components
dvc add data/raw/admin_boundaries_2024/

# Commit the generated pointer files
git add data/raw/orthomosaic_2024.tif.dvc data/raw/admin_boundaries_2024.dvc .gitignore
git commit -m "Track raw spatial assets via DVC"

For vector data that undergoes frequent attribute updates or geometry edits, the default DVC behaviour re-uploads entire files on any modification. Teams managing high-frequency vector revisions can significantly reduce bandwidth by pairing DVC with the delta tracking algorithm for vector data that stores only changed feature records as incremental patches, then wrapping the resulting patch file with a standard dvc add.

Step 3: Pipeline Integration and Reproducible Processing

Tracking files is only half the workflow. Production GIS environments require automated, reproducible processing chains. DVC pipelines (dvc.yaml) capture command execution, input/output dependencies, parameter state, and emitted metrics.

# dvc.yaml — raster preprocessing stage
stages:
  preprocess_orthomosaic:
    cmd: >-
      python scripts/preprocess_raster.py
        data/raw/orthomosaic_2024.tif
        data/processed/orthomosaic_clipped.tif
    deps:
      - data/raw/orthomosaic_2024.tif
      - scripts/preprocess_raster.py
      - requirements.txt
    outs:
      - data/processed/orthomosaic_clipped.tif
    metrics:
      - metrics.json
    params:
      - params.yaml:
          - clip_bbox
          - target_crs
          - output_resolution

Run the pipeline with:

dvc repro

DVC caches outputs in .dvc/cache using the computed hash as the directory path. Subsequent runs skip unchanged stages, saving hours of compute time. Always externalize spatial parameters (CRS, resolution, clipping extents) into params.yaml to enable rapid experimentation without modifying pipeline definitions.

Step 4: Push, Pull, and Team Collaboration

Once local processing completes, push tracked files and pipeline metadata:

dvc push
git push

Team members restore the full workspace from scratch with:

git clone <repo-url> && cd spatial-dvc-repo
dvc pull

Use .dvcignore to prevent accidental tracking of temporary files, GDAL sidecars, and application states:

# .dvcignore
*.aux.xml
*.ovr
*.qgz
*.lock
__pycache__/
*.tmp

When integrating PostGIS database dumps or GeoJSON feature collections alongside binary rasters, see how to configure DVC for PostGIS and GeoJSON for connection pooling, transaction isolation, and schema versioning alignment with the immutable artifact model.

Code Reliability Patterns

Defensive DVC usage for spatial pipelines centres on three failure surfaces: hash verification, partial transfers, and stale cache.

Hash verification before merge. Before promoting a feature branch that updates a tracked raster, verify the pointer resolves cleanly:

import subprocess, sys, json, pathlib

def verify_dvc_pointer(dvc_file: str) -> bool:
    """Return True if the remote artifact matches the .dvc pointer hash."""
    result = subprocess.run(
        ["dvc", "status", "--cloud", dvc_file],
        capture_output=True, text=True
    )
    if result.returncode != 0:
        print(f"dvc status error: {result.stderr}", file=sys.stderr)
        return False
    status = json.loads(result.stdout) if result.stdout.strip() else {}
    if status:
        print(f"Out-of-sync: {status}", file=sys.stderr)
        return False
    return True

if not verify_dvc_pointer("data/raw/orthomosaic_2024.tif.dvc"):
    sys.exit(1)

Rollback logic. If a dvc repro stage fails mid-pipeline, the previous cache entry remains intact. Roll back the output to the last known-good state:

git checkout HEAD~1 -- data/processed/orthomosaic_clipped.tif.dvc
dvc checkout data/processed/orthomosaic_clipped.tif.dvc

Tolerance snapping before tracking. Snap vertices to the project tolerance before hashing to prevent spurious hash changes from floating-point drift introduced by CRS reprojection:

from shapely import set_precision
import geopandas as gpd

gdf = gpd.read_file("data/raw/admin_boundaries.gpkg")
gdf.geometry = gdf.geometry.apply(lambda g: set_precision(g, grid_size=1e-8))
gdf.to_file("data/raw/admin_boundaries_snapped.gpkg", driver="GPKG")

Performance and Scale Considerations

Scenario	Recommendation
Files > 5 GB	Ensure multipart upload is enabled; set `dvc remote modify geodata jobs 8` or higher
10 Gbps+ network	Raise `jobs` to 16–32; benchmark with `dvc push --verbose` to identify per-file overhead
Memory-constrained CI runners	Set `DVC_NO_ANALYTICS=1`; use `dvc fetch --jobs 4` to pull only required files
Archives > 50 GB	Switch to SHA-256: `dvc config core.checksum sha256` at init time to eliminate collision risk
Tiled COG exports	Track the tile directory as one `outs` entry to avoid O(n) hash overhead per tile
Satellite or unstable links	Wrap `dvc push` and `dvc pull` in a shell retry loop with exponential backoff
Local disk pressure	Run `dvc gc -w -c` to evict cache entries not referenced by the current workspace

Link type tuning. Hardlinks avoid data duplication when the cache lives on the same filesystem as the workspace:

dvc config cache.type hardlink,symlink,copy

Memory-mapped raster reads in pipeline scripts. Use rasterio’s window-based reading to avoid loading entire GeoTIFFs into RAM during processing stages:

import rasterio
from rasterio.windows import from_bounds

with rasterio.open("data/raw/orthomosaic_2024.tif") as src:
    window = from_bounds(
        left=-74.02, bottom=40.68, right=-73.95, top=40.75,
        transform=src.transform
    )
    data = src.read(window=window)   # only the clip region enters RAM

Troubleshooting and Failure Modes

Symptom	Root cause	Fix
`ERROR: failed to pull data from the cloud`	IAM credentials expired or missing `ListBucket` permission	Re-export credentials; verify bucket policy grants `s3:ListBucket` to the service account
`dvc status` shows files as `modified` after `dvc checkout`	Cache link type incompatible with filesystem (e.g. hardlinks across mount points)	Set `dvc config cache.type copy` for cross-device setups
Hash mismatch on `dvc push` for large raster	Partial upload from a previous interrupted transfer left a corrupt object	Delete the partial object in the bucket console, then re-run `dvc push`
`dvc repro` re-runs all stages despite no code changes	`params.yaml` or `requirements.txt` has a timestamp change without content change	Pin `requirements.txt` with `pip freeze`; commit `params.yaml` only when values genuinely change
`OSError: [Errno 28] No space left on device` during `dvc pull`	DVC cache fills available disk space	Run `dvc gc -w -c` to evict unused entries; move cache to a larger volume with `dvc config cache.dir`
QGIS `.qgz` or `.aux.xml` sidecars tracked accidentally	Files were in a tracked directory before `.dvcignore` was configured	Add patterns to `.dvcignore`, then `dvc add` the parent directory again to regenerate the pointer

Security Boundaries and Access Control

Geospatial data frequently contains sensitive infrastructure coordinates, proprietary survey data, or regulated environmental measurements. DVC does not encrypt payloads natively; security is enforced at the storage layer.

The security boundaries in spatial repositories guidance covers the full governance framework; the DVC-specific controls are:

IAM least privilege. Grant storage buckets read/write access only to service accounts executing dvc push/dvc pull. Use bucket policies to block public access and restrict DeleteObject to designated archive roles.
Encryption at rest. Enable server-side encryption (SSE-S3 or SSE-KMS) on the bucket. DVC passes encrypted blobs through transparently without modification.
Secrets management. Never hardcode credentials in .dvc/config. Use environment variables (AWS_ACCESS_KEY_ID, GOOGLE_APPLICATION_CREDENTIALS) or a secret manager injected by your CI/CD platform.
Audit trails. Enable storage bucket access logging. Cross-reference with git log to establish a complete audit trail: who committed which .dvc pointer, and when the corresponding payload was written to the remote.
Air-gapped environments. MinIO with TLS termination is a drop-in S3-compatible backend. Configure it identically to a cloud remote; DVC’s behaviour is backend-agnostic.

FAQ

Does DVC re-upload the entire file when only a small attribute changes?

By default, yes — DVC computes a hash of the whole file and treats any change as a new version. For high-frequency vector attribute updates, combine DVC with the delta tracking workflow so only changed feature records are stored as incremental patches, then wrap the resulting patch file with a standard dvc add.

What is the largest single file DVC can reliably handle?

DVC streams uploads using multipart chunking (configurable chunk size), so there is no hard file-size ceiling. In practice, single files up to several hundred gigabytes are routine. Beyond that, consider splitting into tiled sub-directories so team members can dvc pull only the tiles they need for a given processing area.

How does DVC interact with cloud-optimised GeoTIFF (COG) or Zarr?

DVC tracks COG and Zarr at the file or directory level, just like any other binary. Because COG files are self-contained and HTTP range-readable, teams often pull only the .dvc pointer and access tiles directly from the remote bucket, bypassing the local cache entirely for read-only analysis — combining content-addressable version control with the performance benefits of cloud-native formats.

Can DVC pipelines run in CI/CD (GitHub Actions, GitLab CI)?

Yes. Inject cloud credentials as CI secrets, run dvc pull to restore inputs, and dvc repro to execute the pipeline. Use dvc push in the post-stage to persist new outputs. Cache the .dvc/cache directory between runs using the platform’s cache action to avoid re-downloading unchanged assets between pipeline runs.

How do I prevent QGIS sidecar files and GDAL overviews from being tracked?

List them in .dvcignore. Common patterns: *.aux.xml, *.ovr, *.qgz, *.lock, and __pycache__/. DVC honours .dvcignore the same way Git honours .gitignore — entries are excluded from hashing and tracking entirely. Configure .dvcignore before running dvc add on any directory that QGIS has touched.

Delta Tracking Algorithms for Vector Data — feature-level diffing to minimise re-upload volume for frequently updated layers
Pointer Synchronization for Raster Datasets — resolving hash mismatches and partial downloads across distributed compute nodes
Security Boundaries in Spatial Repositories — full access-control and audit-trail governance for sensitive geospatial data
How to Configure DVC for PostGIS and GeoJSON — extending DVC tracking to database dumps and vector feature collections
Back to Geospatial Data Versioning Fundamentals & Architecture