Why does set_precision matter before geometry comparison?

Spatial coordinates are IEEE 754 doubles. Identical features from different sources rarely match at the bit level. set_precision snaps all vertices to a fixed grid, eliminating false positives caused by sub-millimeter floating-point drift.

When should I use a spatial join instead of a primary-key join for diffing?

Only when primary keys are missing or unreliable. Spatial proximity matching is ambiguous: two different features can overlap legitimately, and nearby features may be mismatched when boundaries shift. Reserve sjoin for datasets that have no stable identifier.

How do I handle datasets too large to fit in RAM?

Partition by spatial grid (H3 or S2 cells) or administrative boundary, compute diffs per partition with a small buffer overlap, then concatenate. Use dask-geopandas to parallelize topology repair across partitions.

Implementing Spatial Diff Algorithms in Python

A focused implementation guide for computing deterministic change sets between two polygon GeoDataFrame objects — part of Spatial Diff Algorithms for Polygon Data.

Concept & Context

Spatial diff is harder than text diff because geometry is continuous, topology-dependent, and sensitive to coordinate representation. Two linestrings that look identical on screen may differ at the 14th decimal place because they were produced by different GIS tools. Any diff algorithm that skips precision normalization will report false positives on every feature, filling your change log with noise.

The approach here combines geopandas for tabular-spatial joins with shapely 2.0+ for vectorized GEOS-backed operations. By avoiding row-by-row iteration and using a stable primary-key join rather than a positional or spatial-proximity join, the pipeline produces a deterministic change set that integrates directly into the branching and merge workflows used by GIS data engineering teams. When automated conflict detection in merge requests needs to gate a pull request, it reads exactly this kind of structured diff output.

Core Algorithmic Pipeline

The pipeline enforces a strict sequence; skipping any step produces either false positives (phantom changes) or false negatives (missed edits).

Normalize CRS — reproject target to base.crs before any geometric operation. Comparing coordinates across different projections produces nonsense differences.
Repair Topology — call make_valid on every geometry. Invalid rings and self-intersections cause GEOS overlay operations to fail silently or return None.
Quantize Coordinates — apply set_precision(grid_size) to snap all vertices to a fixed grid. This collapses sub-millimeter floating-point noise into exact matches, eliminating false positives without altering visible geometry.
Classify Feature IDs — use Python set arithmetic on the primary-key column to split features into added, removed, and shared buckets in O(n) time.
Detect Geometry Modifications — for shared IDs, run a vectorized shapely.equals across numpy arrays to identify features whose geometry changed.
Reconcile Attributes — for shared IDs, compare non-spatial columns using pandas boolean indexing to separate geometry-only edits from attribute-only updates.

For the attribute reconciliation step, the automating attribute reconciliation with pandas and geopandas guide covers the composite change-type logic in detail.

Working Implementation

The function below is the complete production implementation. It returns a unified change table with a change_type column (added, removed, modified, unchanged) suitable for feeding into CI/CD gates or audit logs.

import geopandas as gpd
import pandas as pd
import numpy as np
import shapely
from shapely.validation import make_valid


def compute_spatial_diff(
    base: gpd.GeoDataFrame,
    target: gpd.GeoDataFrame,
    id_col: str = "feature_id",
    precision: float = 1e-6,
) -> gpd.GeoDataFrame:
    """
    Vectorized spatial diff for polygon GeoDataFrames.

    Parameters
    ----------
    base      : GeoDataFrame — the reference (older) version
    target    : GeoDataFrame — the candidate (newer) version
    id_col    : column name that uniquely identifies each feature
    precision : grid size for coordinate snapping (degrees or metres
                depending on CRS; 1e-6 degrees ≈ 0.1 m at mid-latitudes)

    Returns
    -------
    GeoDataFrame with columns [id_col, 'change_type', 'geometry'],
    CRS inherited from base.
    """
    base, target = base.copy(), target.copy()

    # --- Step 1: CRS Alignment -------------------------------------------
    if base.crs != target.crs:
        target = target.to_crs(base.crs)

    # --- Step 2: Topology Repair -----------------------------------------
    # make_valid handles self-intersections, duplicate vertices, unclosed rings.
    base["geometry"] = base.geometry.apply(make_valid)
    target["geometry"] = target.geometry.apply(make_valid)

    # --- Step 3: Coordinate Precision Normalization (Shapely 2.0+) -------
    # set_precision snaps vertices to a fixed grid, eliminating sub-mm drift.
    base["geometry"] = base.geometry.set_precision(precision)
    target["geometry"] = target.geometry.set_precision(precision)

    # --- Step 4: Set-based ID Classification ----------------------------
    base_ids = set(base[id_col])
    target_ids = set(target[id_col])

    added_ids   = target_ids - base_ids   # new features not in base
    removed_ids = base_ids - target_ids   # deleted features not in target
    shared_ids  = base_ids & target_ids   # features present in both

    # --- Step 5: Vectorized Geometry Equality for Shared Features --------
    # Sort both slices by id_col to align arrays before element-wise compare.
    shared_base   = base[base[id_col].isin(shared_ids)].set_index(id_col).sort_index()
    shared_target = target[target[id_col].isin(shared_ids)].set_index(id_col).sort_index()

    # shapely.equals dispatches to GEOS in compiled C; ~10–50× faster than
    # a Python loop calling .equals() on individual geometry objects.
    geom_equal   = shapely.equals(
        shared_base["geometry"].values,
        shared_target["geometry"].values,
    )
    modified_ids  = set(shared_base.index[~geom_equal])
    unchanged_ids = set(shared_base.index[geom_equal])

    # --- Step 6: Build Output Slices -------------------------------------
    added     = target[target[id_col].isin(added_ids)].copy()
    removed   = base[base[id_col].isin(removed_ids)].copy()
    modified  = target[target[id_col].isin(modified_ids)].copy()
    unchanged = target[target[id_col].isin(unchanged_ids)].copy()

    for frame, label in [
        (added, "added"), (removed, "removed"),
        (modified, "modified"), (unchanged, "unchanged"),
    ]:
        frame["change_type"] = label

    diff = pd.concat([added, removed, modified, unchanged], ignore_index=True)
    return gpd.GeoDataFrame(
        diff[[id_col, "change_type", "geometry"]],
        geometry="geometry",
        crs=base.crs,
    )

Extending to Attribute Changes

To distinguish geometry-only edits from attribute-only updates, extend step 6 with a pandas comparison across non-spatial columns:

# Attribute diff for shared features (run after compute_spatial_diff)
base_attrs   = base[base[id_col].isin(shared_ids)].drop(columns="geometry").set_index(id_col)
target_attrs = target[target[id_col].isin(shared_ids)].drop(columns="geometry").set_index(id_col)

# True where at least one attribute column differs
attr_changed = (base_attrs != target_attrs).any(axis=1)
attr_only_ids = set(base_attrs.index[attr_changed]) - modified_ids

# Composite labels
# geometry_only  = modified_ids - attr_changed IDs
# attribute_only = attr_only_ids
# full_update    = modified_ids & set(base_attrs.index[attr_changed])

Combine these with modified_ids to produce composite change types (geometry_only, attribute_only, full_update) before writing the diff to your audit table.

Validation & Output Verification

After running compute_spatial_diff, confirm correctness with these assertions before writing output to storage:

result = compute_spatial_diff(base_gdf, target_gdf, id_col="feature_id")

# 1. Row-count conservation: every feature must appear exactly once
total_base   = len(base_gdf)
total_target = len(target_gdf)
n_added      = (result["change_type"] == "added").sum()
n_removed    = (result["change_type"] == "removed").sum()
n_unchanged  = (result["change_type"] == "unchanged").sum()
n_modified   = (result["change_type"] == "modified").sum()

assert n_added + n_removed + n_unchanged + n_modified == len(result), \
    "Row count mismatch — duplicate or missing feature IDs"
assert n_added + n_unchanged + n_modified == total_target, \
    "Target feature count mismatch"
assert n_removed + n_unchanged + n_modified == total_base, \
    "Base feature count mismatch"

# 2. CRS preserved
assert result.crs == base_gdf.crs, "CRS was altered during diff"

# 3. Topology validity of output geometries
invalid_mask = ~result.geometry.is_valid
if invalid_mask.any():
    print(f"WARNING: {invalid_mask.sum()} invalid geometries in diff output")
    result.loc[invalid_mask, "geometry"] = result.loc[invalid_mask, "geometry"].apply(make_valid)

# 4. No null geometries in non-removed features
non_removed = result[result["change_type"] != "removed"]
assert non_removed.geometry.notna().all(), "Null geometry in non-removed features"

# 5. Change-type distribution check (sanity — adjust thresholds to your data)
print(result["change_type"].value_counts())

For large datasets, also compute a SHA-256 hash of the sorted feature_id + change_type pairs and store it with the diff artifact. This lets downstream pipeline stages verify they received an uncorrupted diff without re-running the comparison.

import hashlib, json

fingerprint_rows = (
    result[["feature_id", "change_type"]]
    .sort_values("feature_id")
    .to_dict(orient="records")
)
diff_hash = hashlib.sha256(
    json.dumps(fingerprint_rows, sort_keys=True).encode()
).hexdigest()
print(f"Diff fingerprint: {diff_hash}")

Failure Modes

Symptom: modified_ids contains nearly every shared feature even though the datasets are visually identical. Root cause: set_precision was not applied, or precision is too coarse and is snapping distinct vertices to the same grid point. Fix: Apply set_precision(1e-6) before comparison; for projected EPSG:3857 data use 1e-3 (millimetres), not 1e-6.
Symptom: make_valid returns a GeometryCollection instead of a Polygon, causing downstream area calculations to fail. Root cause: The input polygon had a bowtie (figure-eight) self-intersection that make_valid splits into a multipart geometry. Fix: After make_valid, explode multi-part results with gdf.explode(index_parts=False) and re-check feature counts.
Symptom: shapely.equals raises ValueError: operands could not be broadcast together on the shared-feature arrays. Root cause: shared_base and shared_target are not the same length because id_col contains duplicates in one dataset. Fix: Assert base[id_col].is_unique and target[id_col].is_unique before running the diff; deduplicate with .drop_duplicates(subset=id_col, keep="last") if needed.
Symptom: The diff output is correct but writing to GeoPackage fails with a GDAL duplicate field name error. Root cause: The output GeoDataFrame still carries extra columns from the concat (e.g., two columns named geometry from different CRS objects). Fix: Select only [id_col, "change_type", "geometry"] explicitly in the return statement, as the implementation does.

Spatial Diff Algorithms for Polygon Data — parent cluster covering the broader algorithmic landscape, including hash-based and topology-graph approaches
Automated Conflict Detection in Merge Requests — how diff output feeds into merge-gate CI/CD checks
Automating Attribute Reconciliation with Pandas and GeoPandas — extending attribute-only change detection from the composite diff
Resolving Topology Errors During Branch Merges — remediation steps when diff output surfaces topology breaks before merge

Back to Branching & Merge Strategies for Spatial Datasets