What happens when feature IDs are missing from both datasets?

Records without a matching ID on either side appear as NaN-filled rows after an outer merge. Pre-validate schemas with pandera before merging; drop or quarantine rows where the id_col is null rather than propagating empty identifiers into the reconciled output.

Can this pipeline handle multi-geometry types in the same GeoDataFrame?

Yes, but only if both GeoDataFrames share a compatible geometry column type. Use gdf.geom_type.value_counts() to audit heterogeneous collections before merging, and call gdf.explode(index_parts=True) to flatten GeometryCollections that would otherwise break spatial index rebuilding.

How should I handle float attribute comparisons to avoid false-positive conflicts?

Use numpy.isclose() with an absolute tolerance matching your dataset's precision (e.g., atol=1e-6 for survey-grade coordinates) instead of strict inequality. Replace the != operator in the conflict mask with ~np.isclose(merged[base_col], merged[inc_col], atol=1e-6, equal_nan=True).

Automating Attribute Reconciliation with Pandas and GeoPandas

A focused implementation guide for replacing error-prone manual GIS editing with a deterministic, code-driven reconciliation pipeline — part of Attribute Reconciliation for Tabular Spatial Data.

Concept & Context

Attribute reconciliation for tabular spatial data is the process of merging two diverged versions of the same feature dataset into a single authoritative output, resolving field-level disagreements according to explicit rules. Doing this by hand in QGIS or ArcGIS is feasible for a dozen features; at thousands of features across weekly sprint cycles, it becomes the primary bottleneck in distributed GIS team workflows.

Pandas and GeoPandas together solve this problem at the right altitude: Pandas handles the tabular join and vectorized delta logic, while GeoPandas manages CRS normalization and geometry reconstruction. The resulting pipeline is stateless and deterministic — given identical inputs it always produces identical outputs — which is essential for the audit trails that conflict resolution and team synchronization workflows require. When geometry itself is in dispute rather than just attributes, the companion technique of automated patching for minor geometry shifts handles the spatial side.

The pipeline described here is version-control friendly: every conflict decision is recorded in a structured log, making the reconciliation step reproducible from a CI/CD trigger or a DVC stage.

Core Algorithmic Pipeline

The reconciliation algorithm runs five sequential steps. Each step has an exact contract so that any step can be replaced or extended without breaking the others.

CRS normalization. Before any tabular operation, reproject incoming_gdf to base_gdf.crs using gdf.to_crs(). A silent CRS mismatch propagates all the way through the merge and corrupts every downstream spatial join or distance calculation. Fail fast here rather than later.
Outer join on feature ID. Drop the geometry column from both frames before calling pd.merge(..., how="outer", suffixes=("_base", "_incoming")). The outer join preserves newly added features from either dataset; the suffixes isolate each source’s values so the original columns are never overwritten before comparison.
Vectorized delta detection. For each attribute column, compute a boolean mask_conflict that is True where both sides are non-null and unequal. This is O(n) via NumPy broadcasting — no Python-level row iteration. Float columns require numpy.isclose() with an absolute tolerance matched to your dataset’s precision rather than strict !=.
Timestamp-priority resolution. Where mask_conflict is True, compare the last_updated columns using np.where. The more recent timestamp wins. Simultaneously, record each conflict in a flat log DataFrame: feature ID, field name, both original values, and the applied decision ("base" or "incoming").
Geometry reconstruction and audit export. Reattach geometries from base_gdf backfilled with incoming_gdf via Series.combine_first, then wrap as gpd.GeoDataFrame. Compile the per-column conflict records into a single audit log and return both as a tuple.

Working Implementation

import pandas as pd
import geopandas as gpd
import numpy as np
from typing import Tuple

# Enable copy-on-write to prevent chained-assignment warnings (Pandas 2.0+)
pd.options.mode.copy_on_write = True


def reconcile_attributes(
    base_gdf: gpd.GeoDataFrame,
    incoming_gdf: gpd.GeoDataFrame,
    id_col: str = "feature_id",
    priority_col: str = "last_updated",
    float_atol: float = 1e-9,
) -> Tuple[gpd.GeoDataFrame, pd.DataFrame]:
    """
    Reconcile attribute conflicts between two diverged GeoDataFrames.

    Returns:
        reconciled  — a new GeoDataFrame with conflicts resolved deterministically
        audit_log   — a flat DataFrame recording every conflict and its resolution
    """

    # ── Step 1: CRS normalization ──────────────────────────────────────────────
    if base_gdf.crs != incoming_gdf.crs:
        incoming_gdf = incoming_gdf.to_crs(base_gdf.crs)

    # ── Step 2: Outer join on feature ID (geometry excluded from merge) ────────
    merged = pd.merge(
        base_gdf.drop(columns="geometry"),
        incoming_gdf.drop(columns="geometry"),
        on=id_col,
        how="outer",
        suffixes=("_base", "_incoming"),
    )

    # Columns that are not attribute fields (priority timestamps + ID)
    meta_cols = {id_col, f"{priority_col}_base", f"{priority_col}_incoming"}
    # Collect only the _base side of each attribute pair
    base_attr_cols = [
        c for c in merged.columns
        if c.endswith("_base") and c not in meta_cols
    ]

    conflict_records: list[pd.DataFrame] = []
    resolved_cols: dict[str, pd.Series] = {}

    # ── Steps 3 & 4: Vectorized delta detection + priority resolution ──────────
    for base_col in base_attr_cols:
        attr = base_col.removesuffix("_base")          # e.g. "land_use"
        inc_col = f"{attr}_incoming"

        if inc_col not in merged.columns:
            resolved_cols[attr] = merged[base_col]
            continue

        both_present = merged[base_col].notna() & merged[inc_col].notna()

        # Float-safe inequality check
        if pd.api.types.is_float_dtype(merged[base_col]):
            values_differ = ~np.isclose(
                merged[base_col].fillna(np.nan),
                merged[inc_col].fillna(np.nan),
                atol=float_atol,
                equal_nan=True,
            )
        else:
            values_differ = merged[base_col] != merged[inc_col]

        mask_conflict = both_present & values_differ

        # Incoming wins when its timestamp is strictly newer
        incoming_newer = (
            merged[f"{priority_col}_incoming"] > merged[f"{priority_col}_base"]
        )

        # Apply resolution; fall back to whichever side is non-null
        resolved = merged[base_col].copy()
        if mask_conflict.any():
            resolved = resolved.where(
                ~mask_conflict,
                np.where(incoming_newer, merged[inc_col], merged[base_col]),
            )
            # Audit: record every conflict with its decision
            log_slice = merged.loc[mask_conflict, [id_col, base_col, inc_col]].copy()
            log_slice["field"] = attr
            log_slice["resolution"] = np.where(
                incoming_newer[mask_conflict], "incoming", "base"
            )
            log_slice = log_slice.rename(
                columns={base_col: "value_base", inc_col: "value_incoming"}
            )
            conflict_records.append(
                log_slice[[id_col, "field", "value_base", "value_incoming", "resolution"]]
            )

        # Back-fill attributes that only exist in one dataset
        resolved = resolved.combine_first(merged[inc_col])
        resolved_cols[attr] = resolved

    # ── Step 5: Reconstruct GeoDataFrame ──────────────────────────────────────
    reconciled_df = pd.DataFrame(resolved_cols)
    reconciled_df[id_col] = merged[id_col].values

    # Geometry: prefer base, fill gaps from incoming
    geom_base = base_gdf.set_index(id_col)["geometry"]
    geom_inc  = incoming_gdf.set_index(id_col)["geometry"]
    geometry  = geom_base.combine_first(geom_inc)

    reconciled_df = reconciled_df.join(geometry, on=id_col)
    reconciled = gpd.GeoDataFrame(reconciled_df, geometry="geometry", crs=base_gdf.crs)

    # ── Step 6: Compile audit log ──────────────────────────────────────────────
    audit_log = (
        pd.concat(conflict_records, ignore_index=True)
        if conflict_records
        else pd.DataFrame(columns=[id_col, "field", "value_base", "value_incoming", "resolution"])
    )

    return reconciled, audit_log

When to use spatial proximity instead of key-based joins

If unique identifiers are absent or unreliable — common after a schema migration or when integrating a third-party dataset — replace pd.merge with gpd.sjoin_nearest. Spatial joins require an explicit max_distance tolerance and should always be followed by a secondary attribute validator to catch false-positive matches before the conflict resolution step runs.

Validation & Output Verification

After reconcile_attributes returns, verify the output before writing to disk or committing to version control.

Row-count assertions confirm no silent data loss:

expected = len(base_gdf) + len(incoming_gdf.loc[
    ~incoming_gdf[id_col].isin(base_gdf[id_col])
])
assert len(reconciled) == expected, (
    f"Row count mismatch: expected {expected}, got {len(reconciled)}"
)

CRS round-trip check confirms geometry integrity survived reconstruction:

assert reconciled.crs == base_gdf.crs, "CRS lost during geometry reconstruction"
assert reconciled.geometry.is_valid.all(), "Invalid geometries after reconciliation"

Audit log completeness — every conflict must have a recorded decision:

assert audit_log["resolution"].isin({"base", "incoming"}).all(), \
    "Unresolved conflicts in audit log"
print(audit_log.groupby("field")["resolution"].value_counts())

Topology validation for polygon datasets — pipe through automated conflict detection in merge requests to catch ring self-intersections introduced during geometry backfill:

from shapely.validation import explain_validity
invalid = reconciled.loc[~reconciled.geometry.is_valid, id_col]
if not invalid.empty:
    for fid in invalid:
        geom = reconciled.set_index(id_col).at[fid, "geometry"]
        print(fid, explain_validity(geom))

Export the audit log as a versioned Parquet partition alongside the reconciled file:

import pyarrow as pa, pyarrow.parquet as pq, datetime

run_id = datetime.datetime.utcnow().strftime("%Y%m%dT%H%M%SZ")
reconciled["reconciliation_run_id"] = run_id
reconciled.to_file(f"output/reconciled_{run_id}.gpkg", driver="GPKG")

audit_log["reconciliation_run_id"] = run_id
audit_log.to_parquet(f"output/audit_{run_id}.parquet", index=False)

Failure Modes

CRSError on merge — base_gdf.crs and incoming_gdf.crs are set but differ, and the to_crs() guard was bypassed or one frame has None as its CRS. Fix: add assert base_gdf.crs is not None before the function call and enforce EPSG:4326 (or your organisation’s standard) as a pipeline precondition.
Silent dtype coercion during np.where — mixing object and int64 columns causes np.where to upcast to object, turning integer IDs into strings and breaking downstream joins. Fix: pre-validate schemas with pandera before calling reconcile_attributes; enforce column dtypes with df.astype(schema_dtypes).
Geometry None in reconciled output — features that exist in incoming_gdf but not base_gdf have no geometry in geom_base, and combine_first only backfills when the base series entry is NaN, not a genuine None from a missing index entry. Fix: reindex geom_base to include all IDs from the outer-merged result before calling combine_first.
Memory exhaustion on large datasets — an outer join on two 10 M-row GeoDataFrames materialises a frame three times the size of either input. Fix: partition the merge by bounding box or administrative unit using spatial indexing, or switch the tabular layer to dask-geopandas with lazy evaluation.

Attribute Reconciliation for Tabular Spatial Data — parent overview of the reconciliation strategies this page implements
Resolving Overlapping Polygons in Collaborative Editing — handle the geometric dimension of merge conflicts
Automated Patching for Minor Geometry Shifts — coordinate-level patch logic to complement attribute reconciliation
Automated Conflict Detection in Merge Requests — gate topology validity before a reconciled dataset reaches the main branch
Back to Conflict Resolution & Team Synchronization Workflows