How to parse municipal zoning PDFs into GeoJSON
To parse municipal zoning PDFs into GeoJSON, extract embedded vector paths using a PDF parsing library, apply a 2D affine transformation to map PDF point coordinates to real-world spatial references, and serialize the calibrated geometries through GeoPandas to produce RFC 7946-compliant output. Municipal zoning documents are rarely born-geospatial; they require coordinate calibration, topology validation, and attribute mapping before they can reliably feed into automated compliance pipelines. This guide provides a production-ready Python workflow for vector-exported zoning maps.
Prerequisites & Toolchain Setup
Municipal zoning PDFs generally fall into two categories: vector-exported maps (CAD/GIS exports) and scanned/raster documents. This pipeline targets vector PDFs, which retain parseable path geometry. For raster-heavy documents, see the fallback section below.
pip install pdfplumber geopandas shapely pyproj numpy
Ensure Python 3.9+ is active. pdfplumber reliably extracts path data across modern PDF generators, while GeoPandas manages spatial serialization, projection handling, and topology cleanup. Consult the GeoPandas Documentation for advanced spatial operations and CRS management.
Step 1: Extract Closed Vector Paths
Zoning districts are represented as closed polygons. pdfplumber exposes page paths as dictionaries containing coordinate tuples and metadata. Filter for closed paths and deduplicate the closing vertex to prevent invalid polygon construction.
import pdfplumber
from shapely.geometry import Polygon
def extract_zoning_paths(pdf_path: str) -> list[Polygon]:
"""Extract closed vector paths from a zoning PDF."""
polygons = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
for path in page.get_paths():
# Zoning boundaries are closed paths with ≥3 unique vertices
if path.get("is_closed") and len(path["points"]) >= 3:
pts = path["points"]
# Remove duplicate closing point if present
if pts[0] == pts[-1]:
pts = pts[:-1]
polygons.append(Polygon(pts))
return polygons
PDF coordinates use a bottom-left origin measured in points (1/72 inch). These values are purely relative and must be mapped to a real-world coordinate reference system (CRS) before spatial analysis.
Step 2: Calibrate Coordinate Space
Georeferencing requires at least two known control points: (x_pdf, y_pdf) mapped to (x_real, y_real). A 2D affine transform solves for translation, rotation, scaling, and skew simultaneously. We use least-squares regression to handle measurement noise or imperfect control point selection.
import numpy as np
from shapely.affinity import affine_transform
def solve_affine_matrix(pdf_pts: list[tuple], real_pts: list[tuple]) -> list[float]:
"""
Solve for Shapely's [a, b, d, e, xoff, yoff] affine matrix.
x' = a*x + b*y + xoff
y' = d*x + e*y + yoff
"""
A, B_x, B_y = [], [], []
for (px, py), (rx, ry) in zip(pdf_pts, real_pts):
A.append([px, py, 1, 0, 0, 0])
A.append([0, 0, 0, px, py, 1])
B_x.append(rx)
B_y.append(ry)
A, B_x, B_y = np.array(A), np.array(B_x), np.array(B_y)
coeffs_x, _, _, _ = np.linalg.lstsq(A, B_x, rcond=None)
coeffs_y, _, _, _ = np.linalg.lstsq(A, B_y, rcond=None)
# Return in Shapely's expected order
return [coeffs_x[0], coeffs_x[1], coeffs_y[0], coeffs_y[1], coeffs_x[2], coeffs_y[2]]
def apply_transform(polygons: list[Polygon], matrix: list[float]) -> list[Polygon]:
return [affine_transform(poly, matrix) for poly in polygons]
Select control points at unambiguous intersections (e.g., street grid corners, parcel vertices, or surveyed monuments). Three or more points improve transform accuracy by averaging out digitization error.
Step 3: Serialize & Validate Topology
Raw PDF exports frequently contain slivers, self-intersections, or overlapping district boundaries. Before exporting to GeoJSON, validate geometries, assign a CRS, and clean topology.
import geopandas as gpd
def polygons_to_geojson(
polygons: list[Polygon],
crs: str = "EPSG:26918", # Example: NAD83 / UTM zone 18N
output_path: str = "zoning.geojson"
) -> str:
gdf = gpd.GeoDataFrame(geometry=polygons, crs=crs)
# Fix self-intersections and invalid rings
gdf.geometry = gdf.geometry.make_valid()
# Optional: remove slivers (< 10 sq meters)
gdf = gdf[gdf.geometry.area > 10.0]
# Export RFC 7946-compliant GeoJSON
gdf.to_file(output_path, driver="GeoJSON")
return output_path
The resulting file conforms to the RFC 7946: The GeoJSON Format specification, ensuring compatibility with web mapping libraries, compliance engines, and municipal GIS portals.
Handling Raster-Heavy Zoning Maps
If pdfplumber returns empty paths, the PDF is likely a scanned image. Vector extraction will fail. In these cases:
- Convert the PDF to TIFF using
pdf2imageorPyMuPDF. - Georeference the raster in QGIS using the Georeferencer plugin, saving a World File (
.tfw) or GeoTIFF. - Vectorize using
rasterio.features.shapes()orgdal_polygonize.py. - Clean topology and export as GeoJSON using the pipeline above.
Raster workflows introduce higher error margins and require manual ground-truthing. Prioritize vector PDFs when available.
Production Considerations
Automating zoning PDF ingestion requires consistent control point sourcing, CRS standardization, and error logging. Municipal documents vary widely in drafting standards; implement validation gates that flag geometries with extreme aspect ratios, negative areas, or CRS mismatches before ingestion.
This extraction pipeline forms the foundation of a broader Core Geospatial Compliance Architecture & Regulatory Mapping framework, enabling automated setback calculations, use-case validation, and overlay analysis. For teams scaling this across multiple jurisdictions or integrating with CI/CD data pipelines, review our Zoning Layer Ingestion Strategies guide to handle batch processing, version control, and regulatory update triggers.
Maintain a control-point registry per municipality. Store PDF-to-real-world mappings alongside extracted GeoJSON to enable reproducible georeferencing and audit trails for compliance reviews.