Skip to content
ES

Location Intelligence: Data Collection

Topic

From the PointSav Documentation

Updated 2026-06-18 · History
vendor-internal

Status (2026-06-11): All seven steps below are complete. VWH production build: 6,368 clusters (T1=852/T2=1,327/T3=4,189). PKS production build: 6,953 clusters (T1=691/T2=2,658/T3=3,604). This guide is retained as a reference runbook for re-ingesting or extending to new chains and countries.

Initial pipeline setup for ingesting VWH (Vertical Warehouse) and PKS (Parking Structures) chain and infrastructure data. Steps are recorded in execution order.

Working directory for all commands: app-orchestration-gis/ (inside the GIS monorepo clone).

[edit]Prerequisites

  • Overpass API access (queries run via ingest-osm.py; no API key required)
  • Python 3.11+ with pipeline dependencies installed
  • TOTEBOX_DATA_PATH pointing to the active deployment data directory

Verify the pipeline is clean:

python3 -c "from taxonomy import CATEGORIES, BRAND_FILL; print('taxonomy OK')"
python3 -c "from config import TOTEBOX_DATA_PATH; print('config OK')"

[edit]Step 1 — Run existing YAML ingests (VWH auto-parts + paint)

Five chain YAMLs were scaffolded on 2026-06-01. Run the ingest to download OSM records:

python3 ingest-osm.py --chain \
  autozone-us \
  oreilley-auto-us \
  napa-us \
  sherwin-williams-us \
  halfords-uk

Expected output files in $TOTEBOX_DATA_PATH/service-fs/service-business/:

  • autozone-us.jsonl — expect 5,000–7,000 records
  • oreilley-auto-us.jsonl — expect 5,000–7,000 records
  • napa-us.jsonl — expect 3,000–6,000 records (franchise network; OSM partial)
  • sherwin-williams-us.jsonl — expect 3,000–5,000 records
  • halfords-uk.jsonl — expect 300–450 records

Verify:

for f in autozone-us oreilley-auto-us napa-us sherwin-williams-us halfords-uk; do
  echo "$f: $(wc -l < $TOTEBOX_DATA_PATH/service-fs/service-business/$f.jsonl) records"
done

If a chain returns 0 records, check if Wikidata tag coverage is sparse in OSM and add name_query: fallback to the YAML.

[edit]Step 2 — Add Würth (biggest EU MRO gap)

Würth is the single highest-value chain not yet in the taxonomy. ~1,500 EU branches in industrial parks across DE/FR/IT/PL/AT/NL and beyond. OSM has brand:wikidata=Q183759 applied to many records.

2a. Create YAML:

cat > $TOTEBOX_DATA_PATH/service-fs/service-business/wurth-de.yaml << 'EOF'
schema: service-business-chain-v1
chain_id: wurth-de
country: Germany
country_code: DE
region: europe-central
category: Industrial MRO supply
category_slug: mro-industrial
naics_code: "423840"
top_category: Industrial and Personal Service Paper and Related Products
sub_category: Industrial MRO Supply
overture_taxonomy:
  - industrial
  - wholesale
brand_family: MROIndustrial
retailer: Würth
canonical_name: "Adolf Würth GmbH & Co. KG"
parent_company: "Würth Group"
website: wuerth.de
wikidata_id: Q183759
osm_overpass_tag: brand:wikidata=Q183759
store_count_approx: 1500
locations_file: locations/wurth-de.jsonl
locations_status: pending
last_updated: 2026-06-01
notes: "VWH archetype signal. MRO distributor; branches in industrial parks across EU. Multi-country — set country_code per-ISO ingest or use multi_country: true."
EOF

2b. Add mro_industrial category to taxonomy.py:

In taxonomy.py, inside the CATEGORIES dict, after the paint block:

"mro_industrial": {
    "label": "Industrial MRO Supply",
    "naics": "423840",
    "description": "Maintenance, repair, and operations distributor — VWH signal. Never gates tier.",
},
"flooring": {
    "label": "Flooring & Tile Supply",
    "naics": "442210",
    "description": "Contractor-facing flooring/tile warehouse — VWH signal. Never gates tier.",
},
"tool_rental": {
    "label": "Tool & Equipment Rental",
    "naics": "532412",
    "description": "Equipment rental branch — VWH signal; deliberate hardware co-location. Never gates tier.",
},
"lumber": {
    "label": "Lumber & Building Materials",
    "naics": "444190",
    "description": "Lumber yard or building materials dealer — VWH signal. Never gates tier.",
},
"car_rental": {
    "label": "Car Rental",
    "naics": "532111",
    "description": "Car rental branch — PKS signal; defines transit node commercial zone. Never gates tier.",
},

None of these go into _RETAIL_CATS — they never affect T1/T2/T3 tier logic.

2c. Add BRAND_FILL entries for all new categories:

In the BRAND_FILL dict, after the existing paint block:

"mro_industrial": {
    "DE": ["wurth-de"],
    "FR": ["wurth-de"],   # Würth is multi-country; chain_id shared
    "IT": ["wurth-de"], "ES": ["wurth-de"], "PL": ["wurth-de"],
    "AT": ["wurth-de"], "NL": ["wurth-de"], "GB": ["wurth-de"],
    "US": ["fastenal-us", "grainger-us"],
    "CA": [], "MX": [],
    "SE": ["wurth-de"], "DK": ["wurth-de"], "NO": ["wurth-de"],
    "FI": ["wurth-de"], "IS": [], "GR": [], "PT": [],
},
"flooring": {
    "US": ["floor-decor-us"],
    "GB": ["topps-tiles-uk"],
    "CA": [], "MX": [], "FR": [], "DE": [], "ES": [], "IT": [],
    "GR": [], "PL": [], "AT": [], "NL": [], "PT": [],
    "SE": [], "DK": [], "NO": [], "FI": [], "IS": [],
},
"tool_rental": {
    "US": ["united-rentals-us", "sunbelt-rentals-us"],
    "CA": ["united-rentals-us"],
    "FR": ["loxam-fr", "kiloutou-fr"],
    "MX": [], "GB": [], "DE": [], "ES": [], "IT": [],
    "GR": [], "PL": [], "AT": [], "NL": [], "PT": [],
    "SE": [], "DK": [], "NO": [], "FI": [], "IS": [],
},
"lumber": {
    "US": ["84-lumber-us", "builders-firstsource-us"],
    "CA": ["kent-building-supplies-ca"],
    "MX": [], "GB": [], "FR": [], "DE": [], "ES": [], "IT": [],
    "GR": [], "PL": [], "AT": [], "NL": [], "PT": [],
    "SE": [], "DK": [], "NO": [], "FI": [], "IS": [],
},
"car_rental": {
    "US": ["enterprise-us", "hertz-us", "avis-us"],
    "CA": ["enterprise-us", "hertz-us"],
    "MX": [],
    "DE": ["sixt-de"], "FR": ["europcar-fr", "sixt-de"],
    "GB": ["europcar-fr", "sixt-de"], "ES": ["europcar-fr"],
    "IT": ["europcar-fr"], "NL": ["europcar-fr"],
    "GR": [], "PL": [], "AT": ["sixt-de"], "PT": [],
    "SE": ["sixt-de"], "DK": ["sixt-de"], "NO": [], "FI": [], "IS": [],
},

2d. Run Würth ingest:

python3 ingest-osm.py --chain wurth-de

Expect 800–1,500 records. Würth branches are often tagged shop=trade or shop=wholesale in OSM — the brand:wikidata tag is the reliable signal.

[edit]Step 3 — Add Tier A retail/rental VWH chains

Create YAMLs following the schema below, then run ingest.

YAML schema reference:

schema: service-business-chain-v1
chain_id: floor-decor-us          # matches BRAND_FILL key
country: United States
country_code: US
region: north-america
category: Flooring supply          # display only
category_slug: flooring-supply
naics_code: "442210"
top_category: Floor Covering Stores
sub_category: Floor Covering Stores
brand_family: Flooring             # matches category
retailer: Floor & Decor
canonical_name: "Floor & Decor Holdings, Inc."
parent_company: "Floor & Decor Holdings, Inc. (public; NYSE: FND)"
website: flooranddecor.com
wikidata_id: Q22350998
osm_overpass_tag: brand:wikidata=Q22350998
store_count_approx: 240
locations_file: locations/floor-decor-us.jsonl
locations_status: pending
last_updated: 2026-06-01
notes: "VWH Tier A. Warehouse-format contractor flooring; same footprint as Home Depot."

Chains to create (copy and adapt schema above):

chain_id retailer wikidata_id country_code approx_count
floor-decor-us Floor & Decor Q22350998 US 240
topps-tiles-uk Topps Tiles Q7825827 GB 300
united-rentals-us United Rentals Q7889284 US 1,400
sunbelt-rentals-us Sunbelt Rentals Q7645154 US 1,100
loxam-fr Loxam Q6692217 FR 1,100
kiloutou-fr Kiloutou Q3197034 FR 600
fastenal-us Fastenal Q1394323 US 3,400
grainger-us Grainger Q904633 US 600
hilti-ch Hilti Q565285 CH 600
84-lumber-us 84 Lumber Q4641204 US 310
builders-firstsource-us Builders FirstSource Q4934620 US 570
kent-building-supplies-ca Kent Building Supplies Q6383907 CA 45

Run all at once after YAMLs are created:

python3 ingest-osm.py --chain \
  floor-decor-us topps-tiles-uk \
  united-rentals-us sunbelt-rentals-us loxam-fr kiloutou-fr \
  fastenal-us grainger-us hilti-ch \
  84-lumber-us builders-firstsource-us kent-building-supplies-ca

Note: fastenal-us and 84-lumber-us may return 0 records via wikidata — add name_query: "Fastenal" / name_query: "84 Lumber" fallback if needed.

[edit]Step 4 — Write ingest-osm-airports.py (commercial airport filter)

The existing Overture airport data (20,841 records) includes private airstrips, heliports, and military fields. This script replaces it with OSM-sourced commercial airports only.

Pattern: copy ingest-osm-civic.py, change the Overpass query to:

QUERY = """
[out:json][timeout:60];
(
  node["aeroway"="aerodrome"]
    ["aerodrome:type"~"^(public|international|regional|domestic)$"]
    ({bbox});
  node["aeroway"="aerodrome"]["iata"~"."]({bbox});
  way["aeroway"="aerodrome"]
    ["aerodrome:type"~"^(public|international|regional|domestic)$"]
    ({bbox});
  way["aeroway"="aerodrome"]["iata"~"."]({bbox});
);
out center;
"""

Output: $TOTEBOX_DATA_PATH/service-places/cleansed-civic-airports.jsonl

Schema: same as cleansed-places.jsonl with category_id: "airport", naics_code: "488119". Enrich with IATA code from the OSM iata tag where present.

Run per country using COUNTRY_BBOX from config.py (defined for all 17 display ISOs).

Expected output: ~5,000–8,000 records (vs. 20,841 from Overture).

[edit]Step 5 — Write ingest-osm-railway.py (intercity stations)

Output: $TOTEBOX_DATA_PATH/service-places/cleansed-civic-railway.jsonl

Overpass query per country:

QUERY = """
[out:json][timeout:60];
(
  node["railway"="station"]
    ["station"!="subway"]
    ["station"!="light_rail"]
    ["station"!="tram"]
    ["station"!="monorail"]
    ({bbox});
  way["railway"="station"]
    ["station"!="subway"]
    ["station"!="light_rail"]
    ["station"!="tram"]
    ["station"!="monorail"]
    ({bbox});
);
out center;
"""

Post-processing: filter to intercity operators only using the operator= tag:

INTERCITY_OPERATORS = {
    "US": ["Amtrak"],
    "CA": ["VIA Rail Canada", "Via Rail"],
    "FR": ["SNCF", "Société Nationale des Chemins de fer Français"],
    "DE": ["Deutsche Bahn", "DB", "DB Regio"],
    "ES": ["Renfe", "Renfe Operadora"],
    "IT": ["Trenitalia", "Italo", "Ferrovie dello Stato"],
    "AT": ["ÖBB", "Österreichische Bundesbahnen"],
    "NL": ["NS", "Nederlandse Spoorwegen"],
    "SE": ["SJ", "Norrtåg"],
    "DK": ["DSB", "Danske Statsbaner"],
    "NO": ["Vy", "NSB"],
    "FI": ["VR", "VR Group"],
    "PT": ["CP", "Comboios de Portugal"],
    "PL": ["PKP Intercity", "RegioJet"],
    "GB": None,  # All Network Rail TOCs qualify — no filtering needed
    "MX": None,  # No passenger rail — skip
    "IS": None,  # No passenger rail — skip
    "GR": ["TrainOSE"],
}

For countries where operator= filtering applies: only keep stations where the operator tag matches (case-insensitive substring). For GB: keep all. For MX and IS: skip entirely.

Schema: category_id: "railway_station", naics_code: "482111".

[edit]Step 6 — Add PKS car rental chain YAMLs

Create five YAMLs following the same schema as Step 3:

chain_id retailer wikidata_id notes
enterprise-us Enterprise Rent-A-Car Q2283517 Multi-country NA
hertz-us Hertz Q379425 Multi-country NA + EU
avis-us Avis Q849144 Multi-country
sixt-de Sixt Q704156 EU primary; some NA
europcar-fr Europcar Q466704 EU primary

naics_code: "532111" (Passenger Car Rental). brand_family: CarRental.

These all need multi_country: true in the YAML since they operate across multiple ISOs in the display list.

Run ingest:

python3 ingest-osm.py --chain enterprise-us hertz-us avis-us sixt-de europcar-fr

Expect high counts due to airport ubiquity: enterprise-us ~4,000–8,000 records; sixt-de ~500–800 EU records.

[edit]Step 7 — Production calibration and deployment (complete 2026-06-11)

test-cluster-archetypes.py was superseded by dedicated DBSCAN build scripts:

  • build-vwh-clusters.pyarchetype-vwh.geojson (6,368 features)
  • build-pks-clusters.pyarchetype-pks.geojson (6,953 features)

To re-run after adding new chains:

python3 build-vwh-clusters.py   # VWH — outputs work/archetype-vwh.geojson
python3 build-pks-clusters.py   # PKS — outputs work/archetype-pks.geojson

Copy outputs to the active deployment's data directory, then verify cluster counts match the production baseline (VWH ≥ 6,368 features; PKS ≥ 6,953 features).

[edit]See also

Category:How To
Last edited:
Edit this page · View source