Location Intelligence: Data Collection
TopicFrom the PointSav Documentation
Status (2026-06-11): All seven steps below are complete. VWH production build: 6,368 clusters (T1=852/T2=1,327/T3=4,189). PKS production build: 6,953 clusters (T1=691/T2=2,658/T3=3,604). This guide is retained as a reference runbook for re-ingesting or extending to new chains and countries.
Initial pipeline setup for ingesting VWH (Vertical Warehouse) and PKS (Parking Structures) chain and infrastructure data. Steps are recorded in execution order.
Working directory for all commands: app-orchestration-gis/ (inside the GIS monorepo clone).
[edit]Prerequisites
- Overpass API access (queries run via
ingest-osm.py; no API key required) - Python 3.11+ with pipeline dependencies installed
TOTEBOX_DATA_PATHpointing to the active deployment data directory
Verify the pipeline is clean:
python3 -c "from taxonomy import CATEGORIES, BRAND_FILL; print('taxonomy OK')"
python3 -c "from config import TOTEBOX_DATA_PATH; print('config OK')"
[edit]Step 1 — Run existing YAML ingests (VWH auto-parts + paint)
Five chain YAMLs were scaffolded on 2026-06-01. Run the ingest to download OSM records:
python3 ingest-osm.py --chain \
autozone-us \
oreilley-auto-us \
napa-us \
sherwin-williams-us \
halfords-uk
Expected output files in $TOTEBOX_DATA_PATH/service-fs/service-business/:
autozone-us.jsonl— expect 5,000–7,000 recordsoreilley-auto-us.jsonl— expect 5,000–7,000 recordsnapa-us.jsonl— expect 3,000–6,000 records (franchise network; OSM partial)sherwin-williams-us.jsonl— expect 3,000–5,000 recordshalfords-uk.jsonl— expect 300–450 records
Verify:
for f in autozone-us oreilley-auto-us napa-us sherwin-williams-us halfords-uk; do
echo "$f: $(wc -l < $TOTEBOX_DATA_PATH/service-fs/service-business/$f.jsonl) records"
done
If a chain returns 0 records, check if Wikidata tag coverage is sparse in OSM and add
name_query: fallback to the YAML.
[edit]Step 2 — Add Würth (biggest EU MRO gap)
Würth is the single highest-value chain not yet in the taxonomy. ~1,500 EU branches in
industrial parks across DE/FR/IT/PL/AT/NL and beyond. OSM has brand:wikidata=Q183759
applied to many records.
2a. Create YAML:
cat > $TOTEBOX_DATA_PATH/service-fs/service-business/wurth-de.yaml << 'EOF'
schema: service-business-chain-v1
chain_id: wurth-de
country: Germany
country_code: DE
region: europe-central
category: Industrial MRO supply
category_slug: mro-industrial
naics_code: "423840"
top_category: Industrial and Personal Service Paper and Related Products
sub_category: Industrial MRO Supply
overture_taxonomy:
- industrial
- wholesale
brand_family: MROIndustrial
retailer: Würth
canonical_name: "Adolf Würth GmbH & Co. KG"
parent_company: "Würth Group"
website: wuerth.de
wikidata_id: Q183759
osm_overpass_tag: brand:wikidata=Q183759
store_count_approx: 1500
locations_file: locations/wurth-de.jsonl
locations_status: pending
last_updated: 2026-06-01
notes: "VWH archetype signal. MRO distributor; branches in industrial parks across EU. Multi-country — set country_code per-ISO ingest or use multi_country: true."
EOF
2b. Add mro_industrial category to taxonomy.py:
In taxonomy.py, inside the CATEGORIES dict, after the paint block:
"mro_industrial": {
"label": "Industrial MRO Supply",
"naics": "423840",
"description": "Maintenance, repair, and operations distributor — VWH signal. Never gates tier.",
},
"flooring": {
"label": "Flooring & Tile Supply",
"naics": "442210",
"description": "Contractor-facing flooring/tile warehouse — VWH signal. Never gates tier.",
},
"tool_rental": {
"label": "Tool & Equipment Rental",
"naics": "532412",
"description": "Equipment rental branch — VWH signal; deliberate hardware co-location. Never gates tier.",
},
"lumber": {
"label": "Lumber & Building Materials",
"naics": "444190",
"description": "Lumber yard or building materials dealer — VWH signal. Never gates tier.",
},
"car_rental": {
"label": "Car Rental",
"naics": "532111",
"description": "Car rental branch — PKS signal; defines transit node commercial zone. Never gates tier.",
},
None of these go into _RETAIL_CATS — they never affect T1/T2/T3 tier logic.
2c. Add BRAND_FILL entries for all new categories:
In the BRAND_FILL dict, after the existing paint block:
"mro_industrial": {
"DE": ["wurth-de"],
"FR": ["wurth-de"], # Würth is multi-country; chain_id shared
"IT": ["wurth-de"], "ES": ["wurth-de"], "PL": ["wurth-de"],
"AT": ["wurth-de"], "NL": ["wurth-de"], "GB": ["wurth-de"],
"US": ["fastenal-us", "grainger-us"],
"CA": [], "MX": [],
"SE": ["wurth-de"], "DK": ["wurth-de"], "NO": ["wurth-de"],
"FI": ["wurth-de"], "IS": [], "GR": [], "PT": [],
},
"flooring": {
"US": ["floor-decor-us"],
"GB": ["topps-tiles-uk"],
"CA": [], "MX": [], "FR": [], "DE": [], "ES": [], "IT": [],
"GR": [], "PL": [], "AT": [], "NL": [], "PT": [],
"SE": [], "DK": [], "NO": [], "FI": [], "IS": [],
},
"tool_rental": {
"US": ["united-rentals-us", "sunbelt-rentals-us"],
"CA": ["united-rentals-us"],
"FR": ["loxam-fr", "kiloutou-fr"],
"MX": [], "GB": [], "DE": [], "ES": [], "IT": [],
"GR": [], "PL": [], "AT": [], "NL": [], "PT": [],
"SE": [], "DK": [], "NO": [], "FI": [], "IS": [],
},
"lumber": {
"US": ["84-lumber-us", "builders-firstsource-us"],
"CA": ["kent-building-supplies-ca"],
"MX": [], "GB": [], "FR": [], "DE": [], "ES": [], "IT": [],
"GR": [], "PL": [], "AT": [], "NL": [], "PT": [],
"SE": [], "DK": [], "NO": [], "FI": [], "IS": [],
},
"car_rental": {
"US": ["enterprise-us", "hertz-us", "avis-us"],
"CA": ["enterprise-us", "hertz-us"],
"MX": [],
"DE": ["sixt-de"], "FR": ["europcar-fr", "sixt-de"],
"GB": ["europcar-fr", "sixt-de"], "ES": ["europcar-fr"],
"IT": ["europcar-fr"], "NL": ["europcar-fr"],
"GR": [], "PL": [], "AT": ["sixt-de"], "PT": [],
"SE": ["sixt-de"], "DK": ["sixt-de"], "NO": [], "FI": [], "IS": [],
},
2d. Run Würth ingest:
python3 ingest-osm.py --chain wurth-de
Expect 800–1,500 records. Würth branches are often tagged shop=trade or shop=wholesale
in OSM — the brand:wikidata tag is the reliable signal.
[edit]Step 3 — Add Tier A retail/rental VWH chains
Create YAMLs following the schema below, then run ingest.
YAML schema reference:
schema: service-business-chain-v1
chain_id: floor-decor-us # matches BRAND_FILL key
country: United States
country_code: US
region: north-america
category: Flooring supply # display only
category_slug: flooring-supply
naics_code: "442210"
top_category: Floor Covering Stores
sub_category: Floor Covering Stores
brand_family: Flooring # matches category
retailer: Floor & Decor
canonical_name: "Floor & Decor Holdings, Inc."
parent_company: "Floor & Decor Holdings, Inc. (public; NYSE: FND)"
website: flooranddecor.com
wikidata_id: Q22350998
osm_overpass_tag: brand:wikidata=Q22350998
store_count_approx: 240
locations_file: locations/floor-decor-us.jsonl
locations_status: pending
last_updated: 2026-06-01
notes: "VWH Tier A. Warehouse-format contractor flooring; same footprint as Home Depot."
Chains to create (copy and adapt schema above):
| chain_id | retailer | wikidata_id | country_code | approx_count |
|---|---|---|---|---|
floor-decor-us |
Floor & Decor | Q22350998 | US | 240 |
topps-tiles-uk |
Topps Tiles | Q7825827 | GB | 300 |
united-rentals-us |
United Rentals | Q7889284 | US | 1,400 |
sunbelt-rentals-us |
Sunbelt Rentals | Q7645154 | US | 1,100 |
loxam-fr |
Loxam | Q6692217 | FR | 1,100 |
kiloutou-fr |
Kiloutou | Q3197034 | FR | 600 |
fastenal-us |
Fastenal | Q1394323 | US | 3,400 |
grainger-us |
Grainger | Q904633 | US | 600 |
hilti-ch |
Hilti | Q565285 | CH | 600 |
84-lumber-us |
84 Lumber | Q4641204 | US | 310 |
builders-firstsource-us |
Builders FirstSource | Q4934620 | US | 570 |
kent-building-supplies-ca |
Kent Building Supplies | Q6383907 | CA | 45 |
Run all at once after YAMLs are created:
python3 ingest-osm.py --chain \
floor-decor-us topps-tiles-uk \
united-rentals-us sunbelt-rentals-us loxam-fr kiloutou-fr \
fastenal-us grainger-us hilti-ch \
84-lumber-us builders-firstsource-us kent-building-supplies-ca
Note: fastenal-us and 84-lumber-us may return 0 records via wikidata — add
name_query: "Fastenal" / name_query: "84 Lumber" fallback if needed.
[edit]Step 4 — Write ingest-osm-airports.py (commercial airport filter)
The existing Overture airport data (20,841 records) includes private airstrips, heliports, and military fields. This script replaces it with OSM-sourced commercial airports only.
Pattern: copy ingest-osm-civic.py, change the Overpass query to:
QUERY = """
[out:json][timeout:60];
(
node["aeroway"="aerodrome"]
["aerodrome:type"~"^(public|international|regional|domestic)$"]
({bbox});
node["aeroway"="aerodrome"]["iata"~"."]({bbox});
way["aeroway"="aerodrome"]
["aerodrome:type"~"^(public|international|regional|domestic)$"]
({bbox});
way["aeroway"="aerodrome"]["iata"~"."]({bbox});
);
out center;
"""
Output: $TOTEBOX_DATA_PATH/service-places/cleansed-civic-airports.jsonl
Schema: same as cleansed-places.jsonl with category_id: "airport", naics_code: "488119".
Enrich with IATA code from the OSM iata tag where present.
Run per country using COUNTRY_BBOX from config.py (defined for all 17 display ISOs).
Expected output: ~5,000–8,000 records (vs. 20,841 from Overture).
[edit]Step 5 — Write ingest-osm-railway.py (intercity stations)
Output: $TOTEBOX_DATA_PATH/service-places/cleansed-civic-railway.jsonl
Overpass query per country:
QUERY = """
[out:json][timeout:60];
(
node["railway"="station"]
["station"!="subway"]
["station"!="light_rail"]
["station"!="tram"]
["station"!="monorail"]
({bbox});
way["railway"="station"]
["station"!="subway"]
["station"!="light_rail"]
["station"!="tram"]
["station"!="monorail"]
({bbox});
);
out center;
"""
Post-processing: filter to intercity operators only using the operator= tag:
INTERCITY_OPERATORS = {
"US": ["Amtrak"],
"CA": ["VIA Rail Canada", "Via Rail"],
"FR": ["SNCF", "Société Nationale des Chemins de fer Français"],
"DE": ["Deutsche Bahn", "DB", "DB Regio"],
"ES": ["Renfe", "Renfe Operadora"],
"IT": ["Trenitalia", "Italo", "Ferrovie dello Stato"],
"AT": ["ÖBB", "Österreichische Bundesbahnen"],
"NL": ["NS", "Nederlandse Spoorwegen"],
"SE": ["SJ", "Norrtåg"],
"DK": ["DSB", "Danske Statsbaner"],
"NO": ["Vy", "NSB"],
"FI": ["VR", "VR Group"],
"PT": ["CP", "Comboios de Portugal"],
"PL": ["PKP Intercity", "RegioJet"],
"GB": None, # All Network Rail TOCs qualify — no filtering needed
"MX": None, # No passenger rail — skip
"IS": None, # No passenger rail — skip
"GR": ["TrainOSE"],
}
For countries where operator= filtering applies: only keep stations where the operator
tag matches (case-insensitive substring). For GB: keep all. For MX and IS: skip entirely.
Schema: category_id: "railway_station", naics_code: "482111".
[edit]Step 6 — Add PKS car rental chain YAMLs
Create five YAMLs following the same schema as Step 3:
| chain_id | retailer | wikidata_id | notes |
|---|---|---|---|
enterprise-us |
Enterprise Rent-A-Car | Q2283517 | Multi-country NA |
hertz-us |
Hertz | Q379425 | Multi-country NA + EU |
avis-us |
Avis | Q849144 | Multi-country |
sixt-de |
Sixt | Q704156 | EU primary; some NA |
europcar-fr |
Europcar | Q466704 | EU primary |
naics_code: "532111" (Passenger Car Rental). brand_family: CarRental.
These all need multi_country: true in the YAML since they operate across multiple ISOs
in the display list.
Run ingest:
python3 ingest-osm.py --chain enterprise-us hertz-us avis-us sixt-de europcar-fr
Expect high counts due to airport ubiquity: enterprise-us ~4,000–8,000 records;
sixt-de ~500–800 EU records.
[edit]Step 7 — Production calibration and deployment (complete 2026-06-11)
test-cluster-archetypes.py was superseded by dedicated DBSCAN build scripts:
build-vwh-clusters.py→archetype-vwh.geojson(6,368 features)build-pks-clusters.py→archetype-pks.geojson(6,953 features)
To re-run after adding new chains:
python3 build-vwh-clusters.py # VWH — outputs work/archetype-vwh.geojson
python3 build-pks-clusters.py # PKS — outputs work/archetype-pks.geojson
Copy outputs to the active deployment's data directory, then verify cluster counts match the production baseline (VWH ≥ 6,368 features; PKS ≥ 6,953 features).
[edit]See also
- vertical-warehouse — the VWH archetype model and chain taxonomy
- parking-structures — the PKS archetype model and chain taxonomy
- location-intelligence-archetypes — PRO/VWH/PKS overview and map integration
- connect-osm-data-pipeline — generic single-chain ingest for new retail categories