Diff: how-to/connect-osm-data-pipeline
From 1c02ec1 to 1c02ec1
+0 / −0 lines
| Before | After |
|---|---|
| --- | --- |
| schema: foundry-doc-v1 | schema: foundry-doc-v1 |
| title: "How to connect to the OSM data pipeline" | title: "How to connect to the OSM data pipeline" |
| slug: connect-osm-data-pipeline | slug: connect-osm-data-pipeline |
| category: how-to | category: how-to |
| content_type: how-to | content_type: how-to |
| type: how-to | type: how-to |
| status: active | status: active |
| last_edited: 2026-06-14 | last_edited: 2026-06-14 |
| editor: pointsav-engineering | editor: pointsav-engineering |
| paired_with: connect-osm-data-pipeline.es.md | paired_with: connect-osm-data-pipeline.es.md |
| --- | --- |
| The platform's location intelligence system ingests point-of-interest (POI) data from OpenStreetMap via JSONL ingest files. Connecting to the OSM data pipeline means writing or adapting an ingest script that queries the Overpass API, producing a JSONL file in the platform's schema, and registering the ingest with the taxonomy configuration. This guide covers a single-chain ingest for a new retail or service category. | The platform's location intelligence system ingests point-of-interest (POI) data from OpenStreetMap via JSONL ingest files. Connecting to the OSM data pipeline means writing or adapting an ingest script that queries the Overpass API, producing a JSONL file in the platform's schema, and registering the ingest with the taxonomy configuration. This guide covers a single-chain ingest for a new retail or service category. |
| For the GIS engine architecture, see [[pointsav-gis-engine]]. For building a map from ingested cluster data, see [[build-a-colocation-map]]. | For the GIS engine architecture, see [[pointsav-gis-engine]]. For building a map from ingested cluster data, see [[build-a-colocation-map]]. |
| ## Prerequisites | ## Prerequisites |
| - Access to the `app-orchestration-gis` working directory (the pipeline scripts) | - Access to the `app-orchestration-gis` working directory (the pipeline scripts) |
| - Python 3.9+ with `requests` available | - Python 3.9+ with `requests` available |
| - Network access to the Overpass API (`overpass-api.de` or a local mirror) | - Network access to the Overpass API (`overpass-api.de` or a local mirror) |
| - A Wikidata Q-ID for the chain or category being ingested (look up at `wikidata.org`) | - A Wikidata Q-ID for the chain or category being ingested (look up at `wikidata.org`) |
| ## Step 1: Identify the Wikidata Q-ID | ## Step 1: Identify the Wikidata Q-ID |
| Every chain in the taxonomy is anchored to a Wikidata Q-ID. This provides a stable, language-neutral identifier for the entity. Look up the chain on Wikidata and record the Q-ID (e.g., Walmart: Q483551, IKEA: Q54078). | Every chain in the taxonomy is anchored to a Wikidata Q-ID. This provides a stable, language-neutral identifier for the entity. Look up the chain on Wikidata and record the Q-ID (e.g., Walmart: Q483551, IKEA: Q54078). |
| If the category has no single Wikidata entry, use a name-based query (`name_query` mode) rather than a Q-ID lookup. | If the category has no single Wikidata entry, use a name-based query (`name_query` mode) rather than a Q-ID lookup. |
| ## Step 2: Write the ingest YAML | ## Step 2: Write the ingest YAML |
| Create an ingest YAML file under `service-business/` named `<chain-name>-<country-code>.yaml`: | Create an ingest YAML file under `service-business/` named `<chain-name>-<country-code>.yaml`: |
| ```yaml | ```yaml |
| chain: walmart-us | chain: walmart-us |
| wikidata_id: Q483551 | wikidata_id: Q483551 |
| query_mode: wikidata # or: name_query | query_mode: wikidata # or: name_query |
| name_query: null # used only when query_mode: name_query | name_query: null # used only when query_mode: name_query |
| country_code: US | country_code: US |
| bbox: [-125.0, 24.4, -66.9, 49.4] # bounding box for the country | bbox: [-125.0, 24.4, -66.9, 49.4] # bounding box for the country |
| output: service-business/walmart-us.jsonl | output: service-business/walmart-us.jsonl |
| taxonomy_family: ALPHA_HYPERMARKET | taxonomy_family: ALPHA_HYPERMARKET |
| taxonomy_tier: 1 | taxonomy_tier: 1 |
| ``` | ``` |
| For `name_query` mode (when Wikidata coverage is sparse), set `query_mode: name_query` and provide `name_query: "Walmart"`. The ingest script performs a free-text name search in the Overpass API. | For `name_query` mode (when Wikidata coverage is sparse), set `query_mode: name_query` and provide `name_query: "Walmart"`. The ingest script performs a free-text name search in the Overpass API. |
| ## Step 3: Run the ingest script | ## Step 3: Run the ingest script |
| Run the existing ingest script with the new YAML: | Run the existing ingest script with the new YAML: |
| ``` | ``` |
| python3 app-orchestration-gis/ingest-chain.py service-business/walmart-us.yaml | python3 app-orchestration-gis/ingest-chain.py service-business/walmart-us.yaml |
| ``` | ``` |
| The script queries the Overpass API, filters results by the bounding box and country code, and writes JSONL records to `service-business/walmart-us.jsonl`. Each record contains: `name`, `lat`, `lon`, `wikidata_id`, `chain`, `country`, `taxonomy_family`, `taxonomy_tier`. | The script queries the Overpass API, filters results by the bounding box and country code, and writes JSONL records to `service-business/walmart-us.jsonl`. Each record contains: `name`, `lat`, `lon`, `wikidata_id`, `chain`, `country`, `taxonomy_family`, `taxonomy_tier`. |
| Typical record counts: dense urban chains produce 500–2,000 records; national hypermarket chains produce 100–500; specialty retailers produce 50–200. | Typical record counts: dense urban chains produce 500–2,000 records; national hypermarket chains produce 100–500; specialty retailers produce 50–200. |
| ## Step 4: Register the chain in the taxonomy | ## Step 4: Register the chain in the taxonomy |
| Add the new chain to the taxonomy configuration in `app-orchestration-gis/taxonomy.py` under the appropriate family group: | Add the new chain to the taxonomy configuration in `app-orchestration-gis/taxonomy.py` under the appropriate family group: |
| ```python | ```python |
| "walmart-us": TaxonomyEntry( | "walmart-us": TaxonomyEntry( |
| family="ALPHA_HYPERMARKET", | family="ALPHA_HYPERMARKET", |
| tier=1, | tier=1, |
| jsonl_path="service-business/walmart-us.jsonl", | jsonl_path="service-business/walmart-us.jsonl", |
| wikidata_id="Q483551", | wikidata_id="Q483551", |
| ), | ), |
| ``` | ``` |
| ## Step 5: Rebuild the cluster layer | ## Step 5: Rebuild the cluster layer |
| After registration, rebuild the cluster layer to incorporate the new POI data: | After registration, rebuild the cluster layer to incorporate the new POI data: |
| ``` | ``` |
| python3 app-orchestration-gis/build-geometric-ranking.py | python3 app-orchestration-gis/build-geometric-ranking.py |
| ``` | ``` |
| The rebuild reads all registered JSONL files, runs the DBSCAN clustering pass, and regenerates `clusters-meta.json`. Verify the new chain appears in the cluster output: | The rebuild reads all registered JSONL files, runs the DBSCAN clustering pass, and regenerates `clusters-meta.json`. Verify the new chain appears in the cluster output: |
| ``` | ``` |
| python3 -c "import json; d=json.load(open('gateway/www/data/clusters-meta.json')); print(sum(1 for c in d['clusters'] if 'walmart' in str(c)))" | python3 -c "import json; d=json.load(open('gateway/www/data/clusters-meta.json')); print(sum(1 for c in d['clusters'] if 'walmart' in str(c)))" |
| ``` | ``` |
| ## Key takeaways | ## Key takeaways |
| - Every chain requires a YAML ingest descriptor and a JSONL output file in `service-business/` | - Every chain requires a YAML ingest descriptor and a JSONL output file in `service-business/` |
| - Wikidata Q-IDs are preferred over name queries; fall back to name queries only when Wikidata coverage is absent | - Wikidata Q-IDs are preferred over name queries; fall back to name queries only when Wikidata coverage is absent |
| - The taxonomy registration step links the JSONL file to the clustering pipeline | - The taxonomy registration step links the JSONL file to the clustering pipeline |
| - A full cluster rebuild is required after adding a new chain — incremental updates are not supported in the current pipeline | - A full cluster rebuild is required after adding a new chain — incremental updates are not supported in the current pipeline |
| ## See also | ## See also |
| - [[pointsav-gis-engine]] — the GIS engine architecture and the DBSCAN clustering pipeline | - [[pointsav-gis-engine]] — the GIS engine architecture and the DBSCAN clustering pipeline |
| - [[build-a-colocation-map]] — how to surface cluster data in a MapLibre web application | - [[build-a-colocation-map]] — how to surface cluster data in a MapLibre web application |
| - [[location-intelligence-archetypes]] — the PRO/VWH/PKS archetype model that the taxonomy feeds | - [[location-intelligence-archetypes]] — the PRO/VWH/PKS archetype model that the taxonomy feeds |
| - [[export-structured-data]] — exporting the resulting GeoJSON for external use | - [[export-structured-data]] — exporting the resulting GeoJSON for external use |