-
Notifications
You must be signed in to change notification settings - Fork 757
[DOCS] Add geojson docs #1814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
[DOCS] Add geojson docs #1814
Changes from 2 commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,216 @@ | ||
| <!-- | ||
| Licensed to the Apache Software Foundation (ASF) under one | ||
| or more contributor license agreements. See the NOTICE file | ||
| distributed with this work for additional information | ||
| regarding copyright ownership. The ASF licenses this file | ||
| to you under the Apache License, Version 2.0 (the | ||
| "License"); you may not use this file except in compliance | ||
| with the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, | ||
| software distributed under the License is distributed on an | ||
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations | ||
| under the License. | ||
| --> | ||
|
|
||
| # Apache Sedona GeoJSON with Spark | ||
|
|
||
| This page shows how to read/write single-line GeoJSON files and multiline GeoJSON files with Apache Sedona and Spark. | ||
|
|
||
| The post concludes with a summary of the benefits and drawbacks of the GeoJSON file format for spatial analyses. | ||
|
|
||
| GeoJSON is based on JSON and supports the following types: | ||
|
|
||
| * Point | ||
| * LineString | ||
| * Polygon | ||
| * MultiPoint | ||
| * MultiLineString | ||
| * MultiPolygon | ||
|
|
||
| See here for [more details about the GeoJSON format specification](https://datatracker.ietf.org/doc/html/rfc7946). | ||
|
|
||
| ## Read multiline GeoJSON files with Sedona and Spark | ||
|
|
||
| Here’s how to read a multiline GeoJSON file with Sedona: | ||
|
|
||
| ```python | ||
| df = ( | ||
| sedona.read.format("geojson").option("multiLine", "true").load("data/multiline_geojson.json") | ||
| .selectExpr("explode(features) as features") | ||
| .select("features.*") | ||
| .withColumn("prop0", expr("properties['prop0']")).drop("properties").drop("type") | ||
| ) | ||
| df.show(truncate=False) | ||
| ``` | ||
|
|
||
| Here’s the output: | ||
|
|
||
| ``` | ||
| +---------------------------------------------+------+ | ||
| |geometry |prop0 | | ||
| +---------------------------------------------+------+ | ||
| |POINT (102 0.5) |value0| | ||
| |LINESTRING (102 0, 103 1, 104 0, 105 1) |value1| | ||
| |POLYGON ((100 0, 101 0, 101 1, 100 1, 100 0))|value2| | ||
| +---------------------------------------------+------+ | ||
| ``` | ||
|
|
||
| The multiline GeoJSON file contains a point, a linestring, and a polygon. Let’s inspect the content of the file: | ||
|
|
||
| ```json | ||
| { "type": "FeatureCollection", | ||
| "features": [ | ||
| { "type": "Feature", | ||
| "geometry": {"type": "Point", "coordinates": [102.0, 0.5]}, | ||
| "properties": {"prop0": "value0"} | ||
| }, | ||
| { "type": "Feature", | ||
| "geometry": { | ||
| "type": "LineString", | ||
| "coordinates": [ | ||
| [102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0] | ||
| ] | ||
| }, | ||
| "properties": { | ||
| "prop0": "value1", | ||
| "prop1": 0.0 | ||
| } | ||
| }, | ||
| { "type": "Feature", | ||
| "geometry": { | ||
| "type": "Polygon", | ||
| "coordinates": [ | ||
| [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], | ||
| [100.0, 1.0], [100.0, 0.0] ] | ||
| ] | ||
| }, | ||
| "properties": { | ||
| "prop0": "value2", | ||
| "prop1": {"this": "that"} | ||
| } | ||
| } | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| Notice how the data is modeled as a `FeatureCollection`. Each feature has a geometry type, geometry coordinates, and properties. | ||
|
|
||
| Multiline GeoJSON is nicely formatted for humans but inefficient for machines. It’s better to store all the JSON data in a single line. | ||
|
|
||
| ## Read single-line GeoJSON files with Sedona and Spark | ||
|
|
||
| Here’s how to read single-line GeoJSON files with Sedona: | ||
|
|
||
| ```python | ||
| df = ( | ||
| sedona.read.format("geojson") | ||
| .load("data/singleline_geojson.json") | ||
| .withColumn("prop0", expr("properties['prop0']")) | ||
| .drop("properties") | ||
| .drop("type") | ||
| ) | ||
| df.show(truncate=False) | ||
| ``` | ||
|
|
||
| Here’s the result: | ||
|
|
||
| ``` | ||
| +---------------------------------------------+------+ | ||
| |geometry |prop0 | | ||
| +---------------------------------------------+------+ | ||
| |POINT (102 0.5) |value0| | ||
| |LINESTRING (102 0, 103 1, 104 0, 105 1) |value1| | ||
| |POLYGON ((100 0, 101 0, 101 1, 100 1, 100 0))|value2| | ||
| +---------------------------------------------+------+ | ||
| ``` | ||
|
|
||
| Here’s the data: | ||
|
|
||
| ``` | ||
| {"type":"Feature","geometry":{"type":"Point","coordinates":[102.0,0.5]},"properties":{"prop0":"value0"}} | ||
| {"type":"Feature","geometry":{"type":"LineString","coordinates":[[102.0,0.0],[103.0,1.0],[104.0,0.0],[105.0,1.0]]},"properties":{"prop0":"value1"}} | ||
| {"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[100.0,0.0],[101.0,0.0],[101.0,1.0],[100.0,1.0],[100.0,0.0]]]},"properties":{"prop0":"value2"}} | ||
| ``` | ||
|
|
||
| Notice how the multi-line GeoJSON uses a `FeatureCollection` whereas each single-line GeoJSON row uses a different `Feature`. | ||
|
|
||
| Single-line GeoJSON files are better because they’re splittable by query engines. | ||
|
|
||
| Now, let's see how to create GeoJSON files with Sedona by writing out DataFrames. | ||
|
|
||
| ## Write to GeoJSON with Sedona and Spark | ||
|
|
||
| Let’s create a Sedona DataFrame and then write it out to GeoJSON files: | ||
|
|
||
| ``` | ||
| df = sedona.createDataFrame([ | ||
| ("a", 'LINESTRING(2.0 5.0,6.0 1.0)'), | ||
| ("b", 'LINESTRING(7.0 4.0,9.0 2.0)'), | ||
| ("c", 'LINESTRING(1.0 3.0,3.0 1.0)'), | ||
| ], ["id", "geometry"]) | ||
| actual = df.withColumn("geometry", ST_GeomFromText(col("geometry"))) | ||
| actual.write.format("geojson").mode("overwrite").save("/tmp/a_thing") | ||
| ``` | ||
|
|
||
| Here are the files that get written: | ||
|
|
||
| ``` | ||
| a_thing/ | ||
| _SUCCESS | ||
| part-00000-856044c5-ae35-4306-bf7a-ae9c3cb25434-c000.json | ||
| part-00003-856044c5-ae35-4306-bf7a-ae9c3cb25434-c000.json | ||
| part-00007-856044c5-ae35-4306-bf7a-ae9c3cb25434-c000.json | ||
| part-00011-856044c5-ae35-4306-bf7a-ae9c3cb25434-c000.json | ||
| ``` | ||
|
|
||
| Sedona writes multiple GeoJSON files in parallel, which is faster than writing a single file. | ||
|
|
||
| Note that the DataFrame must contain a column named geometry for the write operation to work. | ||
|
|
||
| Now let’s read these GeoJSON files into a DataFrame: | ||
|
|
||
| ```python | ||
| df = sedona.read.format("geojson").load("/tmp/a_thing") | ||
| df.show(truncate=False) | ||
| ``` | ||
|
|
||
| ``` | ||
| +---------------------+----------+-------+ | ||
| |geometry |properties|type | | ||
| +---------------------+----------+-------+ | ||
| |LINESTRING (1 3, 3 1)|{c} |Feature| | ||
| |LINESTRING (2 5, 6 1)|{a} |Feature| | ||
| |LINESTRING (7 4, 9 2)|{b} |Feature| | ||
| +---------------------+----------+-------+ | ||
| ``` | ||
|
|
||
| ## Benefits of the GeoJSON file format | ||
|
|
||
| The GeoJSON file format has many advantages: | ||
|
|
||
| * It is human-readable | ||
| * It can be output in multiple files, which allows for faster I/O for parallel processing engines. | ||
| * Many engines support GeoJSON / JSON files. | ||
|
|
||
| However, GeoJSON has many downsides, making it a suboptimal choice for storing geospatial data. | ||
|
|
||
| ## Limitations of the GeoJSON file format | ||
|
|
||
| The GeoJSON format has many limitations that can make it a slow option for spatial data lakes: | ||
|
|
||
| * It’s a row-oriented file format, so performance optimizations like column pruning aren’t available (column-oriented file formats, like GeoParquet, can take advantage of this optimization). | ||
| * It does not store metadata information on row groups, so row-group filtering isn’t possible (row-group filtering is a Parquet performance optimization). | ||
| * The schema is not specified in the footer, so it needs to be manually written or inferred. | ||
|
jiayuasu marked this conversation as resolved.
|
||
| * The GeoJSON specification requires a specific structure that can be rigid for certain types of datasets. | ||
| * You can only build GeoJSON data lakes. You can’t use GeoJSON to build data lakehouses. | ||
|
|
||
| ## Conclusion | ||
|
|
||
| GeoJSON is a common file format in spatial data analyses, and it’s convenient that Apache Sedona offers full read and write capabilities. | ||
|
|
||
| GeoJSON is well-supported and human-readable, but it’s pretty slow compared to formats like GeoParquet. It’s generally best to use GeoParquet or Iceberg for spatial data analyses because the performance is much better. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.