Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
216 changes: 216 additions & 0 deletions docs/tutorial/files/geojson_sedona_spark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Apache Sedona GeoJSON with Spark

This page shows how to read/write single-line GeoJSON files and multiline GeoJSON files with Apache Sedona and Spark.

The post concludes with a summary of the benefits and drawbacks of the GeoJSON file format for spatial analyses.

GeoJSON is based on JSON and supports the following types:

* Point
* LineString
* Polygon
* MultiPoint
* MultiLineString
* MultiPolygon

See here for [more details about the GeoJSON format specification](https://datatracker.ietf.org/doc/html/rfc7946).

## Read multiline GeoJSON files with Sedona and Spark
Comment thread
jiayuasu marked this conversation as resolved.

Here’s how to read a multiline GeoJSON file with Sedona:

```python
df = (
sedona.read.format("geojson").option("multiLine", "true").load("data/multiline_geojson.json")
.selectExpr("explode(features) as features")
.select("features.*")
.withColumn("prop0", expr("properties['prop0']")).drop("properties").drop("type")
)
df.show(truncate=False)
```

Here’s the output:

```
+---------------------------------------------+------+
|geometry |prop0 |
+---------------------------------------------+------+
|POINT (102 0.5) |value0|
|LINESTRING (102 0, 103 1, 104 0, 105 1) |value1|
|POLYGON ((100 0, 101 0, 101 1, 100 1, 100 0))|value2|
+---------------------------------------------+------+
```

The multiline GeoJSON file contains a point, a linestring, and a polygon. Let’s inspect the content of the file:

```json
{ "type": "FeatureCollection",
"features": [
{ "type": "Feature",
"geometry": {"type": "Point", "coordinates": [102.0, 0.5]},
"properties": {"prop0": "value0"}
},
{ "type": "Feature",
"geometry": {
"type": "LineString",
"coordinates": [
[102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0]
]
},
"properties": {
"prop0": "value1",
"prop1": 0.0
}
},
{ "type": "Feature",
"geometry": {
"type": "Polygon",
"coordinates": [
[ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0],
[100.0, 1.0], [100.0, 0.0] ]
]
},
"properties": {
"prop0": "value2",
"prop1": {"this": "that"}
}
}
]
}
```

Notice how the data is modeled as a `FeatureCollection`. Each feature has a geometry type, geometry coordinates, and properties.

Multiline GeoJSON is nicely formatted for humans but inefficient for machines. It’s better to store all the JSON data in a single line.

## Read single-line GeoJSON files with Sedona and Spark

Here’s how to read single-line GeoJSON files with Sedona:

```python
df = (
sedona.read.format("geojson")
.load("data/singleline_geojson.json")
.withColumn("prop0", expr("properties['prop0']"))
.drop("properties")
.drop("type")
)
df.show(truncate=False)
```

Here’s the result:

```
+---------------------------------------------+------+
|geometry |prop0 |
+---------------------------------------------+------+
|POINT (102 0.5) |value0|
|LINESTRING (102 0, 103 1, 104 0, 105 1) |value1|
|POLYGON ((100 0, 101 0, 101 1, 100 1, 100 0))|value2|
+---------------------------------------------+------+
```

Here’s the data:

```
{"type":"Feature","geometry":{"type":"Point","coordinates":[102.0,0.5]},"properties":{"prop0":"value0"}}
{"type":"Feature","geometry":{"type":"LineString","coordinates":[[102.0,0.0],[103.0,1.0],[104.0,0.0],[105.0,1.0]]},"properties":{"prop0":"value1"}}
{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[100.0,0.0],[101.0,0.0],[101.0,1.0],[100.0,1.0],[100.0,0.0]]]},"properties":{"prop0":"value2"}}
```

Notice how the multi-line GeoJSON uses a `FeatureCollection` whereas each single-line GeoJSON row uses a different `Feature`.

Single-line GeoJSON files are better because they’re splittable by query engines.

Now, let's see how to create GeoJSON files with Sedona by writing out DataFrames.

## Write to GeoJSON with Sedona and Spark

Let’s create a Sedona DataFrame and then write it out to GeoJSON files:

```
df = sedona.createDataFrame([
("a", 'LINESTRING(2.0 5.0,6.0 1.0)'),
("b", 'LINESTRING(7.0 4.0,9.0 2.0)'),
("c", 'LINESTRING(1.0 3.0,3.0 1.0)'),
], ["id", "geometry"])
actual = df.withColumn("geometry", ST_GeomFromText(col("geometry")))
actual.write.format("geojson").mode("overwrite").save("/tmp/a_thing")
```

Here are the files that get written:

```
a_thing/
_SUCCESS
part-00000-856044c5-ae35-4306-bf7a-ae9c3cb25434-c000.json
part-00003-856044c5-ae35-4306-bf7a-ae9c3cb25434-c000.json
part-00007-856044c5-ae35-4306-bf7a-ae9c3cb25434-c000.json
part-00011-856044c5-ae35-4306-bf7a-ae9c3cb25434-c000.json
```

Sedona writes multiple GeoJSON files in parallel, which is faster than writing a single file.

Note that the DataFrame must contain a column named geometry for the write operation to work.

Now let’s read these GeoJSON files into a DataFrame:

```python
df = sedona.read.format("geojson").load("/tmp/a_thing")
df.show(truncate=False)
```

```
+---------------------+----------+-------+
|geometry |properties|type |
+---------------------+----------+-------+
|LINESTRING (1 3, 3 1)|{c} |Feature|
|LINESTRING (2 5, 6 1)|{a} |Feature|
|LINESTRING (7 4, 9 2)|{b} |Feature|
+---------------------+----------+-------+
```

## Benefits of the GeoJSON file format

The GeoJSON file format has many advantages:

* It is human-readable
* It can be output in multiple files, which allows for faster I/O for parallel processing engines.
* Many engines support GeoJSON / JSON files.

However, GeoJSON has many downsides, making it a suboptimal choice for storing geospatial data.

## Limitations of the GeoJSON file format

The GeoJSON format has many limitations that can make it a slow option for spatial data lakes:

* It’s a row-oriented file format, so performance optimizations like column pruning aren’t available (column-oriented file formats, like GeoParquet, can take advantage of this optimization).
* It does not store metadata information on row groups, so row-group filtering isn’t possible (row-group filtering is a Parquet performance optimization).
* The schema is not specified in the footer, so it needs to be manually written or inferred.
Comment thread
jiayuasu marked this conversation as resolved.
* The GeoJSON specification requires a specific structure that can be rigid for certain types of datasets.
* You can only build GeoJSON data lakes. You can’t use GeoJSON to build data lakehouses.

## Conclusion

GeoJSON is a common file format in spatial data analyses, and it’s convenient that Apache Sedona offers full read and write capabilities.

GeoJSON is well-supported and human-readable, but it’s pretty slow compared to formats like GeoParquet. It’s generally best to use GeoParquet or Iceberg for spatial data analyses because the performance is much better.
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,8 @@ nav:
- Spatial RDD app: tutorial/rdd.md
- Sedona R: api/rdocs
- Work with GeoPandas and Shapely: tutorial/geopandas-shapely.md
- Files:
- GeoJSON: tutorial/files/geojson_sedona_spark.md
- Map visualization SQL app:
- Scala/Java: tutorial/viz.md
- Use Apache Zeppelin: tutorial/zeppelin.md
Expand Down