-
Notifications
You must be signed in to change notification settings - Fork 748
[SEDONA-714] Add geopandas to spark arrow conversion. #1825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SEDONA-714] Add geopandas to spark arrow conversion. #1825
Conversation
|
I ll fix the missing function issue |
|
Starting from Spark 4.0, we can pass the whole arrow table to Spark.createDataFrame. I don't know when the release will be. |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! I'm new to this code base, so consider my comments optional nits 🙂
Starting from Spark 4.0, we can pass the whole arrow table to Spark.createDataFrame
Based on this PR I'm happy to attempt backporting GeoArrow import of anything implementing __arrow_c_stream__, circumventing a materialize of the GeoPandas data frame as a follow-up 🙂
python/sedona/utils/geoarrow.py
Outdated
| from pyspark.sql import SparkSession | ||
| from pyspark.sql import DataFrame | ||
| from pyspark.sql.types import StructType, StructField, DataType, ArrayType, MapType | ||
| import pyarrow as pa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what the dependency situation is like for spark, but it may be worth making this a lazy import (e.g., like in dataframe_to_arrow so that when we import from seconda.utils.geoarrow from sedona/spark/__init__.py we don't necessarily require pyarrow to be installed (alternatively, we could add pyarrow to the apache-sedona[spark] extras to match the runtime requirement).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I'll make all the changes later today. Thank you for the review!
| return [gen_new_name[name]() for name in names] | ||
|
|
||
|
|
||
| def _deduplicate_field_names(dt: DataType) -> DataType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def _deduplicate_field_names(dt: DataType) -> DataType: | |
| # Backport from Spark 4.0 | |
| # https://github.com/apache/spark/blob/3515b207c41d78194d11933cd04bddc21f8418dd/python/pyspark/sql/pandas/types.py#L1385 | |
| def _deduplicate_field_names(dt: DataType) -> DataType: |
@paleolimbot |
jiayuasu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add documentation to this page? https://sedona.apache.org/latest/tutorial/geopandas-shapely/
sure |
Co-authored-by: Dewey Dunnington <[email protected]>
f328661 to
1c96da0
Compare
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the late review...this is awesome! Thank you!
…#386) * [DOCS] Run Python black on Markdown code blocks (apache#1797) * [CI] pre-commit autoupdate; configure `bandit[toml]` dependency (apache#1799) Under bandit settings it lists the additional dependency for toml files https://bandit.readthedocs.io/en/latest/config.html#bandit-settings * [DOCS] Fix spelling (apache#1800) * [CI] pre-commit: auto add license headers to `.c` and `.h` files (apache#1802) * [CI] Update asf.yml (apache#1803) * Commit * Add john too * [DOCS] Add Pranav Toggi to the Committers list (apache#1806) * .asf.yaml: remove committer jbampton from collaborators (apache#1805) https://github.com/apache/infrastructure-asfyaml?tab=readme-ov-file#assigning-the-github-triage-role-to-external-collaborators "Projects may assign external (non-committer) collaborators the triage role for their repository." * [DOCS] Improve Makefile by Using requirements-docs.txt for Documentation Dependencies (apache#1808) * Update Makefile * Create requirements-docs.txt * Update Makefile * Update Makefile * Update Makefile * [CI] pre-commit: auto add license check for Java files (apache#1807) * [DOCS] Fix spelling (apache#1804) * [DOCS] Add geojson docs (apache#1814) * use dashes not underscores * fix whitespace * update based on pr comments * [DOCS] Add Matomo to Sedona website (apache#1820) * [DOC] Update ST_KNN documentation for left inner join support and inner kNN join details (apache#1821) * [DOCS] Correct the document for ST_MakeValid (apache#1822) * [DOCS] add geoparquet docs page (apache#1818) * add geoparquet docs page * use linter * centralize content on geoparquet page * lint file --------- Co-authored-by: Jia Yu <[email protected]> * [DOCS] add docs on csv files (apache#1824) * [DOCS] add spatial joins (apache#1829) * [DOCS] add spatial joins page * add alt text to images * update spatial joins based on pr comments * Update docs/tutorial/concepts/spatial-joins.md --------- Co-authored-by: Jia Yu <[email protected]> * Add several frequent contributors (apache#1833) * [SEDONA-714] Add geopandas to spark arrow conversion. (apache#1825) * SEDONA-714 Add geopandas to spark arrow conversion. * SEDONA-714 Add geopandas to spark arrow conversion. * SEDONA-714 Add geopandas to spark arrow conversion. * SEDONA-714 Add geopandas to spark arrow conversion. * SEDONA-714 Add geopandas to spark arrow conversion. * Update python/sedona/utils/geoarrow.py Co-authored-by: Dewey Dunnington <[email protected]> * SEDONA-714 Add geopandas to spark arrow conversion. * SEDONA-714 Add docs. * SEDONA-714 Add docs. --------- Co-authored-by: Dewey Dunnington <[email protected]> * [SEDONA-713] add OSM PBF reader (apache#1823) * Add OSM PBF reader. Add documentation. Add documentation. Add documentation. Add documentation. Add documentation. SEDONA-713 moving to common. * SEDONA-713 Add docs. * SEDONA-713 Add docs. * SEDONA-713 Add docs. * SEDONA-714 Add docs. * [DOCS] add geopackage docs (apache#1835) * [DOCS] add shapefiles documentation page (apache#1837) * build(deps): bump com.google.protobuf:protobuf-java in /shade-proto (apache#1834) Bumps [com.google.protobuf:protobuf-java](https://github.com/protocolbuffers/protobuf) from 4.28.0 to 4.28.2. - [Release notes](https://github.com/protocolbuffers/protobuf/releases) - [Changelog](https://github.com/protocolbuffers/protobuf/blob/main/protobuf_release.bzl) - [Commits](https://github.com/protocolbuffers/protobuf/commits) --- updated-dependencies: - dependency-name: com.google.protobuf:protobuf-java dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [SEDONA-717] Fix `dataframe_to_arrow()` for zero-row results (apache#1840) * fix zero-row case * typo * fix lint * [SEDONA-718] Auto Detect geometry column in GeoJSON writer (apache#1841) * [SEDONA-719] Support reading Shapefile with Z/M ordinates (apache#1842) * [DOCS] Fix lint issue * Fix shade-proto pom file name --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: John Bampton <[email protected]> Co-authored-by: Max Base <[email protected]> Co-authored-by: Matthew Powers <[email protected]> Co-authored-by: Feng Zhang <[email protected]> Co-authored-by: ruanqizhen <[email protected]> Co-authored-by: Paweł Tokaj <[email protected]> Co-authored-by: Dewey Dunnington <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Kristin Cowalcijk <[email protected]>
* SEDONA-714 Add geopandas to spark arrow conversion. * SEDONA-714 Add geopandas to spark arrow conversion. * SEDONA-714 Add geopandas to spark arrow conversion. * SEDONA-714 Add geopandas to spark arrow conversion. * SEDONA-714 Add geopandas to spark arrow conversion. * Update python/sedona/utils/geoarrow.py Co-authored-by: Dewey Dunnington <[email protected]> * SEDONA-714 Add geopandas to spark arrow conversion. * SEDONA-714 Add docs. * SEDONA-714 Add docs. --------- Co-authored-by: Dewey Dunnington <[email protected]>
Did you read the Contributor Guide?
Yes, I have read the Contributor Rules and Contributor Development Guide
No, I haven't read it.
Is this PR related to a JIRA ticket?
Yes, the URL of the associated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-XXX. The PR name follows the format
[SEDONA-XXX] my subject.No:
[DOCS] my subject[CI] my subjectWhat changes were proposed in this PR?
How was this patch tested?
Did this PR include necessary documentation updates?
vX.Y.Zformat.