-
Added support to specify a vector of column names in
spark_read_csvto specify column names without having to set the type of each column. -
Improved
copy_to,sdf_copy_toanddbWriteTableperformance underyarn-clientmode. -
Added
tbl_change_tb(). This function changes current database. -
Added
sdf_pivot(). This function provides a mechanism for constructing pivot tables, using Spark's 'groupBy' + 'pivot' functionality, with a formula interface similar to that ofreshape2::dcast(). -
Spark Null objects (objects of class NullType) discovered within numeric vectors are now collected as NAs, rather than lists of NAs.
-
Fixed warning while connecting with livy and improved 401 message.
-
Fixed issue in
spark_read_parquet()and other read methods in whichspark_normalize_path()would not work in some platforms while loading data using custom protocols like s3n:// for Amazon S3. -
Added
ft_count_vectorizer(). This function can be used to transform columns of a Spark DataFrame so that they might be used as input toml_lda(). This should make it easier to invokeml_lda()on Spark data sets. -
Added support for the
sparklyr.ui.connectionsoption, which adds additional connection options into the new connections dialog. Therstudio.spark.connectionsoption is now deprecated. -
Implemented the "new connection dialog" as a Shiny application to be able to support newer versions of RStudio that deprecate current connections ui.
-
Improved performance of
sample_n()andsample_frac()by using TABLESAMPLE query. -
Resolved issue in
spark_save()/load_table()to support saving / loading data and added path parameter inspark_load_table()for consistency with other functions.
-
Implemented basic authorization for Livy connections using
livy_config_auth(). -
Added support to specify additional
spark-submitparameters using thesparklyr.shell.argsenvironment variable. -
Renamed
sdf_load()andsdf_save()tospark_read()andspark_write()for consistency. -
The functions
tbl_cache()andtbl_uncache()can now be using without requiring thedplyrnamespace to be loaded. -
spark_read_csv(..., columns = <...>, header = FALSE)should now work as expected -- previously,sparklyrwould still attempt to normalize the column names provided. -
Support to configure Livy using the
livy.prefix in theconfig.ymlfile. -
Implemented experimental support for Livy through:
livy_install(),livy_service_start(),livy_service_stop()andspark_connect(method = "livy"). -
The
mlroutines now acceptdataas an optional argument, to support calls of the form e.g.ml_linear_regression(y ~ x, data = data). This should be especially helpful in conjunction withdplyr::do(). -
Spark
DenseVectorandSparseVectorobjects are now deserialized as R numeric vectors, rather than Spark objects. This should make it easier to work with the output produced bysdf_predict()with Random Forest models, for example. -
Implemented
dim.tbl_spark(). This should ensure thatdim(),nrow()andncol()all produce the expected result withtbl_sparks. -
Improved Spark 2.0 installation in Windows by creating
spark-defaults.confand configuringspark.sql.warehouse.dir. -
Embedded Apache Spark package dependencies to avoid requiring internet connectivity while connecting for the first through
spark_connect. Thesparklyr.csv.embeddedconfig setting was added to configure a regular expression to match Spark versions where the embedded package is deployed. -
Increased exception callstack and message length to include full error details when an exception is thrown in Spark.
-
Improved validation of supported Java versions.
-
The
spark_read_csv()function now accepts theinfer_schemaparameter, controlling whether the columns schema should be inferred from the underlying file itself. Disabling this should improve performance when the schema is known beforehand. -
Added a
do_.tbl_sparkimplementation, allowing for the execution ofdplyr::dostatements on Spark DataFrames. Currently, the computation is performed in serial across the different groups specified on the Spark DataFrame; in the future we hope to explore a parallel implementation. Note thatdo_always returns atbl_dfrather than atbl_spark, as the objects produced within ado_query may not necessarily be Spark objects. -
Improved errors, warnings and fallbacks for unsupported Spark versions.
-
sparklyrnow defaults totar = "internal"in its calls tountar(). This should help resolve issues some Windows users have seen related to an inability to connect to Spark, which ultimately were caused by a lack of permissions on the Spark installation. -
Resolved an issue where
copy_to()and other R => Spark data transfer functions could fail when the last column contained missing / empty values. (#265) -
Added
sdf_persist()as a wrapper to the Spark DataFramepersist()API. -
Resolved an issue where
predict()could produce results in the wrong order for large Spark DataFrames. -
Implemented support for
na.actionwith the various Spark ML routines. The value ofgetOption("na.action")is used by default. Users can customize thena.actionargument through theml.optionsobject accepted by all ML routines. -
On Windows, long paths, and paths containing spaces, are now supported within calls to
spark_connect(). -
The
lag()window function now accepts numeric values forn. Previously, only integer values were accepted. (#249) -
Added support to configure Ppark environment variables using
spark.env.*config. -
Added support for the
TokenizerandRegexTokenizerfeature transformers. These are exported as theft_tokenizer()andft_regex_tokenizer()functions. -
Resolved an issue where attempting to call
copy_to()with an Rdata.framecontaining many columns could fail with a Java StackOverflow. (#244) -
Resolved an issue where attempting to call
collect()on a Spark DataFrame containing many columns could produce the wrong result. (#242) -
Added support to parameterize network timeouts using the
sparklyr.backend.timeout,sparklyr.gateway.start.timeoutandsparklyr.gateway.connect.timeoutconfig settings. -
Improved logging while establishing connections to
sparklyr. -
Added
sparklyr.gateway.portandsparklyr.gateway.addressas config settings. -
The
spark_log()function now accepts thefilterparameter. This can be used to filter entries within the Spark log. -
Increased network timeout for
sparklyr.backend.timeout. -
Moved
spark.jars.defaultsetting from options to Spark config. -
sparklyrnow properly respects the Hive metastore directory with thesdf_save_table()andsdf_load_table()APIs for Spark < 2.0.0. -
Added
sdf_quantile()as a means of computing (approximate) quantiles for a column of a Spark DataFrame. -
Added support for
n_distinct(...)within thedplyrinterface, based on call to Hive functioncount(DISTINCT ...). (#220)
- First release to CRAN.