Larger-order Capabilities, Avro and Customized Serializers

sparklyr 1.3 is now accessible on CRAN, with the next main new options:

To put in sparklyr 1.3 from CRAN, run

On this submit, we will spotlight some main new options launched in sparklyr 1.3, and showcase situations the place such options turn out to be useful. Whereas quite a few enhancements and bug fixes (particularly these associated to spark_apply(), Apache Arrow, and secondary Spark connections) have been additionally an vital a part of this launch, they won’t be the subject of this submit, and it will likely be a simple train for the reader to seek out out extra about them from the sparklyr NEWS file.

Larger-order Capabilities

Larger-order capabilities are built-in Spark SQL constructs that permit user-defined lambda expressions to be utilized effectively to complicated knowledge sorts reminiscent of arrays and structs. As a fast demo to see why higher-order capabilities are helpful, let’s say sooner or later Scrooge McDuck dove into his enormous vault of cash and located massive portions of pennies, nickels, dimes, and quarters. Having an impeccable style in knowledge buildings, he determined to retailer the portions and face values of the whole lot into two Spark SQL array columns:


sc <- spark_connect(grasp = "native", model = "2.4.5")
coins_tbl <- copy_to(
    portions = checklist(c(4000, 3000, 2000, 1000)),
    values = checklist(c(1, 5, 10, 25))

Thus declaring his internet price of 4k pennies, 3k nickels, 2k dimes, and 1k quarters. To assist Scrooge McDuck calculate the full worth of every sort of coin in sparklyr 1.3 or above, we are able to apply hof_zip_with(), the sparklyr equal of ZIP_WITH, to portions column and values column, combining pairs of parts from arrays in each columns. As you may need guessed, we additionally must specify the right way to mix these parts, and what higher technique to accomplish that than a concise one-sided system   ~ .x * .y   in R, which says we would like (amount * worth) for every sort of coin? So, we now have the next:

result_tbl <- coins_tbl %>%
  hof_zip_with(~ .x * .y, dest_col = total_values) %>%

result_tbl %>% dplyr::pull(total_values)
[1]  4000 15000 20000 25000

With the end result 4000 15000 20000 25000 telling us there are in whole $40 {dollars} price of pennies, $150 {dollars} price of nickels, $200 {dollars} price of dimes, and $250 {dollars} price of quarters, as anticipated.

Utilizing one other sparklyr operate named hof_aggregate(), which performs an AGGREGATE operation in Spark, we are able to then compute the online price of Scrooge McDuck based mostly on result_tbl, storing the lead to a brand new column named whole. Discover for this combination operation to work, we have to make sure the beginning worth of aggregation has knowledge sort (particularly, BIGINT) that’s per the info sort of total_values (which is ARRAY<BIGINT>), as proven under:

result_tbl %>%
  dplyr::mutate(zero = dplyr::sql("CAST (0 AS BIGINT)")) %>%
  hof_aggregate(begin = zero, ~ .x + .y, expr = total_values, dest_col = whole) %>%
  dplyr::choose(whole) %>%
[1] 64000

So Scrooge McDuck’s internet price is $640 {dollars}.

Different higher-order capabilities supported by Spark SQL up to now embody rework, filter, and exists, as documented in right here, and just like the instance above, their counterparts (particularly, hof_transform(), hof_filter(), and hof_exists()) all exist in sparklyr 1.3, in order that they are often built-in with different dplyr verbs in an idiomatic method in R.


One other spotlight of the sparklyr 1.3 launch is its built-in help for Avro knowledge sources. Apache Avro is a extensively used knowledge serialization protocol that mixes the effectivity of a binary knowledge format with the pliability of JSON schema definitions. To make working with Avro knowledge sources easier, in sparklyr 1.3, as quickly as a Spark connection is instantiated with spark_connect(..., package deal = "avro"), sparklyr will routinely work out which model of spark-avro package deal to make use of with that connection, saving numerous potential complications for sparklyr customers attempting to find out the proper model of spark-avro by themselves. Just like how spark_read_csv() and spark_write_csv() are in place to work with CSV knowledge, spark_read_avro() and spark_write_avro() strategies have been applied in sparklyr 1.3 to facilitate studying and writing Avro recordsdata by means of an Avro-capable Spark connection, as illustrated within the instance under:


# The `package deal = "avro"` choice is barely supported in Spark 2.4 or larger
sc <- spark_connect(grasp = "native", model = "2.4.5", package deal = "avro")

sdf <- sdf_copy_to(
    a = c(1, NaN, 3, 4, NaN),
    b = c(-2L, 0L, 1L, 3L, 2L),
    c = c("a", "b", "c", "", "d")

# This instance Avro schema is a JSON string that basically says all columns
# ("a", "b", "c") of `sdf` are nullable.
avro_schema <- jsonlite::toJSON(checklist(
  sort = "document",
  identify = "topLevelRecord",
  fields = checklist(
    checklist(identify = "a", sort = checklist("double", "null")),
    checklist(identify = "b", sort = checklist("int", "null")),
    checklist(identify = "c", sort = checklist("string", "null"))
), auto_unbox = TRUE)

# persist the Spark knowledge body from above in Avro format
spark_write_avro(sdf, "/tmp/knowledge.avro", as.character(avro_schema))

# after which learn the identical knowledge body again
spark_read_avro(sc, "/tmp/knowledge.avro")
# Supply: spark<knowledge> [?? x 3]
      a     b c
  <dbl> <int> <chr>
  1     1    -2 "a"
  2   NaN     0 "b"
  3     3     1 "c"
  4     4     3 ""
  5   NaN     2 "d"

Customized Serialization

Along with generally used knowledge serialization codecs reminiscent of CSV, JSON, Parquet, and Avro, ranging from sparklyr 1.3, custom-made knowledge body serialization and deserialization procedures applied in R will also be run on Spark staff by way of the newly applied spark_read() and spark_write() strategies. We will see each of them in motion by means of a fast instance under, the place saveRDS() is known as from a user-defined author operate to save lots of all rows inside a Spark knowledge body into 2 RDS recordsdata on disk, and readRDS() is known as from a user-defined reader operate to learn the info from the RDS recordsdata again to Spark:


sc <- spark_connect(grasp = "native")
sdf <- sdf_len(sc, 7)
paths <- c("/tmp/file1.RDS", "/tmp/file2.RDS")

spark_write(sdf, author = operate(df, path) saveRDS(df, path), paths = paths)
spark_read(sc, paths, reader = operate(path) readRDS(path), columns = c(id = "integer"))
# Supply: spark<?> [?? x 1]
1     1
2     2
3     3
4     4
5     5
6     6
7     7

Different Enhancements


Sparklyr.flint is a sparklyr extension that goals to make functionalities from the Flint time-series library simply accessible from R. It’s presently underneath lively improvement. One piece of fine information is that, whereas the unique Flint library was designed to work with Spark 2.x, a barely modified fork of it would work effectively with Spark 3.0, and throughout the current sparklyr extension framework. sparklyr.flint can routinely decide which model of the Flint library to load based mostly on the model of Spark it’s related to. One other bit of fine information is, as beforehand talked about, sparklyr.flint doesn’t know an excessive amount of about its personal future but. Possibly you’ll be able to play an lively half in shaping its future!

EMR 6.0

This launch additionally includes a small however vital change that permits sparklyr to accurately hook up with the model of Spark 2.4 that’s included in Amazon EMR 6.0.

Beforehand, sparklyr routinely assumed any Spark 2.x it was connecting to was constructed with Scala 2.11 and tried to load any required Scala artifacts constructed with Scala 2.11 as effectively. This turned problematic when connecting to Spark 2.4 from Amazon EMR 6.0, which is constructed with Scala 2.12. Ranging from sparklyr 1.3, such downside will be mounted by merely specifying scala_version = "2.12" when calling spark_connect() (e.g., spark_connect(grasp = "yarn-client", scala_version = "2.12")).

Spark 3.0

Final however not least, it’s worthwhile to say sparklyr 1.3.0 is understood to be absolutely suitable with the just lately launched Spark 3.0. We extremely suggest upgrading your copy of sparklyr to 1.3.0 for those who plan to have Spark 3.0 as a part of your knowledge workflow in future.


In chronological order, we wish to thank the next people for submitting pull requests in direction of sparklyr 1.3:

We’re additionally grateful for beneficial enter on the sparklyr 1.3 roadmap, #2434, and #2551 from [@javierluraschi](, and nice non secular recommendation on #1773 and #2514 from @mattpollock and @benmwhite.

Please word for those who imagine you might be lacking from the acknowledgement above, it might be as a result of your contribution has been thought-about a part of the following sparklyr launch slightly than half of the present launch. We do make each effort to make sure all contributors are talked about on this part. In case you imagine there’s a mistake, please be happy to contact the creator of this weblog submit by way of e-mail (yitao at rstudio dot com) and request a correction.

When you want to study extra about sparklyr, we suggest visiting,, and a number of the earlier launch posts reminiscent of sparklyr 1.2 and sparklyr 1.1.

Thanks for studying!

Latest articles

Related articles

Leave a reply

Please enter your comment!
Please enter your name here