performance - Spark SQL: Cache Memory footprint improves with 'order by' -

i have 2 scenarios have 23 gb partitioned parquet data , reading few of columns & caching upfront fire series of subsequent queries later on.

setup:

cluster: 12 node emr
spark version: 1.6
spark configurations: default
run configurations: same both cases

case 1:

val paths = array("s3://my/parquet/path", ...) val parqfile = sqlcontext.read.parquet(paths:_*) parqfile.registertemptable("productviewbase") val dfmain = sqlcontext.sql("select guid,email,eventkey,timestamp,pogid productviewbase") dfmain.cache.count

from sparkui, input data read 6.2 gb , cached object of 15.1 gb.

case 1:

val paths = array("s3://my/parquet/path", ...) val parqfile = sqlcontext.read.parquet(paths:_*) parqfile.registertemptable("productviewbase") val dfmain = sqlcontext.sql("select guid,email,eventkey,timestamp,pogid productviewbase order pogid") dfmain.cache.count

from sparkui, input data read 6.2 gb , cached object of 5.5 gb.

any explanation, or code-reference behavior?

it relatively simple. can read in sql guide:

spark sql can cache tables using in-memory columnar format ... spark sql scan required columns , automatically tune compression

nice thing sorted columnar storage is easy compress on typical data. when sort, these blocks of similar records can squashed using simple techniques rle.

this property used quite in databases columnar storage because not efficient in terms of storage aggregations.

different aspects of spark columnar compression covered sql.execution.columnar.compression package , can see runlengthencoding indeed 1 of available compressions schemes.

so there 2 pieces here:

spark can adjust compression method on fly based on statistics:

spark sql automatically select compression codec each column based on statistics of data.
sorting can cluster similar records making compression more efficient.

if there correlations between columns (when not case?) simple sort based on single column can have relatively large impact , improve performance of different compression schemes.

Club Open

Search This Blog

performance - Spark SQL: Cache Memory footprint improves with 'order by' -

Comments

Post a Comment