i have 2 scenarios have 23 gb
partitioned parquet
data , reading few of columns
& caching
upfront fire series of subsequent queries later on.
setup:
- cluster: 12 node emr
- spark version: 1.6
- spark configurations: default
- run configurations: same both cases
case 1:
val paths = array("s3://my/parquet/path", ...) val parqfile = sqlcontext.read.parquet(paths:_*) parqfile.registertemptable("productviewbase") val dfmain = sqlcontext.sql("select guid,email,eventkey,timestamp,pogid productviewbase") dfmain.cache.count
from sparkui
, input data read 6.2 gb , cached object of 15.1 gb.
case 1:
val paths = array("s3://my/parquet/path", ...) val parqfile = sqlcontext.read.parquet(paths:_*) parqfile.registertemptable("productviewbase") val dfmain = sqlcontext.sql("select guid,email,eventkey,timestamp,pogid productviewbase order pogid") dfmain.cache.count
from sparkui
, input data read 6.2 gb , cached object of 5.5 gb.
any explanation, or code-reference behavior?
it relatively simple. can read in sql guide:
spark sql can cache tables using in-memory columnar format ... spark sql scan required columns , automatically tune compression
nice thing sorted columnar storage is easy compress on typical data. when sort, these blocks of similar records can squashed using simple techniques rle.
this property used quite in databases columnar storage because not efficient in terms of storage aggregations.
different aspects of spark columnar compression covered sql.execution.columnar.compression
package , can see runlengthencoding
indeed 1 of available compressions schemes.
so there 2 pieces here:
spark can adjust compression method on fly based on statistics:
spark sql automatically select compression codec each column based on statistics of data.
sorting can cluster similar records making compression more efficient.
if there correlations between columns (when not case?) simple sort based on single column can have relatively large impact , improve performance of different compression schemes.
Comments
Post a Comment