Saturday, October 23, 2021

Miscellaneous Spark

Feature names in Spark ML

Why isn't this better known? This is a great way to extract the features names from a Dataset after the features have been turned into vectors (including even one-hot encoding):

    index_to_name = {}
    for i in df.schema[OUTPUT_COL].metadata["ml_attr"]["attrs"]:
        meta_map = df.schema[OUTPUT_COL].metadata["ml_attr"]["attrs"][i]
        for m in meta_map:
            index = m['idx']
            name = m['name']
            index_to_name[index] = name 

where OUTPUT_COL is the column containing a Vector of encoded features.

Non-deterministic Lists

I've seen this Spark-related bug twice in our code base: even though lists have an order, collect_list "is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle." [StackOverflow]

The solution in the SO answer is to run the collect over a Window that performs an orderBy.

No comments:

Post a Comment