Feature names in Spark ML
Why isn't this better known? This is a great way to extract the features names from a Dataset after the features have been turned into vectors (including even one-hot encoding):
index_to_name = {}
for i in df.schema[OUTPUT_COL].metadata["ml_attr"]["attrs"]:
meta_map = df.schema[OUTPUT_COL].metadata["ml_attr"]["attrs"][i]
for m in meta_map:
index = m['idx']
name = m['name']
index_to_name[index] = name
where OUTPUT_COL is the column containing a Vector of encoded features.
Non-deterministic Lists
I've seen this Spark-related bug twice in our code base: even though lists have an order, collect_list "is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle." [StackOverflow]
The solution in the SO answer is to run the collect over a Window that performs an orderBy.
No comments:
Post a Comment