Friday, April 6, 2018

The Trouble with Hashing

If you use Spark's HashingTF, you avoid needing an external store of key/values that complicates the code but there are downsides. Collisions aside, the matrix it generates is overly wide and sparse. This can cause problems for some machine learning algorithms.

There are a number of clever mathematical dimensionality reduction techniques but since HashingTF makes the matrix unnecessarily sparse, you can compact it trivially with:

import org.apache.spark.sql.{DataFrame, Row}
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.{udf, col}

def toIndices(i: Int)(r: Row): Array[Int] = r.getAs[SparseVector](0).indices

def compacting(replacements: Map[Int, Int])(v: SparseVector): SparseVector = { 
  val indexVals  =
  val newIndices = { case (i, x) => (replacements(i), x) }.sortBy(_._1)
  new SparseVector(replacements.size,,

def compactingUdf(replacements: Map[Int, Int]): UserDefinedFunction = {
  val compactingFn = compacting(replacements) _

(You can find this code in my GitHub repository).

Now, you can compact your matrix. First, collect the non-zero indices:

val toIndicesFn: Row => TraversableOnce[Int] = toIndices(0)
val allIndices =

and then replace the old indices

val cUdf        = compactingUdf(allIndices)
val compactedDF = df.withColumn(YOUR_COLUMN_NAME, cUdf(col(YOUR_COLUMN_NAME)))

Now your matrix is considerably more narrow and less sparse.

For example, using this technique decreases the time taken by Spark's RandomForestClassifier from over an hour to a few minutes. The results were still comparable in my example (23% vs 25%) but the sparse matrix actually reduces the number of nodes from 1000 to 821. With the compacted matrix, I can increase the number of trees to nearly 3000 before seeing the same warning.

No comments:

Post a Comment