Hurray! My Py/Spark pull request to optimize hyperparameters using randomization has been accepted and merged to branch 3.2!
Here are some things I have learned about maintaining the Spark codebase.
- Scalacheck is used extensively. This is a great little library that I should really use more in my work code. It checks corner cases such as: if your function needs an Int then how does it handle MaxValue? How about MinValue?
- Unfortunately (in my opinion) the documentation is generated with Jekyll which runs on Ruby requiring me to install all sorts of weird (to me) software on my system. This StackOverflow post helped me avoid doing all sorts of nasty things to my laptop as root. Documentation is generated with bundle exec jekyll build which takes the .md files and enhances them so tools like IntelliJ that give you a markdown viewer are not terribly useful. You have to spend minutes building the site each time you change it, it seems. Compare that to mdoc which via an SBT plugin allows you do things like have the documentation built the moment you change a file with docs/mdoc --watch.
- The style checks are draconian with even the spaces between imports being significant. I guess this makes a lot of sense in that code reviews are now more automated (I was surprised that it was only really the overworked Sean Owen who was reviewing my code) but was annoying to have builds fail because of very minor infractions. The $SPARK_HOME/dev directory however has some scripts that let's you run this kind of thin on your own laptop first.
- All code must be accessible to Java, Scala and Python. Scala/Java compatibility is not without its wrinkles but PySpark code is a parallel code structure that mostly wraps Spark. Where it doesn't (for example, code on the client side), the Python must have as similar an interface to the JVM classes as possible.
All in all, quite difficult but I think this is a fixed cost. My next pull request should be much faster.
No comments:
Post a Comment