Hello! In this post, we are going to discuss the five steps that I would take if I was rewinding back before I had ever touched a cloud platform and wanted to start doing machine learning on it. My path was a lot more circuitous than the one I outlined here. You’re getting a distillation of all of my pain and suffering haha. Anyways, step 1:
- Start expanding beyond Python. When I first started doing machine learning, I was obsessed with always using python for data manipulation. With real-world datasets, this is going to be difficult at best and impossible at worst. Companies have datasets that are easily in the 10’s of terabytes. In order to transform the data and do meaningful feature engineering on it, you’ll need to start learning a distributed computing framework like Spark. Bridge the gap by using PySpark. PySpark allows you to use a lot of the Python terminology so that learning Spark is more approachable. I was a lot more comfortable starting with PySpark rather than diving into something like Scala.
- Learn a cluster infrastructure where you can deploy hadoop. You can run Hadoop and spark locally on your computer, nothing requires spark to run on a computing cluster. However, you’re not going to be getting a speed advantage. If you really take the time in step 1 to learn about Spark and how it works, you’ll understand that this is ideal for processing large datasets with really parallelizable operations being done on it. My favorite choice is the Amazon EMR. The EMR consists of a master node as well as core and task nodes that are used as workers to compute a part of the total workload. There is a resource manager that controls the work and resources needed and delegates it to the core/task nodes.
- Learn by doing. It’s going to be really hard to learn Spark by reading, number 1 there are not a ton of really great Spark resources out there. And 2, try to take one of your local machine learning projects and get it running in the cloud. You’ll notice there A LOT of things that will need to change. For example, if you’re using sklearn, well that’s not very scalable meaning that you can’t really run it on multiple nodes. PySpark has a lot of built-in modules like random forest regressors or classifiers that can take the place of some sklearn functions, but the syntax is different. Sure, you can plop your python code up in the cloud, but that will be the same as running it on your local. I’ll emphasize again, the benefit of the cloud is mainly scalability – being able to run it on 10 computers instead of 1.
- Implement something that is production ready. Machine learning in the cloud is going to REALLY set you apart from others when you go to interview. There is a really wild clustering so to speak where people are able to get an opportunity to get lots of cloud computing experience or they aren’t. My experience was similar to how you need a job to get a job except it was in order to get cloud computing experience you need cloud computing experience. I remember I had one interview which was my second job, where I had some decent programming experience, data analytics experience, they required a bachelors. So I felt really qualified. They point blank asked if I had ever used the Amazon EMR – my answer to that was well, not really, but I’ve done tons of machine learning locally – I can always learn it! They were not thrilled with that. They want people who are ready to start the ground running with cloud tech. Most companies right now would way rather have a model that can be scaled and deployed to production rather than a local machine learning algorithm that has incredible accuracy. What good is that model if it can’t be scaled to terabytes of data? Less accuracy can be accepted if it means getting visibility into something that previously was invisible.
- Understand monitoring. With the EMR, it is so easy to just set it and forget it. Try to understand ways that you can easily monitor the EMR so that it’s working as expected. AWS does do a pretty good job of allowing you to do this right from the console. There are some other integrations that you can use too like Ganglia that can be installed when you go to spin up your EMR.
So in summary, start expanding your horizons beyond Python, push a local machine learning project to the cloud and try to understand how to make it run twice as fast if desired. Also know how to monitor the execution time and the cluster resources. That’s about all I have for you today – have a good day! BYE