This Small Corner

Eric Koyanagi's tech blog. At least it's free!

Want to leverage AI in a small dev team? Looking at Amazon SageMaker!

By Eric Koyanagi
Posted on

Working with AI on a Small Dev Team

Everyone and their mom seems to be obsessed with machine learning nowadays, and the complexity of these technologies (and how radically different they are from traditional coding) might make you think that this tech is completely unreachable for smaller teams. While Amazon is aiming to change that perspective with new products, they have a ways to go...their products are famously opaque and require a bit of high-level understanding just to make sense of.

You aren't limited to merely paying for OpenAI's API to leverage machine learning -- although that might be the most straightforward for simple use cases, what if you want more control? Let's look into building our own AI implementations using Amazon Sagemaker. To be clear, you can get all this information by watching some of Amazon's videos under their "machine learning plan", but I will do my best to quickly summarize what's involved. Hopefully this can help you decide if it is worth experimenting with SageMaker for your product.

Labelling Data or Images

The first step to training a model is having labelled data. If you aren't familiar with deep learning as a process, this data is required because something needs to "tell" the algorithm what is true. Let's imagine the classification example explored in the linked tutorial -- identifying if an image contains a picture of a bee (and drawing a bounding box around the bee).

If you were trying to teach an alien what a "bee" is, the natural first step is to pull out a bunch of pictures of bees, point to the insect, and say "this is a bee". Trying to describe the bee would be a lot less effective than simply showing an example and letting the alien intuit the description on their own -- okay, a bee is this black and yellow creature with six legs and antennae.

As a smaller org, this step can seem prohibitive. The reality is that you need a lot of labelled data to train the model correctly (e.g. thousands of images of bees). Does your org really have the resources to have someone spend days and days drawing boxes around bees? Can you afford to check their (likely low paying) work to ensure the model is clean? Probably not.

Fortunately for you, SageMaker Ground Truth has systems in place to help you achieve this at-scale. Unfortunately for your sense of ethics, it's using Amazon Mechanical Turk.

Amazon Mechanical Turk and SageMaker Ground Truth

Started in 2005, I've seen the mechanical turk pop up a few times through my career as marketing-savvy co-workers would utilize it to scale their menial but manual workloads. First, the name is based on the historical mechanical turk, a chess-playing machine created in the 1770s that performed surprisingly well against human opponents. No, it wasn't a novel precursor to a computer...it was of course an illusion that allowed a human chess-master to hide within. It even featured a letter board allowing the operator to communicate with challengers.

Is it odd that Amazon decided to name its crowdsourcing platform after a famous illusion? Maybe...but maybe not, because the whole point is that the Turk abstracts away the human beings behind each HIT (human intelligence task, they even have to make that sound machine-like), which typically earn anywhere from $0.03 to a dollar or so. The average pay rate for the Turk is about $1 to $6 an hour (per a 2018 study).

At these rates, you might assume that the Turk isn't actually so bad because "99% of turkers are overseas" where that $1 per hour might go further. This isn't really true, though, as people do turk (yes, even full time) in the US. According to a 2016 study of 3000 US turkers, most make under $5 an hour. What's my point...? Don't buy into the illusion like the 18th and 19th century elites did (for almost 90 years) -- don't forget that there's humans behind these tasks.

By default, every training task will assign five workers -- so if you have 5,000 images, that translates to 25,000 human intelligence tasks. It algorithmically picks the "best answers" using those 5 results. Even with that scale, that's only around $850 of labor.

Federal minimum wage is $7.25/hour. If each image requires about 1 minute to draw a bounding box around, that means about 416 hours of work, or over $3,000 at federal minimum wage. You might think that this process is very slow, but the turk has 500,000 workers and tasks are competitive...simple tasks will be done in minutes, usually. If can feel a lot like the (in)famous historical illusion where you forget that it isn't a machine, and that's probably Amazon's intent.

It's important to remember that these are human beings because one of the major issues with the Turk is wildly unfair practices by job posters. Some will reject jobs far too easily (yes, posters get to decide that they won't pay), some will demand tasks that simply can't be done, and others will provide really unclear specifications, then reject tasks because of their own laziness (or because the task was literally impossible to complete). One study calculated a grim average hourly wage of just $1.76 factoring in wasted/rejected HITs and the time spent looking for tasks. The same study states that just 4% of "turkers" make more than the federal minimum wage (and to be clear, you have to do thousands of low paying tasks before you can even qualify for higher paying tasks).

It's even worse than that, though, because remember that Amazon doesn't do anything for free. They take a cut from the workers, too! For some transactions, they might make exactly as much as the worker, with both Amazon and the worker being paid one penny.

To make matters even more scammy, overseas workers often have little choice but to be paid in Amazon Gift Cards to avoid various fees...which is just plain dirty. For workers that already make so little, forcing them to shop in your "company store" feels...massively wrong for a company making $250 billion+ in profit each year.

If you're going to use the Mechanical Turk, don't be an asshole about it. Remember that these are humans already making very little money for their effort. Be careful and fair if you must reject a HIT. Understand that the worker has far less leverage than you do; Amazon is infamous for giving job posters a lot of leeway and not investigating claims of abuse. Workers have to expend a lot of time and effort just to be paid what little was agreed on for a task, which is madness! There should be much more granular rules about rejections that don't put the onus on the worker to try to claw back a few pennies from a system that gives job posters the power to reject work almost unilaterally.

All this being said, you don't need to rely only on human labelers via the Turk. You can label things completely in-house (a requirement if you have sensitive PII or adult content, for example) and can leverage AI-assisted labelling. Yes, AI can be used to label data that will be used to train AI! This can be set up such that SageMaker auto-labels data it has a "high confidence" about, then delegates anything else to the turk. For many use cases, this will work very well.

Overall, I think it's worth spending even a little time thinking about the people that actually make AI work: labelers making very little money and fighting tooth and nail just to get paid for these tasks, sometimes in "store currency" they can't even use to pay their rent! It makes perfect sense that Amazon wants you to see only the "machine" part of the Turk and forget that it's an army of human beings behind the strings. It doesn't mean you have to avoid it...but at least don't be an ass about it! Behind each rejection just might be someone trying to afford their insulin or simply survive.

Training the Data with SageMaker

Once you have your labelled data, you can start the training process. Creating this from their UI, you can see a mess of algorithm options (such as "object detection" as used in the bees tutorial). This is where you can experiment with various algorithms and tune the parameters that control the training, which will be algorithm-specific. You can also define the instance type and resources you'll devote to training. These training jobs will require several minutes to run even with small datasets (the bees tutorial only had 500 images and it takes ~10 minutes). This could become expensive, but then...you knew that going into the project, right? AI isn't especially cheap.

This becomes clear when you delve into the most important facet of training, tuning hyperparameters. What is a hyperparameter? It's just a param used to control how the model learns. So...why not just call it a parameter? That's beyond my pay grade. Regardless, HPO (hyperparameter optimization) is the difference between your AI model being useful and being a waste of time. I think at this point most devs know intuitively that this is the most important facet of AI.

SageMaker has some neat tools for helping with HPO and can run the training multiple times, tweaking the params to try to obtain the best results. For our tiny "bees" example, this still takes several hours to run. For a model at scale? This might take days or weeks. There's no way around it: training data is expensive. It takes this long even though we can tell it to be smart and abort attempts when it is clear that the accuracy is poor. For as much time and money as that saves, this process can be costly at scale.

There's more caveats, of course -- you have to have some basic understanding of these params (or at least reference the docs) to understand the min/max values to plug into SageMaker for HPO. Like many AWS products, it gives you a UI, but not much help via UX clues -- it expects you to know what's going on!

Deployment and Conclusion

Once the model is trained, it's very easy to deploy an endpoint using the training job name, allowing you to finally utilize the model in production. You'll probably be sad about the accuracy of the model, forcing you to run more training jobs and tune things further...but it really depends on your use case and how good your labelled data is.

Is all this effort really worth it? Well, if you want a bespoke, customizable model that doesn't have the limits of an API, yes....but that depends on your use case, and figuring out if AI fits for your use case is beyond the scope of this article. The reality is that for most small businesses, it's more practical to utilize an API. Still, SageMaker (and other cloud-based tools on the horizon) empowers small firms to create their own bespoke API implementations even with limited developer knowledge about AI.

Cloud-based tools will always be a priority for players like Amazon, and products like SageMaker will be critical for the next generation of cloud specialists to understand. With this platform, you can focus a lot less on the "nuts and bolts" of training and AI and more on the algorithm-specific details and implementation.


Reference: I Found Work on an Amazon Website. I Made 97 Cents an Hour.


« Back to Article List
Written By
Eric Koyanagi

I've been a software engineer for over 15 years, working in both startups and established companies in a range of industries from manufacturing to adtech to e-commerce. Although I love making software, I also enjoy playing video games (especially with my husband) and writing articles.

Article Home | My Portfolio | My LinkedIn
© All Rights Reserved