Processing Huge CSV files with AWS: Part I

By Eric Koyanagi

Posted on 09/18/23

The Goal: Handling Large CSV Uploads in the Cloud

Here’s a very common use case: someone wants to upload a massive CSV and “do stuff” with its contents. Doing this as a one-off is simple enough…but what about doing it at scale? Not as fun. That’s a lot of data to chunk through and a lot of memory to burn.

To be clear, when I say “massive”, I mean massive in both dimensions. The CSV itself might be 100GB or 1TB or more, and we might expect numerous uploads at the same time.

Articles online might tell you that it’s no big deal. Just max out the max execution time and let PHP use as much memory as it wants! This is risky business.

First, if you have a lot of people uploading CSVs at once, you’re not going to have enough processes to handle anything other than those long-running imports. That’s if the server has enough CPU and memory to keep up.

Using workers is much safer, and you can always set up auto scaling so that it spins down to zero instances if there’s nothing in the queue to process. This at least prevents it from tearing down production and lets you scale the hardware dedicated to imports independently.

However, now you’ve likely got a cluster of machines running whose only job is processing big CSVs, and even then this isn’t as easy to scale as it might seem. You need long-running PHP processes and must be careful not to burn too much memory with any given process. You might even need to tune apache or nginx to spin up fewer processes and scale horizontally instead, racking up more cost.

There’s got to be a more “modern” way that leverages the power of the cloud to decouple the process from traditional servers. We might even save some money in the process.

Let’s use AWS: Navigating Acronympocalypse

Back in 2010, AWS boasted five services on their main console: EC2, S3, EMR, RDS, and CloudFront. Now it has something like 809 services spanning every permutation of letters sometimes followed by numbers (although the trend now is giving products actual names).

What’s especially annoying is that many of their “services” are wrappers around open-source software. EMR is backed by Hadoop. ElasticSearch is backed by Apache Lucene. Glue? Apache Spark. You get the idea.

You can’t really fight Amazon’s ever-increasing services, though, because they are powerful tools. We just have to accept their evil ways and learn to use the tools as best we can!

Let’s review some of the services we can use to build a robust upload pipeline that can handle big files and many concurrent uploads. No, I won’t be typing “AWS” in front of every product name. Don’t sue me…their brand is strong enough without all the zombie-typing:

API Gateway: exposes Lambda functions (or Step workflows) to the web, allowing us to build an API that our front-end can utilize without writing any API boilerplate
Lambda: The foundation of serverless, Lambda lets us execute “some function” in the cloud without servers. We pay by usage.
Step Functions: microservice orchestration, a fancy way to say that you can build no-code workflows with either “AST” (it’s just JSON…yes, AWS will even rebrand JSON to make it their own) or using their WYSIWYG. Think “Klaviyo for microservices”. One handy feature is that they can map a CSV sitting in S3, dispatching batches of rows into Lambda. It can do this using up to like 10,000 child workers…so you can rip through a CSV very quickly if you want to pay for it. Doing this with a monolith is a lot more complex to scale.
Glue: Apache Spark based service for ETL (extract, transform, load). This is where we can process data -- in a naive system, this might require more asynchronous processing or inline processing as the file uploads (which burns more CPU in the long-running upload process).
Athena: We probably won’t use this, but Athena would allow us to query against a CSV sitting in S3 directly -- it’s fairly expensive, but could fit some use cases. To leverage it efficiently, you’ll probably want to use Glue to reformat the CSV into a columnar data format, which reduces the number of reads that Athena needs to execute queries efficiently.
EMR: Elastic Map Reduce, one of ye ‘ole classic services from 2010. This is used for data processing (ETL) and analysis -- it’s backed by both Hadoop and Spark. It is more feature-rich than Glue and can offer more complex/fine-grained controlled compared to Glue. It isn’t Serverless, you’ll need to manage an EMR cluster yourself, but for some workloads this will be more cost-effective than Glue.
Kinesis: This is for working with real-time data. It doesn’t make a lot of sense for batch operations like processing CSVs, but maybe you have a reason to ingest and process data in real-time as rows are being processed. That said, it’s a powerful tool for a data pipeline that might consume data from a real-time source like a log.
Batch: Batch is useful for running high-CPU workloads. It works by defining jobs that specify the docker container to run, the command to execute, and any parameters. The order and schedule of jobs is controlled by a job queue, which spins up EC2 resources (perhaps many resources) to launch the container and execute the job. It handles resource allocation -- so you can use spot instances to only execute during off-peak hours. This has some advantages to Lambda, especially for large or long-running pipelines. Of course you can use this in combination with Glue and/or EMR.

I know it’s a lot of products, and some of their use cases definitely overlap! EMR, Batch, and Glue are all about processing or transforming data -- but work in different ways.

Glue is serverless and well suited for straightforward, ETL-only tasks. Like most serverless products, it's most cost-effective for pipelines with unpredictable traffic -- for “always on” or scheduled processes, EMR or Batch might be cheaper.

EMR is a hosted cluster, while Batch handles provisioning and spinning down hardware on its own. Batch can be used for any batch processing job (e.g. simulations, AI training, scientific modeling), it isn’t specific to ETL or data pipelines, while EMR is used for processing and analyzing data specifically.

Part 1: Uploading huge CSVs using AWS Serverless

First step: we need a way to dump a big CSV into an S3 bucket. Eh, let’s use a word other than “dump”? Upload. There we go.

We need to understand that Lamba isn’t as simple as slapping some code into a function and letting it run in the cloud. We need to know how much memory to allocate that function and understand that Lambda functions can’t run forever: they’re limited to 15 minutes.

If our goal is to handle an upload of any size, that 15 minute limit won’t cut it. What if the upload takes several hours? First, it’ll cost you. That’s the price of Lambda, of having all that raw capacity at your fingertips instantly. It doesn’t like these long-running processes.

The often-suggested method for uploading a CSV using Lambda is therefore…not going to work for us. We can’t ignore that 15 minute limit for massive files, so we need a different approach.

Fortunately, the people at AWS already thought about this and we can use a multipart upload to split the workload into “chunks” and upload a bunch of smaller pieces (which we can upload at the same time to increase throughput).

Unfortunately, this means we need to understand the process. This is powerful stuff and can greatly improve the efficiency and robustness of any large file upload, so I think it’s well worth it:

Initiate the multipart upload by calling the CreateMultipartUpload API, which returns an ID
Slice the file into multiple parts and obtain a presigned upload URL for each part
Upload each part using the presigned URLs via the UploadPart API
Complete the multipart upload by calling CompleteMultipartUpload

As you can see, we’ve got some work to do on both the cliend and server side. It seems we’ll need at least four functions to: start the upload, get a presigned URL, call UploadPart, and finally complete the upload.

There’s another benefit to uploading with Amazon’s tech: S3’s edge locations. This reduces upload time even further by having the client upload to the S3 edge location physically nearest to the user…we aren’t going to touch this yet, but it’s nice to know it’s an option if we want the maximum upload performance.

Starting the Upload: Getting Some Signed URLs

First, the Lambda function to initiate a new multipart upload is easy and provided by AWS:

const multipartUpload = await s3. createMultipartUpload(multipartParams).promise()
return {
    statusCode: 200,
    body: JSON.stringify({
        fileId: multipartUpload.UploadId,
        fileKey: multipartUpload.Key,
      }),
    headers: {
      'Access-Control-Allow-Origin': '*'
    }
};

Of course, AWS provides this code without much more comment. What are multipartParams and how do we get them? We can peek into the createMultipartUpload method to see that it expects a bucket and key -- which makes sense, it needs to know where to upload our file.

Now that we have our multipart upload ID, we need to split the data and create presigned URLs for as many parts as we need. This makes it easy to tune -- you can define how big you want each part to be and create as many presigned URLs as required.

What about the front-end code that is responsible for starting the actual upload? First, let’s calculate the number of parts:

const numberOfparts = Math.ceil(this.file.size / this.chunkSize)

One note about chunkSize that is noted in Amazon’s sample repo is that it must be at least 5MB. Files smaller than 5MB might need the more simple implementation. We will come back to this, but for now we’ll focus on the more difficult use case for large (100s of GBs or several TB) CSV files (or any file type).

Now we need to fetch presigned URLs for each part. We’ll look at Amazon’s sample front end to see how this might work:

const AWSMultipartFileDataInput = {
        fileId: this.fileId,
        fileKey: this.fileKey,
        parts: numberOfparts,
}

const urlsResponse = await api.request({
        url: this.useTransferAcceleration?"/getPreSignedTAUrls":"/getPreSignedUrls",
        method: "POST",
        data: AWSMultipartFileDataInput,
        baseURL:this.baseURL
})

Here Amazon’s sample code calls different lambdas depending on if we want to use “transfer acceleration” (uploading to an edge location) -- I will not implement that option in my code just yet, but it’s good to know how easy it would be to enable!

Now we can jump to the backend code that generates these presigned URLs. We use presigned URLs because it hides the underlying details of where these files are actually going and prevents a malicious user from trying to upload massive files just to screw up your system (since they expire).

Be sure to set a reasonable expiration…it’s like whack-a-mole with all these expiration times. Doing all this work to avoid Lambda’s 15 minute timeout just to have the presigned URL have a 15 minute expiration…?

This timeout is easier to deal with than Lambda’s, though, because we will split the file into parts. Since each individual part is smaller, these timeouts shouldn’t be a huge problem unless we opt for a very large chunk size.

Here’s the relevant code from Amazon’s sample:

const multipartParams = {
        Bucket: bucket_name,
        Key: fileKey,
        UploadId: fileId,
}
const promises = []

for (let index = 0; index < parts; index++) {
        promises.push(
            s3.getSignedUrlPromise("uploadPart", {
            ...multipartParams,
            PartNumber: index + 1,
            Expires: parseInt(url_expiration)
            }),
        )
}
const signedUrls = await Promise.all(promises)

We will place this in a separate lambda function as the frontend indicates. Now the frontend has a list of presigned URLs and is ready to finally upload this freakin’ file.

Let’s Upload Something!

First, we use file.slice to obtain a “slice” of the data we’re going to upload using one of our signed URLs:

const sentSize = (part.PartNumber - 1) * this.chunkSize

const chunk = this.file.slice(sentSize, sentSize + this.chunkSize)

In other words, the chunk is a “slice” of the file starting from the last slice and sized by on our chunk size. The file is uploaded by creating a new promise for each chunk that constructs an HTTP PUT to our presigned URL. There is no backend logic for this piece, the file will just appear in S3:

upload(file, part, sendChunkStarted) {
   const xhr = (this.activeConnections[part.PartNumber - 1] = new XMLHttpRequest())
   xhr.open("PUT", part.signedUrl)
   xhr.onreadystatechange = () => {
	   …
	   xhr.send(file)
   }
}

This is a highly edited version of the sample code to illustrate the general concept, which is to simply construct a PUT request for each chunk of the file.

The final step is basically the same exact thing as the first step -- we’re confirming that the upload is done so that the file can be reassembled like some future robot regluing itself into a terrifying whole:

const finalizationMultiPartInput = {
        fileId: this.fileId,
        fileKey: this.fileKey,
        parts: this.uploadedParts,
}

await api.request({
        url: "/finalize",
        method: "POST",
        data: finalizationMultiPartInput ,
        baseURL:this.baseURL
})

We can look into the backend section to see the matching server-side Lambda:

const { fileKey, fileId, parts } = JSON.parse(event.body)
const multipartParams = {
Bucket: BUCKET_NAME,
Key: fileKey,
	UploadId: fileId,
	MultipartUpload: {
	Parts: parts
	},
}
const command = new CompleteMultipartUploadCommand(multipartParams);
await s3.send(command);

Now that we’ve gone through the code and see how all this works, let’s do it the easier way!

Using Amazon’s Upload Util

Since the source code is given a permissive license, we could save some effort and use their method on our own front end. We’ll need to copy their uploader file or implement a similar version (note the reference to the original source in their sample code), then we can do something like this:

const uploaderOptions = {
        file: file,
        baseURL: baseUrl,
        chunkSize: partsize,
        threadsQuantity: numuploads,
        useTransferAcceleration: false 
}

const uploader = new Uploader(uploaderOptions)
Uploader
	.onProgress(({ percentage: newPercentage }) => {
	if(percentage === 100){
             		// done!
          		}
          if (newPercentage !== percentage) {
            percentage = newPercentage
          }
        })
        .onError((error) => {
          console.error(error)
        })

uploader.start()

This not only handles the low-level details, it allows us to easily implement a progress bar. Note the Axios dependency, but otherwise this saves us a lot of time.

Knowing how this code works, we can make our own edits -- for example, the backend function endpoint names are hardcoded into the example Uploader class. You can change or abstract them if needed.

Or leave them as-is, but keep that in mind when you create your lambdas, it expects these function names:

initialize to start the multipart upload
getPreSignedUrls to generate the presigned URLs
finalize to end the upload

That’s it! Woo!

Summary and Next Steps

This is a lot of information just to build a robust pipeline that can upload large files to S3 without any servers. This not only enables us to build a more scalable upload pipe, we can use multipart uploads to improve throughput and edge locations to improve upload speed.

Although more difficult than a simple upload using Lambda, it isn’t that bad when you break it down into steps.

Two simple backend functions to start and end the upload and one simple function to create presigned urls. Most of the heavy lifting on the front end can be done using the code we found in Amazon’s samples, including the Uploader originally authored by Michael Pilov.

Unfortunately, this is just the start of our data pipeline. It’s an often overlooked part, but from here on out we don’t need to worry about front end code.

Next we’ll talk about AWS Glue and ETL, everyone’s most favoritest chore!

Processing Huge CSV Files with AWS: Part II »

« Back to Article List

This Small Corner