How to Upload Large Files up to 5TB to Amazon S3 with Multipart in Kotlin

How to Upload Large Files up to 5TB to Amazon S3 with Multipart in Kotlin

Overview

  • When implementing file uploads in a Client-Server relationship where the final storage location of the file is Amazon S3, the server can provide the client with a time-limited, upload-only Presigned URL for file uploading. This approach delegates the uploading process directly to S3, allowing the server to achieve both security and resource savings.

  • One issue is that the maximum file size that can be uploaded with a Presigned URL is 5GB. For files larger than this, AWS recommends using the Multipart feature, which splits the original large file into multiple smaller segments for uploading. This article introduces how to enable clients to upload large files up to 5TB using the Multipart feature of Amazon S3 in Kotlin.

S3 Multipart Upload Flow

  • The client provides the server with information about the file to be uploaded (e.g., file size, file name).

  • Based on the file size provided by the client, the server calculates the appropriate number of parts and responds to the client with a list of partNumber and uploadUrl for each part (where uploadUrl is a Presigned URL that allows uploads for a specific period). The server saves the uploadId that identifies this multipart upload.

  • The client uses the acquired list of parts to complete the upload of all parts, either sequentially or in parallel (if the client is a browser, it can logically divide the file into the number of parts using File#slice() and upload each chunk separately).

  • Once all parts are uploaded, the client requests the server to complete the final upload, mapping the ETag value obtained from the success response header of each part to the partNumber.

  • The server requests AWS to complete the multipart upload by providing all the ETag, partNumber values received from the client, and the uploadId obtained at the start of the multipart upload.

Reasons to Use S3 Multipart Upload

  • While single-file uploads using Presigned URLs are limited to a maximum of 5GB, multipart uploads can handle files up to 5TB.

  • Large files, such as videos over 100MB, can be split into up to 10,000 parts and uploaded in parallel, significantly speeding up the process compared to single-file uploads.

  • If an upload fails due to network issues, only the affected part needs to be re-uploaded, minimizing the impact of errors.

Considerations for S3 Multipart Upload

  • Amazon S3 recommends using Multipart for objects over 100MB. [Related Link]

  • The maximum allowed upload file size for an object combined through Multipart is 5TB = 5,497,558,138,880 bytes.

  • During a Multipart upload, the maximum allowed file size for an individual part is 5GB = 5,368,709,120 bytes, and the minimum is 5MB = 5,242,880 bytes.

  • The maximum number of parts for a Multipart upload is 10,000.

  • Most modern browsers do not have a limit on the maximum upload size of an individual part. Exceptions include IE8 and Firefox, which are limited to 2GB = 2,147,483,648 bytes. Considering this, limiting the maximum upload size of an individual part to 2GB can accommodate all scenarios.

  • When uploading individual parts at the browser level, the S3 bucket's CORS policy must include the following to ensure the browser can retrieve the ETag response header provided by S3 during part uploads.

// Amazon S3 Console Login > Bucket > Permissions > CORS (Cross-origin resource sharing) > Edit
[
  {
    "AllowedHeaders": [
      "*"
    ],
    "AllowedMethods": [
      "POST",
      "GET",
      "HEAD",
      "PUT"
    ],
    "AllowedOrigins": [
      "*"
    ],
    "ExposeHeaders": [
      "ETag"
    ]
  }
]

build.gradle.kts

  • Add the following content to the build.gradle.kts at the root of the project.
val awsSdkVersion by extra { "2.25.11" }

dependencies {
    implementation("software.amazon.awssdk:dynamodb-enhanced:$awsSdkVersion")
    implementation("software.amazon.awssdk:$awsSdkVersion")
}

S3 Local Setup

  • If you want to develop in a safe, isolated local environment separated from the actual S3 buckets, you can use the S3 Local Docker image provided by MinIO. The setup method using Docker Compose is as follows. (For it to work properly, the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables must be set on your operating system.)
# Set up S3 Local
$ nano docker-compose.yml
version: '3.8'
services:
  s3-local:
    command: "server /data --console-address ':9001'"
    image: "minio/minio:latest"
    container_name: s3-local
    ports:
      - "9000:9000"
      - "9001:9001"
    volumes:
      - "./docker/s3:/data"
    restart: always

# Start S3 Local
$ docker-compose up -d

Writing AmazonS3Util

  • Below is how to write a utility class for executing Multipart Upload at the code level.
import software.amazon.awssdk.core.sync.RequestBody
import software.amazon.awssdk.regions.Region
import software.amazon.awssdk.services.s3.S3Client
import software.amazon.awssdk.services.s3.model.*
import software.amazon.awssdk.services.s3.presigner.S3Presigner
import software.amazon.awssdk.services.s3.presigner.model.GetObjectPresignRequest
import software.amazon.awssdk.services.s3.presigner.model.PutObjectPresignRequest
import software.amazon.awssdk.services.s3.presigner.model.UploadPartPresignRequest
import java.io.File
import java.io.InputStream
import java.time.Duration

object AmazonS3Util {

    private fun s3ClientV2(): S3Client {

        return when (System.getenv("SPRING_PROFILES_ACTIVE") == "local") {
            // Connect to S3 Local for local development environment
            true -> {
                S3Client
                    .builder()
                    .region(Region.of("{region}"))
                    .forcePathStyle(true)
                    .endpointOverride(URI.create("http://localhost:9000"))
                    .httpClient(UrlConnectionHttpClient.builder().build())
                    .build()
            }

           // Connect to S3 for remote deployment environment
            false -> {
                S3Client
                    .builder()
                    .region(Region.of("{region}"))
                    .build()
            }
        }
    }

    private fun s3PresignerV2(): S3Presigner {

        return S3Presigner
            .builder()
            .region(Region.of("{region}"))
            .build()
    }


    override fun createMultipartUpload(
        bucket: String,
        key: String
    ): String {

        return s3ClientV2().createMultipartUpload(
            CreateMultipartUploadRequest.builder().bucket(bucket).key(key).build()
        ).uploadId()
    }

    private fun generateWriteOnlyMultipartPresignedUrl(
        bucket: String,
        key: String,
        duration: Duration,
        uploadId: String,
        partNumber: Int
    ): String {

        return s3PresignerV2().presignUploadPart { request: UploadPartPresignRequest.Builder ->
            request.signatureDuration(duration)
                .uploadPartRequest { uploadPartRequest: UploadPartRequest.Builder ->
                    uploadPartRequest.bucket(bucket)
                        .key(key)
                        .partNumber(partNumber)
                        .uploadId(uploadId)
                }
        }.url().toString()
    }

    override fun generateWriteOnlyMultipartPresignedUrls(
        bucket: String,
        key: String,
        duration: Duration,
        uploadId: String,
        partSize: Int
    ): List<FileMultipartUploadUrlDTO> {

        val multipartPresignedUrls = mutableListOf<FileMultipartUploadUrlDTO>()
        (1..partSize).forEach { partNumber ->
            multipartPresignedUrls.add(
                FileMultipartUploadUrlDTO(
                    partNumber, generateWriteOnlyMultipartPresignedUrl(bucket, key, duration, uploadId, partNumber)
                )
            )
        }

        return multipartPresignedUrls
    }

    override fun completeMultipartUpload(
        bucket: String,
        key: String,
        uploadId: String,
        parts: List<CompletedPart>
    ) {
        s3ClientV2().completeMultipartUpload { request ->
            request
                .bucket(bucket)
                .key(key)
                .uploadId(uploadId)
                .multipartUpload(CompletedMultipartUpload.builder().parts(parts).build())
        }
    }

    override fun abortMultipartUpload(
        bucket: String,
        key: String,
        uploadId: String
    ) {
        s3ClientV2().abortMultipartUpload { request ->
            request
                .bucket(bucket)
                .key(key)
                .uploadId(uploadId)
        }
    }

    override fun calculateMultipartCount(originalFileSize: Long, requestCount: Long): Int {

        val minPartSize: Long = 5242880
        val maxPartSize: Long = 2147483648
        val recommendedMinOriginalFileSize: Long = 104857600
        val maxPartCount: Long = 10000
        val correctedPartCount: Long = if (requestCount > maxPartCount) {
            maxPartCount
        } else {
            requestCount
        }

        if (originalFileSize < recommendedMinOriginalFileSize) return 1
        if (originalFileSize / correctedPartCount < minPartSize) {
            return (originalFileSize / minPartSize).toInt()
        }
        if (originalFileSize / correctedPartCount > maxPartSize) {
            return (originalFileSize / maxPartSize).toInt()
        }

        return requestCount.toInt()
    }
}

data class FileMultipartUploadUrlDTO(

    var partNumber: Int = 1,
    var uploadUrl: String = ""
)

1. Requesting Multipart Upload List

  • Using the previously written utility, a list of URLs for multipart upload targets can be generated as follows.
// Requesting uploadId value for multipart upload to a specific Bucket's Key
val uploadId = AmazonS3Util.createMultipartUpload("{bucket}", "{key}").uploadId()

// Calculating the number of parts for multipart upload
val multipartCount = AmazonS3Util.calculateMultipartCount({fileSize})

// Generating the list of multipart upload URLs
val multipartUploadUrls = AmazonS3Util.generateWriteOnlyMultipartPresignedUrls(
    "{bucket}",
    "{key}",
    Duration.ofMinutes(60),
    uploadId,
    multipartCount
)
  • The multipart upload URLs created above can be represented as a JSON string as follows. partNumber paired with uploadUrl is created for the requested number of divisions.
[
    {
        "partNumber": 1,
        "uploadUrl": "{url}"
    },
    {
        "partNumber": 2,
        "uploadUrl": "{url}"
    }
]

2-1. Uploading Multipart: Linux Side

  • Using the list of uploadUrl obtained earlier, each part can be uploaded as follows using the popular curl command.
# Physically splitting the target file into 100MB units
$ split -b 104857600 {filename}

# Uploading each part of the split file to the previously obtained uploadUrl
$ curl -v -k -T {partname} "{url}"
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< x-amz-id-2: /lZyTJxaI8ZWoEUDNpY7AHzSUhaZegPw2/Fg3Riy2EZwQHNFbIOIuGfCuIufbwu0MAgLmrzx5Yw=
< x-amz-request-id: 9K40M0MSTVB0ZWDK
< Date: Wed, 21 Sep 2022 06:11:39 GMT
< ETag: "0b41b1b5c7228c08597fe7ae9ea06abc"
< Server: AmazonS3
< Content-Length: 0
  • When the upload of each part is successful, it responds with 200 OK and provides a unique hash value of the part as the ETag value in the response header. To complete the upload of all parts, this ETag value must be paired and provided with the partNumber.

2-2. Uploading Multipart: Browser Side

  • In the browser, one can use File#slice() to split a single file into nchunks and upload each part individually as shown below. (This example was made with the help of a colleague, Frontend Engineer Jun. [Jun's GitHub Link])

  • The CORS policy settings for the target bucket allow acquiring the ETag response header provided as a result of each part's upload.

const chunkInterval = Math.floor(file.size / {number of parts responded by server});
let chunkedStart = 0;

const chunkWithUrlList = {part list responded by server}.map(({
    partNumber,
    uploadUrl
}, i) => {
    if (i === {number of parts responded by server}.length - 1) {
        chunkEnd = file.size;
    } else {
        chunkEnd = chunkedStart + chunkInterval;
    }

    const chunk = file.slice(chunkedStart, chunkEnd);
    chunkedStart = chunkEnd;

    return {
        uploadUrl,
        partNumber,
        chunk,
    }
});

const fulfilledList = [];
const rejectedList = [];

await Promise.allSettled(chunkWithUrlList.map(
    ({
        uploadUrl,
        partNumber,
        chunk
    }) => fetch(
        uploadUrl, {
            method: 'PUT',
            body: chunk,
        }).then((res) => {
        console.log(`partNumber : ${partNumber} / ETag : ${res.headers.get('ETag')}`)
        return {
            partNumber,
            eTag: res.headers.get('ETag').replace(/"/g, ''),
        }
    })
)).then((res) => {
    console.log(`upload result : ${res}`)
    res.forEach((el) => {
        if (el.status === 'fulfilled') {
            fulfilledList.push(el.value);
            return;
        }

        rejectedList.push(el.value);
    });
});

// Each fetch for the parts remembers the ETag and partNumber, and requests the result list to the server at the completion point of fetch for all parts.

3-1. Requesting Multipart Upload Completion

  • Finalizing the multipart upload can be done by pairing partNumber with the ETag obtained from individual part uploads and executing the upload completion request as follows.
// Requesting multipart upload completion
AmazonS3Util.completeMultipartUpload(
    "{bucket}",
    "{key}",
    "{uploadId}",
    parts = listOf(CompletedPart.builder().partNumber({partNumber}).eTag("{eTag}").build())
)
  • If a completion request is made without all parts being uploaded, a software.amazon.awssdk.services.s3.model.S3Exception exception will occur, which should be appropriately handled.

  • What happens if an incorrect ETag value is requested? A software.amazon.awssdk.services.s3.model.NoSuchUploadException exception occurs, which should also be appropriately handled.

3-2. Requesting Multipart Upload Cancellation

  • From the moment an uploadId is obtained for multipart upload progression, until each part is uploaded to S3, there usually isn't a problem if it leads to a complete upload. However, in the real world, various cases such as cancellation during upload can occur. The biggest issue is that if left unfinished, the capacity of each part uploaded during the process remains as an S3 operating cost. To prevent this, upload cancellation can be executed as follows. Typically, writing a daily batch scheduler to cancel unfinished uploads is sufficient.
// Requesting multipart upload cancellation
AmazonS3Util.abortMultipartUpload(
    "{bucket}",
    "{key}",
    "{uploadId}"
)

Reference Articles