S3 API

s3cmd for files up to 15mb

# ~/.r2-s3cmd.cfg
[default]
access_key =
secret_key =
bucket_location = auto
host_base = [account-id].r2.cloudflarestorage.com
host_bucket = [account-id].r2.cloudflarestorage.com
enable_multipart = False
s3cmd -c ~/.r2-s3cmd.cfg --no-check-md5 --multipart-chunk-size-mb=50 sync local-dir s3://bucket

s5cmd for files bigger than 15mb

https://github.com/peak/s5cmd

NOTE

 s5cmd is 32x faster than s3cmd and 12x faster than aws-cli. For downloads, s5cmd can saturate a 40Gbps link (~4.3 GB/s), whereas s3cmd and aws-cli can only reach 85 MB/s and 375 MB/s respectively.

brew install peak/tap/s5cmd
# ~/.r2-s5cmd.cfg
[default]  
aws_access_key_id =
aws_secret_access_key =
aws_region = auto  
host_base = [account-id].r2.cloudflarestorage.com  
host_bucket = [account-id].r2.cloudflarestorage.com  
enable_multipart = False
 
# Command line options --profile to use a named profile, --credentials-file flag to use the specified credentials file
 
# Use your company profile in AWS default credential file
s5cmd --profile my-work-profile ls s3://my-company-bucket/
 
# Use your company profile in your own credential file
s5cmd --credentials-file ~/.your-credentials-file --profile my-work-profile ls s3://my-company-bucket

by Env

# Export your AWS access key and secret pair
export AWS_ACCESS_KEY_ID='<your-access-key-id>'
export AWS_SECRET_ACCESS_KEY='<your-secret-access-key>'
export AWS_PROFILE='<your-profile-name>'
export AWS_REGION='<your-bucket-region>'

s5cmd ls s3://your-bucket/

Region detection

While executing the commands, s5cmd detects the region according to the following order of priority:

  1. --source-region or --destination-region flags of cp command.
  2. AWS_REGION environment variable.
  3. Region section of AWS profile.
  4. Auto detection from bucket region (via HeadBucket API call).
  5. us-east-1 as default region.

Examples

Check if a bucket exists

s5cmd head s3://bucket/
s5cmd head s3://bucket/object.gz

Download a single S3 object

s5cmd cp s3://bucket/object.gz .

Download multiple S3 objects

Suppose we have the following objects:

s3://bucket/logs/2020/03/18/file1.gz
s3://bucket/logs/2020/03/19/file2.gz
s3://bucket/logs/2020/03/19/originals/file3.gz
s5cmd cp 's3://bucket/logs/2020/03/*' logs/

s5cmd will match the given wildcards and arguments by doing an efficient search against the given prefixes. All matching objects will be downloaded in parallel. s5cmd will create the destination directory if it is missing.

logs/ directory content will look like:

$ tree
.
└── logs
    ├── 18
    │   └── file1.gz
    └── 19
        ├── file2.gz
        └── originals
            └── file3.gz

4 directories, 3 files

ℹ️ s5cmd preserves the source directory structure by default. If you want to flatten the source directory structure, use the --flatten flag.

s5cmd cp --flatten 's3://bucket/logs/2020/03/*' logs/

logs/ directory content will look like:

$ tree
.
└── logs
    ├── file1.gz
    ├── file2.gz
    └── file3.gz

1 directory, 3 files

Upload a file to S3

s5cmd cp object.gz s3://bucket/

by setting server side encryption (aws kms) of the file:

s5cmd cp -sse aws:kms -sse-kms-key-id <your-kms-key-id> object.gz s3://bucket/

by setting Access Control List (acl) policy of the object:

s5cmd cp -acl bucket-owner-full-control object.gz s3://bucket/

Upload multiple files to S3

s5cmd cp directory/ s3://bucket/

Will upload all files at given directory to S3 while keeping the folder hierarchy of the source.

Stream stdin to S3

You can upload remote objects by piping stdin to s5cmd:

curl https://github.com/peak/s5cmd/ | s5cmd pipe s3://bucket/s5cmd.html

Or you can compress the data before uploading:

gzip -c file | s5cmd pipe s3://bucket/file.gz

Delete an S3 object

s5cmd rm s3://bucket/logs/2020/03/18/file1.gz

Delete multiple S3 objects

s5cmd rm s3://bucket/logs/2020/03/19/*

Will remove all matching objects:

s3://bucket/logs/2020/03/19/file2.gz
s3://bucket/logs/2020/03/19/originals/file3.gz

s5cmd utilizes S3 delete batch API. If matching objects are up to 1000, they’ll be deleted in a single request. However, it should be noted that commands such as

s5cmd rm s3://bucket-foo/object s3://bucket-bar/object

are not supported by s5cmd and result in error (since we have 2 different buckets), as it is in odds with the benefit of performing batch delete requests. Thus, if in need, one can use s5cmd run mode for this case, i.e,

$ s5cmd run
rm s3://bucket-foo/object
rm s3://bucket-bar/object

more details and examples on s5cmd run are presented in a later section.

Copy objects from S3 to S3

s5cmd supports copying objects on the server side as well.

s5cmd cp 's3://bucket/logs/2020/*' s3://bucket/logs/backup/

Will copy all the matching objects to the given S3 prefix, respecting the source folder hierarchy.

⚠️ Copying objects (from S3 to S3) larger than 5GB is not supported yet. We have an open ticket to track the issue.

Using Exclude and Include Filters

s5cmd supports the --exclude and --include flags, which can be used to specify patterns for objects to be excluded or included in commands.

  • The --exclude flag specifies objects that should be excluded from the operation. Any object that matches the pattern will be skipped.
  • The --include flag specifies objects that should be included in the operation. Only objects that match the pattern will be handled.
  • If both flags are used, --exclude has precedence over --include. This means that if an object URL matches any of the --exclude patterns, the object will be skipped, even if it also matches one of the --includepatterns.
  • The order of the flags does not affect the results (unlike aws-cli).

The command below will delete only objects that end with .log.

s5cmd rm --include "*.log" 's3://bucket/logs/2020/*'

The command below will delete all objects except those that end with .log or .txt.

s5cmd rm --exclude "*.log" --exclude "*.txt" 's3://bucket/logs/2020/*'

If you wish, you can use multiple flags, like below. It will download objects that start with request or end with .log.

s5cmd cp --include "*.log" --include "request*" 's3://bucket/logs/2020/*' .

Using a combination of --include and --exclude also possible. The command below will only sync objects that end with .log or .txt but exclude those that start with access_. For example, request.log, and license.txtwill be included, while access_log.txt, and readme.md are excluded.

s5cmd sync --include "*.log" --exclude "access_*" --include "*.txt" 's3://bucket/logs/*' .

Select JSON object content using SQL

s5cmd supports the SelectObjectContent S3 operation, and will run yourSQL query against objects matching normal wildcard syntax and emit matching JSON records via stdout. Records from multiple objects will be interleaved, and order of the records is not guaranteed (though it’s likely that the records from a single object will arrive in-order, even if interleaved with other records).

$ s5cmd select --compression GZIP \
  --query "SELECT s.timestamp, s.hostname FROM S3Object s WHERE s.ip_address LIKE '10.%' OR s.application='unprivileged'" \
  s3://bucket-foo/object/2021/*
{"timestamp":"2021-07-08T18:24:06.665Z","hostname":"application.internal"}
{"timestamp":"2021-07-08T18:24:16.095Z","hostname":"api.github.com"}

At the moment this operation only supports JSON records selected with SQL. S3 calls this lines-type JSON, but it seems that it works even if the records aren’t line-delineated. YMMV.

Count objects and determine total size

$ s5cmd du --humanize 's3://bucket/2020/*'

30.8M bytes in 3 objects: s3://bucket/2020/*

Run multiple commands in parallel

The most powerful feature of s5cmd is the commands file. Thousands of S3 and filesystem commands are declared in a file (or simply piped in from another process) and they are executed using multiple parallel workers. Since only one program is launched, thousands of unnecessary fork-exec calls are avoided. This way S3 execution times can reach a few thousand operations per second.

s5cmd run commands.txt

or

cat commands.txt | s5cmd run

commands.txt content could look like:

cp s3://bucket/2020/03/* logs/2020/03/

# line comments are supported
rm s3://bucket/2020/03/19/file2.gz

# empty lines are OK too like above

# rename an S3 object
mv s3://bucket/2020/03/18/file1.gz s3://bucket/2020/03/18/original/file.gz

Sync

sync command synchronizes S3 buckets, prefixes, directories and files between S3 buckets and prefixes as well. It compares files between source and destination, taking source files as source-of-truth;

  • copies files those do not exist in destination
  • copies files those exist in both locations if the comparison made with sync strategy allows it so

It makes a one way synchronization from source to destination without modifying any of the source files and deleting any of the destination files (unless --delete flag has passed).

Suppose we have following files;

   -  29 Sep 10:00 .
5000  29 Sep 11:00 ├── favicon.ico
 300  29 Sep 10:00 ├── index.html
  50  29 Sep 10:00 ├── readme.md
  80  29 Sep 11:30 └── styles.css
s5cmd ls s3://bucket/static/
2021/09/29 10:00:01               300 index.html
2021/09/29 11:10:01                10 readme.md
2021/09/29 10:00:01                90 styles.css
2021/09/29 11:10:01                10 test.html

running would;

  • copy favicon.ico
    • file does not exist in destination.
  • copy styles.css
    • source file is newer than to remote counterpart.
  • copy readme.md
    • even though the source one is older, it’s size differs from the destination one; assuming source file is the source of truth.
s5cmd sync . s3://bucket/static/

cp favicon.ico s3://bucket/static/favicon.ico
cp styles.css s3://bucket/static/styles.css
cp readme.md s3://bucket/static/readme.md

Running with --delete flag would delete files those do not exist in the source;

s5cmd sync --delete . s3://bucket/static/

rm s3://bucket/test.html
cp favicon.ico s3://bucket/static/favicon.ico
cp styles.css s3://bucket/static/styles.css
cp readme.md s3://bucket/static/readme.md

It’s also possible to use wildcards to sync only a subset of files.

To sync only .html files in S3 bucket above to same local file system;

s5cmd sync 's3://bucket/static/*.html' .

cp s3://bucket/prefix/index.html index.html
cp s3://bucket/prefix/test.html test.html

We don’t support syncing between 2 storage endpoints out of the box. The current solution is to sync remote objects to your local disk first, then sync your local files to the target remote storage. For example, if you’d like to sync S3 and Google Cloud Storage:

s5cmd sync 's3://s3-bucket/path/*' download_folder/

s5cmd --endpoint-url <gcs-endpoint> sync 'download_folder/*' s3://gcs-bucket/path/
Strategy
Default

By default s5cmd compares files’ both size and modification times, treating source files as source of truth. Any difference in size or modification time would cause s5cmd to copy source object to destination.

mod timesizeshould sync
src > dstsrc != dst
src > dstsrc == dst
src dstsrc != dst
src dstsrc == dst
Size only

With --size-only flag, it’s possible to use the strategy that would only compare file sizes. Source treated as source of truth and any difference in sizes would cause s5cmd to copy source object to destination.

mod timesizeshould sync
src > dstsrc != dst
src > dstsrc = dst
src dstsrc != dst
src dstsrc == dst

Dry run

--dry-run flag will output what operations will be performed without actually carrying out those operations.

s3://bucket/pre/file1.gz
...
s3://bucket/last.txt

running

s5cmd --dry-run cp s3://bucket/pre/* s3://another-bucket/

will output

cp s3://bucket/pre/file1.gz s3://another-bucket/file1.gz
...
cp s3://bucket/pre/last.txt s3://anohter-bucket/last.txt

however, those copy operations will not be performed. It is displaying whats5cmd will do when ran without --dry-run

Note that --dry-run can be used with any operation that has a side effect, i.e., cp, mv, rm, mb …

S3 ListObjects API Backward Compatibility

The --use-list-objects-v1 flag will force using S3 ListObjectsV1 API. This flag is useful for services that do not support ListObjectsV2 API.

s5cmd --use-list-objects-v1 ls s3://bucket/

Shell auto-completion

Shell completion is supported for bash, pwsh (PowerShell) and zsh.

Run s5cmd --install-completion to obtain the appropriate auto-completion script for your shell, note that install-completion does not install the auto-completion but merely gives the instructions to install. The name is kept as it is for backward compatibility.

To actually enable auto-completion:

in bash and zsh:

you should add auto-completion script to .bashrc and .zshrc file.

in pwsh:

you should save the autocompletion script to a file named s5cmd.ps1 and add the full path of “s5cmd.ps1” file to profile file (which you can locate with $profile)

Finally, restart your shell to activate the changes.

Note The environment variable SHELL must be accurate for the autocompletion to function properly. That is it should point to bash binary in bash, to zsh binary in zsh and to pwsh binary in PowerShell.

Note The autocompletion is tested with following versions of the shells: 
zsh 5.8.1 (x86_64-apple-darwin21.0) 
GNU bash, version 5.1.16(1)-release (x86_64-apple-darwin21.1.0) 
PowerShell 7.2.6

Google Cloud Storage support

s5cmd supports S3 API compatible services, such as GCS, Minio or your favorite object storage.

s5cmd --endpoint-url https://storage.googleapis.com ls

or an alternative with environment variable

S3_ENDPOINT_URL="https://storage.googleapis.com" s5cmd ls

# or

export S3_ENDPOINT_URL="https://storage.googleapis.com"
s5cmd ls

all variants will return your GCS buckets.

s5cmd reads .aws/credentials to access Google Cloud Storage. Populate the aws_access_key_id and aws_secret_access_key fields in .aws/credentials with an HMAC key created using this procedure.

s5cmd will use virtual-host style bucket resolving for S3, S3 transfer acceleration and GCS. If a custom endpoint is provided, it’ll fallback to path-style.

Retry logic

s5cmd uses an exponential backoff retry mechanism for transient or potential server-side throttling errors. Non-retriable errors, such as invalid credentialsauthorization errors etc, will not be retried. By default,s5cmd will retry 10 times for up to a minute. Number of retries are adjustable via --retry-count flag.

ℹ️ Enable debug level logging for displaying retryable errors.

Integrity Verification

s5cmd verifies the integrity of files uploaded to Amazon S3 by checking the Content-MD5 and X-Amz-Content-Sha256 headers. These headers are added by the AWS SDK for both standard and multipart uploads.

  • Content-MD5 is a checksum of the file’s contents, calculated using the MD5 algorithm.
  • X-Amz-Content-Sha256 is a checksum of the file’s contents, calculated using the SHA256 algorithm.

If the checksums in these headers do not match the checksum of the file that was actually uploaded, then s5cmd will fail the upload. This helps to ensure that the file was not corrupted during transmission.

If the checksum calculated by S3 does not match the checksums provided in the Content-MD5 and X-Amz-Content-Sha256 headers, S3 will not store the object. Instead, it will return an error message to s5cmd with the error code InvalidDigest for an MD5 mismatch or XAmzContentSHA256Mismatch for a SHA256 mismatch.

Error CodeDescription
InvalidDigestThe checksum provided in the Content-MD5 header does not match the checksum calculated by S3.
XAmzContentSHA256MismatchThe checksum provided in the X-Amz-Content-Sha256 header does not match the checksum calculated by S3.

If s5cmd receives either of these error codes, it will not retry to upload the object again and exit code will be 1.

If the MD5 checksum mismatches, you will see an error like the one below.

ERROR "cp file.log s3://bucket/file.log": InvalidDigest: The Content-MD5 you specified was invalid. status code: 400, request id: S3TR4P2E0A2K3JMH7, host id: XTeMYKd2KECOHWk5S

If the SHA256 checksum mismatches, you will see an error like the one below.

ERROR "cp file.log s3://bucket/file.log": XAmzContentSHA256Mismatch: The provided 'x-amz-content-sha256' header does not match what was computed. status code: 400, request id: S3TR4P2E0A2K3JMH7, host id: XTeMYKd2KECOHWk5S

aws-cli and s5cmd are both command-line tools that can be used to interact with Amazon S3. However, there are some differences between the two tools in terms of how they verify the integrity of data uploaded to S3.

  • Number of retries: aws-cli will retry up to five times to upload a file, while s5cmd will not retry.
  • Checksums: If you enable Signature Version 4 in your ~/.aws/configfile, aws-cli will only check the SHA256 checksum of a file while s5cmdwill check both the MD5 and SHA256 checksums.

Sources:

Using wildcards

On some shells, like zsh, the * character gets treated as a file globbing wildcard, which causes unexpected results for s5cmd. You might see an output like:

zsh: no matches found

If that happens, you need to wrap your wildcard expression in single quotes, like:

s5cmd cp '*.gz' s3://bucket/

Output

s5cmd supports both structured and unstructured outputs.

  • unstructured output
$ s5cmd cp s3://bucket/testfile .
 
cp s3://bucket/testfile testfile
$ s5cmd cp --no-clobber s3://somebucket/file.txt file.txt
 
ERROR "cp s3://somebucket/file.txt file.txt": object already exists
  • If --json flag is provided:
{
    "operation": "cp",
    "success": true,
    "source": "s3://bucket/testfile",
    "destination": "testfile",
    "object": "[object]"
}
{
    "operation": "cp",
    "job": "cp s3://somebucket/file.txt file.txt",
    "error": "'cp s3://somebucket/file.txt file.txt': object already exists"
}

Configuring Concurrency

numworkers

numworkers is a global option that sets the size of the global worker pool. Default value of numworkers is 256. Commands such as cpselect and run, which can benefit from parallelism use this worker pool to execute tasks. A task can be an upload, a download or anything in a run file.

For example, if you are uploading 100 files to an S3 bucket and the --numworkers is set to 10, then s5cmd will limit the number of files concurrently uploaded to 10.

s5cmd --numworkers 10 cp '/Users/foo/bar/*' s3://mybucket/foo/bar/

concurrency

concurrency is a cp command option. It sets the number of parts that will be uploaded or downloaded in parallel for a single file. This parameter is used by the AWS Go SDK. Default value of concurrency is 5.

numworkers and concurrency options can be used together:

s5cmd --numworkers 10 cp --concurrency 10 '/Users/foo/bar/*' s3://mybucket/foo/bar/

If you have a few, large files to download, setting --numworkers to a very high value will not affect download speed. In this scenario setting --concurrency to a higher value may have a better impact on the download speed.