S3 storage

You can read, write, or glob files hosted on object storage servers using the Amazon S3 API.

Usage

Accessing S3 storage is implemented as a feature of the httpfs extension.

Please see Install an extension and Load an extension first before getting started.

Configure the connection

Once the httpfs extension is loaded, you can configure the S3 connection using CALL statements in Cypher.

CALL <option_name>=<option_value>

The following options are supported:

Option name	Description
`s3_access_key_id`	S3 access key ID
`s3_secret_access_key`	S3 secret access key
`s3_endpoint`	S3 endpoint
`s3_region`	S3 region
`s3_url_style`	Uses S3 URL style (should either be vhost or path)
`s3_uploader_max_num_parts_per_file`	Used for part size calculation
`s3_uploader_max_filesize`	Used for part size calculation
`s3_uploader_threads_limit`	Maximum number of uploader threads

You can alternatively set the following environment variables:

Environment variable	Description
`S3_ACCESS_KEY_ID`	S3 access key ID
`S3_SECRET_ACCESS_KEY`	S3 secret access key
`S3_ENDPOINT`	S3 endpoint
`S3_REGION`	S3 region
`S3_URL_STYLE`	S3 URL style

Using a non-AWS endpoint

To connect to an S3-compatible service (such as Cloudflare R2, MinIO, or Tigris), set s3_endpoint to the service’s endpoint and provide credentials as usual:

CALL s3_access_key_id='<your-key-id>';
CALL s3_secret_access_key='<your-secret>';
CALL s3_endpoint='<service-endpoint>';
CALL s3_region='auto';

A few notes:

s3_region is required for request signing but is ignored by most non-AWS services. Any non-empty value works; auto is a common convention.
s3_url_style defaults to vhost (virtual-hosted-style). Some services (for example, MinIO in its default configuration) require path-style URLs — set s3_url_style='path' in that case.

Scanning data from S3

The example below shows how to scan data from a Parquet file hosted on S3.

LOAD FROM 's3://lbug-datasets/follows.parquet'
RETURN *;

Glob data from S3

You can glob data from S3 just as you would from a local file system. Globbing is implemented using the S3 ListObjectV2 API.

CREATE NODE TABLE tableOfTypes (
    id INT64,
    int64Column INT64,
    doubleColumn DOUBLE,
    booleanColumn BOOLEAN,
    dateColumn DATE,
    stringColumn STRING,
    listOfInt64 INT64[],
    listOfString STRING[],
    listOfListOfInt64 INT64[][],
    structColumn STRUCT(ID int64, name STRING),
    PRIMARY KEY (id));

COPY tableOfTypes
FROM "s3://lbug-datasets/types/types_50k_*.parquet";

Writing data to S3

Writing to S3 uses the AWS multipart upload API.

COPY (
    MATCH (p:Location)
    RETURN p.*
)
TO 's3://lbug-datasets/saved/location.parquet';

Additional configurations

Requirements on the S3 server APIs

S3 offers a standard set of APIs for read and write operations. The httpfs extension uses these APIs to communicate with remote storage services and thus should also work with other services that are compatible with the S3 API (such as Cloudflare R2 and Tigris).

The table below shows which parts of the S3 API are needed for each feature of the extension to work.

Feature	Required S3 API
Public file reads	HTTP Range request
Private file reads	Secret key authentication
File glob	ListObjectV2
File writes	Multipart upload

Local cache

Scanning the same file multiple times can be slow and redundant. To avoid this, you can locally cache remote files to improve performance for repeated scans. See the local cache section for more details.