S3 storage
You can read, write, or glob files hosted on object storage servers using the Amazon S3 API.
Usage
Accessing S3 storage is implemented as a feature of the httpfs extension.
Please see Install an extension and Load an extension first before getting started.
Configure the connection
Once the httpfs extension is loaded, you can configure the S3 connection using
CALL statements in Cypher.
CALL <option_name>=<option_value>The following options are supported:
| Option name | Description |
|---|---|
s3_access_key_id | S3 access key ID |
s3_secret_access_key | S3 secret access key |
s3_endpoint | S3 endpoint |
s3_region | S3 region |
s3_url_style | Uses S3 URL style (should either be vhost or path) |
s3_uploader_max_num_parts_per_file | Used for part size calculation |
s3_uploader_max_filesize | Used for part size calculation |
s3_uploader_threads_limit | Maximum number of uploader threads |
You can alternatively set the following environment variables:
| Environment variable | Description |
|---|---|
S3_ACCESS_KEY_ID | S3 access key ID |
S3_SECRET_ACCESS_KEY | S3 secret access key |
S3_ENDPOINT | S3 endpoint |
S3_REGION | S3 region |
S3_URL_STYLE | S3 URL style |
Using a non-AWS endpoint
To connect to an S3-compatible service (such as Cloudflare R2, MinIO, or Tigris), set s3_endpoint to the service’s endpoint and provide credentials as usual:
CALL s3_access_key_id='<your-key-id>';CALL s3_secret_access_key='<your-secret>';CALL s3_endpoint='<service-endpoint>';CALL s3_region='auto';A few notes:
s3_regionis required for request signing but is ignored by most non-AWS services. Any non-empty value works;autois a common convention.s3_url_styledefaults tovhost(virtual-hosted-style). Some services (for example, MinIO in its default configuration) require path-style URLs — sets3_url_style='path'in that case.
Scanning data from S3
The example below shows how to scan data from a Parquet file hosted on S3.
LOAD FROM 's3://lbug-datasets/follows.parquet'RETURN *;Glob data from S3
You can glob data from S3 just as you would from a local file system. Globbing is implemented using the S3 ListObjectV2 API.
CREATE NODE TABLE tableOfTypes ( id INT64, int64Column INT64, doubleColumn DOUBLE, booleanColumn BOOLEAN, dateColumn DATE, stringColumn STRING, listOfInt64 INT64[], listOfString STRING[], listOfListOfInt64 INT64[][], structColumn STRUCT(ID int64, name STRING), PRIMARY KEY (id));
COPY tableOfTypesFROM "s3://lbug-datasets/types/types_50k_*.parquet";Writing data to S3
Writing to S3 uses the AWS multipart upload API.
COPY ( MATCH (p:Location) RETURN p.*)TO 's3://lbug-datasets/saved/location.parquet';Additional configurations
Requirements on the S3 server APIs
S3 offers a standard set of APIs for read and write operations. The httpfs extension uses these APIs to communicate with remote storage services and thus should also work
with other services that are compatible with the S3 API (such as Cloudflare R2 and Tigris).
The table below shows which parts of the S3 API are needed for each feature of the extension to work.
| Feature | Required S3 API |
|---|---|
| Public file reads | HTTP Range request |
| Private file reads | Secret key authentication |
| File glob | ListObjectV2 |
| File writes | Multipart upload |
Local cache
Scanning the same file multiple times can be slow and redundant. To avoid this, you can locally cache remote files to improve performance for repeated scans. See the local cache section for more details.