Did you know that you can use S3 to store data in parquet files, wire up an external table in Redshift to point to those files, and then query the table/S3 directly using SQL? This is a great option to store and analyze large quantities of data using S3 (cheap) as opposed to using the SSD storage that comes with Redshift nodes (expensive).
However, it’s not super straight forward and it took me at least a day to work out all the kinks but hopefully you come across this post within minutes of trying find the solution.
Step 1: create an external schema
https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_SCHEMA.html
Remember the IAM_ROLE you used here. This is what caused me a lot of grief. Details to follow.
Step 2: create an external table
https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html
There are always too many options and not enough example on AWS documentation but all you need is:
PARTITIONED BY (<your partition: ex 'partition_date date')
STORED AS PARQUET
LOCATION 's3://<your S3 bucket>/<your optional subdirectory>';
I don’t event think you need the partition but if your data will any sort of scale, I would highly recommend it.
Step 3: UNLOAD your data
This is where you will load your data to S3. The name of the command itself doesn’t make a lot of sense to me.
My UNLOAD statement looks something like:
UNLOAD('SELECT * FROM some_table;')
TO 's3://<your S3 bucket>/<your optional subdirectory>';
IAM_ROLE '<the same IAM role you used to create the external schema!!!>'
FORMAT AS PARQUET
PARTITION BY (<your partition>) INCLUDE
;
The IAM_ROLE is super important here.
Step 4: Add the new partition to your external table
ALTER TABLE <external schema name>.<external table name>
ADD IF NOT EXISTS PARTITION(<your partition>='2022-10-13')
LOCATION 's3://<your S3 bucket>/<your optional subdirectory>';
So this is what caused me a day of grief. While your external schema was created with IAM_ROLE foo, you can add a partition to your table pointing to data from an s3 bucket administered by IAM_ROLE bar, and redshift will not complain. But querying the table will result in the following super intuitive error:
error: Spectrum Scan Error: S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid V5TZ5J4DTPJ2Z3HC,ExtRid Yq6BJzE3/u1fZmgPIZzs8FUL0Fj9xQ7r4TocGxYuZ9f0xMyFUgyDd/me6S6vuhEWLf1ni3+HqZg=,CanRetry 1
The IAM_ROLE for both the schema and the s3 bucket have to match in order for you to query your external table successfully.
Cheers!