I recently took a role in a startup to start building a solid data engineering foundation. For the past few years, the data engineering group has built out its event logging framework and implemented some critical pipelines through airflow to pull in third party data and build datasets to track high-level business objectives.

The data warehouse currently resides in Amazon Redshift, and I’m a total n00b here so I will be cataloging the nuances that I discover over time as we build out a canonical data model.

The first one that I’ve come across: although Amazon Redshift is based on PostgreSQL, it doesn’t support arrays! Crazy, right? Instead, the SUPER data type now most closely resembles arrays in Redshift.

In PostgreSQL, say you want to split a comma-delimited string and unnest the values:

create table test_table (
  some_string varchar
);

insert into test_table (some_string) values ('foo,bar,foo2,bar2');

select some_string, t.x
from test_table
cross join unnest(string_to_array(some_string, ',')) as t(x);

|       some_string |    x |
|-------------------|------|
| foo,bar,foo2,bar2 |  foo |
| foo,bar,foo2,bar2 |  bar |
| foo,bar,foo2,bar2 | foo2 |
| foo,bar,foo2,bar2 | bar2 |

Amazon Redshift uses the PartiQL syntax to iterate over SUPER arrays:

select a.some_string,
       b as token
from (
    select
        some_string,
        split_to_array(some_string) tokens
    from test_table
) a,
a.tokens b;