Typical options for incremental logic
- Time stamp based incremental logic
- Version based incremental logic
- MD5 uniqueness for each row
- Change Data Capture CDC using Talend
- Using file system of AWS s3 and REDSHIFT computation
- Using file system of AWS s3 and Talend as processing engine
Time stamp based incremental logic
- Use lastModified timestamp column from incoming SQL Server table to fetch records modified since last run.
- Using Talend we would know what last run time stamp is or else get Max(lastModified) from redshift and query.
- Example: SELECT * FROM source_table WHERE LastModified > @LastRunTime
Version based Incremental logic
- Use version or sequence column to fetch records with higher versions or sequences.
MD5 Uniqueness for each row
- Calculate MD5 for each row and during first time ingestion all data would be available in REDSHIFT. Compare the new MD5 and update the table.
- Example: ALTER TABLE source_table ADD MD5Hash AS HASHBYTES(‘MD5’, CONCAT(column1, column2, column3, …))
- SELECT primary_key, MD5Hash FROM source_table WHERE LastModified > @LastRunTime
- On redshift side using something like this.
- INSERT INTO target_table (primary_key, MD5Hash, column1, column2, column3, …) SELECT primary_key, MD5Hash, column1, column2, column3, … FROM staging_table ON CONFLICT (primary_key) DO UPDATE SET MD5Hash = EXCLUDED.MD5Hash, column1 = EXCLUDED.column1, column2 = EXCLUDED.column2, column3 = EXCLUDED.column3;
Change data capture CDC
- utilize SQL server’s CDC feature to identify changed records or use Talend CDC – SED type1 or type2 for CDC
Using file system of AWS s3 and REDSHIFT computation
- Using Amazon S3 files for ETL (Extract, Transform, Load) operations in Amazon Redshift is a common and efficient approach. S3 serves as a staging area where raw or transformed data is stored before loading it into Redshift. This process typically involves extracting data into S3, optionally transforming it, and then loading it into Redshift for analysis.
- Using following command – will load data into temp table
- COPY my_table FROM ‘s3://my-bucket/path/to/s3_file.csv’ IAM_ROLE ‘arn:aws:iam::account-id:role/MyRedshiftRole’ CSV IGNOREHEADER 1 DELIMITER ‘,’ TIMEFORMAT ‘auto’ TRUNCATECOLUMNS EMPTYASNULL DATEFORMAT ‘auto’;
- Once data is available in temp table use insert from table_b to table_a
- INSERT INTO table_a (columns…) SELECT b.columns… FROM staging_table b LEFT JOIN table_a a ON a.primary_key = b.primary_key WHERE a.primary_key IS NULL;
- Update existing records
- UPDATE table_a SET (column1, column2, …) = (SELECT b.column1, b.column2, … FROM staging_table b WHERE table_a.primary_key = b.primary_key) FROM staging_table b WHERE table_a.primary_key = b.primary_key AND (table_a.last_modified < b.last_modified);
- Here the compute is happening on REDSHIFT
Using file system of AWS s3 and Talend as processing engine
- Using Amazon S3 files for ETL (Extract, Transform, Load) operations in Amazon Redshift is a common and efficient approach. S3 serves as a staging area where raw or transformed data is stored before loading it into Redshift. This process typically involves extracting data into S3, optionally transforming it, and then process data on talend remote engines and then upsert or insert.