Talend Incremental logic – SQL server to REDSHIFT

Typical options for incremental logic

Time stamp based incremental logic

Use lastModified timestamp column from incoming SQL Server table to fetch records modified since last run.
Using Talend we would know what last run time stamp is or else get Max(lastModified) from redshift and query.
Example: SELECT * FROM source_table WHERE LastModified > @LastRunTime

Use version or sequence column to fetch records with higher versions or sequences.

Calculate MD5 for each row and during first time ingestion all data would be available in REDSHIFT. Compare the new MD5 and update the table.
Example: ALTER TABLE source_table ADD MD5Hash AS HASHBYTES(‘MD5’, CONCAT(column1, column2, column3, …))
SELECT primary_key, MD5Hash FROM source_table WHERE LastModified > @LastRunTime
On redshift side using something like this.
- INSERT INTO target_table (primary_key, MD5Hash, column1, column2, column3, …) SELECT primary_key, MD5Hash, column1, column2, column3, … FROM staging_table ON CONFLICT (primary_key) DO UPDATE SET MD5Hash = EXCLUDED.MD5Hash, column1 = EXCLUDED.column1, column2 = EXCLUDED.column2, column3 = EXCLUDED.column3;

utilize SQL server’s CDC feature to identify changed records or use Talend CDC – SED type1 or type2 for CDC

Using Amazon S3 files for ETL (Extract, Transform, Load) operations in Amazon Redshift is a common and efficient approach. S3 serves as a staging area where raw or transformed data is stored before loading it into Redshift. This process typically involves extracting data into S3, optionally transforming it, and then loading it into Redshift for analysis.
Using following command – will load data into temp table
- COPY my_table FROM ‘s3://my-bucket/path/to/s3_file.csv’ IAM_ROLE ‘arn:aws:iam::account-id:role/MyRedshiftRole’ CSV IGNOREHEADER 1 DELIMITER ‘,’ TIMEFORMAT ‘auto’ TRUNCATECOLUMNS EMPTYASNULL DATEFORMAT ‘auto’;
Once data is available in temp table use insert from table_b to table_a
- INSERT INTO table_a (columns…) SELECT b.columns… FROM staging_table b LEFT JOIN table_a a ON a.primary_key = b.primary_key WHERE a.primary_key IS NULL;
Update existing records
- UPDATE table_a SET (column1, column2, …) = (SELECT b.column1, b.column2, … FROM staging_table b WHERE table_a.primary_key = b.primary_key) FROM staging_table b WHERE table_a.primary_key = b.primary_key AND (table_a.last_modified < b.last_modified);
Here the compute is happening on REDSHIFT

Using Amazon S3 files for ETL (Extract, Transform, Load) operations in Amazon Redshift is a common and efficient approach. S3 serves as a staging area where raw or transformed data is stored before loading it into Redshift. This process typically involves extracting data into S3, optionally transforming it, and then process data on talend remote engines and then upsert or insert.