How to Move Data from Salesforce to AWS Redshift

As companies become more data-driven, there is a greater need to collect and analyze large volumes of data from on-premises and Cloud applications.  A popular IT request is to extract Customer Relationship Data (CRM) data from Salesforce and load it into the Amazon Web Services (AWS) Redshift cloud data warehouse. There are different approaches for loading Salesforce data into Amazon Redshift, including extract/transform/load (ETL) and extract/load/transform (ELT) processes, bulk, and real-time APIs, and this article will share some important things to consider as you look for the best solution for your enterprise.

What is Salesforce? 

Delivered as a cloud-based software-as-a-service (SaaS), Salesforce is the leading CRM platform used by sales and marketing to manage interactions with customers and prospects. With Salesforce, you can store customer information, show marketing activity, track sales opportunities and record service issues. Providing a single view of all customer data, Salesforce is an extremely powerful resource that is often the “hub” of information for all your prospects and customers. As a cloud-based service, it is extremely easy for anyone, from any device, to access the latest information in Salesforce.

What is Amazon Redshift?

Amazon Redshift is a petabyte-scale, fully-managed, data warehouse service in the cloud. Amazon Redshift contains a collection of computing resources called nodes, which are then structured into a group called a cluster. It is designed for building cloud data warehouses, performing Extract Load Transform (ELT) use cases such as data migration and data consolidation, and is part of the AWS cloud integration ecosystem. Amazon Redshift’s appealing nature is the lower cost per terabyte to store and process data compared to traditional, on-premises, enterprise data warehouses. The high scalability and elasticity of Amazon Redshift combined with its pay-as-you-go pricing model, makes it attractive for startups and small projects, to large enterprise deployments.

6 considerations for loading data from Salesforce into Amazon Redshift

Here are six things to think about as you look for the best Salesforce to Redshift connectivity option for your enterprise:

1. Do you know the data structure and APIs of each system?

  • You first need to understand the data structure or schema from each source and destination that you want to integrate. For example, Salesforce has data arranged in Accounts, Contacts, Leads, etc. and relational databases and data warehouses like Redshift have defined tables and columns, each with their own datatype, e.g. INTEGER, CHAR, DATE, etc. 
  • There are different methods and protocols to extract and load information, such as Bulk APIs (batch processing) or Real-time APIs (event-driven), each with their own level of performance. For one-time or daily updates, batch can work, but for real-time information sharing requirements, you may need a message queuing system.
  • On-premises and Software-as-a-Service (SaaS) applications and databases typically have standard APIs and REST APIs to access data in the system. Relational databases and data warehouses also support native and JDBC drivers for better performance in some scenarios.
  • You need to decide the format of the data sent between the two systems: CSV, XML and JSON are popular data formats.

2. What is the transformation complexity?

  • Moving data between systems often involves some type of transformation. There may be content validation or enrichment that occurs to ensure the data being imported has the proper syntax and semantics. Data sorting, deduplication, splitting and aggregation are other tasks. When working with large (big data) or complex data types like EDI or COBOL, your transformations can become process-intensive which will also impact where you run your data processing.

3. Is this a one-off or recurring project, and how likely does change occur?

  • If a one-off project, it can be quicker to find ways to manually dump and load data between systems. However, everyone knows that change happens:  The schema of your database changes, Salesforce or Redshift have new updates/APIs, users want different data – and every change requires you to make adjustments to your integrations.
  • If you need to set up something that occurs at a periodic interval (e.g. real-time, hourly, daily, weekly), then you will need to investigate automating integration tasks like scheduling, management, and monitoring.

4. Where does the data processing occur?

  • Data processing should be designed for the best performance (meet SLAs) at the optimum cost. There are different cost and performance considerations whether doing ETL in the Cloud (e.g. AWS, Azure, Google Cloud Platform) or ELT with the data warehouse (Redshift).

5. Will you need to connect to other systems?

  • Popular applications like Salesforce and Redshift are often integrated with several other applications, messaging systems, and data sources – they are not an island unto themselves. As the number of Cloud applications increases and data pipelines move from on-premises to cloud to multi-cloud, there is an increasing requirement to support more data sources, applications and messaging systems across Amazon Web Services, Google Cloud Platform, and Microsoft Azure. Designing for change shortens maintenance time.

6. Does the data pipeline have adequate security, privacy and data quality?

  • As data is moved from one system to another, it is important to make sure you can protect and trust the data. Sometimes data needs to be encrypted, masked or accessible by only a few users. Data should be checked that it is complete and consistent to provide accurate analytics.

Hand-coding vs Graphical ETL Tools 

“Hand-coding” ETL (or ELT) scripts can be a quick way to move data between Salesforce and Redshift. If you look at the considerations above and they do not apply, or you can address them with hand-coding, an easy and cheap solution.

On the other hand, there are easy-to-use, graphical integration tools like Talend Cloud, that are purpose built to move, transform, cleanse and protect data (real-time and batch) between systems and address all the considerations above, so you do not have to be a Spark, Java, Salesforce, Redshift, or Data Quality expert. Talend has 900 up-to-date, optimized connectors for all your data sources, and management and monitoring tools to automate operations providing peace of mind. Just drag-and-drop Salesforce and Redshift components onto the canvas, configure (not code) the parameters and transformations, push a button to deploy to the Cloud and you are done. 

Ready to get started with Talend?