Using Talend to ingest
real-time data for data analytics
Forward-thinking businesses know and understand the value and potential of streaming data analytics. The challenge is to break down existing data silos, integrate real-time and static data from diverse data points, and ensure it is structured to enable self-service access by Business Intelligence(BI) tools.
Modernization of data integration capabilities is an excellent way to overcome this challenge. Traditional batch processing ETL technologies need more bandwidth to meet the high volume of zero-latency real-time data generated. Here latency refers to the time-lapse between a data-generating event and its arrival at the Data Warehouse(DW). With quick business decisions to be made and implemented, enterprises must get data-driven operational and analytical reports on time.
Streaming ETL technologies help deal with this by immediately responding to newly generated data. In addition, all business transaction data get streamed to ETL engines, which undergo processing and transformation into analyzable formats for the BI tool.
This whitepaper seeks to understand real-time and static data integration modalities and their subsequent cleansing and formatting processes to transform them into analyzable formats suitable for operational and analytical report generation.
A Synopsis Of The Client's Requirement
The Client is a leading insurance and home warranty provider in the USA. They required a process capable of:
- Continuously acquiring data from different data points and
- Restructuring it to generate operational and analytical reports
- Within a pre-approved budget
Challenges Associated With This Requirement
Data presentation is time-consuming, expensive, and difficult to achieve. Typically it accounts for about 60% to 80% of the total project cost. This cost gets compounded by the fragility of ETL processes leading to inflated prices and project complexity throughout its lifetime. Yet this project had to be executed within a pre-decided budget.
The data to be collected was also continuously generated across fragmented and dispersed data sources. Consequently, it needed more semantic or structural consistency, compounding the challenges associated with the project.
Summarizing the challenges & complexity of this project:
- Implementing a secure Data Integration process to align the data and cohesively bring it together
- Quick data cleansing and formatting to enhance its usability by BI tools
- Demonstrating to the Client the power of the collected data
- Educating them to maximize the utility of formatted data by generating reports in diverse formats
- Budgetary constraints that necessitated the use of appropriate open-source tools capable of fulfilling the Client's reporting needs
- Maintaining data security and integrity across the different aspects of the project
A Mix of Batch and Streaming Data Architecture
Generally, ETL tools only focus on batch processing, where users process and transform data using available computing resources without user interaction. The advantages of using batch processing include the following:
- Quick data processing of non-continuous data
- Improved job processing efficiency
- Round-the-clock processing
- Ideal for time-consuming, repetitive tasks
Alternately, streaming data architecture can consume data immediately upon its generation, transform it, and persist it to storage.
It is a framework consisting of software components proficient in ingesting and processing large volumes of data streaming from diverse sources like IoT, Cloud, and APIs.
Some benefits of implementing a Streaming Data architecture include the following:
- The ability to deal with seemingly never-ending data streams
- Real-time processing to enable real-time data analytics
- Efficiency in detecting time-series data patterns
- Ease of data scalability
Being an insurance and home warranty provider, the Client needed to process large volumes of data, with and without any attached timestamps. This data had to be collected from diverse data points and used to generate various operational and analytical BI reports.
Thus, the Client's requirement could be serviced only by using a mix of batch processing and stream processing because:
- Batch processing is ideal:
- To process large volumes of data that come without a timestamp
- For processing data meant for in-depth data analysis
Stream processing is beneficial:
- When data processing requires speed and agility
- For generating critical insights in real-time
The challenge was choosing the appropriate open-sourced ETL tool that supported batch and stream processing.
Reasons For Choosing Talend
Sundew had two choices, Talend and Pentaho. Sundew chose to go with Talend because of its:
- Quick development and fast deployment
- Open scalable architecture
- Pre-built widgets for robust data integration
- Reduced data handling time
- Faster streaming data integration
- Ability to simplify complex steaming technologies
- Enhanced data accessibility
- High dependability
- Simple learning curve
Additionally, Talend offered a comprehensive, unified platform for data integrity, integration, and governance.
Talend also had rich tools and live-streaming capabilities. It could handle huge data volumes from multiple sources, efficiently clean and format it and keep incrementally updating this data.
The most significant advantage gained by selecting Talend as the ETL tool was its ability to invoke REST APIs pulling from different transactional systems.
The above made the data integration process flexible, scalable, and portable. Using REST APIs to GET data also helped maintain data integrity across the participating platforms.
The orchestration of these tasks in Talend was such that the system could execute all functions in the proper order under logically correct conditions and at the right time.
Sundew's choice of Talend as a tool to create real-time streaming and batch-processing data pipelines was justified because it could:
- Aggregate and ingest data from different Online Transaction Processing or OLTP systems using REST APIs
- Organize the structured and unstructured data using ETL tools
- Seamlessly transfer organized data to Data Warehouse for further analysis
- Meet budgetary constraints
- Handle a large volume of static and real-time data and convert them into insights
By using Talend, Sundew achieved two purposes:
- It modernized the Client's data integration, cleansing, and formatting processes to make it future-ready.
- It also ensured the scalability of the entire process.
- The seamless flow of transformed data from Talend to Power BI made generating real-time reports based on individual requirements accessible and more relevant.