Saturday, July 2, 2022
HomeBig DataHow Paytm modernized their information pipeline utilizing Amazon EMR

How Paytm modernized their information pipeline utilizing Amazon EMR

This submit was co-written by Rajat Bhardwaj, Senior Technical Account Supervisor at AWS and Kunal Upadhyay, Common Supervisor at Paytm.

Paytm is India’s main fee platform, pioneering the digital fee period in India with 130 million lively customers. Paytm operates a number of traces of enterprise, together with banking, digital funds, invoice recharges, e-wallet, shares, insurance coverage, lending and cell gaming. At Paytm, the Central Information Platform group is accountable for turning disparate information from a number of enterprise models into insights and actions for his or her government administration and retailers, who’re small, medium or giant enterprise entities accepting funds from the Paytm platforms.

The Information Platform group modernized their legacy information pipeline with AWS companies. The information pipeline collects information from totally different sources and runs analytical jobs, producing roughly 250K stories per day, that are consumed by Paytm executives and retailers. The legacy information pipeline was arrange on premises utilizing a proprietary resolution and didn’t make the most of the open-source Hadoop stack elements reminiscent of Spark or Hive. This legacy setup was resource-intensive, having excessive CPU and I/O necessities. Analytical jobs took roughly 8–10 hours to finish, which regularly led to Service Degree Agreements (SLA) breaches. The legacy resolution was additionally susceptible to outages on account of greater than anticipated {hardware} useful resource consumption. Its {hardware} and software program limitations impacted the flexibility of the system to scale throughout peak load. Information fashions used within the legacy setup processed your entire information each time, which led to an elevated processing time.

On this submit, we reveal how the Paytm Central Information Platform group migrated their information pipeline to AWS and modernized it utilizing Amazon EMR, Amazon Easy Storage Service (Amazon S3) and underlying AWS Cloud infrastructure together with Apache Spark. We optimized the {hardware} utilization and lowered the info analytical processing, leading to shorter turnaround time to generate insightful stories, all whereas sustaining operational stability and scale regardless of the scale of every day ingested information.

Overview of resolution

The important thing to modernizing an information pipeline is to undertake an optimum incremental strategy, which helps cut back the end-to-end cycle to research the info and get significant insights from it. To realize this state, it’s important to ingest incremental information within the pipeline, course of delta data and cut back the analytical processing time. We configured the info sources to inherit the unchanged data and tuned the Spark jobs to solely analyze the newly inserted or up to date data. We used temporal information columns to retailer the incremental datasets till they’re processed. Information intensive Spark jobs are configured in incremental on-demand deduplicating mode to course of the info. This helps to remove redundant information tuples from the info lake and reduces the full information quantity, which saves compute and storage capability. We additionally optimized the scanning of uncooked tables to limit the scans to solely the modified document set which lowered scanning time by roughly 90%. Incremental information processing additionally helps to cut back the full processing time.

On the time of this writing, the prevailing information pipeline has been operationally secure for two years. Though this modernization was important, there’s a danger of an operational outage whereas the adjustments are being applied. Information skewing must be dealt with within the new system by an acceptable scaling technique. Zero downtime is predicted from the stakeholders as a result of the stories generated from this technique are important for Paytm’s CXO, government administration and retailers each day.

The next diagram illustrates the info pipeline structure.

Advantages of the answer

The Paytm Central Information Workplace group, comprised of 10 engineers, labored with the AWS group to modernize the info pipeline. The group labored for roughly 4 months to finish this modernization and migration venture.

Modernizing the info pipeline with Amazon EMR 6.3 helped effectively scale the system at a decrease value. Amazon EMR managed scaling helped cut back the scale-in and scale-out time and improve the utilization of Amazon Elastic Compute Cloud (Amazon EC2) Spot Situations for working the Spark jobs. Paytm is now in a position to make the most of a Spot to On-Demand ratio of 80:20, leading to greater value financial savings. Amazon EMR managed scaling additionally helped robotically scale the EMR cluster primarily based on YARN reminiscence utilization with the specified kind of EC2 cases. This strategy eliminates the necessity to configure a number of Amazon EMR scaling insurance policies tied to particular forms of EC2 cases as per the compute necessities for working the Spark jobs.

Within the following sections, we stroll by the important thing duties to modernize the info pipeline.

Migrate over 400 TB of knowledge from the legacy storage to Amazon S3

Paytm group constructed a proprietary information migration software with the open-source AWS SDK for Java for Amazon S3 utilizing the Scala programming language. This software can join with a number of cloud suppliers , on-premises information facilities and migrate the info to a central information lake constructed on Amazon S3.

Modernize the transformation jobs for over 40 information flows

Information flows are outlined within the system for ingesting uncooked information, preprocessing the info and aggregating the info that’s utilized by the analytical jobs for report technology. Information flows are developed utilizing Scala programming language on Apache Spark. We use an Azkaban batch workflow job scheduler for ordering and monitoring the Spark job runs. Workflows are created on Amazon EMR to schedule these Spark jobs a number of instances throughout a day. We additionally applied Spark optimizations to enhance the operational effectivity for these jobs. We use Spark broadcast joins to deal with the info skewness, which may in any other case result in information spillage, leading to further storage wants. We additionally tuned the Spark jobs to keep away from a  giant variety of small information, which is a recognized downside with Spark if not dealt with successfully. That is primarily as a result of Spark is a parallel processing system and information loading is finished by a number of duties the place every activity can load into a number of partition. Information-intensive jobs are run utilizing Spark phases.

The next is the code snippet for the Scala jobs:

  - identify: jobC
    kind: noop
    # jobC relies on jobA and jobB
      - jobA
      - jobB

  - identify: jobA
    kind: command
      command: echo "That is an echoed textual content."

  - identify: jobB
    kind: command
      command: pwd

Validate the info

Accuracy of the info stories is important for the trendy information pipeline. The modernized pipeline has further information reconciliation steps to enhance the correctness of knowledge throughout the platform. That is achieved by having larger programmatic management over the processed information. We may solely reconcile information for the legacy pipeline after your entire information processing was full. Nonetheless, the trendy information pipeline allows all of the transactions to be reconciled at each step of the transaction, which provides granular management for information validation. It additionally helps isolate the reason for any information processing errors. Automated assessments have been finished earlier than go-live to match the info data generated by the legacy vs. the trendy system to make sure information sanity. These steps helped guarantee the general sanity of the processed information by the brand new system. Deduplication of knowledge is finished steadily through on-demand queries to remove redundant information, thereby lowering the processing time. For instance, if there are transactions that are already consumed by the top shoppers however nonetheless part of the data-set, these will be eradicated by the deduplication, leading to processing of solely the newer transactions for the top shopper consumption.

The next pattern question makes use of Spark SQL for on-demand deduplication of uncooked information on the reporting layer:

Insert over desk  <<desk>>
choose col1,col2,col3 ---...coln 
from (choose t.*
            ,row_number() over(order by col) as rn 
      from <<desk>>
     ) t
the place rn = 1

What we achieved as a part of the modernization

With the brand new information pipeline, we lowered the compute infrastructure by 400% which helps to save lots of  compute value. The sooner legacy stack was working on over 6,000 digital cores. Optimization methods helped to run the identical system at an improved scale, with roughly 1,500 digital cores. We’re in a position to cut back the compute and storage capability for 400 TB of knowledge and 40 information flows after migrating to Amazon EMR. We additionally achieved Spark optimizations, which helped to cut back the runtime of the roles by 95% (from 8–10 hours to twenty–half-hour), CPU consumption by 95%, I/O by 98% and total computation time by 80%. The incremental information processing strategy helped to scale the system regardless of information skewness, which wasn’t the case with the legacy resolution.


On this submit, we confirmed how Paytm modernized their information lake and information pipeline utilizing Amazon EMR, Amazon S3, underlying AWS Cloud infrastructure and Apache Spark. Selection of those cloud & big-data applied sciences helped to handle the challenges for working a giant information pipeline as a result of the sort and quantity of knowledge from disparate sources provides complexity to the analytical processing.

By partnering with AWS, the Paytm Central Information Platform group created a contemporary information pipeline in a brief period of time. It gives lowered information analytical instances with astute scaling capabilities, producing high-quality stories for the manager administration and retailers each day.

As subsequent steps, do a deep dive bifurcating the info assortment and information processing phases on your information pipeline system. Every stage of the info pipeline needs to be appropriately designed and scaled to cut back the processing time whereas sustaining integrity of the stories generated as an output.

When you’ve got suggestions about this submit, submit feedback within the Feedback part beneath.

Concerning the Authors

Rajat Bhardwaj is a Senior Technical Supervisor with Amazon Internet Companies primarily based in India, having 23 years of labor expertise with a number of roles in software program growth, telecom, and cloud applied sciences. He works together with AWS Enterprise prospects, offering advocacy and strategic technical steerage to assist plan and construct options utilizing AWS companies and greatest practices. Rajat is an avid runner, having competed a number of half and full marathons in recent times.

Kunal Upadhyay is a Common Supervisor with Paytm Central Information Platform group primarily based out of Bengaluru, India. Kunal has 16 years of expertise in large information, distributed computing, and information intelligence. When not constructing software program, Kunal enjoys journey and exploring the world, spending time with family and friends.



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments