Like other big companies, PayPal has a lot of data centres, regions, and zones with different security levels and rules to keep data safe. This makes it hard to move data and makes it hard to keep security and data protection compliance.
The Enterprise Data Platform (EDP) Data Movement Platform (DMP) team needed to build a fast and reliable data movement channel for its customers inside and outside of EDP. This channel had to be fully InfoSec compliant and secure. We started building our next-generation data movement platform last year. This blog is about the security of the platform in particular.
At PayPal, security is our number one goal, and part of that is making sure that your data is safe.
In the past, different teams moved data between zones by using either DropZone, an in-house Secure-FTP-based platform, or Kafka, a service made by the Kafka team at PayPal that anyone can use. DropZone is a safe and reliable place to store data for offline use cases. Kafka, on the other hand, is a fast data highway for more real-time use cases. Both have their limits when moving large datasets, and both the producer and the consumer have to put in more work to make them work. These platforms aren't good for moving data in batches, and they have trouble with burst data ingestion and an extra step of data storage (in the middle). This is something that Apache Pulsar tries to solve by making the storage layer interface available for direct analytics.
How do you move data in a safe way?
To move data safely, you have to follow the following InfoSec rules:
TLS 1.2 must be used to secure all connections.
Only a higher security zone can connect to a lower security zone. It can't be done the other way around.
Data that is stored must be encrypted so that no one can get to it without permission or an audit.
Either IAM or Kerberos must be used to give permissions for everything.
All secrets must be taken care of by KeyMaker, a service for managing keys.
How we worked these concepts into the DMP:
The DMP mostly uses Hadoop as its execution platform to make sure that both computing and storage are always available, reliable, and resilient. So, both Apache Gobblin and Hadoop need to support all security measures. Let's take a look at the deployment architecture from a high level.
Setting up a high-level Hadoop cluster in multiple zones with a key distribution centre (KDC), a key management service (KMS), and Apache Ranger.
HTTPS and sockets are two ways to secure communication.
Hadoop gives each service both regular socket addresses and secure socket addresses so they can talk to each other. When TLS 1.2 is used, the connection can be encrypted. Here, you can find out more about secure ports.
Kerberos is used to keep authorizations safe.
Securing a group of Hadoop clusters is a fairly complicated task that requires a number of architectural choices. It's a lot harder than making a single Hadoop cluster secure. Here, you can do one of two things:
1. Make trust between all the different KDC servers in different zones so they can trust each other's TGT (tickets) and DT (tokens) with the same security standards for communication.
2. Use a single Key Distribution Center (KDC) for both Hadoop clusters to keep authentication and permissions in one place.
We chose the second option, which is to manage a single central KDC server that gives all Hadoop clusters the security they need instead of managing multiple KDC servers for each cluster.
We use Yarn mode for Apache Gobblin. At the moment, Apache Gobblin does not have built-in token management for a Hadoop environment with multiple remote nodes. GOBBLIN-1308 was the first one to have this feature.
Using Transparent Data Encryption to keep data safe (TDE)
The Hadoop Platform has a built-in feature called TDE that encrypts data. Once TDE is turned on, all backend services can encrypt and decrypt data without the client code needing to be changed. With the right security settings, it will work without a hitch.
TDE comes with a lot of problems, especially in a multi-cluster Hadoop environment where firewalls separate zones. It makes token management, KMS configurations, WebHDFS client API calls, etc., more complicated.
Managing tokens—delegation token management for authentication and authorization
Kerberos is usually used to turn on TDE. This makes token management for applications even more complicated, since TDE needs an extra token called KMS, which is a cryptographic key management server that gives tokens for encryption and decryption to read and write data to HDFS, the underlying storage. We can also use Apache Ranger policies to set up encryption zones where TDE will work.
Hadoop 2.7.x is known to have several bugs:
HADOOP-14441: The KMS delegation token doesn't ask all KMS servers. So, a request to renew from a server that is not the issuer fails when sent to the KMS server. This is something that the KMS service should figure out on its own.
HADOOP-14445: The KMS does not offer high availability (HA) service, so tokens issued by one KMS server cannot be verified by another KMS instance in the same HA pool. Since this fix is only available in > Hadoop 2.8.4, we had to renew the KMS token with all KMS servers and ignore the failed operations, assuming we picked the right one from the list of available KMS servers.
HADOOP-15997 & HADOOP-16199: When a token is renewed, it doesn't always go to the right issuer. We had to add all issuers to the local Hadoop configuration for this to work. This is because of the way the token class is set up. It doesn't hold and use the right issuer or UGI when doing token. Call renew() like it says in the JIRA tickets.
An alternative to the WebHDFS API: In this highly secure environment, we've learned that it's not a good idea to use the WebHDFS API, especially when both Kerberos and TDE are turned on in Hadoop 2.7.x. Hadoop 2.7.x has a bug in the way it was put together that could be used by an HDFS superuser to get into these "Encrypted Zones." The NameNode could also get stuck because of a known problem with WebHDFS. These could lead to breaches or other problems, so we made a replacement API that works like WebHDFS to solve the problem. We hope that these problems will be taken care of by Hadoop 3.x.
Overall, this setup made it possible to move data between multiple Hadoop clusters in different zones in the safest way possible. This is the foundation for the batch DMP and lets PayPal's internal customers use it in many different ways.
It took some work, but it was well worth it to learn how Hadoop and network security work on the inside. The PayPal Hadoop SRE and InfoSec teams deserve praise for working together on this journey.