Hi Shows Le, we implemented the loader using Pyspark due to a several reasons. One of them is standardization. We don’t want to use multiple packages to perform data extraction. Particularly, Pyspark allows us for extracting data from different data engines by changing just a couples of lines of code. So, switching from a MySQL engine to a MongoDB engine it’s easy to do. Furthermore, Pyspark gives certain flexibility when dealing with large dataframes.

At this point, we haven’t measure anything related to performance. But we experienced some problems at the beginning when using the MySQLdb package and Pandas — most of them were related to data types and nulls. Everything was better after switching to Pyspark.

On the other hand, I’m not really sure about your question of writing to AWS S3. Everything, I’ve done I’ve used Boto3. Can you elaborate more on this particular question?

Writing to learn! | LinkedIn profile: https://www.linkedin.com/in/ajhenaor | Buy me a coffee: https://www.buymeacoffee.com/ajhenaor

Writing to learn! | LinkedIn profile: https://www.linkedin.com/in/ajhenaor | Buy me a coffee: https://www.buymeacoffee.com/ajhenaor