Push-Down-Predicates in Parquet and how to use them to reduce IOPS while reading from S3

Working with datasets in pandas will almost inevitably bring you to the point where your dataset doesn’t fit into memory. Especially parquet is notorious for that since it’s so well compressed and tends to explode in size when read into a dataframe. Today we’ll explore ways to limit and filter the data you read using push-down-predicates. Additionally, we’ll see how you can do that efficiently with data stored in S3 and why using pure pyarrow can be several orders of magnitude more I/O-efficient than the plain pandas version.

zum Artikel gehen

HIVE_CURSOR_ERROR in Athena when reading parquet files written by pandas

In a recent project, a colleague asked me to look at a HIVE_CURSOR_ERROR in Athena that they weren’t able to get rid of. Since the error message was not incredibly helpful and the way this error appeared is not that uncommon, I thought writing this

zum Artikel gehen

Sound of Silence - Lift your heavy Workloads to AWS Batch with Docker

Statistical Computing on Your Local Workstation Recently, a costumer told me about his problems to fullfil stastistical computing workloads on his local workstations. At first, you need to know that statistcal computing language like R and Python by defau

zum Artikel gehen

How ALIAS records can reduce initial load times for your website

DNS is a core component of the internet. In this post we’ll briefly take a look at how it works and what the difference between CNAME and ALIAS records in Amazon Route53 are.

zum Artikel gehen

SQLServer Optimierung auf AWS

Im Dezember haben wir im Rahmen eines Consulting-Einsatzes verschiedene SQLServer EBS-Laufwerks-Kombinationen bei einem Kunden ausprobiert, um eine optimale Performance zu erreichen. Zusammenfassung: r4.4xlarge bringt fast doppelte Performance gegen über

zum Artikel gehen

Analyzing CloudWatch Costs

Amazon CloudWatch is a managed service for storing, visualizing and analyzing logs and metrics data of applications and AWS infrastructure. The service is simple to configure and use and is priced based on usage. Thus, adoption is generally both easy and

zum Artikel gehen