View All Events

Adventures in ETL: Building a Pipeline from the Ground Up

| 2:00pm to 3:00pm | Virtual, via Zoom

Researchers need data. Vendors have data. In an ideal world, researchers would be able to use the data provided by vendors right out of the box. But in reality, this raw data is rarely delivered in a format that will allow it to be quickly and correctly imported into a database that researchers can query, and huge datasets compound this problem. In this talk, Douglas King and Donna St. Louis of Wharton Research Programming will talk about their ETL pipeline for cleaning and loading several TBs of raw real estate data into a columnar datastore, using MariaDB ColumnStore to bring the familiarity of a SQL interface backed by S3 storage.

For additional information, please contact dev-sig@lists.upenn.edu

Sponsored by Developer SIG