Amazon Web Services logo

In an earlier blog post I discussed DynamoDB and the fact it does not seem to be particularly well suited for storing time–based data. In part one of this series I discussed the overall architecture of applications based on micro–services and the functionality I am trying to build. I considered using a very small mySQL based RDS instance, which would translate directly from the existing code, but I felt that using a database would possibly more expensive than simply using S3. I frequently use the AWS Simple Monthly Calculator and in this case it indicated that without any redundancy and about 5G of storage, I'd spend about $15/month on the database. This compares to 0.15/month for 10G of S3 storage.

generic bucket graphic

Given that it was worthwhile to see if an S3 based solution could work. I decided to use a very simple scheme with a single bucket, in which I'd store the logs, each log representing one month of the data, using YYYYMM as the name of the log (e.g. 201607). I knew that those logs would be between 700 and 800 kB, which can easily be manipulated as a single text object in Python. Since any data I need from the logs would always be accessed using a specific day, it made sense to simply store the logs as a CSV, with a time-stamp as the first field as follows:

event_timestamp,field1,field2,field3,…
20160701:00:03:00,data1,data2,data3,…
20160701:00:08:00,data1,data2,data3,…
…

This has the added advantage that AWS provides a very convenient call to filter objects in S3, which I can than use to very quickly find a specific month of data. So, to get the log for today's month, the code looks similar to this:

#!/usr/bin/python
import boto3
from datetime import datetime
key = datetime.now().strftime("%Y%m")
s3 = boto3.resource('s3')
bucket = s3.Bucket('the_bucket')
for o in bucket.objects.filter(Prefix='logs/' + key):
    content = o.get()
    log = content['Body'].read(content['ContentLength'])
    break
print log

In the interlude I discussed how to get the configuration data from S3, allowing us to fetch the CSV data from the facilities FTP server. Since a Lambda function has access to the internet and assuming the FTP server is accessible (which could be using a VPN for privacy reasons), it is now relatively trivial to write a simple function to fetch the data from the FTP server as follows (no batteries included; that is, error checking is left for the reader):

def get_file_from_ftpserver(config):
    server = config.get('ftp-server', 'server')
    username = config.get('ftp-server', 'username')
    password = config.get('ftp-server', 'password')
    filename = config.get('ftp-server', 'filename')
    result = io.BytesIO()
    ftp = FTP(server)
    ftp.login(username, password)
    ftp.retrbinary('RETR ' + filename, result.write)
    return result

Once I retrieve the data, I can re-format it into the format for the S3 log. I'll leave the re-formatting code out, since it's not very relevant to this blog post. However, it is worth noting that the data retrieved is a point in time and may be a duplicate of data already retrieved. Since I don't want to store the same data point multiple times, I'll have to examine the content of the current S3 log to make sure the data is not already present. S3 does not allow updating of content (any object has to be put back in it's entirety), so this is not too much of a problem; I have to read the object before updating anyway. After retrieving the log using the code above, I can simply see if the log entry already exists and add the entry if it does not. Though the code below is rather crude (it simply checks every line without taking into account that the entries should be sorted), it runs in memory and has the advantage of being impervious to an unsorted log. So the code looks something like this, having the new log entry in a python dict called data (note that the actual code is different, this is only for illustration purposes):

for line in log.split():
    if data['event_timestamp'] in line:
		print "data already exists"
		sys.exit()
log += ",".join(data[x] for x in columns) +"\n"
bucket.put_object(Key='logs/' +key, Body=log)

So, now I have code that can retrieve data from from an FTP server, using configuration data stored in S3 and append it reliably to an object in S3, indexed by the month it was retrieved. In future blog posts, I will describe how to turn this code into a Lambda function and use AWS Events to run it on regular intervals.

If you have any comments, questions or other observations, please contact me directly via email: vmic@isc.upenn.edu.