Jafet Rubi


How I built a micro data lake on S3 using AWS, DuckDB, and Streamlit

Apr 10 2024

All of that, but on a tiny budget!


For my 24th birthday earlier this year, my girlfriend gifted me a BirdBuddy. I instantly fell in love - going to my local bird store, buying bird seed, pole, and the adapter for it. The really cool bit of all of this is that the BirdBuddy app produces a lot of data about your feeder and the birds it captures. So, naturally, I started to think about all the interesting things I could do with this data: I could make my own data lake, I could create visualizations and graphs, or I could maybe even deploy my own bird watching companion app with all of the above!

Now don’t get me wrong, the BirdBuddy app is amazing, intuitive, and really user friendly, but the way it is set up is kind of wonky. The app captures a sighting of a bird as an event, but doesn’t always save this event so that you can store it forever. “Sightings” are ephemeral and disappear after a few days if not saved into what BirdBuddy calls your “Collection”. So, that means, that if you don’t save all of your Sightings into your Collection as they occur, chances are you’ll lose that data. For obvious reasons, I wanted to store my sightings data in a permanent solution, which meant extracting that data from the BirdBuddy API into my own data environment, and begin to visualize the data in an application. That’s where the fun part starts.

BirdBuddy API and pybirdbuddy

The BirdBuddy app uses a GraphQL API to pass data around, which means, we can use the same API to make GraphQL calls to extract the data. Huge kudos and thanks to jhansche on Github for pybirdbuddy, which provided a lot of code that made authenticating and interacting with the BirdBuddy API simple and easy. You can read more about the Python library in the github repository.

My goal through using the API was to extract all sightings data and store it permanently so it wouldn’t be deleted, and to have a raw data layer in S3 full of bird sighting data and images. Here’s an example of what the response from the API for a sighting would look like:

"sightingReport": {
"reportToken": "string",
"sightings": [
    {
    "id": "65b43d02-e159-4ef1-a61a-b37a37b29ecd",
    "matchTokens": [
        "string"
    ],
    "__typename": "SightingRecognizedBirdUnlocked",
    "color": "PINK",
    "text": "Knock, knock! Looks like you've got a new first-time visitor!",
    "shareableMatchTokens": [
        "string"
    ],
    "species": {
        "id": "string",
        "iconUrl": "https://assets.cms-api-graphql.cms-api.prod.aws.mybirdbuddy.com/asset/icon/bird/a80b2a00-9a49-46ed-b5f1-5e7bde4b8267_house%20finch.svg",
        "name": "House Finch",
        "__typename": "SpeciesBird",
        "isUnofficialName": false,
        "mapUrl": "https://assets.cms-api-graphql.cms-api.prod.aws.mybirdbuddy.com/asset/map/bird/d8c20684-d366-4933-8ceb-4955ec9459a1_House%20Finch.svg"
    }
    }
],
"__typename": "SightingReport"
}

Once sighting data was received using the pybirdbuddy library, you would then have to post it into a Collection request in order to add it to your collections in the app. However, for my use case, I only really cared about the sighting data as its the most granular form of sighting data. Posting it into your collection would only return less data than the data provided by the sighting report, so I didn’t really see a need for capturing both sighting and collection data.

AWS Extract, Load, and Transform Architecture

If you’ve worked in the cloud before, you know how easy it is to deploy services. The tricky part is getting the configuration and security right, so I won’t talk about that here, but 80 to 90 percent of my debugging time usually had to do with permissions, configuration, and security in AWS.

Here’s a simplified visual of my AWS ELT architecture for this project: Back end architecture

Not shown in the visual for redundancy sake, but I actually have two S3 buckets: one to store the data in its raw form (JSON), and one that holds the transformed data in parquet format. I use the transformed, parquet data to feed into DuckDB to do some pretty powerful analytics on a very tiny budget (free!). Both S3 buckets are fed by the exact same services in the diagram, just slightly different code. The pipeline in the diagram is scheduled to kick off once a day in a batch process.

AWS Lambda and Elastic Container Registry

I used AWS Lambda and AWS Elastic Container Registry (ECR) as the compute for this project. I had to use ECR to deploy my Lambdas as Docker images because the Python libraries they were dependent on for the extraction had grown over the limit for regular layers in standard Lambdas. However, to get around that, you can deploy the Lambda code as Docker images using ECR and have a much higher limit, which is what I ended up doing. This works out well since now you can containerize your Lambda code, standardize the process, and test the code a bit more efficiently. Also, ECR is easy to use and there is a free tier component to it.

AWS Simple Queue Service

I used AWS Simple Queue Service (SQS) to hold data in transit from Lambda to Lambda. This allowed me to decouple my compute process and also set up some dead letter queues if something were to go wrong. Because my extraction pipeline is relatively simple (only powered by 2 lambdas), I wanted to decouple them so I could run the lambdas asynchronously and hold the data in a temporary storage solution before transforming it further and dumping it into S3. This works really well if you have a varying number of producers and subscribers of data, but in my case for a batch process, it might be overkill.

AWS S3

If you haven’t heard of S3 already, I’d be very surprised. It’s a really awesome storage tool that can store pretty much anything. I use S3 to store both JSON data and images from my BirdBuddy. As mentioned earlier, I have two S3 buckets: one to store my raw data (both JSON and images), and one to store my clean data for analytics in parquet format. Not quite medallion architecture since I skipped over the silver layer, but for my simple use case, it works out perfectly and helps keep everything within the free tier!

Front-end Architecture

For the front end of the application, I used Streamlit in a Docker container hosted on AWS EC2. Here’s a simplified visual of what that looks like for my application: Front end architecture

It’s hard to properly visualize the front end since Streamlit runs within Docker, not necessarily in a sequential step. Also, DuckDB is used within Streamlit to run queries on the parquet files in S3, so technically not it’s own service, but to make sure I’m on board the hype train, I included all of that in the title. 😁

DuckDB

I use DuckDB as my OLAP tool instead of a conventional database/data warehouse. Basically, I use Pandas to append raw data into my parquet files as new data comes in, and then use DuckDB to query the data.

Here’s a snippet of code I took from my transformation (eLt) lambda to give you an idea of how it’s done:

data = s3.get_object(Bucket='raw-bucket', Key='sightings/'+f'{m}.json')
content = json.loads(data.get("Body").read())
for medias in content['medias']:
    ts = dt.strptime(medias['createdAt'], "%Y-%m-%dT%H:%M:%S.%fZ").strftime('%Y-%m-%d %H:%M:%S.%f')
if content['sightingReport']['sightings'][0]['__typename'].lower() != 'sightingcantdecidewhichbird':
    for sighting in content['sightingReport']['sightings']:
        new_row = {"name": sighting['species']['name'], "text": sighting['text'], "activityDate": ts, "source": f'{m}.json'}
        logging.info(f'Bird data: {new_row}')
        df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)
    df.to_parquet(path='s3://reporting-bucket')

And here’s a snippet of code of how I use DuckDB once the data is transformed and dumped into my parquet files within the Streamlit application:

df = pd.read_parquet('s3://reporting-bucket')
conn = duckdb.connect(database=':memory:', read_only=False)
conn.execute("CREATE TABLE sightings AS SELECT * FROM df")
count = conn.execute("""SELECT name, MAX(activityDate) as lastVisited, COUNT(*) as count FROM sightings GROUP BY name ORDER BY lastVisited DESC""").df()

Streamlit’s visuals revolves a lot around Pandas dataframes, so using DuckDB to write SQL queries to create Pandas dataframes is way easier and faster (in my opinion) than using Pandas dataframe API to achieve the same. Using DuckDB also saves me from provisioning and maintaining a database in the cloud, as all of my data is parquet based, making reads very fast and efficient.