Why REST APIs are not ideal for analytics
Have you ever thought that too much of a good thing can be a bad thing?
I have this thought a lot — with chocolate especially, but I rarely ever have this thought with data. In today’s world, the more data, the better. But recently I’ve been doing a lot with social media APIs, and most of these APIs are surprisingly difficult to use for analytical purposes.
If you’re not familiar with APIs, REST APIs are the most prevalent form of API protocols. Gone are the days of XML and other forms, REST is now king.
Now I’m no expert in APIs nor an API developer, so forgive me for my rudimentary explanation, but generally how REST APIs work is that they accept different methods to interact with the API; most commonly used methods are GET and POST. You can make these GET or POST requests to predefined endpoints, sometimes with parameters that you can pass, like an offset or limit parameter. Endpoints will almost always return a predefined response, too.
Take this made up example of 2 endpoints. 1 endpoint for customers that always returns the customers name, email, and address, and 1 endpoint for transactions that always returns order details, product details, and the customer who placed that order.
GET /api/v2/users
Response:
{
"customerId": 9,
"firstName": "John",
"lastName": "Doe",
"email": "john.doe@me.com",
"address": "123 Neighbor Rd, Made-up Land, USA 12345"
}
GET /api/v2/orders
Response:
{
"orderId": 1,
"customerId": 9,
"product": "Chamois Butt'r",
"productId": 1
}
What if you wanted to only get the customer’s email and the product ID of a specific order without having to make two separate API calls and receiving / requesting irrelevant bytes of data? Well, you can’t really do that with most REST APIs. You would have to make both calls to two separate endpoints and parse the data that you wanted while discarding the rest (get it?).
This is problematic when you are trying to use an API designed for transactions for analytical purposes, and even more so with API limits and quotas. Imagine you have to pull 3 million instances of customers every day but the endpoints return too much data, meaning you have to paginate to get through all records successfully. So one call to the endpoint returns only 10,000 customers at a time, meaning you need at a minimum (3 million / 10,000 customers) API calls to get the entire batch (a lot), and even when you do get to the end of the pagination, you discard 90% of the data from each request! Too much waste, if you ask me.
GraphQL, on the other hand, allows you to very specifically state what you are looking for against the database without having to request from only one endpoint at a time. So, if we wanted to grab the customer’s email and the product ID within an order, we could do so without having to make two separate calls to each endpoint. Here’s what the request might look like:
GraphQL Query
{
orders {
id
products {
id
}
customers {
email
}
}
}
Response:
{
"data":
{
"orders": {
"id": 1
},
"products": {
"id": 1
},
"customers": {
"email": "john.doe@me.com"
}
}
}
Much more efficient! You get all the data you need, without any of the extra fluff, in less API calls. However, you may need better parsing, but will definitely save you tons of traversing through multiple endpoints and multiple pages to get the data you need.
Now what can be done when trying to get a social media company to develop a GraphQL API rather than a REST API? Little to nothing, but if every API was GraphQL-like, then the world would be a much better place… and you would have to write less requests.