this is a subsequent articles after i wrote the previous one in medium
Before we jumped into N+1 problem in Django object relational mapping (ORM), let’s talk a bit about N+1 problem. The N+1 problem was anti-pattern performance when some applications (let’s say a web blog) make a request to access the database using N+1 calls, where N is defined for how many a numbers / objects when fetching a data from database and it can be growth depending with your database request. In ORM theorem, the problem occurs when it comes to accessing the data between parent-child relationship, or you can say it as “one-to-many relationship”. For several cases, this problem isn’t necessarily, but under some circumstances, it can be growth into worsen situation and lead your application into slower performance. For example, if you have a blog and inside the blog there’s a comment and article, assuming you have 5 articles and there’s a 10 comments in each article. So, if we want to get information about 5 articles, the database will process 1 query to fetch all articles and additional 10 queries to get all comments in each article. This problem called N+1.
In plain implementation in SQL, if you want to get all articles like the example above, you can iterate through like this :
SELECT * FROM Article
And then executing for each query to get all comments in each article :
SELECT * FROM Comment WHERE article_id = ?
Isn’t all that bad? Not necessary, in certain case, it’s not bad at all cause its only perform about 11-16 queries, but it can be growth into worsen situation if you have iterate more than thousands request or you build some e-commerce with thousands of products, items and several descriptions.
For brief explanation about thousand request, assuming your database consume 1ms to return 1 query, so calculate it went issuing thousand request. There you go, right now you probably understand a little bit.
How To Notice the N+1?
Actually, the database will probably look normal even it has N+1 problem, and the CPU load may look normal either (based on my experience, could it be wrong tho). The easiest way is probably we could monitoring our database activities, but what if we have thousands of tables? You can do profiling to access your database according to the method of application / framework that you are using, or you can integrate your application with the others third-party. In this example, we will exemplify how to do database profiling and find the N+1 problem and optimize the database in Django. Before do that, we should understand about how queryset in django works.
As general programming practice, this goes without saying, find out what queries you are doing and what they are costing you. If you find it difficult to do queryset profiling on django, you can use help with django-debug-toolbar. Basically, you must understand too, that you might be optimizing for increase the speed, reduce the memory usage or depend from your requirements or situation. It is up to you to decide what your priorities are, where the balance must lie, and profile all of these as required since this will depend on your application and server. Database Profiling
Before we get started into finding N+1 problem, like i said before, we must understand about how queryset works in Django. Django queryset has a several criteria :
- Queryset doesn’t affect the database until you evaluate itself. It means, no database activity actually occurs until you do something to evaluate the queryset. Assuming you create a function related to your database model, but you don’t use constructor such as the
len()
function,repr()
function, or iterate over the functions that you create, your database performance is not affected. - Queryset are lazy. According to the django documentation the act of creating a queryset doesn’t involve any database activity. You can stack filters together all day long, and Django won’t actually run the query until the queryset is evaluated, this example same as with first criteria.
- Queryest contains a cache. When you created a newly queryset, it has a empty cache until the queryset is evaluate and django saves the query results in the queryset’s cache and returns the results.
To do database profiling in django, besides you use django-debug-toolbar, you can also check with raw SQL using connection, prompt into python shell and write some command like this :
Before prompting into python shell, make sure your
DEBUG
in settings is setTrue
connection.queries
includes all SQL statements something like insert, update, select, etc. Each time your app hits the database, the query will be recorded and it’s a list of dictionaries in order of query execution. Each dictionary has the following :
statement | function |
---|---|
sql | raw SQL statement |
time | how long SQL statement took to execute |
For another example, let say, we have Article model in our database, than we want to recorded how long to execute to get all articles in model Article, we can use it something like this :
Another example to counted a numbers of queries that has been hinted or recorded is using dispatch()
method. We can override dispatch()
and profiling it into our django views. By using the dispatch()
method, at least it is more “human-readable” because we only see the number of queries that are counted in our database through the terminal. Some example :
In here, we can already doing database profiling using raw SQL and at least we can see where there are N+1 problems, but it has a drawbacks this will be lead into troublesome and overhead because the information obtained is difficult for us to understand (cause it’s raw SQL) besides, it’s also difficult to monitoring when our application is already running on the server. For production only, personally i recommended that we can use middleware with the help of django-debug-toolbar as i told before, you can see the documentation directly here. Fix the Problem
If we have finished profiling and find N+1 problems in our database, we can jumping into to optimize our program code with help select_related()
and prefetch_related()
. select_related()
and prefetch_related()
is the default queryset API references from django. Basically we can drop down into :
select_related()
is helpful against foreign-key relationships or selecting additional related-object data when it executes its query. This is a performance booster which results in a single more complex query but means later use of foreign-key relationships won’t require database queries. If you finding difficult to understanding it, use thisselect_related()
if in your database model have a ForeignKey and One-To-One relationship. The plain example is like this :
-
if we want to get detailed about whose person it’s comment, and let’s say there are around 300 comments, if in django views we don’t use
select_related()
the number of queries that we get will be generated around 301 queries. You may ask wondering why we got 301 queries, like explanation in above, when we want to get information about who person is comment in Comment model and there’s around 300 comments are available, then ‘that’ 300 comments are N, and 1 is the value of Person model (regarding how many people that commented, because Person is parent and Comment is child so, that’s the N+1 problem) -
However, if we use
select_related()
like this, then the possibility will decrease to 1-2 queries :
prefetch_related()
on the other hand, does a separate lookup for each relationship, and does the joint them all. This allows it to prefetch like Many-To-Many and Many-To-One relationship, which can’t be done usingselect_related()
, in addition,prefetch_related()
also support ForeignKey and One-To-One like select_related. The example is the same as the example inselect_related()
, but with differences in the Comment model there are ManyToManyField. Just do something like this, then the possibility also decrease to 1-2 queries :
- When we using either
select_related()
andprefetch_related()
, instead of having to go to the database for fetchin each objects, it will find them in a prefetched queryset cache that was populated in a single query. Plain example, if we want to get information about Comment it will return all value, author and comment, whereas if usingselect_related()
, Comment object only returned author value because it has been prepopulated in previous query.
Conclusion
All of this explanation is very long and sometime very wide, but at least we can understand the N+1 problem in the database. Also, keep in mind, regarding solutions using select_related()
and prefetch_related()
is only a recommended idea (caveat approach), actually according to django documentation itself there is no general principle or elegant solution to improve and optimize our database if there is an anti-pattern problem like this kind. Although we prefer databases with high computational performance, it still allows other problems to emerge, because in one case, relationships such as child-parents and Many-To-Many that bring up N+1 can also benefit some processes (cause there are dependent and thus we cannot optimize it). Last but not least, the things to do we can only try to code a program efficiently and effectively, eliminate iteration in SQL rather we simplify or rewrite SQL, understand how ORM works and understand when to evaluate our queryset.