# SQL29 计算用户的平均次日留存率(Sql29 calculates the average next day retention rate of users)-sql

## SQL29 计算用户的平均次日留存率(Sql29 calculates the average next day retention rate of users)

### 示例1

``````drop table if exists `user_profile`;
drop table if  exists `question_practice_detail`;
drop table if  exists `question_detail`;
CREATE TABLE `user_profile` (
`id` int NOT NULL,
`device_id` int NOT NULL,
`gender` varchar(14) NOT NULL,
`age` int ,
`university` varchar(32) NOT NULL,
`gpa` float,
`active_days_within_30` int ,
`question_cnt` int ,
);
CREATE TABLE `question_practice_detail` (
`id` int NOT NULL,
`device_id` int NOT NULL,
`question_id`int NOT NULL,
`result` varchar(32) NOT NULL,
`date` date NOT NULL
);
CREATE TABLE `question_detail` (
`id` int NOT NULL,
`question_id`int NOT NULL,
`difficult_level` varchar(32) NOT NULL
);

INSERT INTO user_profile VALUES(1,2138,'male',21,'北京大学',3.4,7,2,12);
INSERT INTO user_profile VALUES(2,3214,'male',null,'复旦大学',4.0,15,5,25);
INSERT INTO user_profile VALUES(3,6543,'female',20,'北京大学',3.2,12,3,30);
INSERT INTO user_profile VALUES(4,2315,'female',23,'浙江大学',3.6,5,1,2);
INSERT INTO user_profile VALUES(5,5432,'male',25,'山东大学',3.8,20,15,70);
INSERT INTO user_profile VALUES(6,2131,'male',28,'山东大学',3.3,15,7,13);
INSERT INTO user_profile VALUES(7,4321,'male',28,'复旦大学',3.6,9,6,52);
INSERT INTO question_practice_detail VALUES(1,2138,111,'wrong','2021-05-03');
INSERT INTO question_practice_detail VALUES(2,3214,112,'wrong','2021-05-09');
INSERT INTO question_practice_detail VALUES(3,3214,113,'wrong','2021-06-15');
INSERT INTO question_practice_detail VALUES(4,6543,111,'right','2021-08-13');
INSERT INTO question_practice_detail VALUES(5,2315,115,'right','2021-08-13');
INSERT INTO question_practice_detail VALUES(6,2315,116,'right','2021-08-14');
INSERT INTO question_practice_detail VALUES(7,2315,117,'wrong','2021-08-15');
INSERT INTO question_practice_detail VALUES(8,3214,112,'wrong','2021-05-09');
INSERT INTO question_practice_detail VALUES(9,3214,113,'wrong','2021-08-15');
INSERT INTO question_practice_detail VALUES(10,6543,111,'right','2021-08-13');
INSERT INTO question_practice_detail VALUES(11,2315,115,'right','2021-08-13');
INSERT INTO question_practice_detail VALUES(12,2315,116,'right','2021-08-14');
INSERT INTO question_practice_detail VALUES(13,2315,117,'wrong','2021-08-15');
INSERT INTO question_practice_detail VALUES(14,3214,112,'wrong','2021-08-16');
INSERT INTO question_practice_detail VALUES(15,3214,113,'wrong','2021-08-18');
INSERT INTO question_practice_detail VALUES(16,6543,111,'right','2021-08-13');
INSERT INTO question_detail VALUES(1,111,'hard');
INSERT INTO question_detail VALUES(2,112,'medium');
INSERT INTO question_detail VALUES(3,113,'easy');
INSERT INTO question_detail VALUES(4,115,'easy');
INSERT INTO question_detail VALUES(5,116,'medium');
INSERT INTO question_detail VALUES(6,117,'easy');
``````

``````0.3000
``````

``date_add(date1,interval 1 day)=date2``

SQL脚本的书写方式如下：

``````select count(date2) / count(date1) as avg_ret
from(
select
distinct qpd.device_id,
qpd.date as date1,
uniq_id_date.date as date2
from question_practice_detail as qpd
left join(
select distinct device_id, date
from question_practice_detail
) as uniq_id_date
on qpd.device_id=uniq_id_date.device_id
and date_add(qpd.date,interval 1 day)=uniq_id_date.date
) as id_last_next_date
``````

``用户在某天刷题后第二天还会再来刷题的平均概率``

``partition by device_id``
``order by date``
``lead(date) over (partition by device_id order by date)``

``````LEAD(<expression>[,offset[, default_value]]) OVER (
PARTITION BY (expr)
ORDER BY (expr)
)
``````

``````/*
我们将使用示例数据库中的orders和customers表进行演示：
以下语句查找每个客户的订单日期和下一个订单日期：
*/
SELECT
customerName,/*字段*/
orderDate,/*字段*/
PARTITION BY customerNumber /*按照customerName分组*/
ORDER BY orderDate /*排序*/) nextOrderDate/*命名为nextOrderDate*/
FROM
orders
INNER JOIN customers USING (customerNumber);
``````

``````+------------------------------------+------------+---------------+
| customerName                       | orderDate  | nextOrderDate |
+------------------------------------+------------+---------------+
| Atelier graphique                  | 2013-05-20 | 2014-09-27    |
| Atelier graphique                  | 2014-09-27 | 2014-11-25    |
| Atelier graphique                  | 2014-11-25 | NULL          |
| Signal Gift Stores                 | 2013-05-21 | 2014-08-06    |
| Signal Gift Stores                 | 2014-08-06 | 2014-11-29    |
| Signal Gift Stores                 | 2014-11-29 | NULL          |
| Australian Collectors, Co.         | 2013-04-29 | 2013-05-21    |
| Australian Collectors, Co.         | 2013-05-21 | 2014-02-20    |
| Australian Collectors, Co.         | 2014-02-20 | 2014-11-24    |
| Australian Collectors, Co.         | 2014-11-24 | 2014-11-29    |
| Australian Collectors, Co.         | 2014-11-29 | NULL          |
| La Rochelle Gifts                  | 2014-07-23 | 2014-10-29    |
| La Rochelle Gifts                  | 2014-10-29 | 2015-02-03    |
| La Rochelle Gifts                  | 2015-02-03 | 2015-05-31    |
| La Rochelle Gifts                  | 2015-05-31 | NULL          |
| Baane Mini Imports                 | 2013-01-29 | 2013-10-10    |
| Baane Mini Imports                 | 2013-10-10 | 2014-10-15    |
...
/*

*/
``````

``日期差为1的记录/所有的记录``

SQL脚本：

``````select avg(if(datediff(date2, date1)=1, 1, 0)) as avg_ret/*差值为1则为1，否则为0，然后求这个字段相加的平均值*/
from (
select
distinct device_id,
date as date1,
lead(date) over (partition by device_id order by date) as date2/*获取相应的分组以及分组内下行的日期*/
from (
select distinct device_id, date/*简化表格*/
from question_practice_detail
) as uniq_id_date
) as id_last_next_date
``````
————————

### Sql29 calculates the average next day retention rate of users

Difficult passing rate: 48.58% time limit: 1 second space limit: 256M

### describe

Title: now the operator wants to check the average probability that users will brush again the next day after one day. Please take out the corresponding data.

According to the example, your query should return the following results:

### Example 1

Input:

``````drop table if exists `user_profile`;
drop table if  exists `question_practice_detail`;
drop table if  exists `question_detail`;
CREATE TABLE `user_profile` (
`id` int NOT NULL,
`device_id` int NOT NULL,
`gender` varchar(14) NOT NULL,
`age` int ,
`university` varchar(32) NOT NULL,
`gpa` float,
`active_days_within_30` int ,
`question_cnt` int ,
);
CREATE TABLE `question_practice_detail` (
`id` int NOT NULL,
`device_id` int NOT NULL,
`question_id`int NOT NULL,
`result` varchar(32) NOT NULL,
`date` date NOT NULL
);
CREATE TABLE `question_detail` (
`id` int NOT NULL,
`question_id`int NOT NULL,
`difficult_level` varchar(32) NOT NULL
);

INSERT INTO user_profile VALUES(1,2138,'male',21,'北京大学',3.4,7,2,12);
INSERT INTO user_profile VALUES(2,3214,'male',null,'复旦大学',4.0,15,5,25);
INSERT INTO user_profile VALUES(3,6543,'female',20,'北京大学',3.2,12,3,30);
INSERT INTO user_profile VALUES(4,2315,'female',23,'浙江大学',3.6,5,1,2);
INSERT INTO user_profile VALUES(5,5432,'male',25,'山东大学',3.8,20,15,70);
INSERT INTO user_profile VALUES(6,2131,'male',28,'山东大学',3.3,15,7,13);
INSERT INTO user_profile VALUES(7,4321,'male',28,'复旦大学',3.6,9,6,52);
INSERT INTO question_practice_detail VALUES(1,2138,111,'wrong','2021-05-03');
INSERT INTO question_practice_detail VALUES(2,3214,112,'wrong','2021-05-09');
INSERT INTO question_practice_detail VALUES(3,3214,113,'wrong','2021-06-15');
INSERT INTO question_practice_detail VALUES(4,6543,111,'right','2021-08-13');
INSERT INTO question_practice_detail VALUES(5,2315,115,'right','2021-08-13');
INSERT INTO question_practice_detail VALUES(6,2315,116,'right','2021-08-14');
INSERT INTO question_practice_detail VALUES(7,2315,117,'wrong','2021-08-15');
INSERT INTO question_practice_detail VALUES(8,3214,112,'wrong','2021-05-09');
INSERT INTO question_practice_detail VALUES(9,3214,113,'wrong','2021-08-15');
INSERT INTO question_practice_detail VALUES(10,6543,111,'right','2021-08-13');
INSERT INTO question_practice_detail VALUES(11,2315,115,'right','2021-08-13');
INSERT INTO question_practice_detail VALUES(12,2315,116,'right','2021-08-14');
INSERT INTO question_practice_detail VALUES(13,2315,117,'wrong','2021-08-15');
INSERT INTO question_practice_detail VALUES(14,3214,112,'wrong','2021-08-16');
INSERT INTO question_practice_detail VALUES(15,3214,113,'wrong','2021-08-18');
INSERT INTO question_practice_detail VALUES(16,6543,111,'right','2021-08-13');
INSERT INTO question_detail VALUES(1,111,'hard');
INSERT INTO question_detail VALUES(2,112,'medium');
INSERT INTO question_detail VALUES(3,113,'easy');
INSERT INTO question_detail VALUES(4,115,'easy');
INSERT INTO question_detail VALUES(5,116,'medium');
INSERT INTO question_detail VALUES(6,117,'easy');
``````

copy

Output:

``````0.3000
``````

Resolution:

I didn’t know how to solve this problem in the database for beginners. In fact, I didn’t know how to solve this problem in the University, so I didn’t know how to solve this problem in the database mode.

The official solution of niuke.com uses two solutions:

< strong > solution 1 < / strong >, the data in the table can be regarded as all the questions brushed on the first day, so we need to construct the fields that came on the second day. Therefore, we can consider using left join to spell the fields that came on the second day, and limit those that came on the second day to filter and use device_ The ID limit is the same user.

``date_add(date1,interval 1 day)=date2``

As for solution 1, we first look at each piece of data in the table independently and regard them as the first day to brush the questions. Even if the same person brushes the questions continuously for several days, we also look at these continuous data independently and regard each piece of information as the first day to brush the questions separately. Then we conduct a table combination, and the table combination condition is that we use date_ Add (date1, interval 1, day) = date2 to match. What does this mean? In fact, it is to add one day to date1 and see whether it is equal to date2. This operation is actually to add one day to the date in the original data, and then match the added table with the previous question brushing information table to see how many can be matched, because the original time information is retained in the original information table, while the time information in the new table is added by one day. If such a match is made, The date of some information does not exist in the original information, because it does not brush the title the next day, so there is no system information the next day. At the same time, we have to use device_ The ID is limited to the same user. In this way, we can ensure that everyone can brush questions the next day. This idea is actually to see if someone has records of the next day for each brush question record.

The SQL script is written as follows:

``````select count(date2) / count(date1) as avg_ret
from(
select
distinct qpd.device_id,
qpd.date as date1,
uniq_id_date.date as date2
from question_practice_detail as qpd
left join(
select distinct device_id, date
from question_practice_detail
) as uniq_id_date
on qpd.device_id=uniq_id_date.device_id
and date_add(qpd.date,interval 1 day)=uniq_id_date.date
) as id_last_next_date
``````

For this script, a step-by-step query is used. The largest outer query is the ratio of the number of date2 to the number of date1 in the query inner result. The inner query is to make its own left connection to the same table (a simple step is also used here to remove redundant and useless data). Here, if we want to make its own connection to the same table and use its own fields, Aliases can be added to distinguish. The connection condition is that the date of one table is equal to that of the other table after adding one, and the device ID is still equal. In this way, the matching result of a user’s question brushing for two consecutive days can be obtained. It should be noted that the left connection will retain the information that there is no matching between the left table and the right table. Therefore, we can obtain all the question brushing information in the original table by using the number of date1, The right side matches the information about the question brushing the next day. In this way, we can get the result we want.

In fact, a very important question in this question is: < strong > what is the average next day retention rate of users < / strong >? How to calculate this is a problem, which is explained in the title: what does this mean? When solving problems, we must learn to understand the problem and simplify the problem. We will come the next day after brushing the questions one day. This means that for each record of the user, there is the person’s question brushing record on the next day, and this second day is a relative second day, which is a relative concept. That is to say, if I brush the questions for three days in a row: No. 10, No. 11, No. 12, So the 11th is the second day relative to the 10th, and the 12th is also the second day relative to the 11th. Generally, there are two second days in these three days, which makes us have to look at each day separately, and then look for the second day of each day. We can’t have the idea that the 12th is the third day of the 10th in our heart. Although this idea is wrong in the first place, some people still have this kind of concern when doing questions, such as me, Then it is difficult to understand the meaning of this question well. The second day of this question refers to the relative second day. We must look at each day independently, and then find the second day of each day. Then the probability we require is actually the proportion of all the information with the second day, that is, the proportion of all the records with the second day in all the records, as I mentioned above, Two of the three days have the second day, and there are three days in total, so the probability that I will brush the questions the next day is 2 / 3, in these three days. There are two days to have the next day, one day not to have, this is the probability.

``用户在某天刷题后第二天还会再来刷题的平均概率``

< strong > solution 2 < / strong >, use the lead function to splice the records of the same user for two consecutive days. First group by users, then sort in ascending order by date, and then splice in pairs (the last default and null splice), i.e.

``partition by device_id``
``order by date``
``lead(date) over (partition by device_id order by date)``

I basically don’t understand solution 2 at all, because I don’t know what the lead function means, so I checked it: the lead() function is a window function that allows us to look forward to multiple rows and access the data of the row from the current one. The lead() function is very useful for calculating the difference between the current row and subsequent rows in the same result set. The usage of the lead() function:

``````LEAD(<expression>[,offset[, default_value]]) OVER (
PARTITION BY (expr)
ORDER BY (expr)
)
``````

Where expression is the value returned by the lead() function. Expression is arranged in an orderly partition from offset th. Offset is the number of rows from the current row to the front row to obtain the value. Offset must be a non negative integer. If offset is zero, the lead() function calculates the value of the current row of expression. If offset is omitted, the lead() function uses one by default. default_ Value. If there is no subsequent line, the lead() function returns default_ value。 For example, if offset is 1, the return value of the last line is default_ value。 If default is not specified_ Value, the function returns null. The partition by clause divides the rows in the result set into the partition of the application function. If the partition by clause is not specified, all rows in the result set are regarded as a single partition, and the order by clause is used to determine the order of rows in the function.

``````/*
我们将使用示例数据库中的orders和customers表进行演示：
以下语句查找每个客户的订单日期和下一个订单日期：
*/
SELECT
customerName,/*字段*/
orderDate,/*字段*/
PARTITION BY customerNumber /*按照customerName分组*/
ORDER BY orderDate /*排序*/) nextOrderDate/*命名为nextOrderDate*/
FROM
orders
INNER JOIN customers USING (customerNumber);
``````

We will get such a result:

``````+------------------------------------+------------+---------------+
| customerName                       | orderDate  | nextOrderDate |
+------------------------------------+------------+---------------+
| Atelier graphique                  | 2013-05-20 | 2014-09-27    |
| Atelier graphique                  | 2014-09-27 | 2014-11-25    |
| Atelier graphique                  | 2014-11-25 | NULL          |
| Signal Gift Stores                 | 2013-05-21 | 2014-08-06    |
| Signal Gift Stores                 | 2014-08-06 | 2014-11-29    |
| Signal Gift Stores                 | 2014-11-29 | NULL          |
| Australian Collectors, Co.         | 2013-04-29 | 2013-05-21    |
| Australian Collectors, Co.         | 2013-05-21 | 2014-02-20    |
| Australian Collectors, Co.         | 2014-02-20 | 2014-11-24    |
| Australian Collectors, Co.         | 2014-11-24 | 2014-11-29    |
| Australian Collectors, Co.         | 2014-11-29 | NULL          |
| La Rochelle Gifts                  | 2014-07-23 | 2014-10-29    |
| La Rochelle Gifts                  | 2014-10-29 | 2015-02-03    |
| La Rochelle Gifts                  | 2015-02-03 | 2015-05-31    |
| La Rochelle Gifts                  | 2015-05-31 | NULL          |
| Baane Mini Imports                 | 2013-01-29 | 2013-10-10    |
| Baane Mini Imports                 | 2013-10-10 | 2014-10-15    |
...
/*

*/
``````

Now that we know the usage of lead (), we can actually find n fields in n rows, so we can understand the solution of this problem: first group by users, and then sort in ascending order of time. Then we can put together the records of two consecutive days, and then we can check the date difference between the two fields. As long as the date difference is 1, it means that this is a record of continuous question brushing, If it is 0, it means that there is no continuous question brushing on the second day of the day, and then it is the probability we want to ask. Here, we can use the average value to calculate it, that is, find the average value of the number of dates with a difference of 1.

``日期差为1的记录/所有的记录``

SQL script:

``````select avg(if(datediff(date2, date1)=1, 1, 0)) as avg_ret/*差值为1则为1，否则为0，然后求这个字段相加的平均值*/
from (
select
distinct device_id,
date as date1,
lead(date) over (partition by device_id order by date) as date2/*获取相应的分组以及分组内下行的日期*/
from (
select distinct device_id, date/*简化表格*/
from question_practice_detail
) as uniq_id_date
) as id_last_next_date
``````