[原创] 数据库求交集怎么做?([original] how to find intersection in database?)

场景

一个文件和标签的多对多关系表,求取Tag1,Tag2,Tag3下的文件有哪些?

id tag_id file_id
1 TAG_1 F1
2 TAG_2 F1
3 TAG_3 F1
4 TAG_1 F2
5 TAG_2 F2
6 TAG_3 F3
7 TAG_1 F3

例如,求取符合 TAG_1 和 TAG_2 下面的所有文件。通过观察可以发现是F1和F3符合要求。因为F3包含标签,TAG_1, TAG_2, TAG_3,F1包含TAG_1, TAG_2, TAG_3,F2包含TAG_2。因此F1,F3符合要求。

如何求交集?

求取交集的办法有很多,课本上讲过的求取交集的办法就是使用 intersect,然后求取交集。但是很多DBMS并不支持这种求取交集的办法,那该怎么办?下面细细分析一波。

求取交集的办法有很多,课本上讲过的求取交集的办法就是使用 intersect,然后求取交集。但是很多DBMS并不支持这种求取交集的办法,那该怎么办?下面细细分析一波。

直接使用intersect求取交集

按照直觉来看,求取TAG_1的文件:F1, F3。求取TAG_2的文件:F1, F2,F3。最后得到的结果求交集,得到文件ID结果F1, F3。

select file_id
from tag_label
where tag_id = 'TAG_1'
intersect
select file_id
from tag_label
where tag_id = 'TAG_1';

使用连接查询来代替intersect求交集

因为交集操作这个关键字并不是所有的DBMS都拥有的,所以,在MySQL可以使用连接查询的方式来达到求取交集的目的。

  • TAG_1和TAG_2连接操作
    SELECT TAG_1.`file_id` T1_F, TAG_2.`file_id` T2_F
    FROM (SELECT * FROM `tag_file` WHERE `tag_id` = ‘TAG_1’) TAG_1
    LEFT JOIN (SELECT * FROM `tag_file` WHERE `tag_id` = ‘TAG_2’) TAG_2
    ON TAG_1.`file_id` = TAG_2.`file_id`;

    T1_F
    T2_F

    F1
    F1

    F2
    F2

    F3
    NULL

T1_F T2_F
F1 F1
F2 F2
F3 NULL
  • TAG_1和TAG_2连接查询并且添加过滤操作
    SELECT TAG_1.`file_id`
    FROM (SELECT * FROM `tag_file` WHERE `tag_id` = ‘TAG_1’) TAG_1
    LEFT JOIN (SELECT * FROM `tag_file` WHERE `tag_id` = ‘TAG_2’) TAG_2
    ON TAG_1.`file_id` = TAG_2.`file_id`
    WHERE TAG_1.`file_id` IS NOT NULL AND TAG_2.`file_id` IS NOT NULL;

    file_id

    F1

    F2

file_id
F1
F2

使用连接查询是一件代价很大的查询操作,虽然可以这种方式可以查询,但是并不是最好的。第二点,如果有多个标签ID需要构建,那么就需要构造一个很长的SQL查询语句,这样是一件不优雅的查询操作。综上所述,就需要一种新的查询操作来解决这个问题。

使用分组来求取交集

因为这种求取交集的办法很多DBMS并不支持,所以这种方法并不可行,那么该怎么办啊?不如换一种思路,按照文件名(file_id)分组。然后根据标签的长度来判断文件是不是同时属于这几个标签,从而得到文件ID列表。多说无益,不如做几个实验。

  • 查看包含标签列表的文件有哪些?
    SELECT *
    FROM tag_file
    WHERE `tag_id` IN (‘TAG_1’, ‘TAG_2’);

    id
    tag_id
    file_id

    1
    TAG_1
    F1

    2
    TAG_2
    F1

    4
    TAG_1
    F2

    5
    TAG_2
    F2

    7
    TAG_1
    F3

    通过分析这些结果,发现,文件id一列的结果是,符合TAG_1或者符合TAG_2的结果,得到了并集的结果。但是想要求交集,又该如何进行?

id tag_id file_id
1 TAG_1 F1
2 TAG_2 F1
4 TAG_1 F2
5 TAG_2 F2
7 TAG_1 F3
  • 按照文件ID分组,查看tag的长度
    SELECT `file_id`, COUNT(`tag_id`) tag_count
    FROM tag_file
    WHERE `tag_id` IN (‘TAG_1’, ‘TAG_2’)
    GROUP BY `file_id`;

    file_id
    tag_count

    F1
    2

    F2
    2

    F3
    1

    根据 GROUP BY 的性质,可以知道,GROUP BY是按照查询到的结果集,进行分组,因此可以避免连接查询的性能损耗。再其次,可以发现一个规律,因为分组是按照搜索结果分组,所以 tag_count 的长度不可能超过,需要查询的标签列表的长度的。根据场景也就是,tag_count 的结果一定小于等于2,而且长度等于2的结果,一定就是符合所有标签的。因此可以通过这种方式来将交集求出。

file_id tag_count
F1 2
F2 2
F3 1
  • 求出符合条件的文件列表
    SELECT `file_id`
    FROM tag_file
    WHERE `tag_id` IN (‘TAG_1’, ‘TAG_2’)
    GROUP BY `file_id`
    HAVING COUNT(`tag_id`) = 2;

    file_id

    F1

    F2

    最终通过这种方式求出结果。

file_id
F1
F2

利用MyBatis构造一个动态SQL:

<select id="getFileIdListByTagIds" resultType="String" parameterType="java.lang.List">
    <bind name="tagListSize" value="tagList.size()" />
    SELECT `file_id`
    FROM tag_file
    WHERE `tag_id` IN
        <foreach collection="tagList" item="tagId" open="(" sparator="," close=")">
            #{tagId}
        </foreach>
    GROUP BY `file_id`
    HAVING COUNT(`tag_id`) = #{tagListSize};
</select>

方法评价

这种方式主要用于同表以内的交集查询,但是并不适用广泛的求交集操作。

————————

scene

What are the files under tag1, tag2 and tag3 in the many to many relationship table of a file and tag?

id tag_id file_id
1 TAG_1 F1
2 TAG_2 F1
3 TAG_3 F1
4 TAG_1 F2
5 TAG_2 F2
6 TAG_3 F3
7 TAG_1 F3

For example, find the compliance tag_ 1 and tag_ 2 all documents below. Through observation, it can be found that F1 and F3 meet the requirements. Because F3 contains tags, tag_ 1, TAG_ 2, TAG_ 3. F1 contains tag_ 1, TAG_ 2, TAG_ 3. F2 contains tag_ 2。 Therefore, F1 and F3 meet the requirements.

How to find intersection?

There are many ways to get the intersection. The way to get the intersection mentioned in the textbook is to use intersect and then get the intersection. However, many DBMS do not support this method of finding intersection. What should I do? Let’s analyze a wave in detail.

There are many ways to get the intersection. The way to get the intersection mentioned in the textbook is to use intersect and then get the intersection. However, many DBMS do not support this method of finding intersection. What should I do? Let’s analyze a wave in detail.

直接使用intersect求取交集

Intuitively, find tag_ 1 file: F1, F3. Get tag_ 2 files: F1, F2, F3. Finally, the intersection of the obtained results is obtained to obtain the file ID results F1 and F3.

select file_id
from tag_label
where tag_id = 'TAG_1'
intersect
select file_id
from tag_label
where tag_id = 'TAG_1';

Use join query instead of intersect to find intersection

Because the keyword intersection operation is not owned by all DBMS, connection query can be used in MySQL to obtain the intersection.

  • TAG_1和TAG_2连接操作
    SELECT TAG_1.`file_id` T1_F, TAG_2.`file_id` T2_F
    FROM (SELECT * FROM `tag_file` WHERE `tag_id` = ‘TAG_1’) TAG_1
    LEFT JOIN (SELECT * FROM `tag_file` WHERE `tag_id` = ‘TAG_2’) TAG_2
    ON TAG_1.`file_id` = TAG_2.`file_id`;

    T1_F
    T2_F

    F1
    F1

    F2
    F2

    F3
    NULL

T1_F T2_F
F1 F1
F2 F2
F3 NULL
  • TAG_1和TAG_2连接查询并且添加过滤操作
    SELECT TAG_1.`file_id`
    FROM (SELECT * FROM `tag_file` WHERE `tag_id` = ‘TAG_1’) TAG_1
    LEFT JOIN (SELECT * FROM `tag_file` WHERE `tag_id` = ‘TAG_2’) TAG_2
    ON TAG_1.`file_id` = TAG_2.`file_id`
    WHERE TAG_1.`file_id` IS NOT NULL AND TAG_2.`file_id` IS NOT NULL;

    file_id

    F1

    F2

file_id
F1
F2

Using join query is a costly query operation. Although it can be queried in this way, it is not the best. Second, if multiple tag IDS need to be built, a long SQL query statement needs to be constructed, which is an elegant query operation. To sum up, a new query operation is needed to solve this problem.

Use grouping to find intersections

Because this method of finding intersection is not supported by many DBMS, this method is not feasible. What should I do? It’s better to change the idea and group according to the file name (file_id). Then judge whether the file belongs to these tags at the same time according to the length of the tag, so as to obtain the file ID list. It’s useless to say more. It’s better to do a few experiments.

  • See what files contain a list of tags?
    SELECT *
    FROM tag_ file
    WHERE `tag_ id` IN (‘TAG_1’, ‘TAG_2’);
    id
    tag_ id
    file_ id
    one
    TAG_ one
    F1
    two
    TAG_ two
    F1
    four
    TAG_ one
    F2
    five
    TAG_ two
    F2
    seven
    TAG_ one
    F3
    By analyzing these results, it is found that the result in the file ID column is consistent with tag_ 1 or tag compliant_ 2, the result of union is obtained. But if you want to ask for intersection, how should you proceed?
id tag_id file_id
1 TAG_1 F1
2 TAG_2 F1
4 TAG_1 F2
5 TAG_2 F2
7 TAG_1 F3
  • Group by file ID and check the length of tag
    SELECT `file_ id`, COUNT(`tag_id`) tag_ count
    FROM tag_ file
    WHERE `tag_ id` IN (‘TAG_1’, ‘TAG_2’)
    GROUP BY `file_ id`;
    file_ id
    tag_ count
    F1
    two
    F2
    two
    F3
    one
    According to the nature of group by, you can know that group by is grouped according to the queried result set, so you can avoid the performance loss of connection query. Secondly, a rule can be found, because grouping is based on search results, so tag_ The length of count cannot exceed the length of the tag list to be queried. According to the scenario, that is, tag_ The result of count must be less than or equal to 2, and the result with length equal to 2 must meet all labels. Therefore, the intersection can be obtained in this way.
file_id tag_count
F1 2
F2 2
F3 1
  • 求出符合条件的文件列表
    SELECT `file_id`
    FROM tag_file
    WHERE `tag_id` IN (‘TAG_1’, ‘TAG_2’)
    GROUP BY `file_id`
    HAVING COUNT(`tag_id`) = 2;

    file_id

    F1

    F2

    最终通过这种方式求出结果。

file_id
F1
F2

Construct a dynamic SQL using mybatis:

<select id="getFileIdListByTagIds" resultType="String" parameterType="java.lang.List">
    <bind name="tagListSize" value="tagList.size()" />
    SELECT `file_id`
    FROM tag_file
    WHERE `tag_id` IN
        <foreach collection="tagList" item="tagId" open="(" sparator="," close=")">
            #{tagId}
        </foreach>
    GROUP BY `file_id`
    HAVING COUNT(`tag_id`) = #{tagListSize};
</select>

Method evaluation

This method is mainly used for intersection query within the same table, but it is not suitable for a wide range of intersection operations.