SqlAlchemy单表查询过滤之谜

2017-11-17

SqlAlchemy单表查询过滤之谜

发生背景
奇特的一幕
深入探究
- 源码分析
  ~ filtered 分析
  ~ single_entity 分析
  ~ proc 函数分析
  ~ util.unique_list 处理函数
总结

发生背景

假设我有两个表，A 和 B
A表结构如下：

id	content
1	测试1
2	测试2

B表结构如下:

id	aid	content
1	1	测试1
2	1	测试2
3	2	测试3
4	1	测试4
5	2	测试5

现在我有了两个模型

class A(model):
    ...
class B(model):
    ....

奇特的一幕

当我连接A,B两个表时，发生了奇怪的一幕。

results = A.query.join(
    B, B.aid = A.id
).all()
# 居然 len(results) 的值 2
total = A.query.join(
    B, B.aid = A.id
).count()
# 然而 total 的值是 5

深入探究

从 sql 的角度来说，查询出来应该一定是5条数据，于是只能深入查询 sqlalchemy 源码来探究了。

源码分析

A.query 一切都是继承 Query，于是我们去查看 query.py 的源码

def all(self):
    """Return the results represented by this ``Query`` as a list.
    This results in an execution of the underlying query.
    """
    return list(self)

发现 all方法 是调用 list，也就是说 query 对象是一个可迭代的对象。因此我们可以去看 __iter__ 方法

def __iter__(self):
    context = self._compile_context()
    context.statement.use_labels = True
    if self._autoflush and not self._populate_existing:
        self.session._autoflush()
    return self._execute_and_instances(context)

_compile_context() 是处理 sql 语句的方法。
_execute_and_instances(context) 才是真正运行的 sql 的语句。

def _execute_and_instances(self, querycontext):
    conn = self._get_bind_args(
        querycontext,
        self._connection_from_session,
        close_with_result=True)
    result = conn.execute(querycontext.statement, self._params)
    return loading.instances(querycontext.query, result, querycontext)

经过调试，我们发现 result 的数据依然是 5 条。
因此处理后的数据应该是在 loading.instances 里面被处理的。
因此，我们继续追踪 loading.py

def instances(query, cursor, context):
    """Return an ORM result as an iterator."""
    context.runid = _new_runid()
    filtered = query._has_mapper_entities
    single_entity = len(query._entities) == 1 and \
        query._entities[0].supports_single_entity
    if filtered:
        if single_entity:
            filter_fn = id
        else:
            def filter_fn(row):
                return tuple(
                    id(item)
                    if ent.use_id_for_hash
                    else item
                    for ent, item in zip(query._entities, row)
                )
    try:
        (process, labels) = \
            list(zip(*[
                query_entity.row_processor(query,
                                           context, cursor)
                for query_entity in query._entities
            ]))
        if not single_entity:
            keyed_tuple = util.lightweight_named_tuple('result', labels)
        while True:
            context.partials = {}
            if query._yield_per:
                fetch = cursor.fetchmany(query._yield_per)
                if not fetch:
                    break
            else:
                fetch = cursor.fetchall()
            if single_entity:
                proc = process[0]
                rows = [proc(row) for row in fetch]
            else:
                rows = [keyed_tuple([proc(row) for proc in process])
                        for row in fetch]
            if filtered:
                rows = util.unique_list(rows, filter_fn)
            for row in rows:
                yield row
            if not query._yield_per:
                break
    except Exception as err:
        cursor.close()
        util.raise_from_cause(err)

在这个函数里面我们需要注意如下几点：

filtered 参数.
single_entity 参数.
proc 函数
util.unique_list 处理函数.

filtered 分析

我们是在什么时候设置 query._has_mapper_entities。不难看出这个参数是来源于 query 对象，因此我们赶快去查看 query 函数。

def __init__(self, entities, session=None):
    self.session = session
    self._polymorphic_adapters = {}
    self._set_entities(entities)
def _set_entities(self, entities, entity_wrapper=None):
    if entity_wrapper is None:
        entity_wrapper = _QueryEntity
    self._entities = []
    self._primary_entity = None
    self._has_mapper_entities = False
    for ent in util.to_list(entities):
        entity_wrapper(self, ent)
    self._set_entity_selectables(self._entities)

Query 的 _set_entities 发现它将 _has_mapper_entities 设置成了 False
因此我们只能寄希望于 _QueryEntity 了.

class _QueryEntity(object):
    """represent an entity column returned within a Query result."""
    def __new__(cls, *args, **kwargs):
        if cls is _QueryEntity:
            entity = args[1]
            if not isinstance(entity, util.string_types) and \
                    _is_mapped_class(entity):
                cls = _MapperEntity
            elif isinstance(entity, Bundle):
                cls = _BundleEntity
            else:
                cls = _ColumnEntity
        return object.__new__(cls)
class _MapperEntity(_QueryEntity):
    """mapper/class/AliasedClass entity"""
    def __init__(self, query, entity):
        if not query._primary_entity:
            query._primary_entity = self
        query._entities.append(self)
        query._has_mapper_entities = True
        self.entities = [entity]
        self.expr = entity
    supports_single_entity = True

从 _QueryEntity 的 __new__ 方法我们可以清晰的发现，它创建了一个 _MapperEntity 对象，在这个对象里面设置了 _has_mapper_entities 对象。
到这里，我们可以了解到：在处理 instantce 时我们的 filtered 为 True 了。

single_entity 分析

这个参数，其实就是看 query._entities 有多少个。

def __init__(self, entities, session=None):
    self.session = session
    self._polymorphic_adapters = {}
    self._set_entities(entities)

分析这个，还是从 Query 的 __init__ 方法中分析。

1
2
3

A.query.join(
    B, B.aid = A.id
)

当我们经过以上代码调试的时候，发起我们的 entities 只传入了一个 A 对象。
当 _MapperEntity 对象被创建时，就已经设置了 supports_single_entity 参数。

经过以上分析：我们可以得出我们的案例，属于单个对象查询。因此我们的过滤是：

1
2
3

if filtered:
    if single_entity:
        filter_fn = id  # 我们的过滤器是使用了内置函数 id

proc 函数分析

从 loading.py 的 instances 中，我们可以分析出，proc 是来自于 _MapperEntity 对象的 row_processor 方法.

def row_processor(self, query, context, result):
    ...
    _instance = loading._instance_processor(
        self.mapper,
        context,
        result,
        self.path,
        adapter,
        only_load_props=only_load_props,
        refresh_state=refresh_state,
        polymorphic_discriminator=self._polymorphic_discriminator
    )
    return _instance, self._label_name

从这里可以看出 _instance 来自 loading._instance_processor

def _instance_processor(
        mapper, context, result, path, adapter,
        only_load_props=None, refresh_state=None,
        polymorphic_discriminator=None,
        _polymorphic_from=None):
    """Produce a mapper level row processor callable
    which processes rows into mapped instances."""
...
identity_class = mapper._identity_class
...
def _instance(row):
    # determine the state that we'll be populating
    if refresh_identity_key:
        # fixed state that we're refreshing
        state = refresh_state
        instance = state.obj()
        dict_ = instance_dict(instance)
        isnew = state.runid != runid
        currentload = True
        loaded_instance = False
    else:
        # look at the row, see if that identity is in the
        # session, or we have to create a new one
        identitykey = (
            identity_class,
            tuple([row[column] for column in pk_cols])
        )
        instance = session_identity_map.get(identitykey)
        if instance is not None:
            # existing instance
            state = instance_state(instance)
            dict_ = instance_dict(instance)
            isnew = state.runid != runid
            currentload = not isnew
            loaded_instance = False
            if version_check and not currentload:
                _validate_version_id(mapper, state, dict_, row, adapter)
        else:
            # create a new instance
            # check for non-NULL values in the primary key columns,
            # else no entity is returned for the row
            if is_not_primary_key(identitykey[1]):
                return None
            isnew = True
            currentload = True
            loaded_instance = True
            instance = mapper.class_manager.new_instance()
            dict_ = instance_dict(instance)
            state = instance_state(instance)
            state.key = identitykey
            # attach instance to session.
            state.session_id = session_id
            session_identity_map._add_unpresent(state, identitykey)
    # populate.  this looks at whether this state is new
    # for this load or was existing, and whether or not this
    # row is the first row with this identity.
    if currentload or populate_existing:
        # full population routines.  Objects here are either
        # just created, or we are doing a populate_existing
        # be conservative about setting load_path when populate_existing
        # is in effect; want to maintain options from the original
        # load.  see test_expire->test_refresh_maintains_deferred_options
        if isnew and (propagate_options or not populate_existing):
            state.load_options = propagate_options
            state.load_path = load_path
        _populate_full(
            context, row, state, dict_, isnew, load_path,
            loaded_instance, populate_existing, populators)
        if isnew:
            if loaded_instance:
                if load_evt:
                    state.manager.dispatch.load(state, context)
                if persistent_evt:
                    loaded_as_persistent(context.session, state.obj())
            elif refresh_evt:
                state.manager.dispatch.refresh(
                    state, context, only_load_props)
            if populate_existing or state.modified:
                if refresh_state and only_load_props:
                    state._commit(dict_, only_load_props)
                else:
                    state._commit_all(dict_, session_identity_map)
    else:
        # partial population routines, for objects that were already
        # in the Session, but a row matches them; apply eager loaders
        # on existing objects, etc.
        unloaded = state.unloaded
        isnew = state not in context.partials
        if not isnew or unloaded or populators["eager"]:
            # state is having a partial set of its attributes
            # refreshed.  Populate those attributes,
            # and add to the "context.partials" collection.
            to_load = _populate_partial(
                context, row, state, dict_, isnew, load_path,
                unloaded, populators)
            if isnew:
                if refresh_evt:
                    state.manager.dispatch.refresh(
                        state, context, to_load)
                state._commit(dict_, to_load)
    return instance
...
return _instance

note: 我们这里先不谈其他的 _instance 处理，我们这里只选取了我们这个测试案例的运行轨迹。
最后我们可以发现，proc 函数就是这里的 _instance 函数。

从上面的 session_identity_map 和 identitykey 可以看出针对(类名 + 主键）的方式来存储缓存对象,
针对相同主键的ID和类会重复使用缓存对象。

用我们的案例也就是说，A(1) 有三条数据，A(2) 有两条数据。最后会生成一个含有(5)条数据的 list。

1	list(A1, A1, A2, A1, A2)

`util.unique_list` 处理函数

我们可以轻易的在 _collections.py 中，看到 unique_list

def unique_list(seq, hashfunc=None):
    seen = set()
    seen_add = seen.add
    if not hashfunc:
        return [x for x in seq
                if x not in seen
                and not seen_add(x)]
    else:
        return [x for x in seq
                if hashfunc(x) not in seen
                and not seen_add(hashfunc(x))]

经过上面的分析，我们可以清楚的知道：

seq 就是我们最后得出的含有缓存对象的 list;
hashfunc 其实就是内置的 id 函数;
同一对象的 hash 值是一致的；

综上所述：所以最后我们得出只含有两个对象的(list)

总结

sqlalchemy 这样的做法，会使我们和SQL之间产生疑惑。
分析得出，sqlalchemy 在做单 entity 查询时，会对同一条数据的对象重复利用。