Python Ast抽象語法樹的介紹及應用詳解

Posted on 2022-07-28 by WalkonNet

引言

Abstract Syntax Trees即抽象語法樹。Ast是python源碼到字節碼的一種中間產物，借助ast模塊可以從語法樹的角度分析源碼結構。

此外，我們不僅可以修改和執行語法樹，還可以將Source生成的語法樹unparse成python源碼。因此ast給python源碼檢查、語法分析、修改代碼以及代碼調試等留下瞭足夠的發揮空間。

1. AST簡介

Python官方提供的CPython解釋器對python源碼的處理過程如下：

Parse source code into a parse tree (Parser/pgen.c)

Transform parse tree into an Abstract Syntax Tree (Python/ast.c)

Transform AST into a Control Flow Graph (Python/compile.c)

Emit bytecode based on the Control Flow Graph (Python/compile.c)

即實際python代碼的處理過程如下：

源代碼解析 –> 語法樹 –> 抽象語法樹(AST) –> 控制流程圖 –> 字節碼

上述過程在python2.5之後被應用。python源碼首先被解析成語法樹，隨後又轉換成抽象語法樹。在抽象語法樹中我們可以看到源碼文件中的python的語法結構。

大部分時間編程可能都不需要用到抽象語法樹，但是在特定的條件和需求的情況下，AST又有其特殊的方便性。

下面是一個抽象語法的簡單實例。

Module(body=[
    Print(
          dest=None,
          values=[BinOp( left=Num(n=1),op=Add(),right=Num(n=2))],
          nl=True,
 )])

2. 創建AST

2.1 Compile函數

先簡單瞭解一下compile函數。

compile(source, filename, mode[, flags[, dont_inherit]])

source — 字符串或者AST（Abstract Syntax Trees）對象。一般可將整個py文件內容file.read()傳入。
filename — 代碼文件名稱，如果不是從文件讀取代碼則傳遞一些可辨認的值。
mode — 指定編譯代碼的種類。可以指定為 exec, eval, single。
flags — 變量作用域，局部命名空間，如果被提供，可以是任何映射對象。
flags和dont_inherit是用來控制編譯源碼時的標志。

func_def = \
"""
def add(x, y):
    return x + y
print add(3, 5)
"""

使用Compile編譯並執行：

>>> cm = compile(func_def, '<string>', 'exec')
>>> exec cm
>>> 8

上面func_def經過compile編譯得到字節碼，cm即code對象，

True == isinstance(cm, types.CodeType)。

compile(source, filename, mode, ast.PyCF_ONLY_AST) <==> ast.parse(source, filename='<unknown>', mode='exec')

2.2 生成ast

使用上面的func_def生成ast.

r_node = ast.parse(func_def)
print astunparse.dump(r_node)    # print ast.dump(r_node)

下面是func_def對應的ast結構：

Module(body=[
    FunctionDef(
        name='add',
        args=arguments(
            args=[Name(id='x',ctx=Param()),Name(id='y',ctx=Param())],
            vararg=None,
            kwarg=None,
            defaults=[]),
        body=[Return(value=BinOp(
            left=Name(id='x',ctx=Load()),
            op=Add(),
            right=Name(id='y',ctx=Load())))],
        decorator_list=[]),
    Print(
        dest=None,
        values=[Call(
                func=Name(id='add',ctx=Load()),
                args=[Num(n=3),Num(n=5)],
                keywords=[],
                starargs=None,
                kwargs=None)],
        nl=True)
  ])

除瞭ast.dump，有很多dump ast的第三方庫，如astunparse, codegen, unparse等。這些第三方庫不僅能夠以更好的方式展示出ast結構，還能夠將ast反向導出python source代碼。

module Python version "$Revision$"
{
  mod = Module(stmt* body)| Expression(expr body)
  stmt = FunctionDef(identifier name, arguments args, stmt* body, expr* decorator_list)
        | ClassDef(identifier name, expr* bases, stmt* body, expr* decorator_list)
        | Return(expr? value)
        | Print(expr? dest, expr* values, bool nl)| For(expr target, expr iter, stmt* body, stmt* orelse)
  expr = BoolOp(boolop op, expr* values)
       | BinOp(expr left, operator op, expr right)| Lambda(arguments args, expr body)| Dict(expr* keys, expr* values)| Num(object n) -- a number as a PyObject.
       | Str(string s) -- need to specify raw, unicode, etc?| Name(identifier id, expr_context ctx)
       | List(expr* elts, expr_context ctx) 
        -- col_offset is the byte offset in the utf8 string the parser uses
        attributes (int lineno, int col_offset)
  expr_context = Load | Store | Del | AugLoad | AugStore | Param
  boolop = And | Or 
  operator = Add | Sub | Mult | Div | Mod | Pow | LShift | RShift | BitOr | BitXor | BitAnd | FloorDiv
  arguments = (expr* args, identifier? vararg, identifier? kwarg, expr* defaults)
}

上面是部分摘自官網的 Abstract Grammar，實際遍歷ast Node過程中根據Node的類型訪問其屬性。

3. 遍歷AST

python提供瞭兩種方式來遍歷整個抽象語法樹。

3.1 ast.NodeTransfer

將func_def中的add函數中的加法運算改為減法，同時為函數實現添加調用日志。

  class CodeVisitor(ast.NodeVisitor):
      def visit_BinOp(self, node):
          if isinstance(node.op, ast.Add):
              node.op = ast.Sub()
          self.generic_visit(node)
      def visit_FunctionDef(self, node):
          print 'Function Name:%s'% node.name
          self.generic_visit(node)
          func_log_stmt = ast.Print(
              dest = None,
              values = [ast.Str(s = 'calling func: %s' % node.name, lineno = 0, col_offset = 0)],
              nl = True,
              lineno = 0,
              col_offset = 0,
          )
          node.body.insert(0, func_log_stmt)
  r_node = ast.parse(func_def)
  visitor = CodeVisitor()
  visitor.visit(r_node)
  # print astunparse.dump(r_node)
  print astunparse.unparse(r_node)
  exec compile(r_node, '<string>', 'exec')

運行結果：

Function Name:add
def add(x, y):
    print 'calling func: add'
    return (x - y)
print add(3, 5)
calling func: add
-2

3.2 ast.NodeTransformer

使用NodeVisitor主要是通過修改語法樹上節點的方式改變AST結構，NodeTransformer主要是替換ast中的節點。

既然func_def中定義的add已經被改成一個減函數瞭，那麼我們就徹底一點，把函數名和參數以及被調用的函數都在ast中改掉，並且將添加的函數調用log寫的更加復雜一些，爭取改的面目全非：-）

  class CodeTransformer(ast.NodeTransformer):
      def visit_BinOp(self, node):
          if isinstance(node.op, ast.Add):
              node.op = ast.Sub()
          self.generic_visit(node)
          return node
      def visit_FunctionDef(self, node):
          self.generic_visit(node)
          if node.name == 'add':
              node.name = 'sub'
          args_num = len(node.args.args)
          args = tuple([arg.id for arg in node.args.args])
          func_log_stmt = ''.join(["print 'calling func: %s', " % node.name, "'args:'", ", %s" * args_num % args])
          node.body.insert(0, ast.parse(func_log_stmt))
          return node
      def visit_Name(self, node):
          replace = {'add': 'sub', 'x': 'a', 'y': 'b'}
          re_id = replace.get(node.id, None)
          node.id = re_id or node.id
          self.generic_visit(node)
          return node
  r_node = ast.parse(func_def)
  transformer = CodeTransformer()
  r_node = transformer.visit(r_node)
  # print astunparse.dump(r_node)
  source = astunparse.unparse(r_node)
  print source
  # exec compile(r_node, '<string>', 'exec')        # 新加入的node func_log_stmt 缺少lineno和col_offset屬性
  exec compile(source, '<string>', 'exec')
  exec compile(ast.parse(source), '<string>', 'exec')

結果：

def sub(a, b):
    print 'calling func: sub', 'args:', a, b
    return (a - b)
print sub(3, 5)
calling func: sub args: 3 5
-2
calling func: sub args: 3 5
-2

代碼中能夠清楚的看到兩者的區別。這裡不再贅述。

4.AST應用

AST模塊實際編程中很少用到，但是作為一種源代碼輔助檢查手段是非常有意義的；語法檢查，調試錯誤，特殊字段檢測等。

上面通過為函數添加調用日志的信息是一種調試python源代碼的一種方式，不過實際中我們是通過parse整個python文件的方式遍歷修改源碼。

4.1 漢字檢測

下面是中日韓字符的unicode編碼范圍

CJK Unified Ideographs

Range: 4E00— 9FFF

Number of characters: 20992

Languages: chinese, japanese, korean, vietnamese

使用 unicode 范圍 \u4e00 – \u9fff 來判別漢字，註意這個范圍並不包含中文字符(e.g. u'；' == u'\uff1b') .

下面是一個判斷字符串中是否包含中文字符的一個類CNCheckHelper：

  class CNCheckHelper(object):
      # 待檢測文本可能的編碼方式列表
      VALID_ENCODING = ('utf-8', 'gbk')
      def _get_unicode_imp(self, value, idx = 0):
          if idx < len(self.VALID_ENCODING):
              try:
                  return value.decode(self.VALID_ENCODING[idx])
              except:
                  return self._get_unicode_imp(value, idx + 1)
      def _get_unicode(self, from_str):
          if isinstance(from_str, unicode):
              return None
          return self._get_unicode_imp(from_str)
      def is_any_chinese(self, check_str, is_strict = True):
          unicode_str = self._get_unicode(check_str)
          if unicode_str:
              c_func = any if is_strict else all
              return c_func(u'\u4e00' <= char <= u'\u9fff' for char in unicode_str)
          return False

接口is_any_chinese有兩種判斷模式，嚴格檢測隻要包含中文字符串就可以檢查出，非嚴格必須全部包含中文。

下面我們利用ast來遍歷源文件的抽象語法樹，並檢測其中字符串是否包含中文字符。

  class CodeCheck(ast.NodeVisitor):
      def __init__(self):
          self.cn_checker = CNCheckHelper()
      def visit_Str(self, node):
          self.generic_visit(node)
          # if node.s and any(u'\u4e00' <= char <= u'\u9fff' for char in node.s.decode('utf-8')):
          if self.cn_checker.is_any_chinese(node.s, True):
              print 'line no: %d, column offset: %d, CN_Str: %s' % (node.lineno, node.col_offset, node.s)
  project_dir = './your_project/script'
  for root, dirs, files in os.walk(project_dir):
      print root, dirs, files
      py_files = filter(lambda file: file.endswith('.py'), files)
      checker = CodeCheck()
      for file in py_files:
          file_path = os.path.join(root, file)
          print 'Checking: %s' % file_path
          with open(file_path, 'r') as f:
              root_node = ast.parse(f.read())
              checker.visit(root_node)

上面這個例子比較的簡單，但大概就是這個意思。

關於CPython解釋器執行源碼的過程可以參考官網描述：PEP 339

4.2 Closure 檢查

一個函數中定義的函數或者lambda中引用瞭父函數中的local variable，並且當做返回值返回。特定場景下閉包是非常有用的，但是也很容易被誤用。

關於python閉包的概念可以參考我的另一篇文章：理解Python閉包概念

這裡簡單介紹一下如何借助ast來檢測lambda中閉包的引用。代碼如下：

  class LambdaCheck(ast.NodeVisitor):
      def __init__(self):
          self.illegal_args_list = []
          self._cur_file = None
          self._cur_lambda_args = []
      def set_cur_file(self, cur_file):
          assert os.path.isfile(cur_file)， cur_file
          self._cur_file = os.path.realpath(cur_file)
      def visit_Lambda(self, node):
          """
          lambda 閉包檢查原則：
          隻需檢測lambda expr body中args是否引用瞭lambda args list之外的參數
          """
          self._cur_lambda_args =[a.id for a in node.args.args]
          print astunparse.unparse(node)
          # print astunparse.dump(node)
          self.get_lambda_body_args(node.body)
          self.generic_visit(node)
      def record_args(self, name_node):
          if isinstance(name_node, ast.Name) and name_node.id not in self._cur_lambda_args:
              self.illegal_args_list.append((self._cur_file, 'line no:%s' % name_node.lineno, 'var:%s' % name_node.id))
      def _is_args(self, node):
          if isinstance(node, ast.Name):
              self.record_args(node)
              return True
          if isinstance(node, ast.Call):
              map(self.record_args, node.args)
              return True
          return False
      def get_lambda_body_args(self, node):
          if self._is_args(node): return
          # for cnode in ast.walk(node):
          for cnode in ast.iter_child_nodes(node):
              if not self._is_args(cnode):
                  self.get_lambda_body_args(cnode)

遍歷工程文件：

  project_dir = './your project/script'
  for root, dirs, files in os.walk(project_dir):
      py_files = filter(lambda file: file.endswith('.py'), files)
      checker = LambdaCheck()
      for file in py_files:
          file_path = os.path.join(root, file)
          checker.set_cur_file(file_path)
          with open(file_path, 'r') as f:
              root_node = ast.parse(f.read())
              checker.visit(root_node)
      res = '\n'.join([' ## '.join(info) for info in checker.illegal_args_list])
      print res

由於Lambda(arguments args, expr body)中的body expression可能非常復雜，上面的例子中僅僅處理瞭比較簡單的body expr。可根據自己工程特點修改和擴展檢查規則。為瞭更加一般化可以單獨寫一個visitor類來遍歷lambda節點。

Ast的應用不僅限於上面的例子，限於篇幅，先介紹到這裡。期待ast能幫助你解決一些比較棘手的問題。

以上就是Python Ast抽象語法樹的介紹及應用詳解的詳細內容，更多關於Python Ast抽象語法樹的資料請關註WalkonNet其它相關文章！

Python Ast抽象語法樹的介紹及應用詳解

目錄

引言

1. AST簡介

2. 創建AST

2.1 Compile函數

2.2 生成ast

3. 遍歷AST

3.1 ast.NodeTransfer

3.2 ast.NodeTransformer

4.AST應用

4.1 漢字檢測

4.2 Closure 檢查

推薦閱讀：

發佈留言取消回覆

近期文章

目錄

引言

1. AST簡介

2. 創建AST

2.1 Compile函數

2.2 生成ast

3. 遍歷AST

3.1 ast.NodeTransfer

3.2 ast.NodeTransformer

4.AST應用

4.1 漢字檢測

4.2 Closure 檢查

推薦閱讀：

發佈留言 取消回覆

近期文章

標籤

發佈留言取消回覆