设备名称与版本提取方案 郝伟,2020/12/27 [TOC]

1. 需求说明

2. 解决方案

2.1. 方案1:通过正则表达式库来解决

建立正则表达式库,对不同的品牌的产品进行正则处理。具体参见附录1的源代码。根据初步评估算,需要建立两三百条正则即可。不过需要注意的是,正则只能处理大部分情况,仍然有少量的特殊情况无法处理。所以最后的少量的内容,可以进一步使用代码处理再加上正则来解决,最后还有极少数,再由人工来处理即可。

2.2. 方案2:通过自然语言处理(NLP)来解决

目前的自然语言处理都是对通用语言的处理,没有对类品牌和版本分类的功能。所以要建立此功能就需要建立语料库进行学习。而这个过程的本质上与方案一是一样的,而且语料库的准备的量需要比正则还要大,而且训练出来的效果也未知。

3. 输出结果

------------------------------ Matched Items (top 20 of 272) ------------------------------
('Apple AirPort', 'Express WAP')
('Apple AirPort', 'Express WAP')
('Apple AirPort', 'Express WAP')
('Apple', 'iPad')
('Apple', 'iPad')
('Apple', 'iPad')
('Apple', 'iPad')
('Apple Mac OS X', '10.7.3')
('Apple Mac OS X', '10.8.0')
('Apple iOS', '4.1')
('Apple iOS', '4.3.1')
('Apple', 'iPad')
('Apple Mac OS X', '10.5')
('Apple iOS', '5.0.1')
('Apple iOS', '5.0.1')
('Apple iOS', '5.1.1')
('iPhone mobile phone', '2.0.2')
('iPhone mobile phone', '2.2.1')
('Apple', 'iPad')
('Apple Mac OS X', '10.3.9')
------------------------------ Results ------------------------------
   4   matches on   "Apple AirPort Express WAP"
   6   matches on   "Apple iPad"
  79   matches on   "Apple Mac OS X x.y"
  12   matches on   "Apple Mac OS X x.y.z"
   2   matches on   "iPhone mobile phone (iPhone OS 2.0.2)"
   1   matches on   "Apple iOS 5.0"
   4   matches on   "Apple iOS 5.0.2"
  27   matches on   "Brother 000-00000 printer"
 137   matches on   "FreeBSD 000.00-W"
------------------------------ Summary ------------------------------
4121   Total number of lines.
 272   Matched lines.
3849   Remaining number of lines.

4. 附录

4.1. 附1:处理代码

【代码功能】 从输入的pro.txt文件中提取相应的内容,包括产品名称和版本两个主要信息。

【代码使用说明】

  1. 第35行的pro.txt的路径需要指定。
  2. 第38-48行为正则列表,每项由 名称正则表达式 两项组成。未来如果这两个内容过多,可以写成配置文件,然后从配置文件中读取,另外还要加上容错,以防止验证时正则格式出错的情况。

```python{.line-numbers, highlight=[35,38-48]}

5. -- coding: utf-8 --

""" Created on Thu Dec 25 16:24:22 2020

@author: Administrator """ import os, re

def readLines(filepath='./content.txt'): ''' Reads content in lines from a text file filepath, supporting utf-8 and gbk. ''' lines = [] if not os.path.exists(filepath): return lines try: with open(filepath, 'r', encoding = 'utf-8') as f: for line in f.readlines(): lines.append(line.strip()) except Exception: with open(filepath, 'r', encoding='gbk') as f1: for line in f1.readlines(): lines.append(line.strip()) return lines

def find(content, pattern): ''' Extracts matched content by the specified regex pattern ''' regex=re.compile(pattern) res = regex.findall(content) return res

6. texts in line from the specified text file.

texts = readLines(r"d:/data/pro.txt")

7. A list of regex patterns in format: (description, pattern)

patterns = [ ('Apple AirPort Express WAP', '(Apple) (AirPort Express WAP)',), ('Apple iPad', '(Apple)\s(iPad)',), ('Apple Mac OS X x.y', '(Apple Mac OS X) (\d{1,2}.\d{1,2}.\d{1,2})\s'), ('Apple Mac OS X x.y.z', '(Apple Mac OS X) (\d{1,2}.\d{1,2})\s'), ('iPhone mobile phone (iPhone OS 2.0.2)', '(iPhone mobile phone) (iPhone OS (\d.\d.\d))'), ('Apple iOS 5.0', '(Apple iOS) (\w.\w)\s'), ('Apple iOS 5.0.2', '(Apple iOS) (\w.\w.\w)\s'), ('Brother 000-00000 printer', '(Brother) (\w{1,3}-\d{1,5}\w) printer',),
('FreeBSD 000.00-W', '(FreeBSD)\s(\d{1,3}.\d{0,3}-\w+)',), ]

8. used for counting the number of matchers for each pattern

count_list=[0] * len(patterns) result_list=[] # matches remain_list=[] # text that failed to matched for text in texts: found = 0 for i in range(len(patterns)): res = find(text, patterns[i][1]) if res: result_list.append(res[0]) count_list[i] = count_list[i] + 1 found = 1 break if found == 0: remain_list.append(text)

9. show matched

top_n = 20
print('{0} Matched Items (top {1} of {2}) {0}'.format('-' * 30, top_n, len(result_list))) for item in result_list[:top_n]: print(item)

10. statistics of matching results

print('{0} Results {0}'.format('-' 30)) for item in zip(count_list, patterns): print('{0:>4} matches on "{1}"'.format(item[0], item[1][0])) print('{0} Summary {0}'.format('-' 30)) print('{0:>4} Total number of lines.'.format(len(texts))) print('{0:>4} Matched lines.'.format(len(texts) - len(remain_list))) print('{0:>4} Remaining number of lines.'.format(len(remain_list)))



## 附2:测试代码
测试代码,可以将 `a=b;` 实现交换输出为 `b=a;`。
```python
'''
def swap(content):
    pattern = '(.+)(\s*=\s*)(.+);'
    substitute = '\g<3> = \g<1>;'
    return re.sub(pattern, substitute, content)
'''

10.1. 参考

results matching ""

    No results matching ""