＋メッセージ +Message SMS备份部分解析

背景

最近终于有机会把手上的 au 版 Xperia XZ3 更换为了 Redmi Note 12 Turbo(marble), 多了极其先进的 3.5mm 接口、带有 eSE 的 NFC、红外遥控、多摄、打孔直屏以及澎湃 2 系统等重磅功能, 但同时也失去了 TF 卡槽、拍照键、曲面屏以及 2K 屏幕。当然最主要的还是将 RAM 从 4GB 提升至 12 GB, 以及 SoC 从 845 到 7+gen2 的升级使得在炎炎夏日之中再也不用忍受手机的又烫又卡, 甚至常常自动关机。要集齐所有这些现代或者远古功能, 可能只有日系手机能做得到了。但在 Xperia 都退出中国市场的当下, 本土化而言无论是系统上的针对优化还是一些便捷功能可能都无法实得到。

备份

使用小米换机可以利用热点实现较为快速的通信录、手机 ROM 中的图片及其他数据以及大多数应用程序本地的迁移。而应用的数据迁移则一般使用其内置提供的功能, 如 TIM 与微信一类内置局域网数据交换功能, 而 Mihon 一类的开源软件通常会借助于数据的导出与导入。

而最让人头疼的就是日系手机使用的＋メッセージ(+Message) 短信应用。记得之前就因为其缺乏短信中心号码修改功能而无法发送短信, 而其提供的 RCS 服务又无法在国内使用, 甚至还带有 banner 广告。如果不在当地使用日系手机, 还是建议第一时间更换为 Google 信息或者其他短信应用。最恶劣的是其没有将信息直接保存在 mmssms.db 中，使得无论是换机软件还是专用的短信备份软件都无法直接读出。

好在其还是提供的备份的功能, 但其使用了一种二进制格式, 使得直接查看非常困难。根据 Android 的 SMS 储存格式，我们仅需提取 address, date, body 与 type 字段即可。

备份文件解析

由于缺乏 Android 程序的分析能力，故对备份文件进行静态分析，尝试直接分析其格式。

首先非常显然的是其文件的首部与末部标有的 wclBackup 字符串, 而头部除此之外还有至 backup_owner 的字段。

找到的突破口为 ascii 字符串 text/plain$, 这显然为 MIME 类型。幸运的是, 我所有接受的短信都是纯文字的, 不包括其他类型, 大大降低了工作量, 所以我们借此定位所有的短信内容。

# 十六进制前缀
PREFIX_HEX = "0000746578742F706C61696E24000000"
PREFIX_BYTES = bytes.fromhex(PREFIX_HEX)

def split_file_by_prefix(input_file):

    with open(input_file, 'rb') as f:
        data = f.read()

    prefix_len = len(PREFIX_BYTES)
    start = 0
    part_num = 0
    results = []

    # 查找所有匹配的前缀位置
    while True:
        pos = data.find(PREFIX_BYTES, start)
        if pos == -1:
            break
        # 保存从当前位置到下一个前缀之间的数据
        if pos > start:
            results.append(data[start:pos])
        start = pos + prefix_len
        part_num += 1

    # 处理最后剩下的数据
    if start < len(data):
        results.append(data[start:])

    return results

对于 text/plain$ 字符串, 其后存在 3 个 \0 字节, 再气候便是一个 36 位的 UUID 字符串。再读取一个长整型, 继续读入该长整型长度的字符, 其形式为 {"protocalId":"x"}。接下来便是一个 64 位的整数, 其值为短信的 Unix 时间戳(单位为毫秒)。我们就此可以很方便的提取时间。

import struct
import time

def fun(x, if_print=True):
    # 取出 results[x] 的前 36 bytes 转换为 ascii 码
    x += 1 # 跳过 header
    first_36_bytes = results[x][:36]
    ascii_string = first_36_bytes.decode('ascii', errors='ignore')
    if if_print: print(ascii_string)
    # results[x][36:40] 转换为整数
    four_bytes = struct.unpack('<I', results[x][36:40])[0]
    if if_print: print(f"整数值: {four_bytes}")

    timestamp_offset = 40 + four_bytes

    # 取出后 18 bytes 转换为 ascii 码
    next_18_bytes = results[x][40:timestamp_offset]
    next_ascii_string = next_18_bytes.decode('ascii', errors='ignore')
    if if_print: print(next_ascii_string)
    # 取出后 64位整数, 从unix时间戳转换为人类可读的时间
    timestamp_bytes = results[x][timestamp_offset:timestamp_offset + 8]
    timestamp = struct.unpack('<Q', timestamp_bytes)[0]
    readable_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(timestamp / 1000))  # 转换为秒
    if if_print: print(f"时间戳: {timestamp}, 可读时间: {readable_time}")

    return timestamp

而对于短信内容, 我们注意到 text/plain$ 字符串其前总是存在 U+000A 即 EOL 行结束, 而其之前便是短信的 utf-8 文本内容, 而文本前则是有一定的前缀字符, 具体参考代码。文本内容向前第 3 个字节则可表示短信状态，\x00 表示收信，\x08 表示发信等。而在此前缀字符串前则存在电话号码的有关内容，除去不可见字符后为 sms (+86)num num slot x (+86)num 的形式, 其中 num 为电话号码, x 为 SIM 卡槽号, 而 x 与后续 (+86)num 中间填充 \x1d\x1e, 我们据此便可提取出电话号码。

def txt(x):
    # 从后往前搜素 1E 00 00 00 00 04 00 00 00 02 00 00 00 00 00 00 00 05 00 00 00 00 00 00 00，之后字符串转utf8
    #             1E 00 00 00 00 06 00 00 00 02 00 00 00 00 00 00 00 04 00 00 00 08 00 00 00
    #             1E 00 00 00 00 05 00 00 00 02 00 00 00 00 00 00 00 04 00 00 00 08 00 00 00
    TEXT_PREFIX = b'\x1e\x00\x00\x00\x00'
    text_pos = results[x].rfind(TEXT_PREFIX)
    num = ""
    text = ""
    is_send = False
    if text_pos != -1:
        # 从 text_pos 向前搜素 30 1d 1d 1d 
        num_pos = results[x].rfind(b'\x30\x1d\x1d\x1d', 0, text_pos)
        if num_pos != -1:
            text_start = num_pos + 4
            text_bytes = results[x][text_start:text_pos]
            plus_index = text_bytes.rfind(b'\x1d\x1e')
            num = text_bytes[plus_index+2:text_pos].decode('utf-8', errors='ignore')
        text_start = text_pos + 25
        text_end = results[x].find(b'\x0a\x00', text_start)

        if text_end != -1:
            text_bytes = results[x][text_start+12:text_end]
            text = text_bytes.decode('utf-8', errors='ignore')
            if results[x][text_pos+21] == 0x08:
                is_send = True
    return num, text, is_send

至此, 我们便可以将短信的时间、内容以及发送者号码提取出来, 进行后续的处理及存储。而其他二进制数据段则待进一步的分析, 推测可能存在 RCS 服务相关字段, 从而使用国内的样本较为难以分析。

table = []
for i in range(len(results) - 1):
    timestamp = fun(i, if_print=False)
    num, text, is_send = txt(i)
    # 有一个是 None
    if timestamp is None or num == "" or text == "":
        if not(timestamp is None and num == "" and text == ""):
            print(f"第 {i} 个数据不完整, uuid: {results[i][:36].decode('ascii', errors='ignore')}, 时间戳: {timestamp}, num: {num}, text: {text}")
        else:
            print(f"第 {i} 个为空数据")
    table.append([timestamp, num, text, is_send])
for i in range(len(table)):
    table[i][0] = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(table[i][0] / 1000))  # 转换为秒
    if table[i][1].startswith('+86'):
        table[i][1] = table[i][1][3:]
    table[i][2] = table[i][2].replace('\n', '\\n').replace('\r', '\\r').replace('\t', '\\t')
table.sort()

导入

此时我们只要借助如简单短信或更特化的 SMS Import / Export 等应用就可以将提取的信息导入到新手机中, 前者使用 json 格式, 后者使用 ndjson 格式并压缩为 zip 文件, 只是字段上略有差异。

对于前者，一个示例如下:

[
  {
    "subscriptionId": 3,
    "address": "10086",
    "body": "SOME_TEXT",
    "date": $UNIX_TIMESTAMP,
    "dateSent": 0,
    "locked": 0,
    "protocol": null,
    "read": 1,
    "status": -1,
    "type": 1,
    "serviceCenter": null,
    "backupType": "sms"
  },
  ...
]

而对于后者:

{"_id": "1", "thread_id": "6", "address": "10086", "date": "$UNIX_TIMESTAMP", "date_sent": "0", "read": "1", "status": "-1", "type": "1", "subject": "", "body": "SOME_TEXT", "locked": "0", "sub_id": "3", "error_code": "0", "creator": "$YOUR_SMS_APP_PACKAGE_NAME", "seen": "1"}
...

可以看出，对于前者，我们只用填写 address、body、date、type 字段即可, 其中 type 字段的值为 1 表示接受, 4 表示发送。而对于后者, 则需要依次填写 _id, 而 thread_id 无影响。

在此很容易编写出相应的脚本来将备份文件转换为上述格式。以下是一个 Python 脚本示例以转换成前者:

import json

json_list = []
for row in table:
    address = row[1]
    body = row[2]
    # 将时间字符串转为时间戳（毫秒）
    time_struct = time.strptime(row[0], "%Y-%m-%d %H:%M:%S")
    date = int(time.mktime(time_struct)) * 1000
    type_ = 4 if row[3] else 1
    json_list.append({
        "subscriptionId": 3,
        "address": address,
        "body": body,
        "date": date,
        "dateSent": 0,
        "locked": 0,
        "protocol": None,
        "read": 1,
        "status": -1,
        "type": type_,
        "serviceCenter": None,
        "backupType": "sms"
    })

with open("output.json", "w", encoding="utf-8") as f:
    json.dump(json_list, f, ensure_ascii=False, indent=2)

结语

正如 TAOUP(The Art of UNIX Programming) 第五章《文本化：好协议产生好实践》中所说, 基于文本的协议可以使得数据的处理变得更为简单, 性能也可能更好。在一般是实践中避免直接使用二进制形式无论是对编程者本身还是对用户都是有益的。

老🗾 IT 💊