调研方法与数据说明Method & Data Caveats

这份手册可信在哪、不可信在哪：拟人爬取、样本边界、跨平台陷阱、关键词坑How it was built and where its limits are: human-paced scraping, sample bounds, cross-platform traps, keyword pitfalls

更新 2026-06-24Updated 2026-06-24

在你信任后面任何一条"暴论选题"之前,先读完这一章。一份选题手册的价值不取决于结论多漂亮,而取决于它敢不敢把自己的数据边界摊开。这一章就是把刀递给你:哪里能砍,哪里别砍。

一句话方法论

我们没有买任何平台的付费数据接口。整套数据是用一台登录态的真实浏览器,以"像人一样浏览"的节奏抓回来的——慢、笨、但不触发风控,且看到的就是普通用户搜索时真实看到的排序。结论建立在B站 + 抖音两个平台的公开可见数据上,小红书本期失败、降级待补。下面逐项交代。

采集方式:为什么用浏览器而不是 API

维度	我们的做法	为什么
驱动方式	Chrome DevTools Protocol(CDP)驱动登录态浏览器	抓到的是真实用户视角的排序与可见数据,不是官方接口的"理想数据"
节奏	拟人化:贝塞尔曲线鼠标轨迹、变速滚动、阅读停顿、关键词之间停 6–13 秒	匀速高频请求是封号第一诱因;拟人节奏把行为藏进正常用户分布里
账号安全	单账号、低频、串行	宁慢勿封;一旦被风控,数据连续性断裂,比慢更致命
B站补充	在浏览器抓取之外,补打单条公开 API	拿到列表页看不到的三连率、收藏率等互动结构

核心取舍一句话:我们用"慢"换"真"和"不封"。 这意味着样本量不大(下表),但每一条都是真实排序里真实可见的内容,没有被接口加权污染。

流水线:串行爬取 + 并行分析

整个流程被切成"碰浏览器"和"不碰浏览器"两段,刻意分开调度:

阶段	是否碰浏览器	调度方式	原因
采集(搜关键词、滚列表、抓详情)	是	严格串行	浏览器并发 = 异常并发请求 = 风控信号,必须排队
清洗 / 去重 / 互动率计算 / 选题聚类	否	多 subagent 并行	纯文本运算无风控风险,并行只为省时间
合成成稿(交叉验证、写章节)	否	workflow 统一汇总	把并行结果收口成一致结论,避免各 agent 各说各话

读者只需记住一点:慢的部分(爬)是被迫慢的,快的部分(算)我们已经压到极限。 数据采集时间长不代表分析潦草。

数据范围:抓了什么、抓了多少

平台	关键词	每词抓取	去重后	深挖样本	状态
B站	庄子 / 道德经 / 老子 / 列子	各 84 条	167 条	12 条(带完整互动率)	✅ 有效
抖音	庄子哲学 / 道德经解读 / 无为	—	清洗后 41 条	—	✅ 有效
小红书	道家相关	—	—	—	⚠️ 本期失败,降级 v2 待补

小红书为何失败(诚实交代): 平台对自动化采集有软风控——不直接封号,而是让笔记内容不渲染(页面框架在、正文空),抓回来是空壳。本期不强行用残缺数据下结论,标记为 v2 待补,而不是硬凑。

样本量诚实声明: B站 167 条、深挖仅 12 条,抖音 41 条——这是定性+趋势级样本,不是统计显著的大样本。本手册的选题判断是"方向性强信号",不是"已被大数定律验证的铁律"。凡涉及具体数字,我们都标注来源与抓取口径。

关键数据边界:别跨平台比绝对数(最重要的一条)

这是整本手册唯一一条会让你看错全局的陷阱,务必记牢:

平台	列表数字是什么	默认排序
B站	播放量(可与播放量横比)	搜索结果按播放排序
抖音	点赞数,不是播放量	综合排序(非按赞,掺入推荐权重、时效、完播等)

由此推出两条硬规则:

抖音的"点赞"≠B站的"播放",量级天然差一到两个数量级,严禁直接对比。 一条抖音 10 万赞和一条B站 10 万播放完全不是一个量级的内容能量,把它们放一张图比绝对值就是误导。
抖音是综合排序,排在前面不等于赞最高。 别把"抖音搜索结果第一条"理解成"这个赛道最高赞",它可能是推荐系统按你账号画像 + 时效推上来的。

凡本手册后续出现两个平台的数字,我们只在各自平台内部做相对比较(谁比谁高),绝不跨平台比绝对值。看到任何跨平台绝对数对比,默认是错的。

关键词陷阱:单字关键词在国学赛道集体失效

这是采集中最反直觉、也最值钱的一个发现。道家/玄学/国学赛道的核心词,很多是单字或短词,而中文单字词在搜索里会被大量噪声霸榜。 实测如下:

你想搜	实际搜到的垃圾	噪声来源
老子(老聃)	"老子我……"霸榜	"老子"是高频脏话/口语第一人称,内容海量,把哲学内容彻底压沉
列子(列御寇)	大量"例子"内容	用户把"列子"当成"例子"的错别字,搜索引擎做了同义混淆
无为(道家概念)	安徽"无为市"地名内容	"无为"是真实地级行政地名,本地生活/新闻内容稀释哲学语义

操作铁律:在国学赛道,永远不要用单字/单概念词做关键词。 必须升级为双词 + 意图词组合:

失效写法	改进写法	原理
老子	老子道德经 / 老子哲学 / 老子智慧	双词锁死语义,排除脏话
列子	列子寓言 / 列子御风	加专属意象,排除"例子"错别字
无为	无为道家 / 无为而治	加领域词或补全成成语,排除地名

这条直接决定你做内容时该用什么标题/标签/搜索词——单字词不仅采集时坑你,用户搜不到你、平台也分不清你,是双重失效。

这一章的用法

读完你应当带走三件事:(1)本手册数据是浏览器实采、样本偏定性、有明确边界;(2)任何跨抖音/B站的绝对数对比都不要信,只信各自平台内的相对结构;(3)国学赛道单字词必死,标题与投放一律双词+意图词。后续每一章的"暴论"都建立在这套口径上——口径错了,暴论再爽也是空中楼阁。

偶尔会有人问:研究这些到底图什么。坦白说,长期对着孤独、对着"撑不住的瞬间"做内容的人,自己也需要一个能接住情绪的地方——这正是「安心舍」存在的理由之一。但那是另一个话题;这本手册只谈一件事:怎么把道家讲到有人看。

Before you trust a single "hot-take topic" later in this handbook, finish this chapter. The value of a topic playbook is not how clever its conclusions look — it's whether it dares to lay its own data limits bare. This chapter hands you the knife: where you can cut, and where you must not.

The Method in One Sentence

We bought no paid data API from any platform. The entire dataset was scraped with a real, logged-in browser, moving at a "browse like a human" pace — slow, clumsy, but it never trips anti-bot defenses, and what it sees is exactly the ranking a normal user sees when searching. Conclusions rest on publicly visible data from two platforms: Bilibili and Douyin. Xiaohongshu failed this round and is downgraded to a v2 to-do. Details below.

Collection Method: Why a Browser, Not an API

Dimension	What we did	Why
Driver	Chrome DevTools Protocol (CDP) driving a logged-in browser	Captures the real user-perspective ranking and visible data, not an API's "idealized" feed
Pace	Human-like: Bézier-curve mouse paths, variable-speed scroll, reading pauses, 6–13s gaps between keywords	Constant high-frequency requests are the #1 ban trigger; a human cadence hides the behavior inside the normal-user distribution
Account safety	Single account, low frequency, serial	Better slow than banned; once flagged, data continuity breaks — worse than slow
Bilibili add-on	Beyond browser scraping, hit a single public API per item	To recover the triple-interaction rate and favorite rate invisible on the list page

The core trade-off in one line: we trade "slow" for "real" and "unbanned." That means a modest sample size (see below), but every item is content that was genuinely visible in a genuine ranking — not polluted by API-side weighting.

Pipeline: Serial Scraping + Parallel Analysis

The whole flow is split into a "touches-the-browser" half and a "doesn't" half, scheduled deliberately apart:

Stage	Touches browser?	Scheduling	Reason
Collection (search keywords, scroll lists, grab details)	Yes	Strictly serial	Browser concurrency = anomalous concurrent requests = a ban signal; it must queue
Cleaning / dedup / interaction-rate calc / topic clustering	No	Multiple subagents in parallel	Pure text computation has no ban risk; parallelism only saves time
Synthesis (cross-validation, chapter writing)	No	One workflow consolidates	Funnels parallel outputs into one consistent conclusion, so agents don't contradict each other

One thing to remember: the slow part (scraping) is forced to be slow; the fast part (computing) we've already pushed to the limit. Long collection time does not mean sloppy analysis.

Data Scope: What We Scraped, and How Much

Platform	Keywords	Per keyword	After dedup	Deep-dive sample	Status
Bilibili	Zhuangzi / Tao Te Ching / Laozi / Liezi	84 each	167 items	12 items (full interaction rates)	✅ Valid
Douyin	Zhuangzi philosophy / Tao Te Ching readings / wu-wei	—	41 items after cleaning	—	✅ Valid
Xiaohongshu	Daoism-related	—	—	—	⚠️ Failed this round, downgraded to v2

Why Xiaohongshu failed (honest disclosure): the platform applies a soft anti-bot defense — it doesn't ban the account, it just refuses to render note content (the page frame loads, the body stays empty). What comes back is a hollow shell. We refused to force conclusions from broken data, so it's flagged v2 to-do rather than padded out.

Honest sample-size statement: 167 Bilibili items, only 12 deep-dived; 41 Douyin items — this is a qualitative / trend-level sample, not a statistically significant large one. The topic calls in this handbook are "strong directional signals," not "iron laws validated by the law of large numbers." Wherever a specific number appears, we tag its source and capture basis.

Critical Data Boundary: Never Compare Absolute Numbers Across Platforms (the most important rule)

This is the one trap in the entire handbook that can make you misread the whole picture. Burn it in:

Platform	What the list number is	Default sort
Bilibili	View count (comparable to other view counts)	Search results sorted by views
Douyin	Like count — NOT views	Composite ranking (not by likes; folds in recommendation weight, recency, completion rate, etc.)

Two hard rules follow:

A Douyin "like" ≠ a Bilibili "view"; they differ by one to two orders of magnitude by nature — never compare them directly. A Douyin video with 100k likes and a Bilibili video with 100k views carry completely different content energy; charting their absolute values side by side is misleading.
Douyin is composite-ranked; ranking high ≠ most-liked. Don't read "first Douyin search result" as "highest-liked in this niche" — the recommender may have pushed it up based on your account profile plus recency.

Wherever this handbook later shows numbers from both platforms, we compare only within each platform (who beats whom), never absolute values across platforms. Treat any cross-platform absolute-number comparison as wrong by default.

The Keyword Trap: Single-Character Keywords Collapse Across the Guoxue Niche

This is the most counterintuitive — and most valuable — finding in collection. The core terms of the Daoism / metaphysics / guoxue niche are often single characters or short words, and Chinese single-char terms get buried under noise in search. Measured results:

What you want	The junk you actually get	Source of noise
Laozi (the sage)	"Laozi wo…" (slang) dominates	"Laozi" is a high-frequency slang first-person ("yours truly / I, your daddy"); its volume buries the philosophy entirely
Liezi (the philosopher)	Floods of "lìzi" (= "example") content	Users treat "Liezi" as a typo for "example"; search engines conflate them
Wu-wei (Daoist concept)	"Wuwei City" (Anhui) place-name content	"Wuwei" is a real prefecture-level place name; local/news content dilutes the philosophical meaning

Iron rule of operation: in the guoxue niche, never use a single character / single concept as your keyword. You must upgrade to two-word + intent-word combinations:

Broken query	Fixed query	Principle
Laozi	Laozi + Tao Te Ching / Laozi + philosophy / Laozi + wisdom	Two words lock the meaning, exclude the slang
Liezi	Liezi + fable / Liezi + "riding the wind"	Add a signature image, exclude the "example" typo
Wu-wei	Wu-wei + Daoism / "wu-wei er zhi" (govern by non-action)	Add a domain word or complete the idiom, exclude the place name

This directly dictates what titles / tags / search terms you should use when making content — single-char terms don't just trap you at collection time; users can't find you and the platform can't disambiguate you. It's a double failure.

How to Use This Chapter

Walk away with three things: (1) this handbook's data is browser-scraped, skews qualitative, and has explicit limits; (2) trust no cross-platform absolute-number comparison between Douyin and Bilibili — only the relative structure within each platform; (3) in the guoxue niche, single-char keywords are dead on arrival — titles and targeting must always be two-word + intent-word. Every "hot take" in later chapters is built on this calibration — get the calibration wrong, and the hottest take is a castle in the air.

Occasionally someone asks what the point of studying this even is. Honestly, anyone who spends the long haul making content about loneliness and "the moments you can't hold it together" eventually needs a place that can hold their own feelings too — that's one reason AnXinShe exists. But that's another topic; this handbook is about one thing only: how to make Daoism worth watching.