Files

T

Louis Rossmann 820b0a1f77 Generalize banned-words exclusion zones (drop wiki-specific markup refs)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-24 19:28:54 -05:00

17 KiB

Raw Blame History

AI Writing Detection

Words, phrases, punctuation patterns, structural signals, and statistical measures commonly associated with AI-generated text. Avoid these to ensure writing sounds natural and human.

Sources: Grammarly (2025), Microsoft 365 Life Hacks (2025), GPTHuman (2025), Walter Writes (2025), Textero (2025), Plagiarism Today (2025), Rolling Stone (2025), MDPI Blog (2025), isgpt.org corpus analysis (2025), ACL hedging study (2024), Wikipedia AI content detection project (2025), Segmental entropy research (arxiv, 2025)

Em Dashes: The Primary AI Tell
Overused Verbs
Overused Adjectives
Overused Transitions and Connectors
Phrases That Signal AI Writing (Opening, Transitional, Concluding, Structural, Inflated Symbolism)
Filler Words and Empty Intensifiers
Heading Anti-Patterns
Academic-Specific AI Tells
Hallucinated Markup Artifacts
Hedging and Epistemic Modality Overload
Structural and Statistical Patterns
Model-Family-Specific Tells
False Positive Prevention
How to Self-Check

Em Dashes: The Primary AI Tell

The em dash (—) has become one of the most reliable markers of AI-generated content.

Em dashes are longer than hyphens (-) and are used for emphasis, interruptions, or parenthetical information. While they have legitimate uses in writing, AI models drastically overuse them.

Why Em Dashes Signal AI Writing

AI models were trained on edited books, academic papers, and style guides where em dashes appear frequently
AI uses em dashes as a shortcut for sentence variety instead of commas, colons, or parentheses
Most human writers rarely use em dashes because they don't exist as a standard keyboard key
The overuse is so consistent that it has become the unofficial signature of ChatGPT writing

What To Do Instead

Instead of	Use
The results—which were surprising—showed...	The results, which were surprising, showed...
This approach—unlike traditional methods—allows...	This approach, unlike traditional methods, allows...
The study found—as expected—that...	The study found, as expected, that...
Communication skills—both written and verbal—are essential	Communication skills (both written and verbal) are essential

Guidelines

Use commas for most parenthetical information
Use colons to introduce explanations or lists
Use parentheses for supplementary information
Reserve em dashes for rare, deliberate emphasis only
If you find yourself using more than one em dash per page, revise

Overused Verbs

Avoid	Use Instead
delve (into)	explore, examine, investigate, look at
leverage	use, apply, draw on
optimise	improve, refine, enhance
utilise	use
facilitate	help, enable, support
foster	encourage, support, develop, nurture
bolster	strengthen, support, reinforce
underscore	emphasise, highlight, stress
unveil	reveal, show, introduce, present
navigate	manage, handle, work through
streamline	simplify, make more efficient
enhance	improve, strengthen
endeavour	try, attempt, effort
ascertain	find out, determine, establish
elucidate	explain, clarify, make clear

Overused Adjectives

Avoid	Use Instead
robust	strong, reliable, thorough, solid
comprehensive	complete, thorough, full, detailed
pivotal	key, critical, central, important
crucial	important, key, essential, critical
vital	important, essential, necessary
transformative	significant, important, major
cutting-edge	new, advanced, recent, modern
groundbreaking	new, original, significant
innovative	new, original, creative
seamless	smooth, easy, effortless
intricate	complex, detailed, complicated
nuanced	subtle, complex, detailed
multifaceted	complex, varied, diverse
holistic	complete, whole, comprehensive

Overused Metaphorical Nouns (2025-2026)

AI models use these nouns metaphorically to inject false gravitas. Literal uses are fine.

Avoid (metaphorical)	Acceptable (literal)
tapestry ("a tapestry of regulations")	tapestry (actual woven fabric)
symphony ("a symphony of features")	symphony (actual musical composition)
beacon ("a beacon of hope")	beacon (actual light or signal device)
realm ("in the realm of cybersecurity")	realm (actual kingdom or territory)
testament ("a testament to innovation")	testament (actual legal document, e.g., last will and testament)

Overused Transitions and Connectors

Avoid	Use Instead
furthermore	also, in addition, and
moreover	also, and, besides
notwithstanding	despite, even so, still
that being said	however, but, still
at its core	essentially, fundamentally, basically
to put it simply	in short, simply put
it is worth noting that	note that, importantly
in the realm of	in, within, regarding
in the landscape of	in, within
in today's [anything]	currently, now, today

Phrases That Signal AI Writing

Opening Phrases to Avoid

"In today's fast-paced world..."
"In today's digital age..."
"In an era of..."
"In the ever-evolving landscape of..."
"In the realm of..."
"It's important to note that..."
"Let's delve into..."
"Imagine a world where..."

Transitional Phrases to Avoid

"That being said..."
"With that in mind..."
"It's worth mentioning that..."
"At its core..."
"To put it simply..."
"In essence..."
"This begs the question..."

Concluding Phrases to Avoid

"In conclusion..."
"To sum up..."
"By [doing X], you can [achieve Y]..."
"In the final analysis..."
"All things considered..."
"At the end of the day..."

Structural Patterns to Avoid

"Whether you're a [X], [Y], or [Z]..." (listing three examples after "whether")
"It's not just [X], it's also [Y]..."
"Think of [X] as [elaborate metaphor]..."
Starting sentences with "By" followed by a gerund: "By understanding X, you can Y..."
Contrasting parallelisms: "It's not X. It's Y." or "It's not about X, it's about Y." More than two of these in a 500-word block is a high-confidence AI indicator.

Inflated Symbolism Phrases (2025-2026 AI Tells)

These multi-word phrases appear hundreds of times more frequently in AI-generated text than in human baselines (corpus analysis, isgpt.org 2025):

"provide a valuable insight" (468x more frequent in AI text)
"left an indelible mark" (317x)
"play a significant role in shaping" (207x)
"an unwavering commitment" (202x)
"open a new avenue" (174x)
"a stark reminder" (166x)
"gain a comprehensive understanding" (120x)
"serves as a testament"
"watershed moment"
"deeply rooted"

Heading Anti-Patterns

AI-generated content frequently uses narrative, dramatic, or clickbait heading structures that read like thriller chapter titles. These patterns signal low-effort AI writing even when the body text is clean. All headings must describe the section content directly and technically.

Banned Heading Structures

Pattern	Bad Example	Good Replacement
"The [Concept] Trap"	"The Initialization Trap"	"Import vs. Initialize: DDF Metadata Destruction Risk"
"The [Adjective] [Noun]" drama	"The Hidden Danger"	"Firmware Corruption After Sudden Power Loss"
"The [Noun] [Dramatic Noun]"	"The Silent Killer"	"Gradual Bad Sector Growth on Aging Platters"
"Why [Action] [Dramatic Verb] [Object]"	"Why Rebuilding Destroys Everything"	"How Forced Rebuilds Overwrite Parity on Degraded Arrays"
"[Noun]: The [Adjective] [Noun]"	"Encryption: The Hidden Trap"	"Hardware AES-256 Encryption on WD Passport Bridge Boards"
"The [Noun] You [Emotion Verb]"	"The Risk You Overlook"	"Unmonitored SMART Threshold Warnings"

How to Self-Check Headings

Could this heading serve as a thriller chapter title or YouTube clickbait thumbnail? If yes, rewrite it.
Does the heading describe what the section contains, or does it tease it? Headings describe; they do not tease.
Remove "The" from the beginning of any heading and check if it still uses a dramatic noun pairing. If so, rewrite.
A good heading reads like an entry in a technical manual index: specific, descriptive, and boring to non-specialists.

Filler Words and Empty Intensifiers

These words often add nothing to meaning. Remove them or find specific alternatives:

absolutely
actually
basically
certainly
clearly
definitely
essentially
extremely
fundamentally
incredibly
interestingly
naturally
obviously
quite
really
significantly
simply
surely
truly
ultimately
undoubtedly
very

Academic-Specific AI Tells

Avoid	Use Instead
shed light on	clarify, explain, reveal
pave the way for	enable, allow, make possible
a myriad of	many, numerous, various
a plethora of	many, numerous, several
paramount	very important, essential, critical
pertaining to	about, regarding, concerning
prior to	before
subsequent to	after
in light of	because of, given, considering
with respect to	about, regarding, for
in terms of	regarding, for, about
the fact that	that (or rewrite sentence)

Hallucinated Markup Artifacts

When AI generates wikitext, it sometimes hallucinates citation markup from its training data. These are 100% confidence indicators of unedited AI output:

Artifact	Origin
`oaicite`	OpenAI ChatGPT citation placeholder
`contentReference`	OpenAI internal reference tag
`grok_card`	xAI Grok citation tag
`attributableIndex`	AI attribution tracking artifact
`turn0search0`	ChatGPT search result placeholder

Any occurrence of these strings in wikitext means the text was pasted from an AI tool without editing. Zero tolerance.

Hedging and Epistemic Modality Overload

AI models hedge 4-7x more than human writers (ACL 2024 study, 12,000 technical documents). Because models are trained to avoid stating hallucinations as facts, they default to blanket hedging even for established facts.

Hedging Markers

Epistemic modals (45% of AI hedges): may, might, could, potentially Cognitive verbs (25%): I think, I believe, it seems, it appears Adverbs of limitation (20%): probably, generally, usually, arguably, likely Explicit uncertainty markers: unclear, remains to be seen, further research is needed

Thresholds

Per-paragraph: More than 3 hedging instances in a single paragraph warrants scrutiny
Per-1000-words: More than 8 hedging markers per 1,000 words in declarative sections (Background, History, Timeline) indicates AI generation. These sections state established facts.
Appropriate hedging: Sections discussing pending legislation, ongoing litigation, or genuinely disputed facts should hedge. Do not flag hedging in those contexts.

AI Hedging Phrases to Flag

"It is worth noting that..."
"It should be noted that..."
"One could argue that..."
"While X, Y remains..."
"Though precise thresholds can vary depending on..."
"It is widely acknowledged that..."

Human vs. AI Hedging

Humans hedge contextually, grounding uncertainty in specific evidence: "The FTC's 2024 enforcement data suggests a 12% increase." AI hedges with blanket qualifiers on established facts: "It is widely acknowledged that repair restrictions may potentially impact consumers."

Structural and Statistical Patterns

Beyond lexical tells, AI text exhibits measurable structural uniformity that human writing does not.

Paragraph Length Uniformity

AI aims for visual symmetry. Paragraphs tend toward identical sentence counts (typically 3-4 sentences each). Human writing varies paragraph length based on sub-topic complexity.

Threshold: If all paragraphs in a section are within 15% of each other in word count, the section is likely AI-generated.
Exception: Bulleted lists, tables, and template fields are structurally uniform by design.

Sentence Length Uniformity (Burstiness)

Human writing alternates between short, punchy sentences and long, clause-heavy ones. AI sentences cluster uniformly around 15-20 words.

Threshold: If a 500-word block contains no sentences under 8 words or over 30 words, it lacks human burstiness.
Human baseline: Human text exhibits 3+ distinct syntactic patterns per 100 words. AI text shows 1.5 or fewer.

Transition Density

AI over-relies on transition words and adverbial clauses to maintain flow between paragraphs.

Threshold: If more than 30% of paragraphs in an article begin with a transition word or adverbial clause, the text is structurally artificial.

Opening-Word Repetition

Three or more consecutive paragraphs starting with the same word or phrase pattern indicates mechanical generation. Vary opening words.

Segmental Entropy

AI maintains flat stylistic consistency from introduction through conclusion. Human writers naturally vary pacing, complexity, and sentence structure between sections.

Threshold: Calculate sentence length variance separately for the introduction, body, and conclusion. If variance differs by less than 10% across all three segments, the text was likely generated as a single pass by AI.
Why this matters: Human introductions tend to be tighter and more declarative. Human body sections are denser with longer sentences. Human conclusions shift register. AI maintains a monotone throughout.

Contrasting Parallelism Overuse

2025-era models overuse sequential contrasting structures to simulate punchy emphasis:

"It's not X, it's Y."
"It's not about X, it's about Y."
"The issue isn't X. The issue is Y."
Threshold: More than two contrasting parallelisms in a 500-word block.

Model-Family-Specific Tells

Different AI model families produce distinct stylistic fingerprints based on their training and RLHF tuning.

GPT-4o / GPT-4.5 (OpenAI)

Heavy use of bullet-point formatting and structured lists
Staccato short-sentence contrasting: "It's not X. It's Y." used to simulate punchy copy
Rhetorical colon abuse: "Here's the thing:", "Think about it:", "The bottom line:", "The reality:"
Over-structures arguments into numbered steps

Claude 3.5 / Claude 4 (Anthropic)

Better sentence length variation than GPT, but still exhibits flat segmental entropy
Overly polite and conciliatory transitions: "It's worth considering that", "To be fair", "That said"
Leans toward poetic and metaphorical prose with words like "nuanced," "complexities"
Loses thread in long documents and resorts to increasingly generic transitions
Tends toward diplomatic hedging even when stating documented facts

Common Across All Models

Uniform paragraph lengths
Predictable section ordering (Background > Details > Impact > Response)
Citation clustering at paragraph ends rather than distributed throughout sentences
Excessive boldface on concepts, product names, and inline headers

False Positive Prevention

Exclusion Zones

Lexical scans must NOT flag text inside:

Direct quotes ("...") from cited sources
Titles, names, and other verbatim values taken from a source
Code, configuration, or markup that is being shown as an example

Context-Aware Severity

If a banned word appears immediately adjacent to specific named entities (proper nouns, statute numbers, dates, dollar amounts), it is more likely being used with technical meaning than as AI filler. Reduce flag severity.

Higher severity: "a comprehensive examination of the issues" (abstract nouns, no specifics)
Lower severity: "comprehensive audit by the FTC in 2024" (specific entity, specific date)

Metaphorical vs. Literal Distinction

These words require bigram context checking. Only flag metaphorical uses:

ecosystem: "Apple's software ecosystem" (OK) vs. "the repair ecosystem" (flag)
landscape: "Arizona landscape" (OK) vs. "the regulatory landscape" (flag)
navigate: "navigate the website" (OK) vs. "navigate the regulatory process" (flag)
tapestry: "medieval tapestry" (OK) vs. "a tapestry of regulations" (flag)
symphony: "Beethoven's symphony" (OK) vs. "a symphony of features" (flag)
beacon: "lighthouse beacon" (OK) vs. "a beacon of hope" (flag)
testament: "last will and testament" (OK) vs. "a testament to innovation" (flag)

How to Self-Check

Read your text aloud. If phrases sound unnatural in speech, revise them
Ask: "Would I say this in a conversation with a colleague?"
Check for repetitive sentence structures
Look for clusters of the words listed above
Ensure varied sentence lengths (not all similar length)
Verify each intensifier adds genuine meaning
Count hedging markers per paragraph. More than 3 in a single paragraph is a red flag.
Check paragraph word counts within each section. If they are all similar, vary them.
Search for hallucinated markup: oaicite, contentReference, turn0search0, grok_card
Check if your introduction, body, and conclusion have different pacing and sentence complexity

17 KiB Raw Blame History