Datasets

American Stories

historical newspaper

A billion scale dataset of structured texts and layouts from U.S. public domain newspapers.
Huggingface · Paper · Github

Headlines

historical newspaper semantic similarity

A massive scale semantic similarity dataset of historical newspaper headlines.
Huggingface · Paper · Github

Wire Clusters

historical newspaper deduplication

A topic-tagged, entity-tagged and georeferenced datasets of 2.7 million unique public domain U.S. newswire articles, written between 1878 and 1977.
Huggingface · Paper · Github