Article boundary detection
We extract article headlines based on the font size and perform topic modeling based on the TF-IDF score.
Our multilingual diachronic corpus contains more than 120 years of the digitized Credit Suisse Bulletin magazine issues. In order to extract the wealth of linguistic information from this data we need to have proper alignment. It means that for each article in German we need to be able to automatically find the same articles in French, English, Italian, and Spanish. First of all we need a reliable article boundary detection algorithm. This is the challenge that we present to the public.
Our team decided to tackle the problem from two angles. On one hand, we are writing an algorithm to detect article titles. Our starting point is the font size. We find this information in the rich layout XMLs produced by ABBYY reader during the processing of the original scans with the Optical Character Recognition software.
Sometimes articles are published out of order which creates additional challenges with alignment. Therefore our second approach is an attempt to overcome this difficulty. We implement topic modeling to be able to align articles based on themes rather than position. We extract the textual data from files and then calculate each word's relevance to the topic with the TF-IDF score. The initial evaluation would happen on the manually separated articles. If the idea proves itself the algorithm could be extended to automate the article boundary detection.
Once both scripts succeed to produce desired results, we can combine them into one robust algorithm for the automatic article boundary detection.