July 2, 2020
Since the early start of the Corona pandemic German public broadcaster NDR Info airs the popular podcast with the virologist Christian Drosten.
In addtion to the podcast itself the transcript of each episode is published on the broadcaster’s website. This allows to easily do some text processing and analysis on the podcasts’ content.
First I wrote a script to scrape and transform the transcript of all podcast episodes to get a dataframe with the columns episode title, date, link to the transcript, episode no, interviewer, speaker of the transcript section and transcript section.