When heavy metal meets data science | Episode I

Introduction and first basic exploration of the dataset

Luca Ballore
9 min readFeb 11, 2020

Introduction: what is heavy metal?

The Encyclopædia Britannica defines heavy metal as follows:

Genre of rock music that includes a group of related styles that are intense, virtuosic, and powerful.

According to Wikipedia, heavy metal is:

Genre of rock music that developed in the late 1960s and early 1970s, largely in the United Kingdom and the United States. With roots in blues rock, psychedelic rock, and acid rock the bands that created heavy metal developed a thick, massive sound, characterised by highly amplified distortion, extended guitar solos, emphatic beats, and overall loudness.

Technically speaking they are both acceptable definitions, but what is missing is a sort of emotional definition of it. What is heavy metal for a fan or for a musician? What actually made this genre capable of making its mark on the lives of almost 3 generations of people?
As a metalhead, I am absolutely convinced that a lot of this power can be found behind the lyrics even though massive riffs, distortions, solos, and beats play a very important role.

Natural language, with its variations and versatility, is the way we humans choose to communicate and share ideas and emotions. Natural language is also very complex to understand. Its definition applies to any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Due to the many complex issues involved, the understanding of natural language is still an open problem for computers to resolve.

With this project I aim to get familiar with a range of techniques used today to leverage some NLP-related problems. I want to see if I can figure out patterns, styles or something else that can actually define heavy metal in other ways.

So, let’s start putting heavy metal into data science!

Creating a dataset

The first step of this project was gathering a dataset big and accurate enough to serve the scope. Since I could not find anything ready on the Internet, I decided to build it myself. I created a Python library called metalparser (sources are available here), capable of scraping DarkLyrics.com, arguably one of the most accurate archives of metal lyrics on the Internet, second only to metal-archives.com.

After a few days of patiently waiting around, I managed to download a dataset of 230,376 songs, divided into 30,730 albums and 8,202 artists.

I want to clarify that the definition of the “metalness” of the bands that form part of the dataset is very subjective. I know that people have strong opinions about what metal is and what it’s not (including me), but my aim was not to inspect every single band in the dataset or take into consideration my own personal taste. However, I will probably want to repeat this study with another dataset in the future, but with a more set definition of “metalness” (like metal-archives.com seems to do).

Every row in the dataset corresponds to a song and contains the following information:

  • Artist;
  • Album;
  • Album type;
  • Release year;
  • Song title;
  • Track number on the album;
  • Lyrics (where available);
  • Information about the language (ISO 639–3 code and name), obtained by processing the lyrics using a library called langdetect;

IMPORTANT: I will not release any copy of the dataset. It would be too easy to use it in order to copy DarkLyrics, but you can still use metalparser to download it yourself. DarkLyrics does not like robots, so be gentle with the requests so you will not get banned. Also, make sure your scope is compliant with the disclaimer.

Exploring the dataset

As reported above, the dataset consists of:

  • 230,376 songs
  • 30,730 albums
  • 8,202 artists

With the help of basic math I was able to calculate that, on average, a heavy metal album contains 7/8 songs, and that the average contribution to the dataset is made up of almost 4 albums and 28 songs per artist.

In order to obtain a simple but useful statistic, the first thing I wanted to do was to figure out how many of those songs had lyrics and how many did not. How much of what heavy metal musicians wanted to communicate was left only to the power of their instruments?

Songs with lyrics VS instrumental songs in the dataset

The pie chart above shows that almost 6% of the entire database is composed of instrumental songs. 13,397 songs have no lyrics, definitely not a negligible quantity.

Another statistic I was able to figure out with a simple row count is the variety of publications in the dataset. Something that emerges while looking at the pie chart below is that, unsurprisingly, the great majority of the elements in the dataset are composed of studio albums.

Distribution of the types of publications in the dataset

The track-list of a live album is usually a compilation of songs from studio albums, so they are very often not listed at all on DarkLyrics. It is interesting to see that more than 7% of the dataset is composed of EPs and demos. Usually they contain more improvised and “authentic” lyrics which did not pass through all the corrections and adaptions made to meet the market.

Popularity in the dataset

Running my pandas queries, one of the first things I was curious about was measuring the “population” of the dataset. Which artists were the most represented, and in which quantity? Here is what the data says:

Population of the dataset in terms of the number of songs.

In terms of the number of songs, the historical British band Judas Priest seems to dominate the dataset, with 351 songs. They are followed by other iconic artists like Alice Cooper, Rage, and Motorhead. Reasonably, all of these bands have been on the stage for a really long time and are still active (excluding Motorhead).

Predictably, the population based on the number of released albums follows the same pattern:

Population of the dataset in terms of the number of albums released.

The American rocker Alice Cooper tops the chart this time with 30 albums, followed by Melvins, Motorhead, and Judas Priest.

Heavy metal through time

Time is a very important topic when it comes to metal music. I will return to this argument later, perhaps in another post where I will engage in more profound language analysis. My aim, for now, is to get an idea about the distribution of heavy metal albums over the years.

Distribution of heavy metal albums over the years

According to the histogram above, all albums and songs in the dataset have been released within 52 years, between 1968 and 2020, touching 7 decades and 3 generations of people.

Taking into consideration all the albums in the dataset, 2011 seems to be the year where the most albums have been released (1450 ). It precedes 2012, with 1304 releases and 2010 with 1269.

It is interesting to see how the amount of released albums massively increases after the end of the ’90s.
This trend has two main causes:

  1. In that period we witnessed the collapse of the music market (alongside others) and record producers tried to compensate the fall in sales by producing more and enhancing the offer;
  2. Thanks to digitalization, recording an album became much more affordable. Devices and instruments did not require large investments anymore, so more and more bands started releasing their work with acceptable quality standards in terms of production.

I have not found an entirely convincing reason as to why the releases have diminished during the decade that has just ended. I believe that the trend can be explained by a combination of these factors:

  • music revenue is actually growing again since 2015 (for the first time this millennium);
  • metal music has lost fans in favor of other genres, especially among the new generations;
  • lack of data in the dataset;

Heavy Metal Languages

Heavy metal is arguably one of the most commercially successful genres of rock music, which means that albums and songs have been composed and released in every part of the world. The songs in the dataset confirm this statement by displaying a very large set of languages:

Language distribution of metal music according to DarkLyrics dataset
Top 20 languages table

English is de-facto the international language for music in general and it dominates the lyrics set with more than 195,000 songs. Looking at the rest of the table, we can notice (as expected) a heavy presence of Northern European languages (German above all, but also Scandinavian languages). The map below shows the underlying reasons for this. Spanish dominates the Romance languages (being spoken in more than 20 countries), while Polish is the first Slavic language in the table, reaching the 8th place.

Concentration of heavy metal bands per 100,000 people

Latin deserves a particular mention, being the only dead language taking place between the most common 20 idioms in the dataset.
What kind of lyrics can we expect, in Latin?

Let us have a look at some examples:

Penumbra — Pie Jesu

Pie Jesu domine
Dona eis requiem
Dies irae dies illa
Solvet saeclum in favilla
Teste David cum Sybilla

Tuba mirum spargen sonum
Per sepulcra regionum
Coget omnes ante thronum
Liber scriptus proferetur
In quo totum continetur
Unde mundus Judicetur
Rex Tremendae majestetis
Flammis accribu addictis
Voca me cum benedictis
Oro supplex et acclinis
Cor contritum quasi cinis
Gere curan mei finis
Lacrimosa dies illa
Qua resurget ex favilla
Judicandos homo reus
Huic ergo perce deus
Pie jesu dona eis requiem

In this text, it is easy to identify parts of a requiem (a mass to the dead). Dark and gothic vibes make Latin a suggestive language to use for these lyrics.

Here is another example:

Blackmass — Invocatio Mallum

[…]
re verbum ex odium affinis invocare dei
…sopire Inferna…
Satanas, Luciferi, Leviatan ac Belial
Audire nostrum precum
Nobis invocare prave
Nobis invocare prave
Nobis invocare prave
Nobis invocare ad regnare
Luciferi archangelux dei Inferni
Nobis tuus suplicare
Nobis tuus ordinare
Qui advenire, qui revocatus
Erga nobis oculus et mostrare tuus horribilis fácies
Advenire, venire, ego vocare
Venire tenebrae Princeps
Revocatus tuus exercitus, cinis uti is magnum rex
Penetralium in nostrum animus uti nobis
Tuus glorificare creare nobis corpus
Tuus summus domicilium
Invocatio erga umbra Dominus

The translation of the title is “Call upon the evil”, which already gives a clear idea about this song. The text is basically an invocation to Satan and other demons (Leviatan, Belial). Latin adds further elements of darkness and evil undertones in general. Being amongst other things the official language of the Catholic Church, it adds further elements of blasphemy.

Conclusions

In this post I have attempted to make a simple analysis of quantities in the dataset, trying also to give an idea about how these quantities have changed through time.
For the next step, I would like to explore methods and techniques in order to find a definition of metal, focusing on part-of-speech analysis, readability, and word frequency.

--

--

Software Engineer @EA_DICE. AI enthusiast, music addicted, languages lover, football maniac, NERD.