Vega-Lite for data exploration
Posted on April 19, 2020 in
4 min read
This is a follow-up of my previous post dedicated to Vega-Lite.
I want to make practice on Vega-Lite using the Sochi Olympics Game dataset that contains all the athletes information.
Making a scatterplot with Vega-Lite is, after a bit of exercise, dead simple.
Let's try to refine more in order to make it useful for data exploration.
Athletes weight/height
I want to show the athletes distribution according to height/weight properties, a perfect task for a scatterplot:
{
"$schema": "https://vega.github.io/schema/vega-lite/v4.json",
"width": 400,
"data": {
"url": "https://fabiofranchino.com/disk/datasets/athletes_sochi.csv",
"format":{
"type": "csv"
}
},
"mark": "point",
"encoding": {
"x":{
"type":"quantitative", "field": "weight"
},
"y":{
"type":"quantitative", "field": "height"
}
}
}
Since there are some missing values, I want to filter out the data points with missing height and weight, adding a filter in the transform array:
"transform": [
{
"filter": "datum.weight > 0 && datum.height > 0"
}
]
A filter can be also a list of filters, which is much more readable:
"transform": [
{
"filter": "datum.weight > 0"
},
{
"filter": "datum.height > 0"
}
]
By adding one more filter we can show only a specific country:
{
"filter": "datum.country === 'Italy'"
}
But now I want to manipulate the scale of the x axis because a lot of space is not user by the scatterplot. We can set a different domain per single encoding:
"x":{
"type":"quantitative",
"field": "weight",
"scale":{
"type":"linear",
"domain":[45, 105]
}
},
"y":{
"type":"quantitative", "field": "height",
"scale":{
"type":"linear",
"domain":[1.5, 2]
}
}
So far so good.
I've tried to compute the domain dinamically without success. Let's see if it'll be something feasible in next attempts.
Gender
Now, I want to show the gender comparison using a donut chart.
The first thing to do is transforming the dataset in order to have the useful values for the encoding part, thus, here an aggregate
transform to do that:
"transform": [{
"aggregate": [{
"op": "count",
"field": "name",
"as": "num"
}],
"groupby": ["gender"]
}]
Then, let's add the mark
compatible with the graphic element we want to create:
"mark":{
"type": "arc",
"innerRadius": 40
}
And finally, the encoding
that uses the new calculated property num
:
"encoding": {
"theta": {
"type": "quantitative",
"field": "num"
},
"color":{
"type":"nominal",
"field": "gender"
}
}
Full source code here.
There is still something I want to add, that is a label close to each slice. Since it's something we need to add further, let's use the layer
capability to achieve the desired result. Let's add an additional mark
"layer": [
{"mark": {"type": "arc", "innerRadius": 40}},
{
"mark": {"type": "text", "radius": 120},
"encoding": {"text": {"field": "gender", "type": "nominal"}}
}
]
And we need to move the encoding
of the arc
outside the layer
:
"encoding": {
"theta": {"type": "quantitative", "field": "num", "stack": true},
"color": {"type": "nominal", "field": "gender"}
},
"layer": [
{"mark": {"type": "arc", "innerRadius": 40}},
{
"mark": {"type": "text", "radius": 120, "color":"black"},
"encoding": {"text": {"field": "gender", "type": "nominal"}}
}
]
A couple of note here. I make it to work following this example from the official website, where I've learned two mandatory things:
- Moving the
encoding
outside thelayer
- Using the
stack
property in thetheta
Not sure if I really understand the logic behind, maybe it's something I'll see later.
While I couldn't be able to change the text color even using the valid property in the mark
definition. Again, not sure if it's a bug or something that needs to be done in a different way.
Again, so far, so good.