In itinere

An open data project on international students in Italian universities

Find Out More

RESEARCH QUESTION & METHODOLOGY

Are amount of paid fees and of available scholarships
correlated to the enorolment of international students in Italian university?

Source datasets

Choice and technical, legal, and ethical analysis of the source datasets

LOD datasets

Alignment and creation of LOD datasets and its metadata

Data processing

Extract most relevant trends and visualize possible correlations

Documentation

The process lifecycle has been described in an ad hoc documentation

Main results

0
Average amount of
paid fees
(in €)
0
Average amount of
funded scholarship
(in thousands of €)
0
Everage number of
students enrolled in Italian universities
0
Everage number of international students in Italian universities

Please select the reference academic year (default 2015/16):

The data show that no correlation can be found between the amount of fees and scholarships, and the percentage of enrolled international students:

# 2016 2017 2018 2019
1 Perugia Stranieri (39,57%) Perugia Stranieri (38,70%) Perugia Stranieri (36,68%) Perugia Stranieri (36,63%)
2 Rozzano (MI) Humanitas Univ. (25,43%) Rozzano (MI) Humanitas Univ. (27,19%) Bra Scienze Gastronomiche (28,24%) Bra Scienze Gastronomiche (27,98%)
3 Bra Scienze Gastronomiche (23,91%) Bra Scienze Gastronomiche (26,65%) Rozzano (MI) Humanitas Univ. (26,68%) Roma Saint Camillus (27,50%)
4 Bolzano (15,67%) Reggio Calabria - Dante Alighieri (17,31%) Reggio Calabria - Dante Alighieri (18,12%) Rozzano (MI) Humanitas Univ. (25,05%)
5 Torino Politecnico (14,12%) Bolzano (14,64%) Milano Politecnico (15,28%) Reggio Calabria - Dante Alighieri (18,69%)

Source datasets

MIUR

Contribuzione e interventi atenei

MIUR

Spesa per interventi

MIUR

Iscritti per ateneo

MIUR

Iscritti stranieri per anno accademico


Paid Fees 2016


ID: fees2016

Provenance: MIUR (USTAT)

Creation Date: più di 4 anni fa

Format: .csv .tsv .json .xml
Metadata: Provided
URI: 2016ContribuzioneMedia

Paid Fees 2017


ID: fees2017

Provenance: MIUR (USTAT)

Creation Date: più di 4 anni fa

Format: .csv .tsv .json .xml
Metadata: Provided
URI: 2017ContribuzioneMedia

Paid Fees 2018


ID: fees2018

Provenance: MIUR (USTAT)

Creation Date: più di 3 anni fa

Format: .csv .tsv .json .xml
Metadata: Provided
URI: 2018ContribuzioneMedia

Paid Fees 2019


ID: fees2019

Provenance: MIUR (USTAT)

Creation Date: più di 2 anni fa

Format: .csv .tsv .json .xml
Metadata: Provided
URI: 2019ContribuzioneMedia

Scholarships 2016


ID: dsu2016

Provenance: MIUR (USTAT)

Creation Date: più di 5 anni fa

Format: .csv .tsv .json .xml
Metadata: Provided
URI: 2016SpesaInterventi

Scholarships 2017


ID: dsu2017

Provenance: MIUR (USTAT)

Creation Date: più di 4 anni fa

Format: .csv .tsv .json .xml
Metadata: Provided
URI: 2017SpesaInterventi

Scholarships 2018


ID: dsu2018

Provenance: MIUR (USTAT)

Creation Date:più di 3 anni fa

Format: .csv .tsv .json .xml
Metadata: Provided
URI: 2018SpesaInterventi

Scholarships 2019


ID: dsu2019

Provenance: MIUR (USTAT)

Creation Date: più di 2 anni fa

Format: .csv .tsv .json .xml
Metadata: Provided
URI: 2019SpesaInterventi

Total Students 2016-2019


ID: student

Provenance: MIUR (USTAT)

Creation Date: più di 3 anni fa

Format: .csv .tsv .json .xml
Metadata: Provided
URI: IscrittiAteneo

International Students 2016-2019


ID: intstudent

Provenance: MIUR (USTAT)

Creation Date: più di 1 anno fa

Format: .csv .tsv .json .xml
Metadata: Provided
URI: IscrittiStranieriAA

Mashup dataset

Initinere 2016


ID: 2016

Creation Date: 13 November 2022

Format: .csv
Metadata: Provided
URI: Initinere2016

Initinere 2017


ID: 2017

Creation Date: 13 November 2022

Format: .csv
Metadata: Provided
URI: Initinere2017

Initinere 2018


ID: 2018

Creation Date: 13 November 2022

Format: .csv
Metadata: Provided
URI: Initinere2018

Initinere 2019


ID: 2019

Creation Date: 13 November 2022

Format: .csv
Metadata: Provided
URI: Initinere2019

Documentation

Table of content

  1. Introduction & scenario
  2. Original & mash-up datasets
  3. Quality analysis
  4. Legal analysis
  5. Ethical anlysis
  6. Technical analysis
  7. Sostenibility of the update
  8. Visualization
  9. RDF assertion of the metadata

Introduction & scenario

In itinere is an open data project analyzing the enrollment of international students in Italian universities over a reference period of four academic years (a.y. 2015/2016 - 2018/2019). Indeed, in the last decades Italy has stressed the importance of opening academia to international students and researchers. To this end, exchange project are funded and international and English-taught degree courses have been opened. Moreover, specific institutions have been founded such as the "Università per stranieri" such as the one in Perugia (already in 1925), Siena (1992) and "Dante Alighieri" in Reggio Calabria (2007).

As mentioned in the previous sections of this webpage, the main goal of the project is to analyze possible factors which influence the choice of the university from the perspective of an international student. To address this question, after the analysis of the available data, we chose to reformulate this goal in a more circumscribed research question: are scholarships and fees riliable indicators to describe the presence of international students in Italian universities? For the sake of simplicity (expecially in the implementation of algorithms), we translated with "scholarships" several expenditures made by universities (also dorms, canteens, international mobilites etc.), grouped in the Italian term "diritto allo studio".

To what concerns the availability of open data, we relied on the catalogue dati.gov.it. This platform is part of the Italian ongoing process of digital transition, "Piano di Crescita Digitale": it includes 56.893 datasets, mainly from local and regional administration. National datasets are scarce (almost 8%): this results in a difficulty in fetching exaustive data for our project. The entirety of the source datasets have been taken from USTAT, the data portal of tertiary education. Its section "Open Data" is the most updated, yet some datasets do not provide information after a. y. 2018/19. We also considered to integrate them with other sources, such as ISTAT: unfortunately, it either provides information and useful observations referenced to different academic years (often older than the ministerial one) or groups information geographically, which is incompatible with the subdivision per insitution adopted by MIUR datasets.

To better investigate the access to new data, it is necessary to have access to more detailed data. In particular, further researches will surely benefit from datasets including observations about:

  • number of international (mainly English-taught) degree courses and teachings
  • provenance of international students (from EU, non-EU European country, other continents)
  • number and width of international partnerships per university
  • average financial background (which can be described through ISEE) of Italian and international students per university

Statement of resposibility:

The project was carried out by:

  • Federica Bonifazi
  • Manuele Veggi

The workload was divided as following:

Project Documentation
Federica Bonifazi Project ideation
Data retrieval
Mashup Datasets
Metadata creation
Data legal analysis
4. Legal analysis
5. Ethical analysis
6. Technical analysis
7. Sostenibility of the update
9. RDF assertion of the metadata
Manuele Veggi Project ideation
Data retrieval
Mashup Datasets
Data visualization
Web infrastructure
1. Introduction & scenario
2. Original & mash-up datasets
3. Quality analysis
8. Visualization
Creation of the Jupyter Notebooks.

Original datasets

As mentioned, the ten source datasets (dowloaded as .csv files) come from the same ministerial portal, USTAT. From now on it will be cited as MIUR, after the domain of the URL, even though the former "Ministero dell'Istruzione, dell'Università e della Ricerca" have been splitted into "Ministero dell'Università e della Ricerca" and the Ministry of Education (its name changed considerably during the last governments). As listed in the previous sections, the source datasets are:

Topic Portal Dataset(s)
Contribuzione e interventi atenei.
Contribuzione media
MIUR 2016, 2017, 2018, 2019
Diritto allo Studio Universitario (DSU)
Regionale. Spesa per interventi
2016, 2017, 2018, 2019
Iscritti. Iscritti per ateneo 2016-2019
Iscritti. Iscritti stranieri per ateneo 2016-2019

Before the mashup, the datasets are filtered as both telematic univerisities and AFAM (fine and performing arts academies) have been excluded. Moreover, expecially in the first two topics, not all the univerisites are present in all the datasets (which justfies considerably differences in some of the visualization): this is case of three Roman universities,

  • Roma Europea, missing in 2016 dataset about scholarships
  • Roma Link Campus, missing in 2016, 2017 datasets about scholarships
  • Roma Saint Camillus, present exclusively in 2019 (probably because it has been founded only 2017)
These differences can be easily computed through the function distElement(dframe1, dframe2) in the Python script used to create a temporary datasets for the linechart visualization (see §7).

Mash-up datasets

The four output datasets are obtained using the algorithms described in this Jupyter Notebook: please make reference to this document for a more precise description of the rationale and of the computational steps beneath this phase. We decided to keep the diachronical distinction also in the output datasets: as a consequence the association between source and output dataset is the following (indicated by reference year).

Output Scholarships Paid fees Students International
2016 2016 2016 2016-2019 2016-2019
2017 2017 2017
2018 2018 2018
2019 2019 2019

Quality analysis

The qualitative analysis of the datasets follows the guidelines available on the Italian official governmental portal "Docs Italia". The four reference categories are accuracy, consistency, completeness and timeliness. They are drawn from AgID Determinazione Commissariale n. 68/2013, which states that (description of the categories was translated in English here):

In relazione allo specifico contesto d'uso e alle finalità perseguite dalla norma, le basi di dati critiche devono assicurare il valore intrinseco dei dati in modo che gli attributi dei dati stessi siano adeguati rispetto alle caratteristiche di "inerenza" definite nell'ambito del suddetto standard ISO/IEC 25012, di seguito sintetizzate:

  • accuratezza: il dato, e i suoi attributi, rappresenta correttamente il valore reale del concetto o evento cui si riferisce;
  • attualità (o tempestività di aggiornamento): il dato, e i suoi attributi, è del “giusto tempo” (è aggiornato) rispetto al procedimento cui si riferisce;
  • coerenza: il dato, e i suoi attributi, non presenta contraddittorietà rispetto ad altri dati del contesto d’uso dell'amministrazione titolare;
  • completezza: il dato risulta esaustivo per tutti i suoi valori attesi e rispetto alle entità relative (fonti) che concorrono alla definizione del procedimento

Agenzia per l'Italia Digitale, Determinazione Commissariale n. 68/2013 DIG, article 4

As stated on the aforementioned portal, it is garanteed that all the provided data respect these four criteria. Indeed, generally the datasets are of good quality, yet minor flaws can be highlighted. In the following table, a concise description is provided (datasets are analyzed by typology and chronological distinction is not reflected, as data concerning the same field appears to be coerent):

Dataset Accuracy Consistency Completeness Timeliness
Scholarships Only in some cases detached seats are considered separatedly from main campus Missing information about Roma - Link Campus University (2016, 2017) and Roma - Università Europea (2016) No information available after a.y. 2018/2019
Fee No distinction between detached seats of a university from the main campus No information available after a.y. 2018/2019
Total students
International students

Legal Analysis

To analyse the legal aspects of the original datasets used for this project the following check list was adopted as a tool to check the evaluation of some specific topics: privacy issues, IPR policy, licenses, limitations on public access, economical conditions, temporal aspects of the dataset.

To check Paid Fees Scholarships Total Students International Students
Free of any personal data as defined in the Regulation (EU) 2016/679? Yes Yes Yes Yes
Free of any indirect personal data that could be used for identifying the natural person? Yes Yes Yes Yes
Free of any particular personal data (art. 9 GDPR)? Yes Yes Yes Yes
Free of any information that combined with common data available in the web, could identify the person? Yes Yes Yes Yes
Free of any information related to human rights? Yes Yes Yes Yes
Do you use a tool for calculating the range of the risk of de-anonymization? Not needed Not needed Not needed Not needed
Are you using geolocalization capabilities? Yes Yes Yes Yes
Did you check that the open data platform respect all the privacy regulations? Yes Yes Yes Yes
Do you know who are in your open data platform the Controller and Processor of the privacy data of the system? Yes Yes Yes Yes
Do you have checked the privacy regulation of the country where the dataset are physically stored? Yes Yes Yes Yes
Do you have non-personal data? Yes Yes Yes Yes

To check Paid Fees Scholarships Total Students International Students
Do you have created and generated the dataset? No No No No
Are you the owner of the dataset? No No No No
Are the dataset free from third party licenses or patents? Yes Yes Yes Yes
Do you have checked if there are some limitations in your national legal system for releasing some kind of datasets with open license? Yes Yes Yes Yes

To check Paid Fees Scholarships Total Students International Students
Do you release the dataset with an open data license ? Yes Yes Yes Yes
Do you include the clause: "In any case the dataset can’t be used for re-identifying the person"? No No No No
Do you release the API (in case you have) with an open source license? Not needed Not needed Not needed Not needed
Do you check that the open data/API platform license regime is compliance with your IPR policy? Not needed Not needed Not needed Not needed

To check Paid Fees Scholarships Total Students International Students
Do you check that the dataset concerns your institutional competences, scope and finality? Yes Yes Yes Yes
Do you check the limitations for the publication stated by your national legislation or by the EU directives? Yes Yes Yes Yes
Do you check if there are some limitations connected to the international relations, public security or national defence? Yes Yes Yes Yes
Do you check if there are some limitations concerning the public interest? Yes Yes Yes Yes
Do you check the international law limitations? Yes Yes Yes Yes
Do you check the INSPIRE law limitations for the spatial data? Yes Yes Yes Yes

To check Paid Fees Scholarships Total Students International Students
Do you check that the dataset could be released for free? Yes Yes Yes Yes
Do you check if there are some agreements with some other partners in order to release the dataset with a reasonable price? Not needed Not needed Not needed Not needed
Do you check if the open data platform terms of service include a clause of “non liability agreement” regarding the dataset and API provided? Yes Yes Yes Yes
In case you decide to release the dataset to a reasonable price do you check if the limitation imposed by the new directive 2019/1024/EU are respected? Not needed Not needed Not needed Not needed
In case you decide to release the dataset to a reasonable price do you check the e-Commerce directive1 and regulation? Not needed Not needed Not needed Not needed

To check Paid Fees Scholarships Total Students International Students
Do you have a temporary policy for updating the dataset? No No No No
Do you have some mechanism for informing the end-user that the dataset is updated at a given time to avoid mis-usage and so potential risk of damage? No No No No
Did you check if the dataset for some reason can’t be indexed by the research engines (e.g. Google, Yahoo, etc.)? Yes Yes Yes Yes
In case of personal data, do you have a reasonable technical mechanism for collecting request of deletion (e.g. right to be forgotten)? Not needed Not needed Not needed Not needed

License Comparison

Lastly a fundamental aspect of the legal analysis conducted, concerned the choice of the license under which to publish the project and its data, therefore first the original licenses of the analysed datasets have been considered. Despite having all been published by the same organization, two different licenses have been employed: IODL v2.0 for the datasets on Paid Fees and Scholarships and A1 Public Domain for the Students and International Students.

Both being equally (IODL v2.0) or less (A1 Public Domain) restrictive we have decided tp publish the output datasets under a CC-BY 4.0 license requiring only notice of Attribution.

To compare the different possible licenses we referred to the specific documentation of each and to the Licensing Assistant tool provided by data.europa.eu

Original License Output License
MIUR-Paid Fees (2016-19) IODL v2.0 CC-BY 4.0
MIUR-Scholarships (2016-19) IODL v2.0
MIUR-Total Students A1 Public Domain
MIUR-International Students A1 Public Domain

Ethical Analysis

As seen in the previous sections each dataset analysed and re-used in this project has been collected from the MIUR-USTAT Open Data platform, that gathers data from Italian universities and publishes them in compliance with the Legislative Decree n. 33 March 14th on Publicity, Transparency and Diffusion of Information for Public Administration. As declared on their webpage, these date are undoubtedly extremely sensible, covering aspects related to genre, age, residence and citizenship of universities' students and the ethical aspect of their handling has to be carefully considered.

As far as we could see, no specific ethical issue was encountered, yet it is worth analysing each dataset in detail:

  • Paid Fees: Data collected in this first set of datasets regard Academic Year, Name of the Academc Istitution, Average Fees on paying students and Average Fees on total number of students, here none of these information can be redirected to single individuals and have been therefore considered ethical and usable by the team.
  • Scholarships: The second set of datasets contains information about the scholarships offered by each university, their specific description and amount, divided for the academic level they are destined to (Bachelors and Masters, PhDs, specializations). This kind of data, which are potentially very sensible since concerning the economic means of the enrolled students, have however been aggregated before being published. As a result, no connection to the single individual can again be found, allowing also the second dataset to be considered acceptable.
  • Students: The third dataset contains information on the amount of students enrolled in each universities for Academic Year, divided by genre but again correctly anonymized to avoid any connection to single individual, however the distinction of the students' genre was not considered helpful to the aims of this project and could be also percieved as discriminatory. They have been hence unified in the creation of "In itinere" mashed-up datasets.
  • International Students: The last dataset concerned only the amount of international students enrolled in each universities for Academic Year, and despite this also being a potentially sensible information, again all data were anonymized and no correlation with the individuals could be gathered.

Technical Analysis

All original datasets considered for this project have been provided by the MIUR-USTAT Open Data Portal and have all been described following the same metadata schema, displaying notable temporal information (publication date and following modifications). Nonetheless other pieces of information where missing, such as license and creator: indeed, they need to be retrieved elsewhere on the website, challenging the accessibility for the users.

An overview of the output on the metadata analysis is shown below, highlithing weakenesses and differences between the datasets.

MIUR-Paid Fees MIUR-Scholarships MIUR-Total Students MIUR-International Students
2016 2017 2018 2019 2016 2017 2018 2019 2016-19 2016-19
distribution format
license
last modify
creator
format
creation time
media type
datastore active
has views
id
last modified
license type
on same domain
package id
position
revision id
state
url type

Sostenibility

All data and information available on this website have been published under a CC-BY 4.0 license and are compliant with the FAIR principles:

  • Findability: The website is easily findable and correctly indexed, and each mashed-up dataset produced for the performed analysis is uniquely identified.
  • Accessibility: All data collected are accessible both here and on the relative Github Repository and will remain available and downloadable even if the original source would remove them in the future.
  • Interoperability: The mashed-up datasets are described following the DCAT AP version 2.0.0 guidelines for metadata, that are created to be interoperable with other well-known and widely used formats (SKOS, FOAF, PROV-O, DCTerms).
  • Reusability: All data published on the website can be freely reused in accordance with the output license.

Being developed as a final project for the course "Open Access and Digital Ethics" (MA "Digital Humaninities and Digital Knowledge"), University of Bologna for the a.y. 2022/2023, there is no current intention of updating the resource in the future. On the contrary, the source of the original datasets here analysed does provide annual updates and is the most reliable source of information in this field. Consequently, the team would like to draw the attention of any interested users to their platform for any further reference oh how this data may change in the next years.

Visualization

Even though the research question is proven wrong, data can be used to provide further visualizations. The following section proposes different graphic interpretations of the data: they allow to analyse macro- and microtrends on both a chronological and geographical basis. They have been realized (where not differently states) through the library Google Developers Charts: the implementation of algorithms based on this library and on D3.js allow the user to interactively change the displayed information. Please note that most of the provided charts aims at merging together variables with different measurements (euros and percentages): for sake of readability, this made necessary the creation of a secondary vertical axis. Unfortunately (expecially in time series), this could lead to visualization biases: sharp trends could actually be justified by just different reference values on the second y axis.

Scatterplot:
Select the variables and the reference year. Then click on the button to visualize the result.

X Axis:
Y Axis:
Year:

Timeseries:
Select a university and visualize the change of the observations during the reference period.

The creation of the source data set is implemented in this Python file. The code is also responsible for the creation of parts of the option HTML tags.


Cluster:
A new dataset was created, averaging the values of the timeseries. This clusterization is interesting, expecially if we analyse the composition of the first cluster against the other two: public univerisites tend to populate the former group, while private the latter ones.
Note that only the institutions present in all four datasets have been considered for clustering (Roma Europea, Roma Link Campus and Roma Saint Camillus are hence excluded).

  • Torino
  • Torino Politecnico
  • Piemonte Orientale
  • Aosta
  • Genova
  • Insubria
  • Milano
  • Milano Politecnico
  • Milano Bicocca
  • Bergamo
  • Brescia
  • Pavia
  • Bolzano
  • Trento
  • Verona
  • Venezia Ca' Foscari
  • Venezia Iuav
  • Padova
  • Udine
  • Trieste
  • Parma
  • Modena e
    Reggio Emilia
  • Bologna
  • Ferrara
  • Urbino
  • Marche
  • Macerata
  • Camerino
  • Firenze
  • Pisa
  • Siena
  • Siena Stranieri
  • Perugia
  • Perugia Stranieri
  • Tuscia
  • Roma La Sapienza
  • Roma Tor Vergata
  • Roma Foro Italico
  • Roma Tre
  • Cassino
  • Sannio
  • Napoli Federico II
  • Napoli Parthenope
  • Napoli L'Orientale
  • Napoli Benincasa
  • Napoli Vanvitelli
  • Salerno
  • L'Aquila
  • Teramo
  • Chieti e Pescara
  • Molise
  • Foggia
  • Bari
  • Bari Politecnico
  • Salento
  • Basilicata
  • Calabria
  • Catanzaro
  • Reggio Calabria
  • Reggio Calabria -
    Dante Alighieri
  • Palermo
  • Messina
  • Enna KORE
  • Catania
  • Sassari
  • Cagliari
    • Bra Scienze Gastronomiche
    • Milano Bocconi
    • Milano San Raffaele
    • Rozzano (MI) Humanitas University
    • Roma LUISS
    • Roma Biomedico

    • Castellanza LIUC
    • Milano Cattolica
    • Milano IULM
    • Roma LUMSA
    • Roma UNINT
    • Casamassima - G.Degennaro

    Please read the following Jupyter Notebook for the reference code and the 3D visualization of the clusters


    Heatmap:
    These maps can visualized at the following links (they will redirect you to a new window):

    • Scholarships
    • Paid fees
    • International students

    Please read the following Jupyter Notebook for the reference code and customizable visualization of the heatmaps

    Output Metadata

    The produced mashed-up datasets have been described in their metadata specification following DCAT AP version 2.0.0, an RDF vocabulary designed to facilitate interoperability between data-catalogs published on the Web, and the resulting Turtle serialization can be accessed and downloaded by selecting the item of interest in the following list.

    The datasets have been analysed both as single individual dcat:Dataset and as a whole as a dcat:Catalog gathering them all. The main used metadata properties have been summarized in the next table, with specific attention at including both mandatory elements and the highest number of reccomended and optional information as possible to meaningfully enrich the collection.

    Catalog Datasets
    Identifiers dcterms:identifier, dcterms:title dcterms:identifier, dcterms:title
    Description dcterms:description, dcat:keyword dcterms:description, dcat:keyword
    Temporal dcterms:issued, dcterms:modified dcterms:issued, dcterms:modified, dcterms:temporal
    Spatial dcterms:spatial, prov:wasDerivedFrom
    Composition dcat:datasets
    Agents dcterms:publisher, dcterms:creator dcterms:publisher, dcterms:creator
    Legal dcterms:rights, dcterms:license dcterms:license, dcterms:rightsHolder
    Distribution dcat:distribution
    Language dcterms:language dcterms:language
    Web foaf:homepage