In Itinere Project

#	2016	2017	2018	2019
1	Perugia Stranieri (39,57%)	Perugia Stranieri (38,70%)	Perugia Stranieri (36,68%)	Perugia Stranieri (36,63%)
2	Rozzano (MI) Humanitas Univ. (25,43%)	Rozzano (MI) Humanitas Univ. (27,19%)	Bra Scienze Gastronomiche (28,24%)	Bra Scienze Gastronomiche (27,98%)
3	Bra Scienze Gastronomiche (23,91%)	Bra Scienze Gastronomiche (26,65%)	Rozzano (MI) Humanitas Univ. (26,68%)	Roma Saint Camillus (27,50%)
4	Bolzano (15,67%)	Reggio Calabria - Dante Alighieri (17,31%)	Reggio Calabria - Dante Alighieri (18,12%)	Rozzano (MI) Humanitas Univ. (25,05%)
5	Torino Politecnico (14,12%)	Bolzano (14,64%)	Milano Politecnico (15,28%)	Reggio Calabria - Dante Alighieri (18,69%)

Source datasets

MIUR

Contribuzione e interventi atenei

MIUR

Spesa per interventi

MIUR

Iscritti per ateneo

MIUR

Iscritti stranieri per anno accademico

Paid Fees 2016

ID: fees2016

Provenance: MIUR (USTAT)

Creation Date: più di 4 anni fa


                                                Format: .csv .tsv .json .xml

                                                Metadata: Provided

                                                URI: 2016ContribuzioneMedia

Paid Fees 2017

ID: fees2017

Provenance: MIUR (USTAT)

Creation Date: più di 4 anni fa


                                                Format: .csv .tsv .json .xml

                                                Metadata: Provided

                                                URI: 2017ContribuzioneMedia

Paid Fees 2018

ID: fees2018

Provenance: MIUR (USTAT)

Creation Date: più di 3 anni fa


                                                Format: .csv .tsv .json .xml

                                                Metadata: Provided

                                                URI: 2018ContribuzioneMedia

Paid Fees 2019

ID: fees2019

Provenance: MIUR (USTAT)

Creation Date: più di 2 anni fa


                                                Format: .csv .tsv .json .xml

                                                Metadata: Provided

                                                URI: 2019ContribuzioneMedia

Scholarships 2016

ID: dsu2016

Provenance: MIUR (USTAT)

Creation Date: più di 5 anni fa


                                                Format: .csv .tsv .json .xml

                                                Metadata: Provided

                                                URI: 2016SpesaInterventi

Scholarships 2017

ID: dsu2017

Provenance: MIUR (USTAT)

Creation Date: più di 4 anni fa


                                                Format: .csv .tsv .json .xml

                                                Metadata: Provided

                                                URI: 2017SpesaInterventi

Scholarships 2018

ID: dsu2018

Provenance: MIUR (USTAT)

Creation Date:più di 3 anni fa


                                                Format: .csv .tsv .json .xml

                                                Metadata: Provided

                                                URI: 2018SpesaInterventi

Scholarships 2019

ID: dsu2019

Provenance: MIUR (USTAT)

Creation Date: più di 2 anni fa


                                                Format: .csv .tsv .json .xml

                                                Metadata: Provided

                                                URI: 2019SpesaInterventi

Total Students 2016-2019

ID: student

Provenance: MIUR (USTAT)

Creation Date: più di 3 anni fa


                                                Format: .csv .tsv .json .xml

                                                Metadata: Provided

                                                URI: IscrittiAteneo

International Students 2016-2019

ID: intstudent

Provenance: MIUR (USTAT)

Creation Date: più di 1 anno fa


                                                Format: .csv .tsv .json .xml

                                                Metadata: Provided

                                                URI: IscrittiStranieriAA

Documentation

Table of content

Introduction & scenario
Original & mash-up datasets
Quality analysis
Legal analysis
Ethical anlysis
Technical analysis
Sostenibility of the update
Visualization
RDF assertion of the metadata

Introduction & scenario

In itinere is an open data project analyzing the enrollment of international students in Italian universities over a reference period of four academic years (a.y. 2015/2016 - 2018/2019). Indeed, in the last decades Italy has stressed the importance of opening academia to international students and researchers. To this end, exchange project are funded and international and English-taught degree courses have been opened. Moreover, specific institutions have been founded such as the "Università per stranieri" such as the one in Perugia (already in 1925), Siena (1992) and "Dante Alighieri" in Reggio Calabria (2007).

As mentioned in the previous sections of this webpage, the main goal of the project is to analyze possible factors which influence the choice of the university from the perspective of an international student. To address this question, after the analysis of the available data, we chose to reformulate this goal in a more circumscribed research question: are scholarships and fees riliable indicators to describe the presence of international students in Italian universities? For the sake of simplicity (expecially in the implementation of algorithms), we translated with "scholarships" several expenditures made by universities (also dorms, canteens, international mobilites etc.), grouped in the Italian term "diritto allo studio".

To what concerns the availability of open data, we relied on the catalogue dati.gov.it. This platform is part of the Italian ongoing process of digital transition, "Piano di Crescita Digitale": it includes 56.893 datasets, mainly from local and regional administration. National datasets are scarce (almost 8%): this results in a difficulty in fetching exaustive data for our project. The entirety of the source datasets have been taken from USTAT, the data portal of tertiary education. Its section "Open Data" is the most updated, yet some datasets do not provide information after a. y. 2018/19. We also considered to integrate them with other sources, such as ISTAT: unfortunately, it either provides information and useful observations referenced to different academic years (often older than the ministerial one) or groups information geographically, which is incompatible with the subdivision per insitution adopted by MIUR datasets.

To better investigate the access to new data, it is necessary to have access to more detailed data. In particular, further researches will surely benefit from datasets including observations about:

number of international (mainly English-taught) degree courses and teachings
provenance of international students (from EU, non-EU European country, other continents)
number and width of international partnerships per university
average financial background (which can be described through ISEE) of Italian and international students per university

Statement of resposibility:

The project was carried out by:

Federica Bonifazi
Manuele Veggi

The workload was divided as following:

	Project	Documentation
Federica Bonifazi	Project ideation Data retrieval Mashup Datasets Metadata creation Data legal analysis	4. Legal analysis 5. Ethical analysis 6. Technical analysis 7. Sostenibility of the update 9. RDF assertion of the metadata
Manuele Veggi	Project ideation Data retrieval Mashup Datasets Data visualization Web infrastructure	1. Introduction & scenario 2. Original & mash-up datasets 3. Quality analysis 8. Visualization Creation of the Jupyter Notebooks.

Original datasets

As mentioned, the ten source datasets (dowloaded as .csv files) come from the same ministerial portal, USTAT. From now on it will be cited as MIUR, after the domain of the URL, even though the former "Ministero dell'Istruzione, dell'Università e della Ricerca" have been splitted into "Ministero dell'Università e della Ricerca" and the Ministry of Education (its name changed considerably during the last governments). As listed in the previous sections, the source datasets are:

Topic	Portal	Dataset(s)
Contribuzione e interventi atenei. Contribuzione media	MIUR	2016, 2017, 2018, 2019
Diritto allo Studio Universitario (DSU) Regionale. Spesa per interventi		2016, 2017, 2018, 2019
Iscritti. Iscritti per ateneo		2016-2019
Iscritti. Iscritti stranieri per ateneo		2016-2019

Before the mashup, the datasets are filtered as both telematic univerisities and AFAM (fine and performing arts academies) have been excluded. Moreover, expecially in the first two topics, not all the univerisites are present in all the datasets (which justfies considerably differences in some of the visualization): this is case of three Roman universities,

Roma Europea, missing in 2016 dataset about scholarships
Roma Link Campus, missing in 2016, 2017 datasets about scholarships
Roma Saint Camillus, present exclusively in 2019 (probably because it has been founded only 2017)

These differences can be easily computed through the function distElement(dframe1, dframe2) in the Python script used to create a temporary datasets for the linechart visualization (see §7).

Mash-up datasets

The four output datasets are obtained using the algorithms described in this Jupyter Notebook: please make reference to this document for a more precise description of the rationale and of the computational steps beneath this phase. We decided to keep the diachronical distinction also in the output datasets: as a consequence the association between source and output dataset is the following (indicated by reference year).

Output	Scholarships	Paid fees	Students	International
2016	2016	2016	2016-2019	2016-2019
2017	2017	2017
2018	2018	2018
2019	2019	2019

Quality analysis

The qualitative analysis of the datasets follows the guidelines available on the Italian official governmental portal "Docs Italia". The four reference categories are accuracy, consistency, completeness and timeliness. They are drawn from AgID Determinazione Commissariale n. 68/2013, which states that (description of the categories was translated in English here):

In relazione allo specifico contesto d'uso e alle finalità perseguite dalla norma, le basi di dati critiche devono assicurare il valore intrinseco dei dati in modo che gli attributi dei dati stessi siano adeguati rispetto alle caratteristiche di "inerenza" definite nell'ambito del suddetto standard ISO/IEC 25012, di seguito sintetizzate:

accuratezza: il dato, e i suoi attributi, rappresenta correttamente il valore reale del concetto o evento cui si riferisce;

attualità (o tempestività di aggiornamento): il dato, e i suoi attributi, è del “giusto tempo” (è aggiornato) rispetto al procedimento cui si riferisce;

coerenza: il dato, e i suoi attributi, non presenta contraddittorietà rispetto ad altri dati del contesto d’uso dell'amministrazione titolare;

completezza: il dato risulta esaustivo per tutti i suoi valori attesi e rispetto alle entità relative (fonti) che concorrono alla definizione del procedimento

Agenzia per l'Italia Digitale, Determinazione Commissariale n. 68/2013 DIG, article 4

As stated on the aforementioned portal, it is garanteed that all the provided data respect these four criteria. Indeed, generally the datasets are of good quality, yet minor flaws can be highlighted. In the following table, a concise description is provided (datasets are analyzed by typology and chronological distinction is not reflected, as data concerning the same field appears to be coerent):

Dataset	Accuracy	Completeness	Timeliness
Scholarships	Only in some cases detached seats are considered separatedly from main campus	Missing information about Roma - Link Campus University (2016, 2017) and Roma - Università Europea (2016)	No information available after a.y. 2018/2019
Fee	No distinction between detached seats of a university from the main campus		No information available after a.y. 2018/2019
Total students
International students

Legal Analysis

To analyse the legal aspects of the original datasets used for this project the following check list was adopted as a tool to check the evaluation of some specific topics: privacy issues, IPR policy, licenses, limitations on public access, economical conditions, temporal aspects of the dataset.

To check	Paid Fees	Scholarships	Total Students	International Students
Free of any personal data as defined in the Regulation (EU) 2016/679?	Yes	Yes	Yes	Yes
Free of any indirect personal data that could be used for identifying the natural person?	Yes	Yes	Yes	Yes
Free of any particular personal data (art. 9 GDPR)?	Yes	Yes	Yes	Yes
Free of any information that combined with common data available in the web, could identify the person?	Yes	Yes	Yes	Yes
Free of any information related to human rights?	Yes	Yes	Yes	Yes
Do you use a tool for calculating the range of the risk of de-anonymization?	Not needed	Not needed	Not needed	Not needed
Are you using geolocalization capabilities?	Yes	Yes	Yes	Yes
Did you check that the open data platform respect all the privacy regulations?	Yes	Yes	Yes	Yes
Do you know who are in your open data platform the Controller and Processor of the privacy data of the system?	Yes	Yes	Yes	Yes
Do you have checked the privacy regulation of the country where the dataset are physically stored?	Yes	Yes	Yes	Yes
Do you have non-personal data?	Yes	Yes	Yes	Yes

To check	Paid Fees	Scholarships	Total Students	International Students
Do you have created and generated the dataset?	No	No	No	No
Are you the owner of the dataset?	No	No	No	No
Are the dataset free from third party licenses or patents?	Yes	Yes	Yes	Yes
Do you have checked if there are some limitations in your national legal system for releasing some kind of datasets with open license?	Yes	Yes	Yes	Yes

To check	Paid Fees	Scholarships	Total Students	International Students
Do you release the dataset with an open data license ?	Yes	Yes	Yes	Yes
Do you include the clause: "In any case the dataset can’t be used for re-identifying the person"?	No	No	No	No
Do you release the API (in case you have) with an open source license?	Not needed	Not needed	Not needed	Not needed
Do you check that the open data/API platform license regime is compliance with your IPR policy?	Not needed	Not needed	Not needed	Not needed

To check	Paid Fees	Scholarships	Total Students	International Students
Do you check that the dataset concerns your institutional competences, scope and finality?	Yes	Yes	Yes	Yes
Do you check the limitations for the publication stated by your national legislation or by the EU directives?	Yes	Yes	Yes	Yes
Do you check if there are some limitations connected to the international relations, public security or national defence?	Yes	Yes	Yes	Yes
Do you check if there are some limitations concerning the public interest?	Yes	Yes	Yes	Yes
Do you check the international law limitations?	Yes	Yes	Yes	Yes
Do you check the INSPIRE law limitations for the spatial data?	Yes	Yes	Yes	Yes

To check	Paid Fees	Scholarships	Total Students	International Students
Do you check that the dataset could be released for free?	Yes	Yes	Yes	Yes
Do you check if there are some agreements with some other partners in order to release the dataset with a reasonable price?	Not needed	Not needed	Not needed	Not needed
Do you check if the open data platform terms of service include a clause of “non liability agreement” regarding the dataset and API provided?	Yes	Yes	Yes	Yes
In case you decide to release the dataset to a reasonable price do you check if the limitation imposed by the new directive 2019/1024/EU are respected?	Not needed	Not needed	Not needed	Not needed
In case you decide to release the dataset to a reasonable price do you check the e-Commerce directive1 and regulation?	Not needed	Not needed	Not needed	Not needed

To check	Paid Fees	Scholarships	Total Students	International Students
Do you have a temporary policy for updating the dataset?	No	No	No	No
Do you have some mechanism for informing the end-user that the dataset is updated at a given time to avoid mis-usage and so potential risk of damage?	No	No	No	No
Did you check if the dataset for some reason can’t be indexed by the research engines (e.g. Google, Yahoo, etc.)?	Yes	Yes	Yes	Yes
In case of personal data, do you have a reasonable technical mechanism for collecting request of deletion (e.g. right to be forgotten)?	Not needed	Not needed	Not needed	Not needed

License Comparison

Lastly a fundamental aspect of the legal analysis conducted, concerned the choice of the license under which to publish the project and its data, therefore first the original licenses of the analysed datasets have been considered. Despite having all been published by the same organization, two different licenses have been employed: IODL v2.0 for the datasets on Paid Fees and Scholarships and A1 Public Domain for the Students and International Students.

Both being equally (IODL v2.0) or less (A1 Public Domain) restrictive we have decided tp publish the output datasets under a CC-BY 4.0 license requiring only notice of Attribution.

To compare the different possible licenses we referred to the specific documentation of each and to the Licensing Assistant tool provided by data.europa.eu

	Original License	Output License
MIUR-Paid Fees (2016-19)	IODL v2.0	CC-BY 4.0
MIUR-Scholarships (2016-19)	IODL v2.0
MIUR-Total Students	A1 Public Domain
MIUR-International Students	A1 Public Domain

Ethical Analysis

As seen in the previous sections each dataset analysed and re-used in this project has been collected from the MIUR-USTAT Open Data platform, that gathers data from Italian universities and publishes them in compliance with the Legislative Decree n. 33 March 14th on Publicity, Transparency and Diffusion of Information for Public Administration. As declared on their webpage, these date are undoubtedly extremely sensible, covering aspects related to genre, age, residence and citizenship of universities' students and the ethical aspect of their handling has to be carefully considered.

As far as we could see, no specific ethical issue was encountered, yet it is worth analysing each dataset in detail:

Paid Fees: Data collected in this first set of datasets regard Academic Year, Name of the Academc Istitution, Average Fees on paying students and Average Fees on total number of students, here none of these information can be redirected to single individuals and have been therefore considered ethical and usable by the team.
Scholarships: The second set of datasets contains information about the scholarships offered by each university, their specific description and amount, divided for the academic level they are destined to (Bachelors and Masters, PhDs, specializations). This kind of data, which are potentially very sensible since concerning the economic means of the enrolled students, have however been aggregated before being published. As a result, no connection to the single individual can again be found, allowing also the second dataset to be considered acceptable.
Students: The third dataset contains information on the amount of students enrolled in each universities for Academic Year, divided by genre but again correctly anonymized to avoid any connection to single individual, however the distinction of the students' genre was not considered helpful to the aims of this project and could be also percieved as discriminatory. They have been hence unified in the creation of "In itinere" mashed-up datasets.
International Students: The last dataset concerned only the amount of international students enrolled in each universities for Academic Year, and despite this also being a potentially sensible information, again all data were anonymized and no correlation with the individuals could be gathered.

Technical Analysis

All original datasets considered for this project have been provided by the MIUR-USTAT Open Data Portal and have all been described following the same metadata schema, displaying notable temporal information (publication date and following modifications). Nonetheless other pieces of information where missing, such as license and creator: indeed, they need to be retrieved elsewhere on the website, challenging the accessibility for the users.

An overview of the output on the metadata analysis is shown below, highlithing weakenesses and differences between the datasets.

	MIUR-Paid Fees				MIUR-Scholarships				MIUR-Total Students	MIUR-International Students
	2016	2017	2018	2019	2016	2017	2018	2019	2016-19	2016-19
distribution format
license
last modify
creator
format
creation time
media type
datastore active
has views
id
last modified
license type
on same domain
package id
position
revision id
state
url type

Sostenibility

All data and information available on this website have been published under a CC-BY 4.0 license and are compliant with the FAIR principles:

Findability: The website is easily findable and correctly indexed, and each mashed-up dataset produced for the performed analysis is uniquely identified.
Accessibility: All data collected are accessible both here and on the relative Github Repository and will remain available and downloadable even if the original source would remove them in the future.
Interoperability: The mashed-up datasets are described following the DCAT AP version 2.0.0 guidelines for metadata, that are created to be interoperable with other well-known and widely used formats (SKOS, FOAF, PROV-O, DCTerms).
Reusability: All data published on the website can be freely reused in accordance with the output license.

Being developed as a final project for the course "Open Access and Digital Ethics" (MA "Digital Humaninities and Digital Knowledge"), University of Bologna for the a.y. 2022/2023, there is no current intention of updating the resource in the future. On the contrary, the source of the original datasets here analysed does provide annual updates and is the most reliable source of information in this field. Consequently, the team would like to draw the attention of any interested users to their platform for any further reference oh how this data may change in the next years.

Visualization

Even though the research question is proven wrong, data can be used to provide further visualizations. The following section proposes different graphic interpretations of the data: they allow to analyse macro- and microtrends on both a chronological and geographical basis. They have been realized (where not differently states) through the library Google Developers Charts: the implementation of algorithms based on this library and on D3.js allow the user to interactively change the displayed information. Please note that most of the provided charts aims at merging together variables with different measurements (euros and percentages): for sake of readability, this made necessary the creation of a secondary vertical axis. Unfortunately (expecially in time series), this could lead to visualization biases: sharp trends could actually be justified by just different reference values on the second y axis.

Scatterplot:
Select the variables and the reference year. Then click on the button to visualize the result.

X Axis:	Paid fee Average scholarship per student % of international students
Y Axis:	Paid fee Average scholarship per student % of international students
Year:	2015/16 2016/17 2017/18 2018/19

Timeseries:
Select a university and visualize the change of the observations during the reference period.

The creation of the source data set is implemented in this Python file. The code is also responsible for the creation of parts of the option HTML tags.

Cluster:
A new dataset was created, averaging the values of the timeseries. This clusterization is interesting, expecially if we analyse the composition of the first cluster against the other two: public univerisites tend to populate the former group, while private the latter ones.
Note that only the institutions present in all four datasets have been considered for clustering (Roma Europea, Roma Link Campus and Roma Saint Camillus are hence excluded).

Torino
Torino Politecnico
Piemonte Orientale
Aosta
Genova
Insubria
Milano
Milano Politecnico
Milano Bicocca
Bergamo
Brescia
Pavia
Bolzano
Trento
Verona
Venezia Ca' Foscari
Venezia Iuav
Padova
Udine
Trieste
Parma
Modena e
Reggio Emilia

Bologna

Ferrara

Urbino

Marche

Macerata

Camerino

Firenze

Pisa

Siena

Siena Stranieri

Perugia

Perugia Stranieri

Tuscia

Roma La Sapienza

Roma Tor Vergata

Roma Foro Italico

Roma Tre

Cassino

Sannio

Napoli Federico II

Napoli Parthenope

Napoli L'Orientale

Napoli Benincasa

Napoli Vanvitelli

Salerno

L'Aquila

Teramo

Chieti e Pescara

Molise

Foggia

Bari

Bari Politecnico

Salento

Basilicata

Calabria

Catanzaro

Reggio Calabria

Reggio Calabria -
Dante Alighieri

Palermo

Messina

Enna KORE

Catania

Sassari

Cagliari

Bra Scienze Gastronomiche
Milano Bocconi
Milano San Raffaele
Rozzano (MI) Humanitas University
Roma LUISS
Roma Biomedico

Castellanza LIUC
Milano Cattolica
Milano IULM
Roma LUMSA
Roma UNINT
Casamassima - G.Degennaro

Please read the following Jupyter Notebook for the reference code and the 3D visualization of the clusters

Heatmap:
These maps can visualized at the following links (they will redirect you to a new window):

Scholarships
Paid fees
International students

Please read the following Jupyter Notebook for the reference code and customizable visualization of the heatmaps

Output Metadata

The produced mashed-up datasets have been described in their metadata specification following DCAT AP version 2.0.0, an RDF vocabulary designed to facilitate interoperability between data-catalogs published on the Web, and the resulting Turtle serialization can be accessed and downloaded by selecting the item of interest in the following list.

The datasets have been analysed both as single individual dcat:Dataset and as a whole as a dcat:Catalog gathering them all. The main used metadata properties have been summarized in the next table, with specific attention at including both mandatory elements and the highest number of reccomended and optional information as possible to meaningfully enrich the collection.

	Catalog	Datasets
Identifiers	dcterms:identifier, dcterms:title	dcterms:identifier, dcterms:title
Description	dcterms:description, dcat:keyword	dcterms:description, dcat:keyword
Temporal	dcterms:issued, dcterms:modified	dcterms:issued, dcterms:modified, dcterms:temporal
Spatial		dcterms:spatial, prov:wasDerivedFrom
Composition	dcat:datasets
Agents	dcterms:publisher, dcterms:creator	dcterms:publisher, dcterms:creator
Legal	dcterms:rights, dcterms:license	dcterms:license, dcterms:rightsHolder
Distribution		dcat:distribution
Language	dcterms:language	dcterms:language
Web	foaf:homepage

In itinere

An open data project on international students in Italian universities

RESEARCH QUESTION & METHODOLOGY

Are amount of paid fees and of available scholarships
correlated to the enorolment of international students in Italian university?

Source datasets

LOD datasets

Data processing

Documentation

Main results

Source datasets

MIUR

Contribuzione e interventi atenei

MIUR

Spesa per interventi

MIUR

Iscritti per ateneo

MIUR

Iscritti stranieri per anno accademico

Mashup dataset

Documentation

Table of content

Introduction & scenario

Statement of resposibility:

Original datasets

Mash-up datasets

Quality analysis

Legal Analysis

License Comparison

Ethical Analysis

Technical Analysis

Sostenibility

Visualization

Output Metadata

RESEARCH QUESTION & METHODOLOGY

Are amount of paid fees and of available scholarships correlated to the enorolment of international students in Italian university?

Source datasets

LOD datasets

Data processing

Documentation

Main results

Source datasets

MIUR

Contribuzione e interventi atenei

MIUR

Spesa per interventi

MIUR

Iscritti per ateneo

MIUR

Iscritti stranieri per anno accademico

Mashup dataset

Documentation

Table of content

Introduction & scenario

Statement of resposibility:

Original datasets

Mash-up datasets

Quality analysis

Legal Analysis

Privacy Issues

IPR of the dataset

Licenses

Limitations on public access

Economical Conditions

Temporary aspects

License Comparison

Ethical Analysis

Technical Analysis

Sostenibility

Visualization

Cluster group #1

Cluster group #2

Cluster Group #3

Output Metadata

Are amount of paid fees and of available scholarships
correlated to the enorolment of international students in Italian university?