Data linkage can sound a bit daunting and perhaps boring for those who have not come across it in their research experience. I thought I would go over some of the basics so you can start to think more about the potential for data linkage when planning your own research. Data linkage is slowly becoming more common and can be extremely useful to enrich the data used in your research.
What is data linkage?
Data linkage, as it sounds, is the joining of two or more data sets that have information on the same participant. This could be information from a similar time or could enable you to look at information from different time point in the life course of the participant. For example we are currently looking at data from a child’s School Entrant Health Questionnaire, when they are five years old, linked to NAPLAN results when they are nine. This has enabled a wealth of information to be available on a child’s demographics, health and development and their later education outcomes. For example we can now ask ‘which health conditions have a significant impact on education outcomes?’ or ‘does attendance at early childhood services improve education outcomes?’ These questions can have important implications on policy, service delivery and service evaluation.
A simple way to link data is when there is a unique identifier in both data sets such as a student identification number or a medical record number. It is then very easy to match the data on this identifier. Unfortunately in Victoria this is not a common occurrence. Legislation in different states and countries often dictates the possibility and ease of data linkage. Researchers, such as Fiona Stanley, have advocated for and pioneered data linkage at a population level to significantly enhance the power of research.
When there is no unique identifier linking must rely on common variables within both data sets that help to ensure the participant in one set of data is the one being linked to in the other set. Such variables are usually name, sex, age and postcode. A Statistical Linkage Key (SLK) combines letters and number of variables to create a unique identifier and is a common system for linkage. The letters used are the 2nd, 3rd and 5th letters of the family name and the 2nd and 3rd letters of the given name.
Currently data linkage for research is commonly carried out by an independent data linkage agency that takes the data from the data custodians, links the data and then gives the de-identified data to the researchers. This is an extremely simplified version of a costly and lengthy process.
When looking at a linked dataset it is important to account for any bias that may have occurred. For example one set of the data may be missing some of the cohort. Also depending on which variables were used to link there may be additional bias introduced. In the SEHQ-NAPLAN data, one set only has information on children attending government school. Additionally the data was linked on where the children lived so any child that moved may not be included. It is important to look at who within both dataset were and were not linked and how these children may differ. To note, a data linkage rate of around 60% is quite good.
Lastly I will touch on the ethics of data linkage. Participants may have agreed to research being carried out on the initial study you have undertaken but it is likely that they have not agreed for that data to be linked to other information about them. If it is possible to include the potential for data linkage in the initial participant consent headaches along the way can be avoided.
I hope this brief introduction to the ins and outs of data linkage may be of use in your future research. I imagine the more researchers advocate for and use data linkage the more accessible it becomes.
Written by Shae Johnson
Cohort and Platform Data Programs Coordinator
Jack Brockhoff Child Health & Wellbeing Program