Educational Big Data Analysis – Digital Education Summit 2021

By Donggil Song

The Digital Education Summit (DES) is a virtual, one-day teaching and learning conference, which promotes best practices and methodologies for online educators in K-12 and higher education. Fortunately, I was invited as a presenter at the 2021 DES.

The topic was one of my ongoing projects, Educational Big Data Analysis of Student Behaviors in a Learning Management System (LMS) (Funded by SHSU* and supported by SHSU Online).

Because the complete results will be published at some point, I can’t post everything here, but let me share a few things that might be helpful for those who are interested in educational big data analysis. The data was extracted from a learning management system, Blackboard Online, which has been used at a university. Every LMS has pros and cons, but I feel like BB is somewhat (or quite) outdated and not user-friendly (I’m using it for my online courses).

The dataset is huge and is obviously getting increased, for a pilot analysis project I just focused on the dataset of Spring 2021 at the university level. Still, it’s not small, the size of a just student activity dataset is 42.7GB, which includes all sorts of student activities, whenever a student does “something” on Blackboard; for example, discussion posting, assignment submission, reading materials, etc. The data set is purely a type of textual record or footprint and does not include any videos or files. Yes, just a textual student activity dataset in one semester at one university can take up 42.7GB.

Followed the regular process of big data analysis. Since the data is very messy, it took almost 80% of the project time for data preprocessing or data cleaning. Data types, specifically student activity types are not carefully categorized by the BB system. Is it their intention for making us difficult to trace or understand? If you see the BB database schema (https://help.blackboard.com/Learn/Administrator/Hosting/Databases/Open_Database_Schema), you’ll understand what I mean.

Although the preprocessing was touch and extremely time-consuming, it guided me for feature selection, which is somewhat consistent with the current literature. The 10 features are Discussion, Grade, Group, Module, Announcement, Syllabus, External, Guide, Task, and Material. It’s just a machine learning term, you can consider features as “variables.” These 10 variables might be related to student performance (their final grades).

My focus was on students’ behavioral patterns in terms of these 10 features. For example, a group of students (or if I were a student) might check a syllabus and tasks a lot in the first week, and another group of students might review study guides a lot in the second week, something like that. I assume that there are different types of learners who show different behavioral patterns in an LMS.

Since I had the final grade dataset, I could have done a classification analysis. However, I was not interested in creating a prediction model in this pilot project. Simply, I just wanted to find some patterns first.

After figuring out some distinct groups of students, I wanted to check the final grades of each group. In the machine learning field, we call it cluster.

To make a long story short, in Week 0 (one week prior to Spring 2021), 4 groups were identified based on their activity patterns of 10 features above. After finding these 4 clusters, I checked their final grades. A normalized result can be seen below. Group 1 (Blue) is the highest performer group, 4 is the lowest. That means the mid-low (Group 3) and lowest (Group 4) groups’ activities are impressive in Week 0. Interestingly, Group 3’s behaviors are much more active than the mid-high and highest groups in Week 0.

The analysis does not tell why. It could be… they are trying to look at the course as soon as possible to make a drop decision, or they are very passionate learners before a semester begins. More interestingly, the following weeks showed different stories and Week 15 (the final week of Spring 2021) revealed an obvious but shocking result, which will be published in my upcoming paper. Because of the results, there was a somewhat heated discussion at the end of my presentation. Some professors were even frustrated by the results, which were different from their thoughts about the students in their courses. This might be the beauty of data analytics 🙂

In conclusion, educational big data analysis could be extremely time-consuming, but we can find some exciting stories that have been buried in the huge data mountain. Hope I can have some time to write more technical tutorials for computational data analysis techniques at some point.

*Analyzing Educational Big Data of Student Behavior in Online Courses for Dropout Prevention. Interdisciplinary Collaborations Program, Sam Houston State University. Song, D. (PI), Angrove, B. (Co-PI), & Price, D. (Co-PI). Contract Amount: $14,347. June 2020 – August 2021.