Analyzing New York Metropolis Excessive College Knowledge

On this information, I am going to stroll you thru analyzing New York Metropolis’s highschool information to establish relationships between numerous elements and SAT scores. This guided undertaking, Analyzing NYC Excessive College Knowledge, will provide help to develop hands-on expertise in information evaluation, visualization, and statistical correlation utilizing Python.

We’ll assume the position of an information analyst investigating the potential relationships between SAT scores and elements like college security, demographic make-up, and educational applications in NYC excessive faculties. The last word query we’ll discover: Is the SAT a good check, or do sure demographic elements correlate strongly with efficiency?

What You will Study:

Tips on how to mix and clear a number of datasets to create a complete evaluation
Tips on how to establish and visualize correlations between totally different variables
Tips on how to analyze demographic elements like race, gender, and socioeconomic standing in relation to check scores
How to attract significant insights from information visualizations
Tips on how to set up an exploratory information evaluation workflow in Python

Earlier than stepping into this undertaking, you have to be pretty fluent with fundamental Python expertise like lists, dictionaries, loops, and conditional logic. You also needs to have some familiarity with pandas, matplotlib, and fundamental statistical ideas like correlation. If you have to brush up on these expertise, take a look at our Python for Knowledge Science studying path.

Now, let’s dive into our evaluation!

Step 1: Understanding the Knowledge

For this undertaking, we’ll be working with a number of datasets associated to New York Metropolis excessive faculties:

SAT scores: Incorporates common scores for every highschool
College demographics: Details about race, gender, and different demographic elements
AP check outcomes: Knowledge on Superior Placement check participation
Commencement outcomes: Commencement charges for every college
Class measurement: Details about class sizes
College surveys: Outcomes from surveys given to college students, lecturers, and oldsters
College listing: Further details about every college

Every dataset incorporates details about NYC excessive faculties, however they’re separated throughout totally different information. Our first process will likely be to mix these datasets into one complete dataset for evaluation.

Step 2: Setting Up the Surroundings

When you’re engaged on this undertaking throughout the Dataquest platform, your setting is already arrange. When you’re working regionally, you will want:

Python setting: Guarantee you will have Python 3.x put in with pandas, numpy, matplotlib, and re (regex) libraries.
Jupyter Pocket book: Set up Jupyter Pocket book or JupyterLab to work with the offered .ipynb file.
Knowledge information: Obtain the dataset information from the undertaking web page.

Let’s begin by importing the required libraries and loading our datasets:

import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
%matplotlib inline

data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]

information = {}

for f in data_files:
    d = pd.read_csv("faculties/{0}".format(f))
    information[f.replace(".csv", "")] = d

all_survey = pd.read_csv("faculties/survey_all.txt", delimiter="t", encoding='windows-1252')
d75_survey = pd.read_csv("faculties/survey_d75.txt", delimiter="t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)

The code above masses every dataset right into a dictionary referred to as information, the place the keys are the file names (with out the .csv extension) and the values are the pandas DataFrames containing the info.

The survey information is saved in tab-delimited (”t”) textual content information, so we have to specify the delimiter to load it appropriately. For the survey information, we additionally have to standardize the dbn (District Borough Quantity) column, which is the distinctive identifier for every college:

survey["DBN"] = survey["dbn"]

survey_fields = [
    "DBN",
    "rr_s",
    "rr_t",
    "rr_p",
    "N_s",
    "N_t",
    "N_p",
    "saf_p_11",
    "com_p_11",
    "eng_p_11",
    "aca_p_11",
    "saf_t_11",
    "com_t_11",
    "eng_t_11",
    "aca_t_11",
    "saf_s_11",
    "com_s_11",
    "eng_s_11",
    "aca_s_11",
    "saf_tot_11",
    "com_tot_11",
    "eng_tot_11",
    "aca_tot_11",
]
survey = survey.loc[:,survey_fields]
information["survey"] = survey

Step 3: Knowledge Cleansing and Preparation

Earlier than we are able to analyze the info, we have to clear and put together it. This entails standardizing column names, changing information varieties, and extracting geographical info:

# Standardize DBN column identify in hs_directory
information["hs_directory"]["DBN"] = information["hs_directory"]["dbn"]

# Helper operate to pad CSD values
def pad_csd(num):
    string_representation = str(num)
    if len(string_representation) > 1:
        return string_representation
    else:
        return "0" + string_representation

# Create DBN column in class_size
information["class_size"]["padded_csd"] = information["class_size"]["CSD"].apply(pad_csd)
information["class_size"]["DBN"] = information["class_size"]["padded_csd"] + information["class_size"]["SCHOOL CODE"]

# Convert SAT rating columns to numeric and calculate complete SAT rating
cols = ["SAT Critical Reading Avg. Score", "SAT Math Avg. Score", "SAT Writing Avg. Score"]
for c in cols:
    information["sat_results"][c] = pd.to_numeric(information["sat_results"][c], errors="coerce")

information["sat_results"]["sat_score"] = information["sat_results"][cols[0]] + information["sat_results"][cols[1]] + information["sat_results"][cols[2]]

This kind of information cleansing is vital for real-world information evaluation. If you wish to study extra about information cleansing methods, take a look at Dataquest’s Knowledge Cleansing in Python course.

We additionally have to extract geographic coordinates from the placement subject for mapping functions:

# Extract latitude and longitude from Location 1 subject
def find_lat(loc):
    coords = re.findall("(.+, .+)", loc)
    lat = re.findall("[+-]?d+.d+", coords[0])[0]
    return lat

def find_lon(loc):
    coords = re.findall("(.+, .+)", loc)
    lon = re.findall("[+-]?d+.d+", coords[0])[1]
    return lon

information["hs_directory"]["lat"] = information["hs_directory"]["Location 1"].apply(find_lat)
information["hs_directory"]["lon"] = information["hs_directory"]["Location 1"].apply(find_lon)

information["hs_directory"]["lat"] = pd.to_numeric(information["hs_directory"]["lat"], errors="coerce")
information["hs_directory"]["lon"] = pd.to_numeric(information["hs_directory"]["lon"], errors="coerce")

Now we are able to mix all of the datasets right into a single DataFrame. We’ll use the DBN as the important thing to merge all of the datasets:

# Mix the datasets
mixed = information["sat_results"]
mixed = mixed.merge(information["ap_2010"], on="DBN", how="left")
mixed = mixed.merge(information["graduation"], on="DBN", how="left")
mixed = mixed.merge(information["class_size"], on="DBN", how="left")
mixed = mixed.merge(information["demographics"], on="DBN", how="left")
mixed = mixed.merge(information["survey"], on="DBN", how="left")
mixed = mixed.merge(information["hs_directory"], on="DBN", how="left")

# Fill lacking values
mixed = mixed.fillna(mixed.imply())
mixed = mixed.fillna(0)

# Add a college district column for mapping
def get_first_two_chars(dbn):
    return dbn[0:2]

mixed["school_dist"] = mixed["DBN"].apply(get_first_two_chars)

Once you run this code, you will have a single DataFrame referred to as mixed that incorporates all the info from the separate datasets. We have stuffed in lacking values with the imply of every column, and if a column has no numeric values, we have stuffed it with 0.

If we study the primary few rows of our mixed dataset utilizing mixed.head(), we’ll see that it has over 160 columns! It is a lot of knowledge to course of, so we’ll should be strategic about how we discover the info.

Step 4: Discovering Relationships – Correlation Evaluation

Since we have now so many variables, a superb place to start out is by figuring out which of them have the strongest relationships with SAT scores. We will calculate the correlation between every variable and the whole SAT rating:

correlations = mixed.corr(numeric_only=True)["sat_score"]

# Show strongest damaging correlations
print("Strongest damaging correlations:")
correlations_ascending = correlations.sort_values()
print(correlations_ascending.head(10))

# Show strongest optimistic correlations
print("nStrongest optimistic correlations:")
correlations_descending = correlations.sort_values(ascending=False)
print(correlations_descending.head(10))

Working this code produces the next output:

Strongest damaging correlations:
frl_percent        -0.722225
ell_percent        -0.663531
sped_percent       -0.448065
hispanic_per       -0.394415
black_per          -0.363847
total_enrollment   -0.308963
female_per         -0.213514
male_per          -0.213514
Asian_per         -0.152571
grade9_perc         0.034778
Identify: sat_score, dtype: float64

Strongest optimistic correlations:
sat_score                         1.000000
sat_w                             0.986296
sat_m                             0.972415
sat_r                             0.968731
white_per                         0.661642
asian_per                         0.570730
total_exams_taken                 0.448799
AP Check Takers                     0.429749
high_score_percent                0.394919
selfselect_percent                0.369467
Identify: sat_score, dtype: float64

This offers us a superb start line for our investigation. We will see that SAT scores have:

Sturdy damaging correlations with:
- frl_percent (share of scholars on free/decreased lunch, an indicator of poverty)
- ell_percent (share of English Language Learners)
- sped_percent (share of scholars in particular training)
- hispanic_per (share of Hispanic college students)
- black_per (share of Black college students)
Sturdy optimistic correlations with:
- The person SAT part scores (which is smart)
- white_per (share of white college students)
- asian_per (share of Asian college students)
- Varied AP check measures

These correlations counsel that demographic elements like race, socioeconomic standing, and English language proficiency might have vital relationships with SAT scores. Let’s discover these relationships in additional element.

Step 5: Analyzing Survey Outcomes

As a former highschool trainer myself, I’ve seen firsthand how elements like college security and educational engagement can influence pupil efficiency. Let us take a look at how survey responses correlate with SAT scores. The survey consists of questions on teachers, security, communication, and engagement, answered by college students, lecturers, and oldsters:

# Discover correlations between survey fields and SAT scores
survey_correlations = mixed.loc[:, survey_fields].corr(numeric_only=True)["sat_score"]
survey_correlations = survey_correlations.sort_values(ascending=False)

# Visualize correlations
survey_correlations.plot.barh(figsize=(10, 10))
plt.axvline(x=0.25, linestyle='--')
plt.title("Correlation between Survey Knowledge and SAT Scores")
plt.xlabel("Correlation Coefficient")

Once you run this code, you will see a horizontal bar chart displaying the correlation between every survey subject and SAT scores. The dotted line at 0.25 helps us establish stronger correlations (these extending past the dotted line):

The strongest correlations look like with:

Variety of pupil and mother or father survey responses (N_s and N_p) ― means that extra engaged college communities have increased SAT scores
Scholar perceptions of teachers (aca_s_11) ― when college students view teachers positively, SAT scores are usually increased
Security perceptions (saf_s_11, saf_t_11, saf_tot_11) ― faculties perceived as safer are likely to have increased SAT scores

Teacher Perception: The robust correlation between survey response charges and SAT scores is not shocking ― faculties the place mother and father and college students take the time to finish surveys are probably communities the place training is very valued. This engagement typically interprets to higher educational preparation and help for standardized checks just like the SAT.

Let’s discover the connection between security perceptions and SAT scores in additional element:

mixed.plot.scatter(x="saf_s_11", y="sat_score")
plt.title("Security vs. SAT Scores")
plt.xlabel("Scholar Security Notion")
plt.ylabel("Common SAT Rating")

The ensuing scatter plot reveals a normal upward pattern – as security notion will increase, SAT scores have a tendency to extend as effectively. However there’s appreciable variation, suggesting that security is not the one issue at play.

Let’s examine if there are variations in security perceptions throughout NYC boroughs:

borough_safety = mixed.groupby("borough")["sat_score", "saf_s_11", "saf_t_11", "saf_p_11", "saf_tot_11"].imply()
borough_safety

This produces a desk displaying common security perceptions and SAT scores by borough:

             sat_score   saf_s_11   saf_t_11   saf_p_11  saf_tot_11
borough
Bronx        1126.692308  6.606087  6.539855  7.585507    6.837678
Brooklyn     1230.068966  6.836087  7.259420  7.942391    7.230797
Manhattan    1340.052632  7.454412  7.548529  8.603676    7.790441
Queens       1317.279412  7.149338  7.565476  8.193878    7.533921
Staten Island 1376.600000  6.450000  7.325000  7.966667    7.175000

Curiously, whereas there are vital variations in SAT scores throughout boroughs (with the Bronx having the bottom common and Staten Island the best), the variations in security perceptions are much less pronounced. This means that whereas security might correlate with SAT scores general, borough-level security variations aren’t essentially driving borough-level SAT rating variations.

Teacher Perception: One sample I discovered notably attention-grabbing when exploring this information is that mother or father perceptions of security are persistently about 1.5 factors increased than pupil perceptions throughout all boroughs. From my expertise in faculties, this is smart – mother and father aren’t within the constructing day by day and might not be conscious of the day-to-day dynamics that college students expertise. College students have a way more instant and granular understanding of issues of safety of their faculties.

When you’re involved in studying extra about information aggregation methods just like the groupby() methodology we used above, take a look at Dataquest’s Pandas Fundamentals course.

Step 6: Race Demographics and SAT Scores

The correlation evaluation confirmed that race demographics have robust relationships with SAT scores. Let’s visualize these correlations:

race_fields = [
    "white_per",
    "asian_per",
    "black_per",
    "hispanic_per"
]

# Correlation between race demographics and SAT scores
race_correlations = mixed.loc[:, race_fields].corr(numeric_only=True)["sat_score"]
race_correlations.plot.barh()
plt.title("Correlation between Race Demographics and SAT Scores")
plt.xlabel("Correlation Coefficient")

Once you run this code, you will see a bar chart displaying that white and Asian percentages have robust optimistic correlations with SAT scores, whereas Black and Hispanic percentages have damaging correlations.

Let’s take a more in-depth take a look at the connection between Hispanic share and SAT scores, which had one of many strongest damaging correlations:

import numpy as np
mixed.plot.scatter(x="hispanic_per", y="sat_score")

# Add regression line
x = mixed["hispanic_per"]
y = mixed["sat_score"]
m, b = np.polyfit(x, y, 1)
plt.plot(x, m*x + b, '-', shade='crimson')

plt.title("Hispanic Inhabitants Share vs. SAT Scores")
plt.xlabel("Hispanic Inhabitants Share")
plt.ylabel("Common SAT Rating")

The scatter plot with the regression line clearly reveals the damaging relationship – as the share of Hispanic college students will increase, the typical SAT rating tends to lower.

Teacher Perception: When creating this visualization, I selected so as to add a regression line to make the pattern clearer. It is a small however highly effective enhancement to our scatter plot – whereas the bare eye can detect a normal sample within the information factors, the regression line gives a extra exact illustration of the connection.

To grasp this relationship higher, let’s take a look at faculties with very excessive Hispanic populations:

selected_columns = ["school_name", "num_of_sat_test_takers", "hispanic_per", "sat_score", "sped_per", "ell_per", "school_dist", "borough", "saf_s_11"]
high_hispanic = mixed.sort_values("hispanic_per", ascending=False)[selected_columns].head(5)
high_hispanic

This outputs the 5 faculties with the best Hispanic percentages:

                                          school_name  num_of_sat_test_takers  hispanic_per    sat_score   sped_per    ell_per  school_dist    borough  saf_s_11
58   WASHINGTON HEIGHTS EXPEDITIONARY LEARNING SCHOOL                      66     93.918919  1126.515152  14.779270  12.018348           06  Manhattan  6.816092
62                       GREGORIO LUPERON HIGH SCHOOL                      53     96.164384  1021.245283  15.963855  51.807229           06  Manhattan  6.885291
66                          MULTICULTURAL HIGH SCHOOL                      23     97.297297  1023.217391  16.447368  34.210526           09   Brooklyn  7.656250
68             PAN AMERICAN INTERNATIONAL HIGH SCHOOL                      44     97.917847  1045.750000  10.363636  75.454545           24     Queens  7.976744
69          INTERNATIONAL HIGH SCHOOL AT UNION SQUARE                      49     94.871795  1172.653061  10.638298  84.042553           02  Manhattan  7.767857

A sample emerges: these faculties have very excessive percentages of English Language Learners (ELL), starting from 12% to 84%. This means that language obstacles is likely to be a big issue within the correlation between Hispanic share and SAT scores, because the SAT is run in English and requires robust English language expertise.

Teacher Perception: To higher perceive these faculties, I did some further analysis (as we might in an actual information evaluation undertaking) by trying to find details about them on-line. What I found was fascinating ― all of those faculties are a part of New York Metropolis’s Outward Sure program, which particularly focuses on serving English language learners. This gives essential context to our numerical findings. Though these faculties have decrease SAT scores, they’re really specialised applications designed to help a particular pupil inhabitants with distinctive challenges.

For comparability, let’s take a look at faculties with very low Hispanic percentages and excessive SAT scores:

low_hispanic_high_sat = mixed[(combined["hispanic_per"] < 10) & (mixed["sat_score"] > 1800)].sort_values("sat_score", ascending=False)[selected_columns].head(5)
low_hispanic_high_sat

This reveals us the top-performing faculties with low Hispanic populations:

                                          school_name  num_of_sat_test_takers     hispanic_per      sat_score   sped_per    ell_per   school_dist       borough  saf_s_11
37                             STUYVESANT HIGH SCHOOL                     832         2.699055    2096.105769   0.321812   0.000000           02      Manhattan  8.092486
86                  BRONX HIGH SCHOOL OF SCIENCE, THE                     731         7.132868    1941.606018   0.395778   0.197889           10          Bronx  7.058323
83                STATEN ISLAND TECHNICAL HIGH SCHOOL                     226         7.563025    1933.893805   0.883392   0.000000           31  Staten Island  8.333333
51  HIGH SCHOOL OF AMERICAN STUDIES AT LEHMAN COLLEGE                     114         9.322034    1925.657895   0.855856   0.000000           10          Bronx  8.545455
32                QUEENS HIGH SCHOOL FOR THE SCIENCES                      78         9.230769    1877.307692   0.000000   0.000000           28         Queens  8.592593

These faculties have very totally different profiles ― they’ve just about no English Language Learners and really low percentages of particular training college students.

Teacher Perception: Upon additional investigation, I found that every one 5 of those are specialised excessive faculties in NYC that choose college students primarily based on educational efficiency by means of aggressive admissions processes. What’s notably noteworthy about these faculties is that regardless of New York Metropolis having a range initiative for specialised excessive faculties, none of those top-performing faculties have elected to take part in it. This raises essential questions on instructional fairness and entry to high-quality training throughout totally different demographic teams.

This evaluation reveals that the SAT might drawback English Language Learners, who might battle with the check’s language calls for no matter their educational talents.

Step 7: Gender and SAT Scores

Let’s study how gender demographics relate to SAT scores:

gender_fields = ["male_per", "female_per"]
gender_correlations = mixed.loc[:, gender_fields].corr(numeric_only=True)["sat_score"]
gender_correlations.plot.barh()
plt.title("Correlation between Gender Demographics and SAT Scores")
plt.xlabel("Correlation Coefficient")

The correlation between gender and SAT scores is comparatively weak however attention-grabbing – there is a optimistic correlation with feminine share and a damaging correlation with male share.

Let’s visualize this relationship:

mixed.plot.scatter(x="female_per", y="sat_score")
plt.axhspan(40, 60, alpha=0.2, shade='crimson')
plt.title("Feminine Inhabitants Share vs. SAT Scores")
plt.xlabel("Feminine Inhabitants Share")
plt.ylabel("Common SAT Rating")

The highlighted space represents what is likely to be thought of a “regular” gender steadiness (40-60% feminine). We will see that there are faculties with excessive SAT scores throughout numerous gender compositions, however let’s examine if there is a distinction in common SAT scores for faculties with very excessive or low feminine percentages:

print("Common SAT rating for faculties with >60% feminine college students:", mixed[combined["female_per"] > 60]["sat_score"].imply())
print("Common SAT rating for faculties with <40% feminine college students:", mixed[combined["female_per"] < 40]["sat_score"].imply())

This offers us:

Common SAT rating for faculties with >60% feminine college students: 1301.8308823529412
Common SAT rating for faculties with <40% feminine college students: 1204.4375

There is a hole of practically 100 factors! Faculties with predominantly feminine populations have increased common SAT scores than faculties with predominantly male populations.

Let us take a look at a few of these high-female, high-SAT faculties:

high_female_columns = ["school_name", "female_per", "sat_score", "sat_m", "sat_r", "sat_w"]
mixed.sort_values(["female_per", "sat_score"], ascending=False)[high_female_columns].head(5)

The consequence reveals specialised faculties specializing in arts, which is traditionally in style for feminine college students. As with the race evaluation, we’re seeing that specialised faculties play a job in these demographic patterns.

Teacher Perception: This discovering notably resonates with my private expertise. Earlier than turning into an information analyst, I attended a performing arts-focused highschool that had an identical gender distribution ― over 60% feminine. These specialised faculties typically have robust educational applications alongside their focus areas, which can assist clarify the correlation between feminine share and SAT scores. It is a reminder that behind each information level is a fancy institutional story that pure numbers cannot totally seize.

Step 8: AP Check Participation and SAT Scores

Lastly, let’s take a look at Superior Placement (AP) check participation and its relationship with SAT scores. AP programs are college-level lessons that prime college college students can take, probably incomes school credit score:

# Create share of AP check takers relative to pupil inhabitants
mixed["ap_per"] = mixed["AP Test Takers "] / mixed["total_enrollment"]

# Visualize relationship between AP check share and SAT scores
mixed.plot.scatter(x="ap_per", y="sat_score")
plt.title("AP Check Takers Share vs. SAT Scores")
plt.xlabel("AP Check Takers Share")
plt.ylabel("Common SAT Rating")

The scatter plot would not present a really clear pattern. There seems to be a vertical line of factors across the 1200 mark ― that is really an artifact of our information processing. After we stuffed lacking values with the imply, it created this vertical sample.

Teacher Perception: When analyzing real-world information, it is essential to acknowledge these sorts of patterns that consequence from information cleansing selections. I all the time advocate including notes about such artifacts in your evaluation to forestall misinterpretation. If this have been an expert report, I might explicitly point out that this vertical line would not signify a pure clustering within the information.

Let’s calculate the correlation to get a extra exact measure:

correlation = mixed["ap_per"].corr(mixed["sat_score"])
print(f"Correlation between AP check share and SAT scores: {correlation:.4f}")

This offers:

Correlation between AP check share and SAT scores: 0.0566

The correlation could be very weak (lower than 0.06), suggesting that the share of scholars taking AP checks at a college shouldn’t be strongly associated to the college’s common SAT rating. That is considerably shocking, as we’d anticipate faculties with extra AP individuals to have increased SAT scores.

Outcomes

Wanting past the code and the outputs, this undertaking demonstrates the ability of exploratory information evaluation to uncover social patterns and potential inequities in instructional methods. As somebody who has labored in training earlier than transitioning to information evaluation, I discover these sorts of investigations notably significant.

Our evaluation reveals a number of essential insights about NYC excessive faculties and SAT scores:

Socioeconomic elements have among the strongest correlations with SAT scores. Faculties with increased percentages of scholars on free/decreased lunch are likely to have decrease SAT scores.
English Language Studying seems to be a big issue. Faculties with excessive percentages of ELL college students (a lot of which have excessive Hispanic populations) are likely to have decrease SAT scores, which raises questions in regards to the equity of utilizing the SAT for school admissions.
Specialised faculties present a distinct sample. NYC’s specialised excessive faculties, which choose college students primarily based on educational efficiency, have very excessive SAT scores and totally different demographic compositions than different faculties.
Security and educational engagement correlate with increased SAT scores, although the connection is complicated and varies throughout boroughs.
Gender composition reveals attention-grabbing patterns, with predominantly feminine faculties having increased common SAT scores than predominantly male faculties.

These findings counsel that the SAT might not be a totally honest measure of pupil aptitude, as scores correlate strongly with elements like English language proficiency and socioeconomic standing. This aligns with broader criticisms of standardized testing, which have led many schools to maneuver away from requiring SAT scores lately.

Subsequent Steps

This evaluation barely scratches the floor of what could possibly be accomplished with this wealthy dataset. With over 160 columns of knowledge, there are lots of extra relationships to discover. Based mostly on my expertise with instructional information, listed below are some compelling instructions for additional evaluation:

Analyze class measurement:

As a former trainer, I’ve seen how class measurement impacts studying. I might be curious to see if smaller lessons correlate with increased SAT scores, notably in faculties serving deprived populations.
Discover college districts relatively than boroughs:

In our evaluation, we checked out borough-level patterns, however NYC has quite a few college districts that may present extra nuanced patterns. Grouping by district as a substitute of borough may reveal extra localized relationships between demographics, security, and check scores.
Take into account property values:

Schooling and housing are carefully linked. An evaluation combining this dataset with property worth info may reveal attention-grabbing patterns about financial segregation and academic outcomes.
Create a extra equitable college rating system:

Most rating methods closely prioritize check scores, which our evaluation suggests might drawback sure populations. Creating a extra holistic rating system that accounts for elements like English language learner help, enchancment over time, and socioeconomic context may present a extra honest evaluation of faculty high quality.

To take your information evaluation expertise additional, contemplate exploring Dataquest’s different programs and guided tasks within the Knowledge Scientist in Python profession path.

I personally made the transition from educating to information evaluation by means of platforms like Dataquest, and the method of engaged on tasks like this one was instrumental in constructing my expertise and portfolio. Do not be afraid to spend a number of hours on a undertaking like this one – the deeper you dig, the extra insights you will uncover, and the extra you will study.

In case you have questions or need to share your work on this undertaking, be at liberty to affix the dialogue on our Group boards. You possibly can even tag me (@Anna_Strahl) if you would like particular suggestions in your strategy!

What Is Hashing? – Dataconomy

“Scientific poetic license?” What do you name it when somebody is mendacity however they’re doing it in such a socially-acceptable manner that no person ever calls them on it?

How knowledge high quality eliminates friction factors within the CX

What You will Study:

Tips on how to mix and clear a number of datasets to create a complete evaluation
Tips on how to establish and visualize correlations between totally different variables
Tips on how to analyze demographic elements like race, gender, and socioeconomic standing in relation to check scores
How to attract significant insights from information visualizations
Tips on how to set up an exploratory information evaluation workflow in Python

Now, let’s dive into our evaluation!

Step 1: Understanding the Knowledge

For this undertaking, we’ll be working with a number of datasets associated to New York Metropolis excessive faculties:

SAT scores: Incorporates common scores for every highschool
College demographics: Details about race, gender, and different demographic elements
AP check outcomes: Knowledge on Superior Placement check participation
Commencement outcomes: Commencement charges for every college
Class measurement: Details about class sizes
College surveys: Outcomes from surveys given to college students, lecturers, and oldsters
College listing: Further details about every college

Step 2: Setting Up the Surroundings

When you’re engaged on this undertaking throughout the Dataquest platform, your setting is already arrange. When you’re working regionally, you will want:

Python setting: Guarantee you will have Python 3.x put in with pandas, numpy, matplotlib, and re (regex) libraries.
Jupyter Pocket book: Set up Jupyter Pocket book or JupyterLab to work with the offered .ipynb file.
Knowledge information: Obtain the dataset information from the undertaking web page.

Let’s begin by importing the required libraries and loading our datasets:

import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
%matplotlib inline

data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]

information = {}

for f in data_files:
    d = pd.read_csv("faculties/{0}".format(f))
    information[f.replace(".csv", "")] = d

all_survey = pd.read_csv("faculties/survey_all.txt", delimiter="t", encoding='windows-1252')
d75_survey = pd.read_csv("faculties/survey_d75.txt", delimiter="t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)

survey["DBN"] = survey["dbn"]

survey_fields = [
    "DBN",
    "rr_s",
    "rr_t",
    "rr_p",
    "N_s",
    "N_t",
    "N_p",
    "saf_p_11",
    "com_p_11",
    "eng_p_11",
    "aca_p_11",
    "saf_t_11",
    "com_t_11",
    "eng_t_11",
    "aca_t_11",
    "saf_s_11",
    "com_s_11",
    "eng_s_11",
    "aca_s_11",
    "saf_tot_11",
    "com_tot_11",
    "eng_tot_11",
    "aca_tot_11",
]
survey = survey.loc[:,survey_fields]
information["survey"] = survey

Step 3: Knowledge Cleansing and Preparation

Earlier than we are able to analyze the info, we have to clear and put together it. This entails standardizing column names, changing information varieties, and extracting geographical info:

# Standardize DBN column identify in hs_directory
information["hs_directory"]["DBN"] = information["hs_directory"]["dbn"]

# Helper operate to pad CSD values
def pad_csd(num):
    string_representation = str(num)
    if len(string_representation) > 1:
        return string_representation
    else:
        return "0" + string_representation

# Create DBN column in class_size
information["class_size"]["padded_csd"] = information["class_size"]["CSD"].apply(pad_csd)
information["class_size"]["DBN"] = information["class_size"]["padded_csd"] + information["class_size"]["SCHOOL CODE"]

# Convert SAT rating columns to numeric and calculate complete SAT rating
cols = ["SAT Critical Reading Avg. Score", "SAT Math Avg. Score", "SAT Writing Avg. Score"]
for c in cols:
    information["sat_results"][c] = pd.to_numeric(information["sat_results"][c], errors="coerce")

information["sat_results"]["sat_score"] = information["sat_results"][cols[0]] + information["sat_results"][cols[1]] + information["sat_results"][cols[2]]

We additionally have to extract geographic coordinates from the placement subject for mapping functions:

# Extract latitude and longitude from Location 1 subject
def find_lat(loc):
    coords = re.findall("(.+, .+)", loc)
    lat = re.findall("[+-]?d+.d+", coords[0])[0]
    return lat

def find_lon(loc):
    coords = re.findall("(.+, .+)", loc)
    lon = re.findall("[+-]?d+.d+", coords[0])[1]
    return lon

information["hs_directory"]["lat"] = information["hs_directory"]["Location 1"].apply(find_lat)
information["hs_directory"]["lon"] = information["hs_directory"]["Location 1"].apply(find_lon)

information["hs_directory"]["lat"] = pd.to_numeric(information["hs_directory"]["lat"], errors="coerce")
information["hs_directory"]["lon"] = pd.to_numeric(information["hs_directory"]["lon"], errors="coerce")

Now we are able to mix all of the datasets right into a single DataFrame. We’ll use the DBN as the important thing to merge all of the datasets:

# Mix the datasets
mixed = information["sat_results"]
mixed = mixed.merge(information["ap_2010"], on="DBN", how="left")
mixed = mixed.merge(information["graduation"], on="DBN", how="left")
mixed = mixed.merge(information["class_size"], on="DBN", how="left")
mixed = mixed.merge(information["demographics"], on="DBN", how="left")
mixed = mixed.merge(information["survey"], on="DBN", how="left")
mixed = mixed.merge(information["hs_directory"], on="DBN", how="left")

# Fill lacking values
mixed = mixed.fillna(mixed.imply())
mixed = mixed.fillna(0)

# Add a college district column for mapping
def get_first_two_chars(dbn):
    return dbn[0:2]

mixed["school_dist"] = mixed["DBN"].apply(get_first_two_chars)

Step 4: Discovering Relationships – Correlation Evaluation

correlations = mixed.corr(numeric_only=True)["sat_score"]

# Show strongest damaging correlations
print("Strongest damaging correlations:")
correlations_ascending = correlations.sort_values()
print(correlations_ascending.head(10))

# Show strongest optimistic correlations
print("nStrongest optimistic correlations:")
correlations_descending = correlations.sort_values(ascending=False)
print(correlations_descending.head(10))

Working this code produces the next output:

Strongest damaging correlations:
frl_percent        -0.722225
ell_percent        -0.663531
sped_percent       -0.448065
hispanic_per       -0.394415
black_per          -0.363847
total_enrollment   -0.308963
female_per         -0.213514
male_per          -0.213514
Asian_per         -0.152571
grade9_perc         0.034778
Identify: sat_score, dtype: float64

Strongest optimistic correlations:
sat_score                         1.000000
sat_w                             0.986296
sat_m                             0.972415
sat_r                             0.968731
white_per                         0.661642
asian_per                         0.570730
total_exams_taken                 0.448799
AP Check Takers                     0.429749
high_score_percent                0.394919
selfselect_percent                0.369467
Identify: sat_score, dtype: float64

This offers us a superb start line for our investigation. We will see that SAT scores have:

Sturdy damaging correlations with:
- frl_percent (share of scholars on free/decreased lunch, an indicator of poverty)
- ell_percent (share of English Language Learners)
- sped_percent (share of scholars in particular training)
- hispanic_per (share of Hispanic college students)
- black_per (share of Black college students)
Sturdy optimistic correlations with:
- The person SAT part scores (which is smart)
- white_per (share of white college students)
- asian_per (share of Asian college students)
- Varied AP check measures

Step 5: Analyzing Survey Outcomes

# Discover correlations between survey fields and SAT scores
survey_correlations = mixed.loc[:, survey_fields].corr(numeric_only=True)["sat_score"]
survey_correlations = survey_correlations.sort_values(ascending=False)

# Visualize correlations
survey_correlations.plot.barh(figsize=(10, 10))
plt.axvline(x=0.25, linestyle='--')
plt.title("Correlation between Survey Knowledge and SAT Scores")
plt.xlabel("Correlation Coefficient")

The strongest correlations look like with:

Variety of pupil and mother or father survey responses (N_s and N_p) ― means that extra engaged college communities have increased SAT scores
Scholar perceptions of teachers (aca_s_11) ― when college students view teachers positively, SAT scores are usually increased
Security perceptions (saf_s_11, saf_t_11, saf_tot_11) ― faculties perceived as safer are likely to have increased SAT scores

Let’s discover the connection between security perceptions and SAT scores in additional element:

mixed.plot.scatter(x="saf_s_11", y="sat_score")
plt.title("Security vs. SAT Scores")
plt.xlabel("Scholar Security Notion")
plt.ylabel("Common SAT Rating")

Let’s examine if there are variations in security perceptions throughout NYC boroughs:

borough_safety = mixed.groupby("borough")["sat_score", "saf_s_11", "saf_t_11", "saf_p_11", "saf_tot_11"].imply()
borough_safety

This produces a desk displaying common security perceptions and SAT scores by borough:

             sat_score   saf_s_11   saf_t_11   saf_p_11  saf_tot_11
borough
Bronx        1126.692308  6.606087  6.539855  7.585507    6.837678
Brooklyn     1230.068966  6.836087  7.259420  7.942391    7.230797
Manhattan    1340.052632  7.454412  7.548529  8.603676    7.790441
Queens       1317.279412  7.149338  7.565476  8.193878    7.533921
Staten Island 1376.600000  6.450000  7.325000  7.966667    7.175000

When you’re involved in studying extra about information aggregation methods just like the groupby() methodology we used above, take a look at Dataquest’s Pandas Fundamentals course.

Step 6: Race Demographics and SAT Scores

The correlation evaluation confirmed that race demographics have robust relationships with SAT scores. Let’s visualize these correlations:

race_fields = [
    "white_per",
    "asian_per",
    "black_per",
    "hispanic_per"
]

# Correlation between race demographics and SAT scores
race_correlations = mixed.loc[:, race_fields].corr(numeric_only=True)["sat_score"]
race_correlations.plot.barh()
plt.title("Correlation between Race Demographics and SAT Scores")
plt.xlabel("Correlation Coefficient")

Let’s take a more in-depth take a look at the connection between Hispanic share and SAT scores, which had one of many strongest damaging correlations:

import numpy as np
mixed.plot.scatter(x="hispanic_per", y="sat_score")

# Add regression line
x = mixed["hispanic_per"]
y = mixed["sat_score"]
m, b = np.polyfit(x, y, 1)
plt.plot(x, m*x + b, '-', shade='crimson')

plt.title("Hispanic Inhabitants Share vs. SAT Scores")
plt.xlabel("Hispanic Inhabitants Share")
plt.ylabel("Common SAT Rating")

The scatter plot with the regression line clearly reveals the damaging relationship – as the share of Hispanic college students will increase, the typical SAT rating tends to lower.

To grasp this relationship higher, let’s take a look at faculties with very excessive Hispanic populations:

selected_columns = ["school_name", "num_of_sat_test_takers", "hispanic_per", "sat_score", "sped_per", "ell_per", "school_dist", "borough", "saf_s_11"]
high_hispanic = mixed.sort_values("hispanic_per", ascending=False)[selected_columns].head(5)
high_hispanic

This outputs the 5 faculties with the best Hispanic percentages:

                                          school_name  num_of_sat_test_takers  hispanic_per    sat_score   sped_per    ell_per  school_dist    borough  saf_s_11
58   WASHINGTON HEIGHTS EXPEDITIONARY LEARNING SCHOOL                      66     93.918919  1126.515152  14.779270  12.018348           06  Manhattan  6.816092
62                       GREGORIO LUPERON HIGH SCHOOL                      53     96.164384  1021.245283  15.963855  51.807229           06  Manhattan  6.885291
66                          MULTICULTURAL HIGH SCHOOL                      23     97.297297  1023.217391  16.447368  34.210526           09   Brooklyn  7.656250
68             PAN AMERICAN INTERNATIONAL HIGH SCHOOL                      44     97.917847  1045.750000  10.363636  75.454545           24     Queens  7.976744
69          INTERNATIONAL HIGH SCHOOL AT UNION SQUARE                      49     94.871795  1172.653061  10.638298  84.042553           02  Manhattan  7.767857

For comparability, let’s take a look at faculties with very low Hispanic percentages and excessive SAT scores:

low_hispanic_high_sat = mixed[(combined["hispanic_per"] < 10) & (mixed["sat_score"] > 1800)].sort_values("sat_score", ascending=False)[selected_columns].head(5)
low_hispanic_high_sat

This reveals us the top-performing faculties with low Hispanic populations:

                                          school_name  num_of_sat_test_takers     hispanic_per      sat_score   sped_per    ell_per   school_dist       borough  saf_s_11
37                             STUYVESANT HIGH SCHOOL                     832         2.699055    2096.105769   0.321812   0.000000           02      Manhattan  8.092486
86                  BRONX HIGH SCHOOL OF SCIENCE, THE                     731         7.132868    1941.606018   0.395778   0.197889           10          Bronx  7.058323
83                STATEN ISLAND TECHNICAL HIGH SCHOOL                     226         7.563025    1933.893805   0.883392   0.000000           31  Staten Island  8.333333
51  HIGH SCHOOL OF AMERICAN STUDIES AT LEHMAN COLLEGE                     114         9.322034    1925.657895   0.855856   0.000000           10          Bronx  8.545455
32                QUEENS HIGH SCHOOL FOR THE SCIENCES                      78         9.230769    1877.307692   0.000000   0.000000           28         Queens  8.592593

These faculties have very totally different profiles ― they’ve just about no English Language Learners and really low percentages of particular training college students.

This evaluation reveals that the SAT might drawback English Language Learners, who might battle with the check’s language calls for no matter their educational talents.

Step 7: Gender and SAT Scores

Let’s study how gender demographics relate to SAT scores:

gender_fields = ["male_per", "female_per"]
gender_correlations = mixed.loc[:, gender_fields].corr(numeric_only=True)["sat_score"]
gender_correlations.plot.barh()
plt.title("Correlation between Gender Demographics and SAT Scores")
plt.xlabel("Correlation Coefficient")

The correlation between gender and SAT scores is comparatively weak however attention-grabbing – there is a optimistic correlation with feminine share and a damaging correlation with male share.

Let’s visualize this relationship:

mixed.plot.scatter(x="female_per", y="sat_score")
plt.axhspan(40, 60, alpha=0.2, shade='crimson')
plt.title("Feminine Inhabitants Share vs. SAT Scores")
plt.xlabel("Feminine Inhabitants Share")
plt.ylabel("Common SAT Rating")

print("Common SAT rating for faculties with >60% feminine college students:", mixed[combined["female_per"] > 60]["sat_score"].imply())
print("Common SAT rating for faculties with <40% feminine college students:", mixed[combined["female_per"] < 40]["sat_score"].imply())

This offers us:

Common SAT rating for faculties with >60% feminine college students: 1301.8308823529412
Common SAT rating for faculties with <40% feminine college students: 1204.4375

There is a hole of practically 100 factors! Faculties with predominantly feminine populations have increased common SAT scores than faculties with predominantly male populations.

Let us take a look at a few of these high-female, high-SAT faculties:

high_female_columns = ["school_name", "female_per", "sat_score", "sat_m", "sat_r", "sat_w"]
mixed.sort_values(["female_per", "sat_score"], ascending=False)[high_female_columns].head(5)

Step 8: AP Check Participation and SAT Scores

# Create share of AP check takers relative to pupil inhabitants
mixed["ap_per"] = mixed["AP Test Takers "] / mixed["total_enrollment"]

# Visualize relationship between AP check share and SAT scores
mixed.plot.scatter(x="ap_per", y="sat_score")
plt.title("AP Check Takers Share vs. SAT Scores")
plt.xlabel("AP Check Takers Share")
plt.ylabel("Common SAT Rating")

Let’s calculate the correlation to get a extra exact measure:

correlation = mixed["ap_per"].corr(mixed["sat_score"])
print(f"Correlation between AP check share and SAT scores: {correlation:.4f}")

This offers:

Correlation between AP check share and SAT scores: 0.0566

Outcomes

Our evaluation reveals a number of essential insights about NYC excessive faculties and SAT scores:

Socioeconomic elements have among the strongest correlations with SAT scores. Faculties with increased percentages of scholars on free/decreased lunch are likely to have decrease SAT scores.
English Language Studying seems to be a big issue. Faculties with excessive percentages of ELL college students (a lot of which have excessive Hispanic populations) are likely to have decrease SAT scores, which raises questions in regards to the equity of utilizing the SAT for school admissions.
Specialised faculties present a distinct sample. NYC’s specialised excessive faculties, which choose college students primarily based on educational efficiency, have very excessive SAT scores and totally different demographic compositions than different faculties.
Security and educational engagement correlate with increased SAT scores, although the connection is complicated and varies throughout boroughs.
Gender composition reveals attention-grabbing patterns, with predominantly feminine faculties having increased common SAT scores than predominantly male faculties.

Subsequent Steps

Analyze class measurement:

As a former trainer, I’ve seen how class measurement impacts studying. I might be curious to see if smaller lessons correlate with increased SAT scores, notably in faculties serving deprived populations.
Discover college districts relatively than boroughs:

In our evaluation, we checked out borough-level patterns, however NYC has quite a few college districts that may present extra nuanced patterns. Grouping by district as a substitute of borough may reveal extra localized relationships between demographics, security, and check scores.
Take into account property values:

Schooling and housing are carefully linked. An evaluation combining this dataset with property worth info may reveal attention-grabbing patterns about financial segregation and academic outcomes.
Create a extra equitable college rating system:

Most rating methods closely prioritize check scores, which our evaluation suggests might drawback sure populations. Creating a extra holistic rating system that accounts for elements like English language learner help, enchancment over time, and socioeconomic context may present a extra honest evaluation of faculty high quality.

To take your information evaluation expertise additional, contemplate exploring Dataquest’s different programs and guided tasks within the Knowledge Scientist in Python profession path.

Analyzing New York Metropolis Excessive College Knowledge – Dataquest

What Is Hashing? – Dataconomy

“Scientific poetic license?” What do you name it when somebody is mendacity however they’re doing it in such a socially-acceptable manner that no person ever calls them on it?

How knowledge high quality eliminates friction factors within the CX

Invoice Gates: AI will change most human jobs inside a decade

Photo voltaic Energy System Vulnerabilities Might Lead to Blackouts

Md Sazzad Hossain

Related Posts

What Is Hashing? – Dataconomy

“Scientific poetic license?” What do you name it when somebody is mendacity however they’re doing it in such a socially-acceptable manner that no person ever calls them on it?

How knowledge high quality eliminates friction factors within the CX

Agentic AI 103: Constructing Multi-Agent Groups

Monitoring Information With out Turning into Massive Brother

Photo voltaic Energy System Vulnerabilities Might Lead to Blackouts

Leave a Reply Cancel reply

Recommended

Our Water Harm Reconstruction Challenge in Fort Myers, FL

“Monsters: A Fan’s Dilemma”

Categories

CyberDefenseGo

Recent

Discord Invite Hyperlink Hijacking Delivers AsyncRAT and Skuld Stealer Concentrating on Crypto Wallets

How A lot Does Mould Elimination Value in 2025?

Search

Welcome Back!

Retrieve your password

Analyzing New York Metropolis Excessive College Knowledge – Dataquest

What You will Study:

Step 1: Understanding the Knowledge

Step 2: Setting Up the Surroundings

Step 3: Knowledge Cleansing and Preparation

Step 4: Discovering Relationships – Correlation Evaluation

Step 5: Analyzing Survey Outcomes

Step 6: Race Demographics and SAT Scores

Step 7: Gender and SAT Scores

Step 8: AP Check Participation and SAT Scores

Outcomes

Subsequent Steps

You might also like

What You will Study:

Step 1: Understanding the Knowledge

Step 2: Setting Up the Surroundings

Step 3: Knowledge Cleansing and Preparation

Step 4: Discovering Relationships – Correlation Evaluation

Step 5: Analyzing Survey Outcomes

Step 6: Race Demographics and SAT Scores

Step 7: Gender and SAT Scores

Step 8: AP Check Participation and SAT Scores

Outcomes

Subsequent Steps

Invoice Gates: AI will change most human jobs inside a decade

Photo voltaic Energy System Vulnerabilities Might Lead to Blackouts

Related Posts

Leave a Reply Cancel reply

Recommended

Categories

CyberDefenseGo

Recent

Search

Welcome Back!

Retrieve your password