Skip to content

Comments

Ensured Sanitized Column Names are Unique in AutoML CLI#5177

Merged
mstfbl merged 6 commits intodotnet:masterfrom
mstfbl:EnsureSanitizedCNsinAutoMLCLI
Jun 29, 2020
Merged

Ensured Sanitized Column Names are Unique in AutoML CLI#5177
mstfbl merged 6 commits intodotnet:masterfrom
mstfbl:EnsureSanitizedCNsinAutoMLCLI

Conversation

@mstfbl
Copy link
Contributor

@mstfbl mstfbl commented May 29, 2020

Fix #3902. This PR ensures that column names that are pulled from datasets and used to generate C# model input are unique. As a result, naming collisions are prevented and are valid C# variables. This is done by adding the column naming-collision fix in ML.CodeGenerator.Utilities.Utils.GenerateClassLabels.

This naming-collision fix consists of appending the type of the variable to the name of the second duplicate. For example, let us have 3 columns in a sample dataset:

int id;
int Country;
string country;

This is represented with TextLoader.Columns[] as:

new TextLoader.Column[]
{
    new TextLoader.Column(){ Name = "id", Source = new TextLoader.Range[]{new TextLoader.Range(0) }, DataKind = DataKind.Int32 },
    new TextLoader.Column(){ Name = "Country", Source = new TextLoader.Range[]{new TextLoader.Range(1) }, DataKind = DataKind.Int32 },
    new TextLoader.Column(){ Name = "country", Source = new TextLoader.Range[]{new TextLoader.Range(2) }, DataKind = DataKind.String },
}

For string country to be a valid class variable, its first letter has to be capitalized, which would result in a naming collision. This PR ensures that this doesn't happen, and the generated code becomes:

public class ModelInput
{
    [ColumnName("id"), LoadColumn(0)]
    public float Id_col_0 { get; set; }


    [ColumnName("Country"), LoadColumn(1)]
    public float Country_col_1 { get; set; }


    [ColumnName("country"), LoadColumn(2)]
    public string Country_col_2 { get; set; }
}

In addition, this PR ensures non-unique column names are properly named and differentiated, as shown below:

new TextLoader.Column[]
{
    new TextLoader.Column(){ Name = "column1", Source = new TextLoader.Range[]{new TextLoader.Range(0) }, DataKind = DataKind.String },
    new TextLoader.Column(){ Name = "column1_string", Source = new TextLoader.Range[]{new TextLoader.Range(1) }, DataKind = DataKind.String },
    new TextLoader.Column(){ Name = "column1_string_1", Source = new TextLoader.Range[]{new TextLoader.Range(2) }, DataKind = DataKind.String },
    new TextLoader.Column(){ Name = "column1_string_2", Source = new TextLoader.Range[]{new TextLoader.Range(3) }, DataKind = DataKind.String },
    new TextLoader.Column(){ Name = "Column1", Source = new TextLoader.Range[]{new TextLoader.Range(4) }, DataKind = DataKind.String },
    new TextLoader.Column(){ Name = "column1_int", Source = new TextLoader.Range[]{new TextLoader.Range(5) }, DataKind = DataKind.Int32 },
    new TextLoader.Column(){ Name = "column1_string", Source = new TextLoader.Range[]{new TextLoader.Range(6) }, DataKind = DataKind.String },
    new TextLoader.Column(){ Name = "column1", Source = new TextLoader.Range[]{new TextLoader.Range(7) }, DataKind = DataKind.Int32 }
}
public class ModelInput
{
    [ColumnName("column1"), LoadColumn(0)]
    public string Column1_col_0{get; set;}

    [ColumnName("column1_string"), LoadColumn(1)]
    public string Column1__string_col_1{get; set;}

    [ColumnName("column1_string_1"), LoadColumn(2)]
    public string Column1_string_1_col_2{get; set;}

    [ColumnName("column1_string_2"), LoadColumn(3)]
    public string Column1_string_2_col_3{get; set;}

    [ColumnName("Column1"), LoadColumn(4)]
    public string Column1_col_4{get; set;}

    [ColumnName("column1_int"), LoadColumn(5)]
    public int Column1_int_col_5{get; set;}

    [ColumnName("column1_string"), LoadColumn(6)]
    public string Column1_string_col_6{get; set;}

    [ColumnName("column1"), LoadColumn(7)]
    public int Column1_col_7{get; set;}
}

This PR also adds unit tests to test naming collision scenarios in both CodeGenTests and ConsoleCodeGeneratorTests.

In addition, this PR also cleans up duplicate code with respect to the ML.CodeGenerator.Utilities.Utils.GenerateClassLabels function, by linking its duplicate function ML.CodeGenerator.CodeGenerator.GenerateClassLabels to the former.

@mstfbl mstfbl requested a review from justinormont May 29, 2020 05:35
/// <summary>
/// Utilize <see cref="ML.CodeGenerator.Utilities.Utils.GenerateClassLabels(ColumnInferenceResults, IDictionary{string, CodeGeneratorSettings.ColumnMapping})"/>
/// </summary>
internal IList<string> GenerateClassLabels(IDictionary<string, CodeGeneratorSettings.ColumnMapping> columnMapping = default)
Copy link
Contributor Author

@mstfbl mstfbl May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decided to keep this GenerateClassLabels as it is possible that a CodeGenerator instance can call this function, as it does below:

var actualLabels = codeGenerator.GenerateClassLabels();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Name = "ModelInput.cs",
};
NamerFactory.AdditionalInformation = "null_map";
NamerFactory.AdditionalInformation = info + "_null_map";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change to produce and check for 4 "approved" files within 1 unit test, instead of making a new unit test consisting of mostly duplicate code from this test.

Name = "ModelInput.cs",
};
NamerFactory.AdditionalInformation = "map";
NamerFactory.AdditionalInformation = info + "_map";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same logic as above.



}
public class CodeGenTestData
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the data being fed to the unit test ClassLabelGenerationTest. I went this way as it is more efficient to iterate through test data in one ClassLabelGenerationTest unit test, than to have multiple unit tests that consist of duplicate code.

@mstfbl mstfbl marked this pull request as ready for review May 29, 2020 06:19
@mstfbl mstfbl requested a review from a team as a code owner May 29, 2020 06:19
@mstfbl mstfbl marked this pull request as draft June 10, 2020 22:06
@codecov
Copy link

codecov bot commented Jun 17, 2020

Codecov Report

Merging #5177 into master will decrease coverage by 0.00%.
The diff coverage is 98.97%.

@@            Coverage Diff             @@
##           master    #5177      +/-   ##
==========================================
- Coverage   73.54%   73.53%   -0.01%     
==========================================
  Files        1010     1014       +4     
  Lines      188029   188879     +850     
  Branches    20266    20329      +63     
==========================================
+ Hits       138288   138901     +613     
- Misses      44272    44472     +200     
- Partials     5469     5506      +37     
Flag Coverage Δ
#Debug 73.53% <98.97%> (-0.01%) ⬇️
#production 69.30% <89.28%> (-0.08%) ⬇️
#test 87.50% <100.00%> (+0.12%) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.CodeGenerator/Utils.cs 59.20% <88.46%> (+4.84%) ⬆️
...odeGenerator/CodeGenerator/CSharp/CodeGenerator.cs 67.31% <100.00%> (+0.64%) ⬆️
...r.Tests/ApprovalTests/ConsoleCodeGeneratorTests.cs 100.00% <100.00%> (ø)
...t/Microsoft.ML.CodeGenerator.Tests/CodeGenTests.cs 100.00% <100.00%> (ø)
...osoft.ML.KMeansClustering/KMeansPlusPlusTrainer.cs 83.60% <0.00%> (-7.27%) ⬇️
src/Microsoft.ML.FastTree/Training/StepSearch.cs 57.42% <0.00%> (-4.96%) ⬇️
src/Microsoft.ML.Data/Training/TrainerUtils.cs 65.86% <0.00%> (-3.82%) ⬇️
src/Microsoft.ML.AutoML/Utils/BestResultUtil.cs 53.84% <0.00%> (-3.69%) ⬇️
...crosoft.ML.StandardTrainers/Standard/SdcaBinary.cs 85.06% <0.00%> (-3.25%) ⬇️
...rosoft.ML.AutoML/ColumnInference/TextFileSample.cs 59.60% <0.00%> (-2.65%) ⬇️
... and 43 more

@mstfbl mstfbl requested a review from LittleLittleCloud June 18, 2020 01:55
@mstfbl mstfbl marked this pull request as ready for review June 18, 2020 02:27
@mstfbl mstfbl marked this pull request as draft June 18, 2020 19:34
@mstfbl mstfbl marked this pull request as ready for review June 19, 2020 02:12
@mstfbl mstfbl changed the title Ensure Sanitized Column Names are Unique in AutoML CLI Ensured Sanitized Column Names are Unique in AutoML CLI Jun 19, 2020
@mstfbl mstfbl requested review from JakeRadMSFT and harishsk June 19, 2020 03:00
/// <summary>
/// Utilize <see cref="ML.CodeGenerator.Utilities.Utils.GenerateClassLabels(ColumnInferenceResults, IDictionary{string, CodeGeneratorSettings.ColumnMapping})"/>
/// </summary>
internal IList<string> GenerateClassLabels(IDictionary<string, CodeGeneratorSettings.ColumnMapping> columnMapping = default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

result.Add(sb.ToString());
result.Add("\r\n");
}
for (int i = 1; i < result.Count; i+=3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, these generated labels are generated using Utils.Normalize in GenerateSampleData, and hence are not added the _col_x suffix for conflicting column names. I am making an issue to track this change. Thanks!

@mstfbl mstfbl requested a review from LittleLittleCloud June 29, 2020 21:12
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ensure Sanitized Column Names are Unique in AutoML

2 participants